20180530-Nss 5.x Troubleshooting Course SG Aspv1.1

Nutanix
NSS 5.x Troubleshooting

Student Guide
January 5, 2018
Version 1.1
© Copyright 2017 Nutanix Inc.

Table of Contents
Module 1 - Introduction to Troubleshooting
Module 2 Tools and Utilities
Module 3 Services and Logs
Module 4 Foundation
Module 5 Hardware Troubleshooting
Module 6 Networking
Module 7 Acropolis File Services Troubleshooting
Module 8 Acropolis Block Services Troubleshooting
Module 9 DR
Module 10 AOS Upgrade Troubleshooting
Module 11 Performance
Appendix A Module 4 Foundation Troubleshooting - Foundation log
Appendix B Module 7 Acropolis File Services Troubleshooting - Minerva log

Copyright
Copyright 2017 Nutanix, Inc.
Nutanix, Inc.
1740 Technology Drive, Suite 150
San Jose, CA 95110
All rights reserved. This product is protected by U.S. and international copyright and intellectual
property laws. Nutanix is a trademark of Nutanix, Inc. in the United States and/or other jurisdictions.
All other marks and names mentioned herein may be trademarks of their respective companies.
License
The provision of this software to you does not grant any licenses or other rights under any Microsoft
patents with respect to anything other than the file server implementation portion of the binaries for
this software, including no licenses or any other rights in any hardware or any devices or software
that are used to communicate with or in connection with this software.
Module 1
Introduction to Troubleshooting
Nut anix Tro ub lesho o t ing 5.x
Introduction to Troubleshooting
Course Agenda
1. Int ro 7. A FS
2. Tools & Ut ilit ies 8. A BS
3. Services & Log s 9. DR
4. Found at ion 10 . AOS Up g rad e
5. Hard w are 11. Perform ance
6. Net w orking
Course Agenda
• This is the Introduction module.
O bject ives
Aft er com p let ing t his m od ule, you w ill b e ab le t o:

• Exp lain t he five st ep s in t he life cycle of a cust om er
sup p ort case.
Objectives
After completing this module, you will be able to:
• Explain the five steps in the life cycle of a customer support case.
Int roduct ions
• Nam e?
• Com p any?
• W here?
• IT Exp erience?
Installing a Nutanix Cluster

4
Scenario- Based Approach
Learn by going through the steps

of a typical Customer Support Call
Use what you

learned in Labs and
Classroom Exercises
Scenario-Based Approach
• This course uses a Scenario-based Approach that lets you practice these steps as
you learn troubleshooting techniques.
o Each scenario begins with a realistic customer call reporting a problem that you
need to troubleshoot.
• You will work through each step of the life cycle of the customer support call to
diagnose the customer's problem.
o As the course progresses, you will need to use the knowledge and skills that you
learned in earlier scenarios to troubleshoot the customer's problem.
The Life Step 1: Isolate
Cycle of a and Identify
Problems
Case
Life Cycle of a Case Step 1

Problems
Case
Step 2:
Research
Problems in
Documentation

Problems
Case
Step 2:
Research
Problems in
Documentation
Step 3: Gather
and Analyze Steps 2 and 3 may be
Logs reversed, depending
on the problem.
8

Problems
Case
Step 2:
Level 2 Research
Provide
Support Tasks Problems in
In-depth
Documentation
Analysis
Re-create
the Problem
Step 4: Step 3: Gather

Troubleshoot and Analyze Steps 2 and 3 may be
Problem Logs reversed, depending
on the problem.
9

Problems
Case
Step 5: Complete,
Step 2:
Document, or
Level 2 Research
Escalate the Case Provide
Support Tasks Problems in
In-depth
Documentation
Analysis
Re-create
the Problem
Step 4: Step 3: Gather

Troubleshoot and Analyze Steps 2 and 3 may be
Problem Logs reversed, depending
on the problem.
10

Problems
Case
11
1: Isolate and Ident ify Problems
• List Possib le Causes

o A sk Case-Fram ing Quest ions
o Check Cluster Services St at us
o Id ent ify Process Crashes
o Run NCC
12
Step 1: Isolate and Identify Problems

• Begin troubleshooting a problem by narrowing down the list of possible causes.
o Do this by asking case-framing questions, checking the status of cluster services,
identifying process crashes, and running Nutanix Cluster Check.
2: Research Problems in Document at ion
Search Nut anix Support Port al for:
• Solut ions, Help, and Ad vice
• Nut anix Prod uct Docum ent at ion
• Know led g e Base A rt icles ( KBs)
• Field Ad visories
• Bullet ins
• Vid eos
• Training
13
Step 2: Research Problems in Documentation

• After you have isolated and identified the problem, use the Nutanix support portal to
search for solutions, help, and advice in Nutanix product documentation, Knowledge
Base articles (KBs), field advisories, bulletins, videos, and training.
3: Gat her and Analyze Logs
Log s That Relat e t o Your Issue
14
Step 3: Gather and Analyze Logs

• Gather and analyze logs specific to the type of issue you are troubleshooting.
o NOTE: Steps 2 and 3 may be reversed, depending on the problem you are
troubleshooting.
4 : Troubleshoot t he Problem
Use t he Info you Collect ed

• Troub leshoot Prob lem
15
Step 4: Troubleshoot the Problem

• Using the information you gathered in steps 1 - 3, troubleshoot the problem.
o This may involve using tools such as the IPMI web page or looking for Genesis
connectivity issues.
5: Communicate, Document , or Escalate
t he Case
Tell t he Cust om er A b out t he Solut ion

• …or Escalat e if Necessary
Prop erly d ocum ent t he case.
16
Step 5: Communicate, Document, or Escalate the Case

• At the end of the case, you need to communicate the problem resolution or
workaround to the customer or escalate the case.
o Either way, you need to properly document the case.
The Enterprise Cloud Company
NSS 5.x Certification Overview
17
• NSS 5.x Certification Overview

NSS 5.x Certification Topics
• Removal of soft-skills questions.

• Addition of Nutanix Configuration section.
• Exclusively AHV Hypervisor.
• Data Protection.
• AOS 5.x Features.
•You are encouraged to have an in-depth understanding
of all topics documented in the NSS Exam Blueprint.
18
• NSS 5.x Certification Topics

Written Exam
• The NSS 5.x written exam will consist of a pool of multiple choice
questions. The exam is closed-book and offered in a proctored
environment. The candidate is not required to attend the NSS
Troubleshooting Course before attempting the Written exam.
• BUT it is highly recommended!
• The candidate is required to pass the NSS 5.x Written exam prior
to being offered the NSS 5.x Lab exam.
• The candidate will need to achieve a score of 70% on the Written
exam in order to pass.
19
Written Exam
Lab Exam - Format
• The NSS Lab exam is a four-hour, hands-on exam which requires

you to configure and troubleshoot a series of tasks related to the
Nutanix environment.
• The NSS Lab exam consists of a 2 hour and 30 minute
Configuration section, and a 1 hour and 30 minute
Troubleshooting section.
• Passing score is 80%.
• The candidate will only have access to the Nutanix Support Portal
for documentation reference.
20
Lab Exam – Format

Lab Exam - Configuration Section
• The candidate will be presented with a series of Configuration tasks.

• Some tasks may contain multiple steps.
• No partial credit will be given.
• The candidate is required to achieve a score of 80% or greater on the
Configuration section before moving on to the Troubleshooting section. If a
score of 80% or greater is not reached, you will be required to reschedule
your lab practical exam.
• If the candidate achieves a score of 80% or greater, he/she will be given the
Troubleshooting portion of the exam.
21
Lab Exam - Configuration Section

Lab Exam - Troubleshooting Section
• The Nutanix cluster is pre-configured, but contains faults.

• The candidate will be presented with a series of
Troubleshooting tasks.
• The candidate will be required to identify and fix the issues that
are present in the environment.
• If the candidate achieves a score of 80% or greater on the
Troubleshooting section, he/she will have achieved the NSS
Certification.
22
Lab Exam - Troubleshooting Section

Recertification
• Recertification will be required every 2 years from the

date of original certification.
• A candidate can recertify by passing the NSS written
exam.
•If you fail to recertify and your certification expires,
you will need to retake the Written and the Lab exam.
23
Recertification
Node D
24
Lab Equipment
Thank You!
25
Thank You!
Module 2
Tools and Utilities
Nut anix Troub leshoot ing 5.x
Module 2 Tools and Utilities

Course Agenda
1. Intro 7. A FS
2. Tools & Utilities 8. A BS
3. Services & Logs 9. DR
4. Foundation 10 . Up g rad e
5. Hardware 11. Perform ance
6. Networking
Course Agenda
• This is the Tools and Utilities module.
O bject ives
After completing this module, you w ill be able to:

• Provid e a com p rehensive view of t he p rim ary t ools availab le
t o t roub leshoot issues on a Nut anix clust er and relat ed
ap p licat ions
• Run and int erp ret t he out p ut of Nut anix Clust er Check
(NCC)
• Det erm ine w hich t ool is b est t o use w hen t roub leshoot ing a
Nut anix issue
• Collect a com p rehensive log b und le from t he clust er t o
p rovid e t o Nut anix Sup p ort for analysis
Objectives
• Provide a comprehensive view of the primary tools available to troubleshoot issues
on a Nutanix cluster and related applications.
• Run and interpret the output of Nutanix Cluster Check (NCC).
• Determine which tool is best to use when troubleshooting a Nutanix issue.
• Collect a comprehensive log bundle from the cluster to provide to Nutanix Support for
analysis.
Nut anix Troubleshoot ing Toolkit
Nutanix Troubleshooting Toolkit

• There are several different tools available that can be used to diagnose a Nutanix
cluster. As it is cumbersome to run each and every tool whenever an issue is
experienced in a Nutanix environment, it is important to be able to differentiate the
significance and usage of each – This will help determine when it is appropriate to
leverage one particular tool or another. In some instances, multiple tools or a
combination of tools will need to be used to completely diagnose the cluster and
resolve the issue. This module will walk you through all of the tools that are available
on the system, and under what circumstances they would be suitable to use.
Tools & Ut ilit ies
• NCC/ Log Collector

• Cluster Interfaces/ CV M Commands ( nCLI)
• AHV Commands (aCLI)
• Linux File/ Log Analysis Tools
• Hardware Tools
• Performance Tools
• Service Diagnost ics Pages
• SRE Tools
Tools & Utilities

• NCC/Log Collector
NCC & Log Collect or
NCC & Log Collector

NCC O verview
• Nut anix Cluster Check ( NCC) is a fram ew ork of

scrip t s t hat can help d iag nose clust er healt h.
• NCC can b e run reg ard less of clust er st at e.
• NCC checks are run ag ainst t he clust er or t he nod es,
d ep end ing on t he t yp e of inform at ion b eing ret rieved .
• NCC g enerat es a log file on t he CVM w it h t he out p ut
of t he d iag nost ic com m and s select ed by t he user.
• The ‘ncc’ scrip t is locat ed on t he CVM at :
/home/nutanix/ncc/bin/ncc
7
NCC Overview
NCC Modules and Plugins
• NCC actions are grouped into Modules and Plugins:

o Modules are logical groups of plugins that can be run as a set.
o Plugins are object s t hat run t he diagnost ic com m ands.
• Each NCC plugin is a test that completes independently of

other plugins. Each test completes with one of the following
status types:
o PASS – The t est ed asp ect of t he clust er is healthy.
o FAIL – The t est ed aspect of t he clust er is not healt hy.
o W A RN – The plugin ret urned an unexpect ed value.
o INFO – The p lug in ret urned an exp ect ed value, b ut cannot b e evaluat ed
as PA SS/ FA IL.
8
NCC Modules and Plugins
NCC is a framework of scripts to run various diagnostics commands against the cluster.
These scripts are called Plugins and are grouped by Modules. The scripts can help
diagnose cluster health. The scripts run standard commands against the cluster or
nodes depending on what type of information is being retrieved.
A Module is a logical grouping of Plugins. Some modules have sub modules. Within a
module, administrators have a choice to run all plugins run_all=true or just a specific
plugin out of that module.
The scripts (plugins) run tests against the cluster to report on the overall health.
NCC Version
To determine w hat version of NCC is

running on a cluster from the CLI, use the
follow ing command:
ncc --ncc_version OR ncc --version
[nutanix@NTNX-16SM6B490273-A-CVM:10.1.60.113:~$ncc —version
3.0.3.2-16e3fe75
NCC Version
• Ensure that each version in the cluster is running the same version of NCC.
Furthermore, Prism Central and each cluster managed by Prism Central are all
required to run the same NCC version.
NCC Version (co nt ’d )
From Prism, the NCC version can be captured from the either the
gear icon in the top right corner - > Upgrade Software - > NCC, or
from clicking on the user icon in the top right - > About Nutanix
Verify from the Compatibility Matrix on the Nutanix Support Portal

that the NCC version running on the cluster is compatible w ith the
AO S version installed .
10
NCC Version (cont’d)

11

12

NCC Upgrade - Prism
• It is alw ays recom m end ed t o run t he lat est version of NCC

ag ainst a Nut anix clust er t o ensure t hat t he m ost recent checks
are b eing run.
• NCC can b e up g rad ed via 2 m et hod s:
o Prism Element - Tarball
o Command Line - Shell Script
• If t he aut om at ic d ow nload op t ion is enab led , t he lat est availab le
version w ill b e d isp layed for inst all. If t here are no new er
versions availab le, No available versions for upgrade w ill
b e d isp layed .
14
NCC Upgrade – Prism

NCC Upgrade - Prism
If t he version of NCC t hat is to be applied is downloaded

from the Support Portal, it can be uploaded to Prism
from t he Gear icon - > Upgrade Software - > NCC.
The .json metadata file and .tar.gz binary bundle are
required.
15

NCC Upgrade - Prism
O nce t he required files have been provided, an

‘Upgrade’ option will be displayed.
The system w ill confirm t hat NCC should be upgraded .
16

NCC Upgrade - Prism
An associated t ask w ill be displayed in t he Prism t ask

monitor. Similarly, the progress of the NCC upgrade
can be monitored from t he Upgrade Soft ware w indow.
The upgrade is finished once all associated tasks
reach 100% complet ion.
17

NCC Upgrade – CLI
 A single installer file can be run from the CLI of one CVM,
and NCC will be installed across all CV Ms on the cluster.
 After dow nloading the file from the Support Portal, copy it to
the CV M using SCP/ SFTP. Ensure that the directory it is copied
to exists on all CV Ms.
 Ensure that the MD5SUM value of the Installer file matches that
which is posted on the Support Portal
18
NCC Upgrade – CLI

• Enter this command at the prompt:
o $ md5sum nutanix-ncc-e16-release-ncc-3.0.4-stable-installer.sh
NCC Upgrade – CLI (co nt ’d )
Make t he inst allat ion file execut able and run it to inst all
NCC.
19
NCC Upgrade – CLI (cont’d)

Running All Healt h Checks
From CLI:
• To execute all NCC health checks on a system from the CV M CLI,
the follow ing command is issued on the CV M. O n a 4 - node cluster,
this takes approximately 5 minutes:
ncc health_checks run_all
• The resulting output of the checks is approximately 70 KB in size;
however, will depend on the return status of checks and those that
don’t PASS. The file w ill be saved to the follow ing directory:
/home/nutanix/data/logs/ncc-output-latest.log
From PRISM:
• To execute all NCC health checks on a system from Prism ( AOS
5.0 + & NCC 3.0 +) , navigate to the Health page - > Actions - > Run
20 Checks. Select All checks and click Run.
Running All Health Checks

Running All Healt h Checks – CLI out put
21
Running All Health Checks – CLI Output

Changing Default Log File Name - CLI
You can elect to change t he nam e o f t he d e fault output

file with the --ncc_plugin_output_history_file flag:
22
Changing Default Log File Name – CLI

Running All Healt h Checks - Prism
23
Running All Health Checks – Prism

Running Specific Healt h Checks - CLI
• To execute a single NCC health check on a system from

the CVM CLI, specify the module and plugin name:
ncc health_checks system_checks gflags_diff_check
• Mult ip le Plugins (checks) from t he sam e Module can b e run

sep arat ed by a sing le com m a charact er:
ncc health_checks system_checks
--plugin_list=cluster_version_check,cvm_reboot_check
• To re-run only t hose checks t hat rep ort ed a FA IL st at us,

use t he –rerun_failing_plugins=True flag:
ncc ––rerun_failing_plugins=True
24
Running Specific Health Checks - CLI
( cont ’d)
Use the --npyscreen flag in order to print out a progress

bar of the current NCC modules and plugins being run on
the CLI:
ncc health_checks run_all --use_npyscreen=true
25
Running Specific Healt h Checks - PRISM
• To run sp ecific healt h checks from Prism , select t he

Specific Checks rad io b ut t on and search for t he check(s)
you would like to run.
26
NCC File Ut ilit ies
NCC has a built - in file_utils module t hat does not run

any checks, but can help manage files in the cluster:
27
NCC Log Collector
• AOS implements many logs and configuration information files that are
useful for troubleshooting issues and finding out details about a particular
node or cluster.
o A log bundle can be collected from the NCC Log Collector utility to
view these details.
• An NCC Log Collector bundle should be captured for the following:
o Every Performance Case
o Anytime a case is escalated to Engineering
o Any offline Analysis of a Support Case
• Note: The NCC Log Collector does not collect the NCC
health checks – these should be run separately.
28
NCC Log Collector

NCC Log Collector
• To use t he NCC Log Collect or, t he follow ing com m and is run from a CVM:
ncc log_collector [plugin_name]
• The output of the NCC Log Collector bundle will be stored in:
o /home/nutanix/data/log_collector
o Output of the last 2 Colletions will be saved
• Available plugin options are:
o run_all – Runs all p lug ins, collect s everyt hing from t he last 4 hours by d efault . This is
t he recom m end ed p lug in to b e run to b eg in t roub leshoot ing clust er -related issues.
o cvm_config – Cont roller VM config urat ion
o sysstats – Cont roller VM system st at ist ics
o cvm_logs – Cont roller VM log s
o cvm_kernel_logs – Cont roller VM kernel log s
o alerts – Cluster alert s
o hypervisor_config – Hyp ervisor config urat ion
29 o hypervisor_logs – Hyp ervisor log s
NCC Log Collector
What NCC Log Collector Gathers

The logs from the following components are collected and individual components can
be specified. By default, all levels of logs are collected (INFO, ERROR, WARNING,
FATAL)
- acropolis
- alerts(alert_manager/alert_tool)
- arithmos
- cassandra(cassandra_init/cassandra_monitor/cassandra/system/ cassandra)
- cerebro
- cluster_health
- connection_splicer
- core_data_path (includes, logs from cassandra, stargate, curator, cerebro, pithos,
zookeeper, chronos/chronos_node_main)
- curator/chronos/chronos_node_main
- dynamic_ring_changer
- genesis
- ha
- hades
- hyperint/hyperint_monitor
- install/finish/svm_boot
- pithos
- prism/prism_monitor/prism_gateway/catalina
- stargate
- stats_aggregator/stats_aggregator_monitor/stats_tool
- zookeeper/zookeeper_config_validator/zookeeper_monitor/
zookeeper_session_closer
- serviceability
The log collector will also collect the alerts and cluster configuration details including:
- cluster_config
- cvm_config
- cvm_kernel_logs
- cvm_logs
- hypervisor_config
- hypervisor_logs
- sysstats
- df
- disk_usage
- fio_stats
- fio_status
- interrupts
- iostat
- ipmi_event
- ipmi_sensor
- lsof
- meminfo
- metadata_disk_usage
- mpstat
- ntpq
- ping_gateway
- ping_hosts
- sar
- top
In most instances, it is ideal to capture ncc log_collector run_all, unless the culprit
component has been identified. The output of the NCC Log Collector bundle will be
stored in the /home/nutanix/data/log_collector directory. Only the output of the last 2
collections will be saved.
NCC Log Collector Flags – Time Period
• By default, Log Collector w ill collect all logs

from the system for the last 4 hours. To specify
a time range other than the default, there are a
few flags that can be specified.
• ncc log_collector --last_no_of_hours=[1-23] [plugin_name]
• ncc log_collector --last_no_of_days=[1-30] [plugin_name]
• ncc log_collector --start_time=YYYY/MM/DD-HH:MM:SS --

end_time=YYYY/MM/DD-HH:MM:SS [plugin_name]
30
NCC Log Collector Flags – Time Period

• NCC Log Collector will capture the specified log bundle for the last 4 hours by
default. In certain scenarios, it may be necessary to tweak this. For instance, if you
only need the last 1 hour of log files, or if you need a more comprehensive collection
of logs. There are flags available that can be specified to configure the last number
of hours (up to 23) or the last number of days (up to 30) to collect logs. Furthermore,
NCC Log Collector can be configured to collect logs with an exact start and end time.
Ensure that the NCC Health Checks are run and captured in conjunction with the
NCC Log Collector as these are not captured by running ncc log_collector run_all.
• The NCC Log Collector will only collect logs if there is more than 10% of available
free space. If you would like to elect to collect a log bundle and forego this
prerequisite, the --force flag can be used:
o ncc log_collector --force=1 run_all.
• The default value of the flag is 0. This should only be changed under the direction
and supervision of Nutanix Support.
NCC Log Collector Flags
• If sensitive cluster information, such as IP Addresses need to be anonymized,
the --anonymize_output=true flag can be specified as part of the log collector
command
ncc log_collector --last_no_of_hours=[1-23]
--anonymize_output=true [plugin_name]
• To sp ecify a p art icular CV M or list of CV Ms to collect log s from , use
t he --cvm_list com m and .
ncc log_collector --cvm_list=“10.4.45.54,10.4.45.55” [plugin_name]
31
--cvm_list=“10.4.45.54,10.4.45.55” [plugin_name]
Tools & Ut ilit ies

• AHV Commands ( aCLI)
• Hardware Tools
• Service Diagnostics Pages
• SRE Tools
32
Tools & Utilities

• Cluster Interfaces/CVM Commands (nCLI)
Prism
Prism
Prism
• Prism p rovid es a m anag em ent g at ew ay for

ad m inist rat ors t o config ure and m onit or an ind ivid ual
Nut anix clust er. This includ es t he w eb console, nCLI
and REST A PI.
• Prism runs on every nod e in t he clust er, and like som e

ot her com p onent s it elect s a lead er. A ll req uest s are
forw ard ed from follow ers t o t he lead er. If t he Prism
lead er fails, a new one is elect ed .
34
Prism
• Prism can b e accessed using any CVM IP ad d ress
• Prism com m unicat es w it h Zeus for clust er

config urat ion d at a and Cassand ra for st at ist ics t o
p resent t o t he user
• Prism runs on p ort 94 4 0
35
Prism Cent ral
Prism Central
Prism Cent ral
• Prism Cent ral is a w eb console used t o m onit or and

m anag e ent it ies across m ult ip le Nut anix clust ers. This
sep arat e VM can b e run insid e or out sid e of t he
Nut anix clust er.
• The m ult i-clust er view runs as a sep arat e, sing le-nod e

clust er VM and p rovid es feat ures such as t he follow ing :
o Sing le sig n-on for all reg ist ered clust ers
o Sum m ary d ashb oard across clust ers
o Mult i-clust er alert s sum m ary and analyt ics cap ab ilit y
37
Prism Central
Prism Cent ral
38
Prism Central
REST A PI
REST API
Nut anix REST API
• REST API st and s for Representable State Transfer

Application Program Interface.
• The Nut anix REST A PI allow s ad m inist rat ors t o
creat e scrip t s t o run ag ainst t heir Nut anix clust ers.
• HTTP req uest s are use to ob t ain cluster inform at ion,
as w ell as m ake config urat ion chang es.
• Com m and out p ut is ret urned in JSON form at
• JavaScript O bject Notation (JSON) is an op en-st and ard file
form at
40
• Consist s p rim arily of at t rib ut e:value p airs and array d at a
Nutanix REST API

• The Nutanix REST API allows administrators to create scripts that run system
administration commands against the Nutanix cluster. The API enabled the use of
HTTP requests to get information about the cluster as well as make changes to the
configuration. Output from the commands are returned in JSON format. A complete
list of REST API functions and parameters is available in the REST API Explorer.
Nut anix REST API (co nt ’d )
• There are 3 versions of the

Nutanix REST API
• v1
• v2
• v3 (AOS 5.1.1) *A s t he
archit ect ure of v3 is b eing re-
d one, it is not b ackw ard s
com pat ible w it h V1 or V2.
• The Nutanix MGMT API is v0 .8
• Code samples are available at:
41 http:// developer.nutanix.com
Nutanix REST API (cont’d)

REST API Explorer
42
REST API Explorer

• Nutanix provides a utility with the Prism web console to help administrators get
started with the REST API. The REST API Explorer displays the parameters and
format for the API calls that can be included in scripts. Sample API calls can be
made to show the type of output you should expect to receive. The REST API
Explorer displays a list of the cluster objects that can be managed by the API.
REST API Explorer
43
REST API Explorer (co nt ’d )
• Each cluster object has t he follow ing opt ions

associated with it:
• Show/ Hide: expand/ collapse object det ails
• List Operations: show all operat ions
• Expand O perations: show a det ailed view of all operat ions
• The object s are listed by a relat ive pat h t hat is

appended to the base URL:
https://management_ip_addr:9440/PrismGateway/services/rest/v[1,2,3]/api
44
Nut anix Com m and Line Int erface ( nCLI)
Nutanix Command Line Interface (nCLI)

nCLI
• nCLI allows system administration commands to be run

against a Nutanix cluster
• nCLI has a few hidden command options that can be run

by administrators w ith the guidance of Nutanix Support
ncli --hidden=true can be used to spawn an nCLI session w
ith hidden commands available
• nCLI can be leveraged for scripting configuration,

management and deployment tasks on a Nutanix
cluster
46
nCLI – Local Machine
• nCLI can b e run, and is p referred t o b e, from your local m achine

• This requires dow nloading nCLI from t he Prism int erface, adding
t he appropriate folder t o t he syst em ’s PATH variab le, and
inst ant iat ing a connect ion t o t he clust er
ncli –s cvm_ip_address –u username –p password
•
47
nCLI – CVM
• nCLI can also be run from any CV M in the Nutanix cluster.
• After launching an SSH session into a CV M, complete nCLI

commands can be run from the top level command line mode.
• Executing the ncli command w ill drop the user into the nCLI
shell, making it easier to use tab completion if the specific
command to be run is unknow n.
• Audit logs of ncli commands are available in a hidden file

named .nutanix_history in the /home/nutanix directory.
48
nCLI Help
• nCLI offers a help op t ion on t hree levels t o ob t ain

assist ance w it h ent it ies and act ions b uilt into t he
t ool
49
nCLI Examples
ncli> cluster version
Cluster Version : euphrates-5.1.0.3-stable

Changeset ID : 5bdc88
Changeset Date : 2017-06-02 13:06:53 -0700
ncli Version : euphrates-5.1.0.1-stable

Changeset ID : 419aa3
Changeset Date : 2017-05-03 19:05:57 -0700
ncli> license get-license
Category : Starter
Sub Category :
Cluster Expiry Date : No Expiry
Use non-compliant feat... : true
50
nCLI Examples
Tools & Ut ilit ies

• Hardware Tools
• SRE Tools
51
Tools & Utilities

• AHV Commands (aCLI)
aCLI
• The aCLI ( Acropolis CLI) provides a command line version

of several of the AHV-related options in Prism ( such as
power on/ off V Ms, image creation, host details, and so
on) .
• aCLI can be leveraged for scripting configuration,

management and deployment tasks on the AHV
hypervisor
• In many instances, aCLI provides the ability to manipulate

52 several entities simultaneously, in a single command
aCLI
• Acropolis provides a command line interface for managing hosts, networks,
snapshots, and VMs. The Acropolis CLI can be accessed from an SSH session into
any CVM in a Nutanix cluster using the acli command at the shell prompt.
Alternatively, complete aCLI commands can be run from outside of the aCLI prompt,
or the shell can be accessed from the 2030 Service Diagnostics Page of the
Acropolis service. aCLI command history can be viewed in the /home/nutanix/.acli
history file on a CVM. Note that this is a hidden file and only available on hosts that
are running the AHV hypervisor. It is persistent across reboots of the CVM.
aCLI Examples
<acropolis> vm.list
Output:
VM name VM UUID
Server01 40a5cde6-2b20-4fa9-a8b4-1dd4cfc72226
<acropolis> vm.clone Server02 clone_from_vm=Server01
<acropolis> image.create testimage source_url=http://

example.com/image_iso container=default
image_type=kIsoImage
53
aCLI Examples
• aCLI commands can also be run, with the output specified to be in JSON format.
This is again useful in scripting the configuration and management of an AHV
environment.
o nutanix@cvm:~$ acli vm.list
o VM name VM UUID
o Server01 40a5cde6-2b20-4fa9-a8b4-1dd4cfc72226
o nutanix@cvm:~$ acli -o json vm.list
o {"status": 0, "data": [{"name": "Server01", "uuid": "40a5cde6-2b20-4fa9-a8b4-
1dd4cfc72226"}], "error": null}
• The complete aCLI command line reference manual is available on the Nutanix
Support Portal.
Tools & Ut ilit ies

• Hardware Tools
• SRE Tools
54
Tools & Utilities

• Linux File/Log Analysis Tools
cat
• The cat command is short for concatenate.

• It has 3 major functions: Reading files, Concatenating
files, and Creating files.
• It is primarily used to print the contents of a file to stdout
( w hich can be redirected to a file if needed ) .
55
more/ less and pipe
• If you need t o brow se log files w it hout having t o edit t hem , t he more and
less commands are very useful.
• The more com m and w ill print out a file one page at a t im e. The spacebar
can be used to go to the next page.
• The less command is similar t o more but it allow s you t o navigat e up/
dow n t hrough t he file, and allow s for st ring searches.
• The less +F is similar to tail –f. You can control C out and will still be in less to search the
file.
• The pipe ( |) charact er isused t o connect m ult ip le program s t oget her. The
st andard out put of t he first program is used as standard input for the
second.
56
head/ t ail
• The head command displays t he first 10 lines of a file,

unless otherw ise specified.
• The tail command w ill display t he last part of a file.

tail -n w ill d isp lay t he last n lines of a file. tail
• –[f|F] w ill d isp lay ap p end ed d at a as t he file grows.
57
head/tail
grep
• The grep command searches file( s) for lines

containing a match to a given pattern. It is
an extremely powerful tool that can assist
w it h analyz ing log files in a N ut anix
environment
• There are 4 variants to grep available:

• egrep, fgrep, zgrep, and rgrep
58
grep
grep Examples
• To list out all file in a directory containing a particular string:

ls -l | grep <string>
• To search a file for a particular string:

cat <filename> | grep <string> O R
grep <string> <filename>
59
grep Examples
grep Examples (co nt ’d )
• To display the matched line and a specified number of

lines after ( -A) or before ( -B) :
ifconfig | grep –A 4 eth0
ifconfig | grep –B 2 UP
60
grep Examples (cont’d)

grep Examples (co nt ’d )
To count t he number of matches of a specific

string:
cat <file> | grep -c <string>
To recursively search for a string in the current

directory and all subdirectories:
grep –r <string> *
61
grep Examples (cont’d)

Linux Text Editors
• The Linux O S offers a w ide variety of text editors that can

be used to edit files directly on a system running Linux,
such as a CV M or AHV host.
• Two of the more popular of these editors are: vi and nano
• Both provide the capability to make edits to existing text

files on the Nutanix system, for the purpose of updating
configuration files, implementing service workaround, or
other file manipulation requirements.
62
Linux Text Editors

IPTables
• IPTables is an application built that allows administrators

to configure rules and chains provided by the Linux kernel
firewall.
• IPTables runs on the Nutanix CV M by default.
• In the event that certain ports need to be opened, or
closed, on the CV M, rules can be added or deleted in
order to allow or deny access to certain services.
• An example of a requirement to open up IPTables would
be to access a Nutanix service diagnostic page externally.
63
IPTables
Tools & Ut ilit ies

• Hardware Tools
• SRE Tools
64
Tools & Utilities

• Hardware Tools
Hardware Tools
• Supermicro SUM tool

• ipmitool
• ipmiview
• ethtool
• smbiosDump
• smartctl
• hdparm
• lsiutil
65
Hardware Tools
Sup erMicro SUM Tool
SuperMicro SUM Tool

SUM Tool
• The SuperMicro Update Manager ( SUM) Tool can b e used

t o m anag e BIOS and BMC firm w are im ag e and
config urat ion up d at es
• Syst em checks and event log m anag em ent are also
sup p ort ed
• Runs on W ind ow s, Linux, and FreeBSD OS
• Can b e run on t he CVM assum ing t here is connect ivit y t o

t he IPMI IP
67
SUM Tool
• This tool is not to be distributed to any customers or 3rd parties.
SUM Tool Install

This tool is not to be distributed to any customers or third parties.
The SUM tool can be downloaded internally with the following command:
wget http://uranus.corp.nutanix.com/hardware/vendors/supermicro/tools/
sum_1.5.0_Linux_x 86_64_20150513.tar.gz
The following Evernote link contains further details and examples around using
the SUM tool: https://www.evernote.com/shard/s28/sh/
b9e3e5cd-82cd-4d16-823f-2717d31d394d/efdb0a118ce0afead0df29314e97b6f3
IPMIt ool
IPMItool
IPMITool
• The Intelligent Platform Management Interface

(IPMI) is a st and ard ized com p uter system interface
used for out -of-b and m anag em ent , m onit oring ,
and config urat ion of syst em s.
• IPMI funct ionalit ies can b e accessed via IPMItool,

w hich is p art of t he Op enIPMI-t ools Linux p ackag e
70
IPMITool
IPMITool
• Packaged w it h bot h t he CV M and AHV host

• W hen using ipmitool on a CVM, you need t o sp ecify t he host ,
usernam e and p assw ord t o est ab lish a rem ot e IPMI session. Ot her
com m and s can b e sp ecified inst ead of t he ‘shell’ op t ion t o run t hem
d irect ly:
71
IPMITool
> IPMITool (cont'd)
• Running the ipmitool from the AHV Host
• When using ipmi tool on a AHV host, you do not need to
specify the IP Address, Username or Password for Access
Allows IPMI user Administration andresetting Passwords
ipmitool user list-to list allusers by 10
ipmitool user set password user_id PasswOfd-to reset an IPMI user password
ipmitoolme reset [ warm lcold)-to reset the Manaaement Controller
• Requires Login Credentials to the AHV Host
II
SMCIPMITool
SMCIPMITool
SMCIPMITool
• The SMCIPMITool is an out -of-b and CLI op t ion for

accessing a host ’s IPMI interface, offering som e of
t he follow ing key feat ures:
• IPMI Manag em ent & Firmw are Up g rad e
• FRU Manag em ent
• Virt ual Med ia Manag em ent
• The SMCIPMITool can b e accessed on t he CVM at

t he follow ing locat ion:
/home/nutanix/foundation/lib/bin/smcipmitool/
73
SMCIPMITool
• SuperMicro reference:
o https://www.supermicro.com/solutions/SMS_IPMI.cfm
SMCIPMITool Examples
• Launch t he SMC IPMITool using t he

follow ing command:
SMCIPMITool <IPMI IP> <user ID> <password>
shell
74
• This tool is not to be distributed to any customers or third parties.
• Verify current running BIOS version:

bios ver
• List configured IPMI users:

user list
75
SMCIPMITool Examples (co nt ’d )
• List out all SEL entries on a particular host

sel list
76
SMCIPMITool Examples (cont’d)

Sup erMicro IPMIView
SuperMicro IPMIView
SuperMicro IPMIView
• IPMIView - A GUI-b ased ap p licat ion t hat allow s

ad m inist rat ors t o m anag e m ult ip le syst em s via t he BMC.
• It is availab le on W ind ow s and Linux p lat form s
• A llow s for t he ab ilit y t o m anag e several d ifferent nod es via

one int erface
• Config urat ion d et ails such as IP and log in cred ent ials can b e
saved t o a file for rep eat ed use
78
SuperMicro IPMIView
SuperMicro IPMIView – Login/ Connected
79
SuperMicro IPMIView – Login/Connected

SuperMicro IPMIView – BMC Info
80
SuperMicro IPMIView – BMC Info

SuperMicro IPMIView – Device O perat ions
81
SuperMicro IPMIView – Device Operations

Tools & Ut ilit ies

• Hardware Tools
• SRE Tools
83
Tools & Utilities

collect_perf, Illum inat i, & Weat her Rep ort
collect_perf, Illuminati, & Weather Report

collect_perf
• The collect_perf utility is included w ith the CV M from AOS 4 .1.1. W hen a
performance issue is being observed in the Nutanix environment, performance
data can be collected using collect_perf if the issue cannot be resolved by live
troubleshooting or if the issue is not immediately reproducible.
• It is critical that data is collected during a timeframe w hen the reported

performance issue is being experienced – running it outside of this w indow w ill
just collect normal performance data from the environment and w ill not he
helpful for root cause analysis
• Avoid running the collect_perf utility for an extended amount of time ( >3
hours) as it w ill take a LONG time to dow nload the large bundle from the
Nutanix cluster and require additional time for backend processing of the data.
85
• You should ALW AYS collect logs for any performance-related issue
collect_perf
• Performance issues are often complex in nature. Information gathering is a crucial
first step in troubleshooting a performance issue in order to gain further insight
around the specifics of the environment and to identify when/where the problem
exists. Some of the important factors to note include, but are not limited to:
• What are the performance expectations?
• What part of the environment is the issue isolated to (UVMs, vDisks, hosts,
CVMs)?
• Is the issue constant or intermittent?
• Is the issue reproducible?
• When did the issue start and what is the business impact?
• After gathering environment-related information, the next step would be to run the
collect_perf utility in order to collect performance-related data. The utility should be
run during a timeframe when the performance issue is being observed so the
appropriate dataset can be analyzed to root cause the issue. It is ideal if the
performance issue is reproducible so the collect_perf utility can be executed in
conjunction with a recreate. If the issue is not reproducible, the collect_perf utility
should be run during a timeframe that overlaps with the reported performance
problem. While the tool will run and collect performance data on a stable cluster,
there will not be much to be done by way of identification of an issue.
collect_perf
• To start a collect_perf collection, use the

collect_perf start command:
• To verify the performance collection is running :
86
collect_perf
• The collect_perf utility is generally safe to run on an active Nutanix cluster in
production. There are some additional flags that can be specified in conjunction with
the collect_perf command. You can control the amount of data collected using the –
space_limit parameter. This is set in bytes. There is also a 20% space limit – the
data collector will choose the lower of those two limits. Note that the command flag
must be specified before the start option:
nutanix@NTNX-A-CVM:~$ collect_perf --space_limit=4294967296 start
collect_perf
• To stop a collect_perf collection, use the

collect_perf stop command:
• The resulting output file w ill be stored in

/home/nutanix/data/performance as a .tgz
file.
87
collect_perf
collect_perf
• crontab can be leveraged to help schedule a

collect_perf run if t he exact date and t ime period
to gat her performance dat a is know n.
• It is import ant t hat you are comfort able w it h

edit ing crontab and t hat none of t he default ent ries
are deleted. In part icular, you w ill need to ensure
t hat you are:
1. Com fort ab le w it h ed it ing t he crontab using crontab –e
2. Creat e a start A ND a stop ent ry
3. Do not d elet e any of t he exist ing d efault crontab ent ries.
88
collect_perf
• If there is a particular time when a performance issue impacts the Nutanix
environment, perhaps after hours or some other inconvenient time, it is possible to
schedule the collect_perf utility to run through a crontab entry. You will need to
ensure that you are:
• Comfortable with editing the crontab using crontab –e
• Create a start AND a stop entry
• Don’t delete any of the existing default crontab entries.
• The following example would configure the collect_perf script to run between 1:50 am
and 3:30 am on October 22, 2017:
50 01 22 10 * /usr/local/nutanix/bin/collect_perf start
30 03 22 10 * /usr/local/nutanix/bin/collect_perf stop
• Try to avoid running the collect_perf utility for an extended amount of time (greater
than 3 hours), as it will be timely to download a large bundle from the Nutanix cluster
and require additional time for backend processing of the data.
collect_perf & Illuminat i
Ext ract ing t he content s of t he collect_perf log file

will not result in anything immediately useful; however,
t here is an internal Nut anix server called Illuminat i
which parses through the data and provides a
simplified view of t he performance observed on t he
cluster.
89
collect_perf & Illuminati

Illuminat i Server
• Illum inat i is accessib le from :

ht t p :// illum inat i.corp.nut anix.com
• Aft er copying t he collect_perf out p ut t o a local m achine,

90 up load it t o Illum inat i by select ing Choose File
Illuminati Server
Illuminat i Server
O nce t he file has been successfully uploaded and

processed, a link will be provided to access the
analyzed dat aset
91
Illuminati Server
Illuminat i / / W eat her Report
• The W eather Report is built into the Illuminati server. It

runs 4 checks against the collect_perf output :
• How m any read s are com ing from cold -t ier
• Op log usag e
• Med usa lat ency
• CPU St at us
92
Illuminati // Weather Report

Illuminat i / / W eat her Report (co nt ’d )
Clicking on Full W eather Report w ill provide more detail on the checks
and further insight to the cluster. All checks have independent
watermarks that w ill flag the component for further investigation if
exceeded.
93
Illuminati // Weather Report (cont’d)

Illuminat i / / W eat her Report (co nt ’d )
94
Illuminati // Weather Report (cont’d)

Illuminat i / / W eat her Report – CPU St at us
• The CPU St at us check is b ased on t he g roup ed % Used and %

Ready count ers in esxtop for t he CVM. The count ers m easure t he
p ercent ag e of t im e sp ent in t hat st at e resp ect ively.
• % Used rep resent s t he com b ined CPU ut ilizat ion
• % Ready rep resent s t he p ercent ag e of t im e sp ent w ait ing t o g et

sched uled on a CPU
• Wat erm arks:

• Warn If t he averag e CPU ut ilizat ion is g reat er t han 50 0 %, b ut less t han 70 0 %.
• A lert if t he averag e CPU ut ilizat ion is g reat er t han 70 0 %.
• Warn if t he averag e Read y-t im e is g reat er t han 24 %, b ut less t han 4 0 %
• A lert if t he averag e Read y-t im e is g reat er t han 4 0 %
95
Illuminati // Weather Report – CPU Status

Illuminat i / / W eat her Report – HDD St at us (Cold-Tier
Reads)
• W hen d at a is sourced from t he HDD, know n as t he cold -t ier,

resp onse t im es can b e d eg rad ed . In an effort t o id ent ify t his
b ehavior, t he HDD St at us check p rovid es t he p ercent ag e of HDD
read s. This p ercent ag e is b ased on t he read_ops_per_sec count ers
in h/vars.
• If an increase in HDD read s correlat es t o a d ecrease in t hroug hp ut or

increase in resp onse t im e, it m ay ind icat e t hat t he w orking -set size
exceed s t he hot t ier. The source of t he w orkload need s t o b e
id ent ified .
• Wat erm arks:

• Warn if m ore t han 25% of d isk read s com e from HDD
• A lert is m ore t han 50 % of d isk read s com e from HDD
96
Illuminati // Weather Report – HDD Status (Cold-Tier Reads)

Illuminat i / / W eat her Report – Medusa Latency
St at us
• Medusa is the interface to the metadata layer of AO S and

manages several in- memory caches to speed resolution of
various objects. W hen a cache lookup misses a Cassandra read
must be performed, w hich can result in an increase to response
time.
• W atermarks:
• The w at erm arks for t his check are b ased on how oft en m et ad at a lookup s
result ing in m ore t han 20 % of t he t ot al lat ency
• Warn if lookup s exceed ed t he w at erm ark m ore t han 50 % and less t han 80 %
of t he sam p les
• A lert if lookup s exceed ed t he w at erm ark for m ore t han 80 % of t he sam p les
97
Illuminati // Weather Report – Medusa Latency Status

Illuminat i / / W eat her Report – O plog
St at us
• The p urp o se o f O p lo g is t o b uffer rand o m w rit es w it h SSD.
• Each p er-vDisk can consume a certain amount of space up t o a
shared m axim um .
• O nce a vDisk’s oplog is full, I/ O t o t hat vDisk is st alled .
• This can have a d irect im p act o n t he p erc eiv ed t h ro u g h p u t an d
resp o n se t im e.
• To p ro v id e an ind icat ion of w hen t his m ay b e occurring t he Op log
St at us check compares t he per-vDisk Oplog usage to t he allowed
m ax im um set t ing and co unt s t he num b er o f t im es t he O p lo g w as
at 8 0 % cap acit y.
• Read KB 19 8 3
98
Illuminati // Weather Report – Oplog Status

Tools & Ut ilit ies

• Hardware Tools
• SRE Tools
99
Tools & Utilities

Service Diagnost ics Pages
Service Diagnostics Pages exist for many of the major services

that run on the CV M.
They are useful in troubleshooting, once an issue has been

narrowed dow n to a particular component.
Each page w ill show related statistics, tasks, and operation data
for the respective service it is for.
It is recommended to have Nutanix Support engaged to assist in

interpreting and analyzing any output/ values that are displayed .
10 0
Service Diagnostics Pages

• Most of the major services that run on a Nutanix CVM have a service diagnostics
page that can be navigated to find out more details about the particular services and
the jobs/tasks that it is servicing. In order to access the diagnostics page of a
service, you need to first allow access to the port through the IPTables firewall.
While it is convenient to just disable the IPTables service altogether, it is much more
secure to add a specific rule to accept connections on a particular port number.
Alternatively, the links command can be used to load the diagnostics pages from an
SSH session into the CVM – this just requires a bit more interaction via keyboard
instead of a mouse and web browser. In order to allow access to a specific port
across all CVMs, the following IPTables command can be used:
‘for i in `svmips`; do ssh $i "sudo iptables -t filter -A WORLDLIST -p tcp -m tcp --dport
[port #] -j ACCEPT"; done’.
• This change can be made persistent using the iptables-save command; however, this
is not recommended to leave as a permanent configuration:
‘for i in `svmips`; do ssh $i "sudo iptables -t filter –A WORLDLIST -p tcp -m tcp --dport
2009 -j ACCEPT && sudo iptables-save"; done’.
• The –D flag can be used to delete these specific rules once access to the ports are
no longer needed.
• The IPTables service can be stopped completely with the following command:
sudo service iptables stop’;
• Once troubleshooting has completed the service should be re-enabled using sudo
service iptables start.
• The links command can be used to launch a service diagnostics page without altering
the current IPTables configuration. links http://localhost:[port #] or links http://0:[port
#] will launch a text mode browser view for the specified port.
Service Diagnost ics Pages – W eb Browser
10 1
Service Diagnostics Pages – Web Browser

Service Diagnost ics Pages – Text Mode
10 2
Service Diagnostics Pages – Text Mode

Nut anix Sup p ort Port al
Nutanix Support Portal

Nut anix Support Port al
The Nutanix Support Portal is your one stop shop

for:
• Sof t w are Docum ent at ion ( Inst allat ion & Conf ig urat ion Guid es/
Release Not es)
• Hard w are Rep lacem ent Docum ent at ion
• Know ledge Base art icles
• Field Ad visories
• Best Pract ice Guid es/ Tech Not es/ W hit e Pap ers/ Reference
A rc hit ec t ure D o c um ent at io n
• Com p at ib ilit y Mat rix
• Sof t w are Dow nload s
10 4
Accessible from: http://portal.nutanix.com
Soft ware Document at ion
• Soft w are Docum ent at ion is useful p rim arily for

config urat ion and init ial set up -relat ed issues.
• Ad m inist rat ion and Set up g uid es cont ain d et ails around t he
op erat ional t heory b ehind sp ecific feat ures, as w ell as t he correct
config urat ion p roced ure.
• Reference A rchit ect ures/ Best Pract ices Guid es/ Tech Not es
p rovid es insig ht t o exp ect ed b ench m ark values and
recom m end ed config urat ion set t ing s for various feat ures and
soft w are int eg rat ions.
10 5
Software Documentation
Know ledge Base
The Knowledge Base is a collection of articles that can be utilized to solve

N ut anix- related and integrated applicat ion issues.
A r t ic le s co n t a in d e t a ils a ro u n d co n f ig u ra t io n t ip s a n d p ro ce d u re s, N CC
check informat ion, t roubleshoot ing steps for specific issues/ scenarios, and
more.
If you find a particular Know ledge Base article useful or needs clarificat ion or
correct ion, t here is a sect ion t it led “Submit Feedback on this Article".
feedback on t his art icle”.
10 6
Knowledge Base
• The Nutanix Knowledge Base is a collection of articles published and maintained by
Nutanix employees. There are a wide variety of articles that cover everything from
troubleshooting to initial configuration to product- and feature-related details. Specific
KBs can be referenced with the shorthand notation of:
http://portal.nutanix.com/kb/<KB #>. If you believe a particular KB article requires
improvements or further clarify, there is a section titled “Submit feedback on this
article” which will notify the appropriate Nutanix engineers to make the changes that
are specified.
Nut anix NEXT Communit y
• The Nutanix NEXT Community provides users of Nutanix a forum

to post and answer questions and solutions.
• Nutanix Employees regularly monitor the NEXT community and
often provide guidance on questions.
• “Kudos” are given out to responses that are helpful in solving an
issue, or of high quality content
10 7
Nutanix NEXT Community

• The Nutanix NEXT Community provides users of Nutanix systems to post questions
and solutions in a forum style. Nutanix Employees are members of the NEXT
community and will provide guidance on questions that are posted by external
individuals. There are a wide variety of top-level topics for questions to be
categorized into. “Kudos” are given out to responses that are helpful in solving an
issue, or of high quality content.
Rem ot e Sup p ort
Remote Support
Remote Support
• The Rem ot e Sup p ort Tunnel allow s Nut anix

p ersonnel t o p erform sup p ort t asks
rem ot ely on b ehalf of t he cust om er
• SSH t unnels are est ab lished from t he ext ernal clust er t o b ackend
servers int ernal t o Nut anix
• This feat ure is d isab led by d efault , b ut can b e enab led from Prism
or t he com m and line only by the customer.
• Can b e enab led for a t im e w ind ow b et w een 1 m inut e and 24 hours
119
Remote Support
• Remote Support is a feature that allows Nutanix engineers to access a Nutanix
cluster remotely. Customers have the option to enable the Remote Support feature
from their end for a pre-determined time in order to allow Nutanix engineers to
provide remote troubleshooting or proactive support monitoring. Once Remote
Support is enabled at the customer site, one of the CVMs establishes an SSH tunnel
connection to an internal Nutanix server, allowing engineers to access the system.
Remote Support - Enable/ Configurat ion
120
Remote Support - Enable/Configuration

• The Remote Support feature can be enabled from the CLI with the ncli cluster start-
remote-support command. Similarly, to disable it, use the ncli cluster stop-remote-
support command. The feature can be enabled for a specified period of time using
the duration option:
ncli cluster start-remote-support duration=<minutes>
• The status of the Remote Tunnel can be verified from Prism or from the command
line.
• The Remote Tunnel can also be enabled via ncli with the following command:
ncli cluster get-remote-support-status
• In order to ensure a successful Remote Support connection, the CVMs must be able
to resolve DNS. Name servers can be configured on the CVMs from within Prism, or
with the ncli cluster add-to-name-servers servers=[comma separated list of name
servers]. If a firewall device exists in the customer’s infrastructure, port 8443 will need
to be opened.
Pulse/ Pulse HD
Pulse/Pulse HD
Pulse
Pulse provides diagnostic systems data to Nutanix support for the delivery of
pro-active Nutanix solutions.
• The Nut anix cluster autom at ically collect s t his inform at ion w it h no effect on system perform ance.
• It is t uned t o collect im p ort ant syst em -level d at a and st at ist ics in ord er t o aut om at ically d et ect issues and
help m ake t roub leshoot ing easier.
• A llow s Nut anix t o proact ively reach out to custom ers for version specific ad visories and alert s.
• Cases from syst em s w it h Pulse/ Pulse HD enab led are resolved 30 % fast er.
Pulse collects the follow ing system information:

• Syst em alert s
• Hard w are and Soft w are inform at ion
• Nut anix p rocesses and CVM inform at ion
• Hyp ervisor d et ails
Pulse collects machine data only and does NOT collect any private customer
122 information.
Pulse
• When Pulse is enabled, Pulse sends a message once every hour to a Nutanix
support server by default. Pulse also collects the most important data like system-
level statistics and configuration information more frequently to automatically detect
issues and help make troubleshooting easier. With this information, Nutanix support
can apply advanced analytics to optimize your implementation and to address
potential problems. Pulse sends messages through ports 80/8443/443, or if this is
not allowed, through your mail server. When logging in to Prism/Prism Central for the
first time after installation or an upgrade, the system checks whether Pulse is
enabled. If it is not, a message appears recommending that you enable Pulse. To
enable Pulse, click Continue in the message and follow the prompts; to continue
without enabling Pulse, check the Disable Pulse (not recommended) box and then
click Continue.
Pulse HD
• Pulse HD was introduced in AO S 4 .1 and contains data collection

enhancements over Pulse ( more time series granularity)
• Pulse w ill be slow ly phased out in favor of Pulse HD
• Like Pulse, Pulse HD collects machine data only and does NOT
collect any private customer information but provides the same
benefits:
• The Nut anix clust er aut om at ically collect s t his inform at ion w it h no effect on
system performance
• It is t uned t o collect im port ant syst em -level dat a and st at ist ics in order to
aut om at ically det ect issues and help m ake t roubleshoot ing easier.
• A llo w s N ut anix t o p ro ac t ively reac h o ut t o c ust o m er s f o r ver sio n
sp ec if ic advisories and alert s
• Cases from syst em w it h Pulse/ Pulse HD enabled are resolved 30 % fast er.
123
Pulse HD
• Nutanix Pulse HD provides diagnostic system data to Nutanix support teams to
deliver pro-active, context-aware support for Nutanix solutions. The Nutanix cluster
automatically and unobtrusively collects this information with no effect on system
performance. Pulse HD shares only basic system-level information necessary for
monitoring the health and status of a Nutanix cluster. This allows Nutanix support to
understand the customer environment better and is an effective troubleshooting tool
that drives down the time to resolution. Several different tools are available internally
for Nutanix Support to more easily parse and utilize the data collected via Pulse HD.
The following information is collected from the cluster when Pulse HD is enabled.
• Cluster Info
o - Cluster name
o - Uptime
o - NOS version
o - Cluster ID
o - Block serial number
o - HW model
o - Cluster IOPS
o - Cluster latency
o - Cluster memory
• Node / Hardware
o - Model number
o - Serial number
o - CPU - number of cores
o - Memory (size)
o - Hypervisor type
o - Hypervisor version
o - Disk model
o - Disk status
o - Node temperature
o - Network Interface Model
o - SATADOM firmware
o - PSU status
o - Node location
• Storage pool list
o - Name
o - Capacity (logical used capacity and total capacity)
o - IOPS and latency
• Container Info
o - Container Name
o - Capacity (logical used and total)
o - IOPS and latency
o - Replication factor
o - Compression ratio
o - Deduplication ratio
o - Inline or post-process compression
o - Inline deduplication
o - Post-process deduplication
o - Space available
o - Space used
o - Erasure coding and savings
• Controller VM
o - Details of logs, attributes, and configurations of services on each Controller VM
o - Controller VM memory
o - vCPU usage
o - Uptime
o - Network statistics
o - IP addresses
• VM
o - Name
o - VM state
o - vCPU
o - Memory
o - Disk – space available
o - Disk – space used
o - Number of vDisks
o - Name of the container that contains the VM
o - VM operating system
o - IOPS
o - Latency
o - VM protected?
o - Management VM?
o - IO Pattern - Read/ Read Write/Random/sequential
• Disk Status
o - Perf stats and usage
• Hypervisor
o - Hypervisor software and version
o - Uptime
o - Installed VMs
o - Memory usage
o - Attached datastore
• Datastore Information
o - Usage
o - Capacity
o - Name
• Protection Domain (DR)
o - Name
o - Count and names of VMs in each protection domain
• Currently Set Gflags
• BIOS / BMC info
o - Firmware versions
o - Revision / release date
• Disk List
o - Serial Numbers
o - Product part numbers
o - manufacturer
o - Firmware versions
o - Slot Location
o - Disk type
• Domain Fault Tolerance States
• Default Gateway
• SSH key List
• SMTP Configuration
• NTP Configuration
• Alerts
• Click Stream Data
• File Server Data
Pulse HD – Enable from Prism
124
Pulse HD – Enable from Prism

Pulse HD – Enable/ Verify from CLI
125
Pulse HD – Enable/Verify from CLI

Pulse HD – Support Port al
• Cust om ers can verify t he st at us of Pulse in t heir inst alled

b ase from t he Nut anix Sup p ort Port al -> My Prod uct s ->
Inst alled Base. This w ill confirm if an asset is send ing Pulse
m essag es correct ly.
126
Pulse HD – Support Portal

Troubleshoot ing Pulse HD
• NCC provides a check to troubleshooting cluster and

network settings to allow Pulse messages to be sent:
ncc health_checks system_checks auto_support_check
• This check w ill determine if Pulse is enabled or disabled, and

if the appropriate ports are reachable through any firewalls
127
Troubleshooting Pulse HD
• Look at KB 1585.
Tools & Ut ilit ies
• Hardware Tools
• SRE Tools
128
Tools & Utilities

• Internal Nutanix Support Tools
SRE Tools
SRE Tools
SRE Tools
• The SRE Tools ut ilit y p rovid es for a sim p le and valid at ed

p rocess t o inst all and rem ove ap p roved sup p ort RPM s
(telnet and tcpdump) on CVMs and FSVMs.
• The /root/sretools d irect ory cont ains t he follow ing

files:
• sreinstall.sh
• sreuninstall.sh
• tcpdump-4.0.0-3.20090921gitdf3cb4.2.el6.x86_64.rpm
• telnet-0.17-48.el6.x86_64.rpm
158
SRE Tools
SRE Tools Inst all
159
SRE Tools Install

SRE Tools Uninst all
16 1
SRE Tools Uninstall

Labs
Module 2
Tools and Utilities
16 3
Labs
• Module 2 Tools and Utilities.
Thank You!
16 4
Thank You!
Module 3
Services and Logs

Course Agenda
1. Int ro 7. A FS
6. Net w orking
Course Agenda
• This is the Services and Logs module.
O bject ives

• Describe the major components and w hat they
do.
• Troubleshoot the components.
• Find and interpret logs of interest.
Objectives
• Describe the major components and what they do.
• Troubleshoot the components.
• Find and interpret logs of interest.
Com p onent s
Components.
Component Relat ionships
Component Relationships.
• All of the AOS components (services/processes) have dependencies on others.
Some are required, some are optional, depending on features being made available.
• Internal link:
o https://confluence.eng.nutanix.com:8443/display/STK/AOS+Services
• Finding leaders
o https://confluence.eng.nutanix.com:8443/download/attachments/13631678/How%2
0to%20determine%20the%20leaders%20in%20a%20cluster.docx?version=2&mod
ificationDate=1498665487454&api=v2
Component Relat ionships
Internal link:
https://confluence.eng.nutanix.com:8443/display/STK/AOS+Services
Finding Leaders:
https://confluence.eng.nutanix.com:8443/download/attachments/13631678
/How%20to%20determine%20the%20leaders%20in%20a%20cluster.docx
?version=2&modificationDate=1498665487454&api=v2
Component Relationships.
• All of the AOS components (services/processes) have dependencies on others.
Some are required, some are optional, depending on features being made available.
• Internal link:
o https://confluence.eng.nutanix.com:8443/display/STK/AOS+Services
• Finding leaders
o https://confluence.eng.nutanix.com:8443/download/attachments/13631678/How%2
0to%20determine%20the%20leaders%20in%20a%20cluster.docx?version=2&mod
ificationDate=1498665487454&api=v2
Cluster Component s
Genesis Uhura
Prism Erg on
Zeus Chronos
Med usa Cereb ro
Pit hos
Insig ht s
St arg ate
Lazan
Curat or
Acrop olis Minerva
Ot hers
7
Cluster Components
• There are 30+ components in 5.x.
Some Useful Cluster Commands
$ cluster status
• Ret urns st at us of AOS services
from all nod es in t he clust er.
$ cluster stop
• St op s m ost of t he clust er
com p onent s on all nod es in t he
clust er. Makes st orag e
unavailab le t o t he UVMs. W ill
not execut e if UVMs are running .
$ cluster start
• Sig nals Genesis on all nod es t o
st art any clust er p rocesses not
8 running .
Some Useful Cluster Commands

Additional notes:
• Cluster Status - Also shows if a CVM is offline as well as PIDs for each component
on each node.
• Cluster Stop - Will not execute if UVMs are running.
• Cluster Start - This is non-intrusive and will not stop already started services.
• Cluster Destroy – This will destroy the Zeus config and all data in the cluster will be
unrecoverable. Additionally, all data in the cluster will be lost. The CVMs will continue
to exist but they will not be participating in a cluster any longer. The Foundation
service will start broadcasting to announce they are available to join a cluster.
Genesis
Genesis
Genesis
Cluster Component Manager.

• Runs on each nod e.
• Req uires Zookeep er is Up and
Running .
o Ind ep end ent Process - Does not
req uire t he clust er t o b e
config ured / running .
• Resp onsib le for init ial clust er
config urat ion, g at hering nod e
info, st art ing / st op p ing ot her
services.
10
Genesis
• Genesis is responsible for the following among other things.
• Assigning the 192.168.5.2 address to the eth1 interface of the Controller VM.
• Ensuring that the node is up to date (release version maintained by Zookeeper).
• Preparing and mounting the physical disks.
• Starting the other services.
Some Useful Genesis Commands
$ genesis status
• Ret urns st at us and PIDs for all clust er p rocesses on t he nod e
$ genesis stop <service>
• St op s t he sp ecified service
$ cluster start
• Sig nals Genesis on all nod es t o st art any clust er p rocesses not
running
$ genesis restart
• Rest art s t he local g enesis p rocess
$ cluster restart_genesis
• Rest art s g enesis on all nod es in t he clust er
11
Some Useful Genesis Commands.

• watch –d cluster status to see what process ids are changing – Do as demo…
• Open 2 ssh sessions to a CVM.
• Do watch -d genesis status… let it run for a bit.
• On another ssh session stop a service (genesis stop ergon).
• Go back to the watch session. Notice the task PIDs disappear and it’s highlighted.
• Go back to the other session and enter the command cluster start (genesis start
service doesn’t start the service).
• Go back to the watch session and see the highlighted PIDs.
Cluster St at us Error Message
If you execute the command cluster status and see the above output, this
means that genesis isn’t running on the local node.
• This m ay b e b ecause t he nod e or CVM w as recent ly reboot ed or t he genesis
service isn’t fully operat ional.
• If a reboot hasn’t been perform ed, check t he genesis.out log t o see w hy
t he service isn’t st art ed. A lso check for genesis FATAL files t o see if t he
12
service is crashing .
Cluster Status Error Message

Genesis Requirement s
Genesis has the following requirements to start:
• All Hypervisors: The Cont roller VM can reach 192.168.5.1 from 192.168.5.254 .
• ESXi/ AHV : The Cont roller VM has p assw ord -less ssh access to 192.168.5.1 (t he internal
interface). If you encounter p rob lem s, run t he fix_host_ssh scrip t .
• Hyper-V : The Nut anix Host Ag ent Service is running on t he hyp ervisor, and aut horized
cert s file has t he need ed cert ificates. Run t he winsh com m and to verify t hat Nut anix
Host Ag ent Service is running , and t hen run a Pow ershell com m and to verify t hat t he
cert ificates are correct .
• The com m and to start is issued by t he Nut anix user
NOTE: Genesis m ust b e st arted using t he Nut anix user context . If it is st arted w it h any
ot her context (such as root) t he cluster w ill b e unst ab le.
13
Genesis Requirements
Genesis Logs
~/data/logs/genesis.out
Common Issues
• Unab le t o com m unicat e w it h hyp ervisor
• Clust er IPs chang ed w it hout follow ing p rop er p roced ure
• Zookeep er crashing or unresp onsive
14
Genesis Logs
• /home/nutanix/data/logs is the location of most of the log files.
• Genesis.out is a symlink to the most recent genesis log file and is the one most often
needed. However, there are times when previous versions of the genesis logs can be
useful.
• Some of the details in the genesis logs are:
o Services start and stop.
o Connection problems between nodes.
o Some interactions with the hypervisor.
Genesis.out Content s
https://portal.nutanix.com/#/page/docs/details?t
argetId=Advanced-Admin-AOS-v50:tro-genesis-log-
entries-c.html
15
Genesis.out Contents
• Have students go to
https://portal.nutanix.com/#/page/docs/details?targetId=Advanced-Admin-AOS-
v50:tro-genesis-log-entries-c.html
• …and read the file. It’s about 5 minutes of reading. Text of the page is on the next
couple slides which are hidden. This is the 5.0 version of the file, as of this writing,
there is also a 5.1 version of the file which appears to be very similar if not identical.
These pages are pulled from the Acropolis Advanced Administration Guide.
• As of 5.0 there are 34 services.
Genesis.out content s ( cont .)
W hen checking the status of the cluster services, if any of the services are down, or the
Controller V M is reporting Down with no process listing, review the log
at /home/nutanix/data/logs/genesis.out to determine why the service did not start,
or why Genesis is not properly running.
Check the contents of genesis.out if a Controller V M reports multiple services as
DOW N, or if the entire Controller V M status is DOW N.
Under normal conditions, the genesis.out file logs the following messages
periodically:
• Unp ub lishing service Nut anix Cont roller
• Pub lishing service Nut anix Cont roller
• Zookeep er is running as [ lead er|follow er]
Prior to these occasional messages, you should see Starting [n]th service. This is an
indicator that all services were successfully started.
16
Ignore INFO messages in genesis.out

Possible Errors
• 2017-03-23 19:28:00 WARNING command.py:264 Timeout executing scp -q -o
CheckHostIp=no -o ConnectTimeout=15 -o StrictHostKeyChecking=no -o
TCPKeepAlive=yes -o UserKnownHostsFile=/dev/null -o
PreferredAuthentications=keyboard-interactive,password -o
BindAddress=192.168.5.254 'root@[192.168.5.1]:/etc/resolv.conf'
/tmp/resolv.conf.esx: 30 secs elapsed
• 2017-03-23 19:28:00 ERROR node_dns_ntp_config.py:287 Unable to download ESX

DNS configuration file, ret -1, stdout , stderr
• 2017-03-23 19:28:00 WARNING node_manager.py:2038 Could not load the local ESX
configuration
• 2017-03-23 19:28:00 ERROR node_dns_ntp_config.py:492 Unable to download the

ESX NTP configuration file, ret -1, stdout , stderr
Any of the above messages mean that Genesis was unable to log on to the host using the
17
configured password.
Diagnosing a Genesis Failure
Determine the cause of a Genesis failure based on the information available in the log files.
1. Examine the contents of the genesis.out file and locate the stack trace ( indicated by the CRITICAL message type) .
2. Analyze the ERRO R messages immediately preceding the stack trace.  ...
2017-03-23 19:30:00 INFO node_manager.py:4170 No cached Zeus configuration found.
2017-03-23 19:30:00 INFO hyperv.py:142 Using RemoteShell ...
2017-03-23 19:30:00 INFO hyperv.py:282 Updating NutanixUtils path
2017-03-23 19:30:00 ERROR hyperv.py:290 Failed to update the NutanixUtils path: [Errno 104] Connection reset by peer
2017-03-23 19:30:00 CRITICAL node_manager.py:3559 File "/home/nutanix/cluster/bin/genesis", line 207, in <module>
main(args)
File "/home/nutanix/cluster/bin/genesis", line 149, in main
Genesis().run()
File "/home/nutanix/jita/main/28102/builds/build-danube-4.1.3-stable-release/python-tree/bdist.linux-
x86_64/egg/util/misc/decorators.py", line 40, in wrapper
x86_64/egg/cluster/genesis/server.py", line 132, in run
x86_64/egg/cluster/genesis/node_manager.py", line 502, in initialize
x86_64/egg/cluster/genesis/node_manager.py", line 3559, in discover
...
 In the example above, the certificates in AuthorizedCerts.txt w ere not updated, w hich means that you failed to connect
18
St arg at e
Stargate
St argate
Runs on all nodes.

Handles all storage I/ O to the
Nut anix storage cont ainers.
Implement s storage protocol
interfaces for dat a access.
• NFS, iSCSI, SMB.
Relocates data between nodes/ drives
as inst ructed by Curator.
Replicates data to remote sites.
20
Stargate
• https://confluence.eng.nutanix.com:8443/display/STK/The+IO+Path%3A+A+Stargate
+Story
o This is another source for stargate architecture.
• At the end of this section (Stargate) there are 2 hidden slides that show another
diagram of the read and write orchestration within Stargate.
St argate Component s
SSD
OpLog
Store
Unified
Cache
Memory
NFS vDisk Extent

Adapter Controller Store
SMB
Admission
Controller
Adapter
iSCSI
Adapter
HDD
21
Stargate Components.
Adapters.
• iSCSI / NFS / SMB.
Admission Controller.
• Separate queues for user IO and background tasks.
• Responsible for “hosting” individual vdisks.
vDisk Controller.
• One vDisk controller per vDisk.
• Maintains metadata and data caches.
• Looks up and updates metadata.
• Directs I/O requests based on direction (read/write), size, and so on.
• Ops to fix extent groups, migrate extents and extent groups, drain oplog, copy block
map, and so forth.
Oplog.
• Maintains in-memory index of oplog store data.
• Draining optimizes metadata lookup, updates & extent store writes.
• Write log on SSD for faster response to writes.
• Persistent write cache of dirty data that has not yet been drained to Extent Store.
• Pipelines write requests to remote Oplog Store.
Content / Unified Cache.

• Read cache which spans both the CVM’s memory and SSD.
Extent Store.
• Handles extent group read and write requests from vDisk Controller.
• Maintains each extent group as a file on disk.
• Manages all disks in the Controller VM.
• Pipelines write requests to remote Extent Store.
O plog
22
Oplog
Purpose of the Oplog
• Low write latency.
• Absorb burst random write.
o Write hot data may get overwritten soon, thus prevent cost to write to extent store.
• Coalesce contiguous IO reducing ops to extent store.
• Upon a write, the OpLog is synchronously replicated to another n number of CVM’s

OpLog before the write is acknowledged for data availability purposes.
• All CVM OpLogs partake in the replication and are dynamically chosen based upon
load.
• For sequential workloads, the OpLog is bypassed and the writes go directly to the
extent store.
Unified/ Content Cache
23
Unified/Content Cache.
• Upon a read request of data not in the Content Cache (or based upon a particular
fingerprint), the data will be placed into the single-touch pool of the Content Cache
which completely sits in memory (Extent Cache).
o LRU (Least Recently Used) is used until it is evicted from the cache.
• Any subsequent read request will “move” (no data is actually moved, just cache
metadata) the data into the memory portion of the multi-touch pool, which consists of
both memory and SSD. From here there are two LRU cycles:
o One for the in-memory piece upon which eviction will move the data to the SSD
section of the multi-touch pool where a new LRU counter is assigned.
o Any read request for data in the multi-touch pool will cause the data to go to the
peak of the multi-touch pool where it will be given a new LRU counter.
Random W rite Sequence
1. Protocol W RITE
5 4a
SSD
OpLog
request
Cassandra
Unified
Store
2. Incoming t raffic cont rol
Memory Cache 4
Memory 3. Check in- memory
1 Buffers
Extent O plog index

6
NFS
6 vDisk Store
4 . W rite to Oplog Store
Adapter
Controller
SMB
2
Admission
Controller
Adapter
3 and replicas
iSCSI
Adapter
5. Receive Oplog Store
replica ACK
FUSE
Adapter HDD
24
6 . Protocol W rite ACK
Random Write Sequence

Notes:
• Between #1 and #2, STARGATE communicates with MEDUSA/CASSANDRA to

translate the WRITE op from inode/filehande:offset:length to
vDiskID:ExtendID:ExtentOffset:length.
• In #2, we only allow a maximum of 128 in-flight (outstanding) ops.
o The rest are put in priority queues of up to 400 entries.
o Beyond this, further incoming requests are dropped and we either reply with an
EJUKEBOK (for NFS) or just let the request time out.
• In both cases, this will trigger a retransmission after some back out period.
• In #4, the OpLog keeps a small index in memory for fast lookups of the contents in
the OpLog Store.
• In #4, the OpLog Store also acts as a WAL (write-ahead log) where in-memory
transactions are backed-up in case replay is necessary.
• In #4, the number of replicas would depend on the Replication Factor (RF). An RF of
two (2) means that data would be written to two (OpLog Store) replicas.
• In #4, all OpLog Store writes are considered replicas. At least one of these replicas
should be local (relative to the vDisk Controller) unless the local OpLog Store doesn’t
have enough free space.
• In #4, no Cassandra/Medusa lookup is necessary – the vDisk controller is already
aware of the location of the OpLog replicas.
• In #5, the vDisk controller waits for RF number of ACKs that the data was
successfully written to RF number of replicas. Otherwise, we cannot ACK the write
properly.
• In #6, the vDisk sends the WRITE ACK directly to the protocol adapter; Admission
Controller isn’t involved in the ACK path.
Questions:
• What are the criteria to write data from the Memory Buffers into the OpLog Store? Is
it outstanding writes to a contiguous region (1.5MiB, according to documentation)
within a specific time or is it from the size of the protocol payload?
O plog Drain Sequence
1. Check in-memory O pLog inode for

SSD outstanding dirty w rites
Cassandra OpLog 6a 2. Fetch dirty writes into de-staging area
Store
3 Unified
Cache Memory
2 and generate async W riteO p
Memory Buffers
6 3. Query Cassandra for extent groups
5 4. Cassandra extent group metadata
NFS
7
Extent
response
1
Adapter vDisk Store
Controller
SMB 5. Check each extent group for sufficient
Admission
Controller
Adapter
replicas
iSCSI
Adapter 6. Commit async writeOp to ExtentGroup
FUSE
replicas
Adapter HDD 7. Replicate extents to ExtentGroup
replicas and release O plog space
25
Oplog Drain Sequence

Sequent ial W rite Sequence
1. Protocol W RITE request

2. Incoming traffic control
3. Check memory buffers and
generate synchronous W riteOp
4. Q uery Cassandra for Extent Groups
5. Cassandra Extent Group metadata
response
6. Commit async W riteO p to Extent
Group replicas
7. Extent Group replication
completed ACK
8. Protocol W RITE ACK
26
Sequential Write Sequence

Notes:
• Between #1 and #2, STARGATE communicates with MEDUSA/CASSANDRA to

translate the WRITE op from inode/filehande:offset:length to
vDiskID:ExtendID:ExtentOffset:length.
• In #2, we only allow a maximum of 128 in-flight (outstanding) ops. The rest are put in
priority queues of u to 400 entries. Beyond this, further incoming requests are
dropped and we either reply with an EJUKEBOK (for NFS) or just let the request time
out. In both cases, this will trigger a retransmission after some back out period.
• In #3, we check if there are 1.5MB of outstanding data in a contiguous region (that is,
causing new extents to be created).
• In #6, the vDisk sends the WRITE ACK directly to the protocol adapter; Admission
Controller isn’t involved in the ACK path.
Questions:
• What are the criteria to write data from the Memory Buffers into the OpLog Store? Is
it outstanding writes to a contiguous region (1.5MiB, according to documentation)
within a specific time or is it from the size of the protocol payload?
Cached Read Sequence
SSD 1. Protocol READ
Cassandra
4 OpLog
Store
request
Unified 2. Incoming traffic
control
Cache
Memory
3. Check in-memory
Memory
Buffers
1
Oplog index
5 NFS
5 4 vDisk
Controller
Extent
Store 4. Check if requested
Adapter
2 data is in the Unified
SMB
3 Cache
Admission
Controller
Adapter
iSCSI 5. Protocol READ ACK

Adapter
FUSE
Adapter
HDD
27
Cached Read Sequence

O plog Read Sequence
SSD 1. Protocol READ
Cassandra OpLog
Store request
Unified
Cache
2. Incoming traffic
Memory Memory
control
1
Buffers
3. Check in-memory
NFS
vDisk Extent Oplog index
5 5 Controller Store
Adapter
2 4. Fetch data from
SMB
Adapter 3 Oplog Store
Admission
Controller
5. Protocol READ ACK

iSCSI
Adapter
FUSE
Adapter
HDD
28
Oplog Read Sequence

Extent Read Sequence
SSD 1. Protocol READ request

2. Incoming traffic control
Cassandra
4 OpLog
Store 3. Check in-memory Oplog
Unified 8 index
Cache
Memory
5 4. Check is requested data is in
Memory
Buffers the Unified Cache
1 7 5. Query Cassandra for Extent
NFS 4 vDisk Extent Groups
9 Adapter 9 Controller Store
6. Cassandra Extent Group
2
metadata response
SMB
3 7. Retrieve data from Extent
Admission
Adapter
Controller
Store
iSCSI
Adapter 8. Store retrieved data in
Unified Cache
FUSE
Adapter 9. Protocol READ ACK
HDD
29
Extent Read Sequence

St argate Diagnost ics
~/data/logs/stargate.out
~/data/logs/stargate.{INFO,WARNING,ERROR,FATAL}
~/service iptables stop
RPC – TCP Port 20 0 9

• Accessib le on every CVM
• $ links http:0:2009
• $ allssh "sudo iptables -t filter -A WORLDLIST
-p tcp -m tcp --dport 2009 -j ACCEPT"
30
Stargate Diagnostics
• https://portal.nutanix.com/#/page/docs/details?targetId=Advanced-Admin-AOS-
v51:tro-stargate-log-entries-c.html
• Discuss “salt stack blocking the port”.

20 0 9 page
31
2009 page.
20 0 9 page ( cont ’d)
32
2009 page (cont’d).

SVM / Disk Information.
• Each SVM ID / IP.
• Disk IDs in each tier.
• Space available for Stargate (GB available / GB total).
• If a cell in this table is red, it means that the disk is currently offline. The web console
and nCLI will also show an alert for the disk.
Container / Storage Pool Information

• ID / Name.
• Max Capacity.
• Reservations / Thick Provisioned space.
• Usage (including garbage).
• Available Space.
Hosted Act ive vDisks – 20 0 9 page
33
Hosted Active vDisks – 2009 page.

Hosted vDisks.
• vDisks active on this node.
• vDisks will migrate to another node if Stargate crashes.
• Measured at the vDisk Controller.
vDisk Data.
• Name.
• Usage / Deduped (calculated during Curator scans – may be stale).
• Perf stats for both oplog and overall vDisk.
• AVG Latency – latency between Oplog and Extent Store - does not measure what’s
experienced by the guest VM.
20 0 9 page – Extent Store
34
2009 page – Extent Store.

Single table split to fit slide.
Raw Disk Statistics.

• How much I/O to both SSD and HDD.
• Latency to physical disks.
O t her St argate Pages of Interest
• /h/gflags: Configuration settings

• /h/traces: Activity traces of ops in components
• /h/threads: Current stack traces of all threads
• /h/vars: Low-level exported statistics, e.g. lock contention
35
Other Stargate Pages of Interest.

Anatomy of a Read
36
Anatomy of a Write
37
Curat or
Curator.
Curator
Background Process.
• Runs on all t he nodes of t he clust er.
• Mast er/ slave archit ect ure.
Manages and Distributes Jobs Across the

Cluster.
• Self heal t he clust er.
o Nod e/ d isk failure or rem oval.
o Fix und er-rep licated eg roup s (d ist rib uted
fsck).
• Im prove overall st at e of t he clust er.
o ILM, Disk Balancing , Garb ag e cleanup,
d efrag m ent at ion.
39
• Expose int erest ing st at s.
o Snap shot usag e, vDisk usag e, sp ace saving s.
Curator.
Curator – Map/ Reduce
Map Tasks and Reduce Tasks.

• Basic t ask unit w hich d oes
useful w ork.
• Run ind ep end ent ly on all nod es.
• Curat or m ast er d ecid es w hich
t asks run on w hich nod es.
• Each t ask m ay em it key/ val
p airs for d ow nst ream t asks.
• Generat e Foreg round /
Backg round t asks w hich chang e
t he st at e of t he clust er.
40
Curator – Map/Reduce.
• Mapping and reducing are distinct tasks which are carried out on individual nodes.
o The Curator master will dictate which curator slaves will perform which tasks.
• Mapping is the master determining from the slaves what each node has and the
reducing tasks are manipulating that data.
o Data can be deleted, or more copies made of it, depending on the results of the
mapping tasks.
Curator Scans
Periodically Scan Cluster

Metadata.
• Part ial.
o 1 hour aft er p revious scan.
o Scan sub set of m et ad at a.
• Full.
o 6 hours aft er p revious scan.
o Scan all m et ad at a.
• Select ive (int rod uced in 5.0 ).
o Scans only relevant d at a, relat ive t o
t he curat or scan t yp e need ed .
o Finishes m uch fast er.
• Dynam ic.
o Disk / Nod e / Block failure.
o ILM Im b alance.
o Disk / Tier Im b alance.
41
Curator Scans
Curator Scans
42
Curator Scans.
Curator ILM
ILM ( Informat ion Lifecycle

management ) .
• Id ent ify cold d at a ( read
cold + w rit e cold ).
• Dow n-m ig rat e cold d at a.
o From fast st orag e (SSD)
t o slow st orag e (HDD,
cloud ).
• Prevent SSDs from
g et t ing full.
43
Curator ILM.
• ILM – Information Lifecycle Manager.
o Used to down migrate data from hot tier to cold tier.
o Up migrations are done by Stargate.
Curator – O t her Tasks It Does
Scan metadata for other operations.

• Oplog Delet ion.
• Space Saving.
o Com p ression, Ded up, Erasure Cod ing .
o Defrag m ent at ion.
• NFS Met adat a Consist ency Checks.
o Req uires scanning nfs m et ad at a m ap s.
• Dat a Movem ent .
o Disk Balancing , Block Aw areness.
• A ct on Logical t o Rem ove Ent it ies.
o vDisk, Cont ainer, Storag e Pool.
• Copy vblock Met adat a.
o Snap shot s, clones.
44
Curator – Other Tasks It Does.

Curator – 20 10 page
45
Curator – 2010 page.

Curator 20 10 page – Map and Reduce Tasks
Tables
46
Curator 2010 page – Map and Reduce Tasks Tables.

Curator Diagnost ics
~/data/logs/curator.out
~/data/logs/curator.{INFO,WARNING,ERROR,FATAL}
~/service iptables stop
RPC – TCP Port 20 10 .

• Accessib le on every CVM.
47
Curator Diagnostics.
• o Talk about “salt stack blocking the port”
Med usa / Cassand ra
Medusa / Cassandra
Medusa / Cassandra
Medusa – Interface between cluster services and Cassandra.

Cassandra – Backend database.
Interface to access cluster metadata stored in Cassandra.

Cassandra is reserved 15 GiB per SSD ( up to 4 SSDs) .
• Com m only referred t o as t he m et ad at a d isks.
• Old er versions had 30 GiB m irrored on only t he first 1-2 SSDs.
Quorum- based consensus to choose a value.
• 2/ 3 nod es m ust ag ree for RF2.
• 3/ 5 nod es m ust ag ree for RF3.
49
Medusa / Cassandra.
Cassandra Ring
50
Cassandra Ring.
• Basic ring, 4 nodes in one cluster.
Cassandra Ring Updat ing
51
Cassandra Ring Updating.

• Expanding the cluster by 4 more nodes.
Monitoring Cassandra
Monitor Cassandra Healt h.

• Track fault t olerance st at us.
• Checks m em ory and t hread p rog ress st at us.
• Init iat e Cassand ra self healing .
Maintains and Updates Configuration and State.
• Cassand ra st at e chang es.
• Zeus-b ased healt h and config urat ion m onit oring .
• Lead ership st at us for t okens.
• Each nod e is t he lead er for it s p rim ary t oken rang e.
52
Monitoring Cassandra.
Medusa/ Cassandra Diagnost ics
~/data/logs/cassandra_monitor.{INFO,WARNING,ERROR,FATAL}
~/data/logs/cassandra.out
~/data/logs/cassandra/system.log
~/data/logs/dynamic_ring_changer.out
Print Cassandra Ring Information:

$ nodetool -h 0 ring
53
Medusa/Cassandra Diagnostics.
v51:tro-cassandra-log-entries-c.html
Node States:
• Normal – All good.
• Forwarding – Data is being moved to other nodes – usually due to node removal.
• Detached – Node has been detached due to unresponsive Cassandra process.
• ncc health_checks cassandra_checks ring_balance_check

Extent s and Extent Groups
Extents are logically contiguous data.

Extent groups are physically contiguous stored
data.
54
Extents and Extent Groups.

vdisk_usage_printer -vdisk_id=<VDISK ID>
• The vdisk_usage_printer is used to get detailed information for a vDisk, its extents
and egroups.
• The graphic is taken from the nutanixbible.
Extent
• Key Role: Logically contiguous data.
• Description: An extent is a 1MB piece of logically contiguous data which consists of
n number of contiguous blocks (varies depending on guest OS block size). Extents
are written/read/modified on a sub-extent basis (aka slice) for granularity and
efficiency. An extent’s slice may be trimmed when moving into the cache depending
on the amount of data being read/cached.
Extent Group
• Key Role: Physically contiguous stored data
• Description: An extent group is a 1MB or 4MB piece of physically contiguous stored
data. This data is stored as a file on the storage device owned by the CVM. Extents
are dynamically distributed among extent groups to provide data striping across
nodes/disks to improve performance.
• NOTE: As of 4.0, extent groups can now be either 1MB or 4MB depending on
dedupe.
• There are several commands in the curator_cli that can be used to see vDisk info
such as vdisk chain info or vdisk usage.
Cereb ro
Cerebro.
Cerebro
Replication Management.
Master/ Slave Architecture.
• Master.
o Task d eleg at ion t o Cereb ro Slaves.
o Coord inat ing w it h rem ot e Cereb ro Mast er w hen rem ot e
rep licat ion is occurring .
o Det erm ines w hich d at a need s t o b e rep licat ed .
– Deleg at es rep licat ion t asks t o t he Cereb ro Slaves.
• Slaves.
o Tell local St arg at e w hich d at a t o rep licat e and t o w here.
56
Cerebro.
Cerebro Diagnost ics
~/data/logs/cerebro.out
~/data/logs/cerebro.{INFO,WARNING,ERROR,FATAL}
~/data/logs/cerebro_cli.{INFO,WARNING,ERROR,FATAL}
RPC – TCP Port 20 20 .

Linked from curator master.
• $ allssh "sudo iptables -t filter -A WORLDLIST -p
tcp -m tcp --dport 2020 -j ACCEPT"
57
Cerebro Diagnostics.
• The 2020 page on the Cerebro master shows protection domains (with number of
snapshots, Tx and Rx KB/s, and so forth), remote bandwidth, ping remote latency,
ergon tasks, slave data and Medusa latency.
A crop olis
Acropolis.
Acropolis
…is not AHV.
59
Acropolis.
Acropolis – W hat It Does
Virt ualizat ion Funct ionalit y.

Management. • VM Provisioning .
• Event ually rep laces • Net w ork Manag em ent .
hyp erint p rocess. • Hig h Availab ilit y.
• Sched uler.
• DR.
60
Acropolis – What It Does.

Acropolis Component s
CVM CVM
( Acrop olis Mast er)
Stats Collection/ Publisher Stats Collection/ Publisher
VNC Proxy VNC Proxy
Task Executors
Scheduler/ HA
Net Controller
61
Acropolis Components.
Acropolis is a distributed service.
• All of the above components run in a single Python thread (with coroutines).
• Acropolis master is responsible for all task execution and resource management.
• We will distribute master responsibilities if we run into scalability limits.
• Each Acropolis instance collects and publishes stats for the local hypervisor.
• Each Acropolis instance hosts a VNC proxy service for VM console access.
Nutanixbible has a good section about the dynamic scheduler, what it does and
how it does it. It also has a good section on placement decisions.
Acropolis Interface
All V M Interact ion in

Prism
Acropolis CLI on CV Ms
• https://portal.nuta
nix.com/#/page/docs
/details?targetId=A
MF_Guide-
AOS_v4_7:man_acli_c
.html
• $ acli
62
Acropolis Interface.
Acropolis Diagnost ics
~/data/logs/acropolis.out
~/data/logs/curator.{INFO,WARNING,ERROR,FATAL}
~/data/logs/health_server.log (scheduler log)
RPC – TCP Port 20 30

• Accessib le on every CVM
63
Acropolis Diagnostics.
• Health_server.log – Look for scheduler.py issues.
Acropolis 20 30 pages on Acropolis
Master
:20 30 – Host / Task / Net work Informat ion
:20 30 / sched – Host status / V Ms on each host
:20 30 / t asks – Acropolis t asks
:20 30 / vms – V M Name / UUID / V NC / St at us
:20 30 / shell – Interactive acli shell
64
Acropolis 2030 pages on Acropolis Master.

Log File Locat ions
In case you didn’t notice, logs are stored in

home/nutanix/data/logs ( Cassandra logs are in the Cassandra
directory) .
Logs of interest besides the ones show n already would be:
• sysstats ( in t he /sysstats sub d irect ory) and includ es t he
follow ing log s…
o netstat (Displays net w ork connect ions).
o ping (ping_hosts, ping_gateway, and so on) .
o sar (Net w ork bandw idt h, and so fort h) .
o iotop (Current IO in real t im e).
o df (Mount ed filesyst em s).
o zeus_config (Zeus configurat ion).
65
Log File Locations.

v50:tro-cvm-logs-r.html
Lab s
Labs.
Labs
67
Labs.
Module 3 Services and Logs.
Thank You!
68
Thank You!
Module 4 Foundation
Module 4 Foundation
Course Agenda
1. Int ro 7. A FS
4. Found at ion 10 . Up g rad e
6. Net w orking
Course Agenda
• This is the Foundation module.
O bject ives

• Exp lain w hat Found at ion is and how it is used
• Define t he d ifference b et w een St and alone and CVM-
b ased Found at ion
• Describ e how t o Troub leshoot Found at ion
• Move Nod es t o ot her Blocks
• Find and A nalyze Log s of Int erest
Objectives
• Explain what Foundation is and how it is used
• Define the difference between Standalone and CVM-based Foundation
• Describe how to Troubleshoot Foundation
• Move Nodes to other Blocks
• Find and Analyze Logs of Interest
This module is designed to provide an overview of Foundation with a focus on
how it can be used in Troubleshooting.
• It also provides information about the different versions of Foundation, the standalone
virtual machine and CVM-based.
o We will discuss when to use both Foundation types and the dependencies for each
type.
• We will cover general troubleshooting techniques for Foundation including
networking, virtual networking, and IP configuration.
• Finally, we will review the Foundation logs and cover how they can be used for
overall troubleshooting.
Foundation
Foundation
W hat is Foundat ion?
Nut anix Tool to Image Nodes w it h Proper AO S Version

and Hypervisor
• Nod es and Blocks Ship w it h Most Current Versions of
AOS and A HV
• Found at ion Inst alls Hyp ervisor and CVM on Nod es
o A HV – Nut anix
o ESXi – VMWare
o Hyp er-V – Microsoft
o XenServer - Cit rix
What is Foundation?
• Foundation is a Nutanix tool to image (Bare Metal) or re-image nodes in the field
with the proper AOS version and Hypervisor.
o When a customer purchases a block of nodes, from the factory the nodes are
imaged with the latest version of the Acropolis Operating System (AOS) and the
latest Acropolis Hypervisor (AHV) version.
• If customers want to use a different hypervisor other than the one that the cluster
ships with (ESXi/HyperV), the nodes need to be imaged and configured using
Foundation.
o Nutanix supports various hypervisors including:
o AHV
o ESXi
o HyperV, and
o Xenserver.
Foundat ion in t he Field
Found at ion can also Creat e t he Clust er

• St and alone VM or
• CVM-Based
Foundation in the Field

• Foundation can also create the cluster after imaging or on existing nodes already
imaged.
o During initial installation and configuration a Nutanix Sales Engineer will normally
perform this task in the field at the customer site where the Nutanix cluster will be
installed.
There are 2 types of Foundation:
• Standalone Foundation is a Virtual Machine that can run on any laptop or
computer.
• Standalone Foundation is the tool used to do the setup.
• CVM-based Foundation is on the Controller VM.
Standalone Foundation
St andalone Foundat ion
Blocks and Nodes:

• Im ag ed in the field w it h St and alone Found at ion VM
• Lat est version is 3.11.2
o Dow nload from Portal Foundation_VM-3.xx.x.tar file
• Tar Ball file includ es
o Foundation_VM-version#.ovf
o Foundation_VM-version#-disk1.vmdk
• Standalone Foundation is available for download on the Support Portal.
o The latest version is 3.7.2.
• Download the Standalone Foundation from the Portal Foundation_VM-
3.7.2.tar file
o The standalone Foundation VM is for Nutanix Internal use only, not available for
customers.
• Once downloaded from the Support Portal the tar file needs to be unpackaged.
o After unpackaging the tar file, the virtual machine can be deployed using the ovf
file or creating a new VM and pointing to the vmdk virtual disk.
• Oracle Box or VM Workstation virtualization applications (required) can be installed
on the computer/laptop to run the Foundation virtual machine.
o The virtual machine is Linux-based and needs network access.
• The VM virtual networking in most instances will be bridged from the VM to the
computer/laptop physical NIC (Network Interface Card).
o This will give the Foundation virtual machine the ability to communicate with the
Nutanix nodes on the local network required for imaging.
o By default, the eth0 interface on the Foundation virtual machine is set to DHCP.
Through the console you can log in to the virtual machine with these credentials:
• Username: nutanix
• Password: nutanix/4u
There is also a script on the desktop of the Foundation virtual machine to configure the
IP address settings.
• If you make any changes and save this will cause the network service to be restarted
on the Foundation virtual machine.
9
Nutanix Cluster
10
Switch Nutanix Cluster
11
Laptop
Switch Nutanix Cluster

12
• O racle V irt ual Box / V M W o rkst at io n

• F o und at io n URL: http://ip_address_Foundation_vm:8000/gui
• AOS Images Locat ion:
o /home/nutanix/foundation/nos
• Hypervisor Images Locat ion:
o /home/nutanix/foundation/isos/hypervisor/esx
hyperv or kvm
• Copy Images to CV M using SCP or Foundation UI
14
• Once the Foundation virtual machine is configured with the correct IP address you
can then access the Foundation Application at the following URL:
o http://ip_address_Foundation_vm:8000/gui
• The Foundation web User Interface is wizard-based and will walk the customer
through several screens/tabs to perform the node imaging.
o The various tabs will provide block discovery automatically for bare metal (No
hypervisor or CVM running on node) using the MAC addresses from the IPMI
interface, CVM and Hypervisor IP address configurations, subnet mask, DNS,
NTP, and default gateway settings.
• Foundation also provides a tab during the wizard-based deployment to upload the
AOS and hypervisor binaries.
o Scp or winscp can also be used to copy the binaries up to the Foundation Virtual
Machine.
• Copy nutanix_installer_package-version#.tar.gz
o …to the /home/nutanix/foundation/nos folder.
o It is not necessary to download an AHV ISO because the AOS bundle includes an
AHV installation bundle.
• However, you have the option to download an AHV upgrade installation
bundle if you want to install a non-default version of AHV.
• If you intend to install ESXi or Hyper-V, you must provide a supported ESXi or Hyper-
V ISO (Windows 2012R2 Server) image.
o Therefore, download the hypervisor ISO image into the appropriate folder for that
hypervisor.
• ESXi ISO image: /home/nutanix/foundation/isos/hypervisor/esx
• Hyper-V ISO image: /home/nutanix/foundation/isos/hypervisor/hyperv
• For third party hypervisors, reference the download page on the Nutanix Support
Portal for approved releases and the appropriate JSON file.
• Third-party hypervisor software can be obtained from that specific vendor.
o Example ESXi 6.5: Download from the VMware site.
o Example Hyper-V: Download from the Microsoft site.
• Once the Foundation virtual machine is up and running the network interface eth0
has to be configured with the correct IP address Subnet Mask Default Gateway and
DNS Servers.
o By default, eth0 is configured for DHCP, recommended to use a static IP Address.
• After configuring all the IP settings test connectivity with PING to make sure you can
reach the gateway name servers and most of all the IPMI IP addresses configured on
the Nutanix nodes.
o Once the IP connectivity is established the nodes can be successfully imaged from
the Foundation virtual machine.
Bare Met al
• No hypervisor or CV M exists on the node
• No Automatic Discovery via IPv6 Link- Local
Address FE80 …
Requires IPMI MAC address
How to O bt ain IPMI MAC Address –
• &RQILJXUH,3$GGUHVV
• .H\ERDUGDQG0RQLWRU
• :HE%URZVHUWR,30,,3$GGUHVV
• &90RU+\SHUYLVRU
• 3K\VLFDO'HYLFH
16
• Standalone bare metal imaging is performed using the IPMI Interface.
o The Foundation virtual machine needs access to each nodes IPMI interface to
successfully image.
o Since there is not an existing CVM on the node, automatic discovery via IPv6 link-
local address will not be available.
• The only way to discover the nodes is to obtain the MAC address to the IPMI
interface and manually add the block and nodes by MAC address.
o Foundation can also configure the IPMI IP address, but in most instances you will
console into each node and configure the IPMI with a static IP address ahead of
time or day of installation.
• Then test IP connectivity from the Foundation virtual machine using ping.
o If the Foundation virtual machine does not successfully ping the node’s IPMI
address, go back and review IP settings on the nodes (IPMI interface) and the eth0
interface on the Foundation virtual machine.
• To make things easier, you can cable the nodes IPMI or Shared LOM port and the
Laptop/Computer to a single flat switch and then assign a network ID and IP
addresses to the Laptop/Computer and Nutanix nodes.
o This way there are no VLAN configurations or complications.
o This setup provides basic network connectivity for imaging the nodes.
• Once the nodes are successfully imaged and IP addresses configured for each
node’s CVM and Hypervisor, they can then be moved into the production
network and subnet.
o At this point, proper VLANs must be assigned if applicable to the CVM and
Hypervisor external interfaces.
o To access the console of the node point a web browser to the IP address set for
the IPMI interface, or…
o You can also go into the datacenter and hook up a keyboard and Monitor
physically to the nodes.
IPMI IP Address Default Characteristics

• DHCP
• IPMI Default Port Mod e is Failover
o Can use t he 10 G p ort for IPMI Plat form d ep en d en t
o Share or Failover
Configuring IP Address for Foundation Virtual
Machine
• Mult ihom ed Op t ion
17
• Once on the console of the IPMI interface, under Configuration there is Network.
o The network configuration tab is where you can:
• Get the IPMI mac address and
• Obtain or modify existing IP address, gateway, and DNS settings.
• For the LAN interface settings there are three choices:
o Dedicate – Do not use!
o Share
o Failover
• The LAN interface setting should always be set to Failover (Default).
o IPMI port speeds vary depending on platform.
• Some platforms have 100Mb, others the IPMI will be 1Gb.
o All the ports are RJ-45 interfaces.
• The IPMI can failover or share access with one of the LOM (LAN on Motherboard)
ports.
o This allows the use of one port for both IPMI and Data.
• The shared failover LOM Ethernet port can be either 1Gb or 10Gb depending on the
platform.
IPMI Web Interface

MAC Address for Bare Metal
IPv4/IPVv6 Settings
VLAN Settings
18
LAN Interface
• Notice in the IPMI web interface on the network settings page is where to configure
the network.
o The MAC address of the IPMI interface required by standalone Foundation for
bare metal imaging and cluster setup.
o The VLAN setting for the IPMI by default is set to 0.
• VLAN 0 means that the interface will be on the whatever the default VLAN that
is set on the network switch port.
• Let’s deal with access mode switch ports first.
o For example, let’s say we have a Cisco Nexus 5K switch.
• When the switch is initialized and the ports are brought online (no shutdown)
they will be configured by default in access mode and the default VLAN will be
VLAN 1.
o If the IPMI port is configured for VLAN 0 then it will use the VLAN that is assigned
to the switch port in access mode.
• When the switch port is in access mode then all changes to the VLAN would be
configured on the switchport and do not change the VLAN ID from 0 in the IPMI
settings.
• Now let’s look at Trunk ports.
o Switch ports can also be configured in trunk mode.
• Why do we want to set a switch port to trunk mode (switchport mode trunk)?
- Changing a switchport to trunk mode allows you to do something on the host
network interfaces called VLAN tagging.
• VLAN tagging allows for creating multiple logical networks from a single
interface on a physical network adapter or a virtual network adapter.
• We can also apply a VLAN tag to the IPMI network interface.
- Now this is where the discussion can get a little confusing.
• If the IPMI port is cabled to a switchport that is configured in trunk mode, it
works slightly differently than if the switchport is in access mode.
• If the IPMI port is cabled to a switchport that is in trunk mode and the IPMI port is
configured for VLAN ID 0 then the switch will determine what VLANs the port will be
on with a setting called (switchport trunk native VLANs 1).
o The default is VLAN ID 1 on Cisco switches.
o Think of VLAN 0 as non-tagged as opposed to tagged.
• If the IPMI port is cabled to a switchport that is in trunk mode and the IPMI port is
configured for any other VLAN ID then 0 then that will be the VLAN ID for the IPMI
port.
o Also, on the LAN interface setting, the default is Failover and there is no need to
change this setting.
Configure IPMI to use Static IP ipmitool lan set 1 ipsrc static
Configure IPMI IP Address ipmitool lan set 1 ipaddr [ip

address]
Configure IPMI Network Mask ipmitool lan set 1 netmask
[netmask]
Configure IPMI Default Gateway ipmitool lan set 1 defgw ipaddr
[gateway IP]
Configure IPMI VLAN Tag ipmitool lan set 1 VLANs id
[####]
Remove IPMI VLAN tag ipmitool
lan set 1 VLANs id off
Show
19 Current IPMI Configuration ipmitool lan print 1
Figure: Back panel of NX-1065-G4/G5
20
• This is the backplane of the NX-1065-G4/G5 platform.
o The IPMI port is 1Gb with an RJ-45 connector.
• The LOM Shared IPMI is 1Gb with an RJ-45connector.
NX-3175-G4
Dedicated IPMI Port

21
• This is the backplane of the NX-3175-G4 platform.
o The IPMI port is 1Gb with an RJ 45 connector.
• The LOM Shared IPMI is 10Gb with an SFP+ connector.
CVM- Based Foundation
CVM-Based Foundation
CVM- based Foundat ion
 Runs on CVMs to Provide Node Imaging

 Requires a Hypervisor and running CVM
 Nodes Cannot be Part of an Existing Cluster
 Requires Foundation Service
• Verify Service is Running with genesis status command
 To Access: http://CVM_IP_Address:8000
 CVM-based Foundation Requires IPv6 discovery
• Link-Local FE80…..
• mDNS
23
CVM-based Foundation
• All of the Binary files and the Foundation Service is now available on the CVMs.
o For the most part, the Foundation that runs directly from the CVMs is the same as
the standalone virtual machine with a few differences.
CVM versus St andalone Foundat ion
Similarities Differences
The Access URL is the same Must have a Hypervisor installed on the node and a
http://CVM_IP or VM_IP:8000. healthy CVM virtual machine up and running on the
node.
The paths for the images and ISOs for AOS The node must not be part of an existing cluster.
and Hypervisors are the same.
The log path locations are the same. Requires IPv6 for node discovery.
Most of the pages in the wizards between Serially images first node.
the two are the same.
Have to create a cluster. Standalone can image only
and cluster creation is not required.
24
CVM versus Standalone Foundation

Let’s look at some of the similarities and then get to the main differences between
the two versions.
• The Access URL is the same http://CVM_IP or VM_IP:8000.
• The paths for the images and ISOs for AOS and Hypervisors are the same.
• The log path locations are the same.
• Most of the pages in the wizards between the two are the same.
So what are the differences?
• Must have a Hypervisor installed on the node and a healthy CVM virtual machine
up and running on the node.
• The node must not be part of an existing cluster.
o The foundation service will start only if the node is not part of an existing cluster!
• Requires IPv6 for node discovery.
• Serially images nodes.
• Have to create a cluster.
o Standalone can image only and cluster creation is not required.
• Cluster creation can be done at a later time.
• Verify that the Foundation service is started on the CVM with Genesis Status.
CVM- based Foundat ion
nutanix@NTNX-16SM13150152-A-CVM:10.30.15.47:~/data/logs$ genesis status
2017-05-04 15:50:44.060695: Services running on this node:
abac: []
acropolis: []
alert_manager: []
aplos: []
aplos_engine: []
arithmos: []
cassandra: []
Genesis Status Showing
cerebro: [] Foundation Service Running
chronos: []
cim_service: [] but not Part of a Cluster
cluster_health: []
curator: []
dynamic_ring_changer: []
ergon: []
foundation: [27512, 27550, 27551]
25
genesis: [26213, 26229, 26253, 26254, 27495]
CVM-based Foundation
• The Foundation service must be running.
o The node cannot be part of an existing cluster.
• If the node is part of an existing cluster then all the other services will be running and
the Foundation service will not.
CVM Foundat ion W eb Interface
10.30.15.47:8000/gui/index.html
Point Your Web

Browser to any
CVM IP Address
Port 8000
26
CVM Foundation Web Interface

Access to a CVM-based Foundation is the same as access to a standalone Foundation.
• http://CVM_IP_Address:8000
Also in the Foundation wizard there is a page for uploading your AOS and Hypervisor
versions.
• You can also scp or winscp to copy images to the CVM where you will be running
Foundation.
Foundat ion Tar Ball and ISO Locat ion
o Support Portal Hypervisor Details Page Lists

Tested and Q ualified Versions
• Third Party Hypervisors require .JSO N file During Upload
/home/nutanix/foundation/nos/
Location to Upload AOS nutanix_installer_package-
Software thru GUI or SCP release-danube-4.7.5.2-
stable.tar.gz
/home/nutanix/foundation/isos/
hypervisor/kvm/host-bundle-
AHV Hypervisor Tarball el6.nutanix.20160217.2.tar.gz
host-bundle-el6.nutanix.
27
20160217.tar.gz
Foundation Tar Ball and ISO Location

• The AOS Tarball is bundled with a compatible AHV version.
o If using any other hypervisor except AHV, the ISO must be obtained from that
Vendor; for example, Microsoft or VMware.
• Also visit the Nutanix Support Portal and on the Hypervisor Details page is a list of
tested and qualified versions for ESXi and others.
o For third-party hypervisors there is also the corresponding .JSON file required
during upload with the MD5 checksum, current hypervisor upgrade compatible
versions, NOS compatible versions.
Foundation CVM-Based Node Discovery
Page
Troubleshooting
IPv6 issues for
Discovery to Work
Required for CVM-

based Foundation
28
Foundation CVM-Based Node Discovery Page

• Here is the discovery page in the wizard.
o Notice the blue link to rerun discovery.
• Remember that CVM-based Foundation requires IPv6 discovery.
o That means that the CVM interfaces eth0 and eth1 need to be assigned and
running IPv6 addresses.
• By default the CVMs have IPv4 and IPv6 addresses configured on the eth0 and eth1
network interfaces.
o The IPv6-assigned address is the link-local address FE80… which allows IPv6
nodes to communicate with each other on the subnet.
• There is no need for static or automatic assignment for the IPv6 address for the CVM
interfaces.
o To troubleshoot, verify that the CVM’s have IPv6 addresses assigned.
• Use the ifconfig command on the CVM to verify that proper IPv6 addresses have
been set on the network interfaces on the CVM.
Foundat ion CVM- based Define Cluster
Page
Let’s create a network

problem by typing in the
Wrong Subnet Mask
29
Foundation CVM-based Define Cluster Page

• In this scenario let’s intentionally misconfigure the subnet mask on the Define
Cluster page.
o The wizard does not provide a pre-check on the Define Cluster page.
• There is a pre-check on the next page Setup Node.
• The Define Cluster page is also where the following settings are configured:
o Cluster Name
o Cluster IP Address or the VIP (Virtual IP) used by the Prism interface
o DNS Servers for the Cluster
o NTP Servers for the Cluster
o Timezone
o NetMask for CVM and Hypervisor
o CVM Memory
Foundat ion CVM- based Set up Node Page
30
Foundation CVM-based Setup Node Page

• The Setup Node page is where the IP addresses are set for the CVM’s external
network access and also the hypervisor’s IP addresses for external network
access.
o Once the IP settings are defined, the next button will perform a pre-check to
make sure the IP addresses for the CVMs and hypervisors can reach each
other and the default gateway.
• If there are any connectivity problems between the hypervisor the CVMs and the
Gateway, the wizard will not allow you to proceed until the issues are resolved.
o If proper VLAN tags are not assigned, this can be an issue that will not allow
connectivity when imaging in production environments.
o You may need to consult the network team to assist in the VLAN assignments and
IP address scheme.
• Where do you go to change the VLAN ID for the IPMI interface?
• What is the default VLAN ID for the IPMI interface?
Foundat ion CVM- based
The Foundation
interface does not
provide much insight as
to the actual problem
Where to Find the

Real Issue?
Foundation Logs
31
Foundation CVM-based
• When you click Next on the Setup Node page will perform a pre-check and test
connectivity between all the IP addresses and the gateway.
o If there is a connectivity issue then the wizard will highlight the IP addresses in
question in red.
• The error is not very obvious from the web interface.
o At this point, there could be a number of issues causing the network connectivity
problem.
• To provide better troubleshooting we must access the Foundation Server logs
either:
o On the CVM if CVM-based or
o On the Standalone Virtual Machine.
• Both use the same log path locations and naming conventions.
• Foundation logs are stored in the following path:
o /home/nutanix/data/logs/foundation.out
o This log is for the overall Foundation Service.
o It also tells you where the individual logs for each node are being stored and their
names.
o You may have to examine the foundation.out log on multiple nodes to find the
one that indicated where the node log files will be stored.
cat foundation.out
tar: Cowardly refusing to create an empty archive
Try `tar --help' or `tar --usage' for more information.
'tar cf /home/nutanix/data/logs/foundation/archive/orphanlog-archive-20170511-
144648.tar -C /home/nutanix/data/logs/foundation/. ' failed; error ignored.
2017-05-11 14:46:48,304 console: Console logger is configured
2017-05-11 14:46:48,304 console: Configuring <class
'foundation.simple_logger.SmartFileLoggerHandler'> for None
'foundation.simple_logger.FoundationLogHandler'> for foundation
'foundation.simple_logger.SessionLogHandler'> for foundation.session
2017-05-11 14:46:48,428 console: Foundation 3.7 started
Foundat ion CVM- based - (cont)
The Foundation
interface does not
provide much insight as
to the actual problem
Where to Find the

Real Issue?
Foundation Logs
32
Foundation CVM-based - (cont)

The foundation.out log tells where the individual node logs will be stored:
2017-05-11 14:47:35,672 console: Log from foundation.session.20170511-144735-
1.node_10.30.15.47 is logged at /home/nutanix/data/logs/foundation/20170511-144735-
1/node_10.30.15.47.log
1/node_10.30.15.48.log
1/node_10.30.15.49.log
1.cluster_eric is logged at /home/nutanix/data/logs/foundation/20170511-144735-
1/cluster_eric.log
tar: Removing leading `/' from member names
tar: Removing leading `/' from member names
/home/nutanix/data/logs/foundation/YYYY-HHMMSS-x( where x is an Incremental

counter starting at 1)
• Here is the folder for the individual foundation logs for each node.
o For troubleshooting, the specific node logs will provide better answers to node
imaging problems.
• When reviewing these files any network connectivity issues can be causing the
imaging issues or failures.
o Node log file naming convention: node_cvm_ip_address.log
Foundat ion CVM- based
Answer is in Foundation logs for Nodes:
• Cannot Reach Default Gateway

• Subnet mask is Wrong
33
Foundation CVM-based
• By looking in to the individual node log we discover the issue. The subnet mask was
incorrect at 255.255.255.128 which will not allow the CVM to reach the gateway.
o Subnet Mask 255.255.255.128.
o The Gateway is 10.30.0.1.
o The CVM IP address is 10.30.15.47.
• So let’s examine:
o 255.255.255.128 subnet mask means that my network IDs are the following:
• In the last octet we are borrowing one bit which means every 128 decimals is a
new subnet.
• This will allow us to divide the last octet in to two subnets starting from 0 to 255.
o 0-127 is a subnet
o 128-155 is a subnet
• When assigning IP addresses any IP address from 0 thru 127 are on the same
subnet and 128 thru 255 is a separate network.
• Examine the IP address on the CVM 10.30.15.47/255.255.255.128.
o So the CVM is running on subnet 10.30.15.0-128.
• The Gateway is 10.30.0.1/255.255.240.0.
o This is the correct subnet mask for this network which means in the third octet
every 16 decimals is a different subnet.
• So the Gateway is running on subnet 10.30.(0-15).xxx
o These are not on the same subnet and therefore cannot communicate.
Healt hy Log for a Node After Foundat ion
Imaging Steps
• Get FRU Device Descrip t ion via IPMIt ool
• Check t o see w hat t yp e of Syst em
o Lenovo
o Soft w are Only
o Dell
• Prep are NOS Packag e and Nod e Sp ecific Phoenix
• Mount Phoenix ISO and Boot int o Phoenix
34
Healthy Log for a Node After Foundation

• Here is a sample of Foundation log for a specific node.
o Notice the directory where the logs are stored for each individual node.
• NOTE: Go to Appendix A, Module 4 Foundation Troubleshooting Foundation
Log, to see the complete Foundation log file.
o Review this with the class.
Moving Nodes to Other Blocks
Moving Nodes to Other Blocks

Moving Nodes to O t her Blocks
Change the node position for new hardware

platform (G4 and later)
• Scenarios
o Nod es Show up in t he W rong Posit ion in Prism
– Nod e is p hysically in Posit ion A of t he Block b ut
show s up in Prism in Posit ion D
o Found at ion failed t o d iscover t he nod es and b are
m et al im ag ing w as used
o A Nod e w as Manually Im ag ed Throug h Phoenix
o A n A lread y Im ag ed Nod e is Lat er Moved t o a Different
Chassis Posit ion
36
Moving Nodes to other Blocks

Moving Nodes to O t her Blocks
Before G4 Platforms
• Mod ify t he field "nod e_p osit ion":
o "A " in /etc/nutanix/factory_config.json
o restart genesis
Since G4 Platforms
• Just m od ifying factory_config.json file d oes not up d at e
t he nod e locat ion in Prism
37
Moving Nodes to other Blocks

Before G4 Platforms
• Before G4 platforms, to modify the position the following procedure was required:
o You modify the field "node_position": "A" in /etc/nutanix/factory_config.json
o You issue the genesis restart command.
• Since G4 platforms, just modifying factory_config.json file does not update the node
location in Prism and ncli host ls still displays the following:
ID : 00050df9-1d9c-d30b-0000-00000000XXXX::47009525
IPMI Address : 192.168.52.52
Controller VM Address : 192.168.52.50
Hypervisor Address : 192.168.52.51
Host Status : NORMAL
Oplog Disk Size : 100 GiB (107,374,182,400 bytes) (2.3%)
Under Maintenance Mode :false (-)
Metadata store status: Metadata store enabled on the node
Node Position: Node physical position can't be displayed for this model.
Please refer to Prism UI for this information.
Node Serial (UUID): OM153S02XXXX (87f7f0c8-bb3e-4719-9c54-3bd7031e163f)
Block Serial (Model) : 15SM6013XXXX (NX-3060-G4)
Solut ion Case A
Case A
Copy an exist ing hardware_config.json
• You can copy t he file from a correct ly config ured nod e
inst alled at t he sam e p osit ion in a d ifferent b lock
Easiest , not alw ays ap p licab le
38
Solution
• In the example below, we have a node that's being displayed in slot D while
physically in slot A of the same chassis.
• We want to make sure this node shows up correctly in slot A of the hardware diagram
in Prism.
Case A (easiest, not always applicable):
• Copy an existing hardware_config.json file from a correctly configured node installed
at the same position in a different block.
Requirements
• You have access to a hardware_config.json file of the exact same node model as the
one you are going to modify (eg. you can use the file from an A node in a different
chassis).
Procedure
• Move original hardware_config.json file
o <Wrong_positioned_node_IP>:~$ sudo mv /etc/nutanix/hardware_config.json
o /etc/nutanix/hardware_config.json.ORIG
• Copy /etc/nutanix/hardware_config.json from a node that is already in the desired
location to the node you want to correct(these two nodes should be in the same
position of another block)
<Wrong_positioned_node_IP>:~$ sudo sh -c 'rsync -av
o
nutanix@Well_positioned_node_IP:/etc/nutanix/hardware_config.json
/etc/nutanix/hardware_config.json'
• Restart genesis
o <Wrong_positioned_node_IP>:~$ genesis restart
Solut ion Case B
Case B
Manually Mod ify hardware_config.json File
hardware_config.json
• JSON file st ruct ure is used by Prism GUI
• Provid e config urat ion inform at ion ab out t he syst em
Make a b ackup copy of t his file b efore you m ake
m od ificat ions
39
Moving Nodes to other Blocks - 3

Case B (always applicable):
• Manually modify hardware_config.json file.
Requirements
• This procedure has been tested only on four-node blocks.
• Corresponding values have to be confirmed in case of single/double-node blocks.
Procedure
• The hardware_config.json file structure is used by PRISM GUI to get all configuration
info about the system.
o The field values to modify in case of node replacement are : node_number
• Slot A corresponds to value 1
• Slot B corresponds to value 2
• Slot C corresponds to value 3
• Slot D corresponds to value 4
• In both node and disks:
o Slot A means the first position on the x-axis
o Slot B means the second position on the x-axis, etc.)
• cell_x
- Slot A corresponds to value 0
- Slot B corresponds to value 6
- Slot C corresponds to value 12
- Slot D corresponds to value 18
Depending if you are working with a 2U or a 4U block, you have also to modify the
cell_y value into the location section:
• cell_y
• Lower nodes (slot A level) correspond to value 1
• Higher nodes (slot D level) correspond to value 0
For example:
• NX-3060-G4 node C
• "node_number": 3, "location": {"access_plane": 2, "cell_x": 6, "width": 4, "cell_y": 1,
"height": 1 },
Foundation Logs Of Interest
Foundation Logs Of Interest

Foundation Logs of Interest
Tells you where each
/home/nutanix/data/logs/ individual Node-specific
Foundation log file resides.
foundation.out
• Most likely this will be on
the CVM where foundation
webpage was launched.
/home/nutanix/data/ FolderName Format:

logs/foundation/ • Date-time-
20170504-230052-8/ try_counter
node_10.30.15.48.log Node Log Format:
• node_CVM_IP.log
43
Foundation Logs of Interest

Labs
Labs
Thank You!
46
Thank You!
Module 5
Hardware Troubleshooting

O bject ives

• Describe and Troubleshoot the Physical Hardware Components of a Nutanix
Cluster
• Perform Failed Component Diagnosis
• Explain the Role of Disk Manager ( Hades)
• Repair Single SSD Drive Failures
• Apply Firmware Updates using Lifecycle Manager LCM
• Hardware Troubleshooting Logs of Interest
• Describe and Perform a Node Removal in a Nutanix Cluster
o How t o Perform a Nod e Rem oval
o Node Rem oval Process
o Troub leshoot ing and Monit oring Nod e Rem oval
• Describe and Perform a Disk Removal on a Node
o How t o Perform a Disk Rem oval in a Nut anix Clust er
o Disk Rem oval Process
3 o Troub leshoot ing and Monit oring Disk Rem oval
Objectives
• Describe and Troubleshoot the Physical Hardware Components of a Nutanix Cluster
• Perform Failed Component Diagnosis
• Explain the Role of Disk Manager (Hades)
• Repair Single SSD Drive Failures
• Apply Firmware Updates using Lifecycle Manager LCM
• Hardware Troubleshooting Logs of Interest
• Describe and Perform a Node Removal in a Nutanix Cluster
o How to Perform a Node Removal
o Node Removal Process
o Troubleshooting and Monitoring Node Removal
• Describe and Perform a Disk Removal on a Node
o How to Perform a Disk Removal in a Nutanix Cluster
o Disk Removal Process
o Troubleshooting and Monitoring Disk Removal
Com p onent s
Components
Describe & Troubleshoot Nutanix Cluster Hardware Component s
Model Numbering Chassis Form

Product CPU
Family Factor Generation
3 4 6 0 G4
Nutanix # of Nodes Drive Form
NX-ABCD-GE
Factor
A = Product Family Series – (1 = entry – 1 SSD + 2 HDD; 3 = balanced – 2 SSDs + 2/4 HDDs;
6 = capacity – 1-2 SSDs + 4/5 HDDs; 7 = GPU node; 8 = Ent Apps; 9 = all Flash)
B = # of nodes being sold in the chassis ( some will always be 1, others will be 1,2,3 or 4)
C = Chassis form factor – nodes/rack units (5 = 1N-2U; 6 = 4N-2U; 7 = 1N-1U)
D = 0 = 2.5” disk form factor; 5 = 3.5” form factor
5
GE = Generation (4 = Haswell; 5 = Broadwell)
Describe & Troubleshoot Nutanix Cluster Hardware Components

Nutanix provides the NX line of hardware to run the Nutanix Clusters.
The NX line of hardware consists of Blocks (chassis) and Nodes (x86 Severs). There are
different Models to choose from including entry level platforms such as the 1000 series or the
All Flash 9000 series platform. Nutanix also provides a concept called storage heavy nodes in
the 6000 series. See the Nutanix Support Portal for details and specifications on all Nutanix NX
Series Platforms.
The chassis that the NX hardware ships in is called a Block. All of the Nutanix NX platform
chassis are standard 2U in width. Chassis come in several variations for node configuration.
Some chassis have four slots to contain up to a max of 4 nodes in the Chassis. Examples would
be the NX 1000 and NX 3000 series. Storage heavy nodes will have two slots for 2 nodes.
Nutanix even has single node backup only targets which only contain a single slot in the
chassis and one node.
The compute resources in the chassis are the nodes. Each node is a separate x86 hardware
server that runs the hypervisor, CVM, and UVM's. Nodes can be referred to as sleds or blades.
Each sled slides into a slot in the chassis and connects to the midplane for power and disk
connectivity. All chassis have disk shelves built in for storage. All nodes run Intel Processors
and come in different socket, core, Network, and memory configurations. The 8000 series
nodes support expansion GPU cards for GPU intensive workloads.
On the top of all the chassis, you will see the Nutanix Branding label that confirms this is
Nutanix hardware. On the side is where the chassis serial no. is located. This is the serial
number in the factory_config.json file for node postition and naming. The serial number will
also be in the name of the CVM in the SSH shell.
Failed Component Diagnosis
SATA SSDDĞƚĂĚĂƚĂ Drive
SATA HDD or SATA SSD Data Drive
Node
Chassis or Node Fan
Memory
Power Supply
Chassis

SATA SSD Drive Failure – Failure Indicators
• If a Nutanix Node Experiences a SSD Drive Failure the Console of the

Controller VM Shows the Following:
• Repeated Ext4 Filesystem Errors
Failure • The Words “hard reset failed kernel panic”
Indicators • I/O errors on the /dev/sda device
• An Alert was Received on the Nutanix UI that a SATA SSD was Marked Offline
• There are Errors in the Cassandra Log File indicating an Issue Reading
Failure Metadata
Indicators
SATA SSD Drive Failure – Failure Indicators

SATA SSD Drive Failure – W hat to Do Next
W hat to do Next
If the Controller V M Shows Errors on the Console:

• Pow er Dow n t he Cont roller VM.
• Reseat t he SSD Drive.
• If t hat does not Resolve t he Issue, Replace t he Drive.
If an Alert was Received that the Disk was Marked O ffline run the Follow ing
Command:
ncli>disk list id=xx – W here xx is the Disk ID in the Alert to Verify
Location for Replacement
If there are Errors Indicating Metadata Problems, Replace the Disk
8
SATA SSD Drive Failure – What to Do Next

• What to do next if you encounter SATA SSD Drive Failures.
• SSD drives failures can be one of the most challenging issues to fix in a Nutanix
Cluster. This is especially true for Nutanix single SSD node systems. The SATA SSD
drive is where the Controller Virtual Machine lives. The Controller Virtual Machine
configuration files and data are stored on the SSD drives.
• Nutanix nodes ship with either a single or dual SSD drive configuration.
• For dual configurations, the CVM configuration files partition is mirrored for fault
tolerance. If only one SSD drive is lost, the Controller Virtual Machine should not
experience downtime. Note: The Controller VM will reboot if there is a SSD drive
failure. Replace the failed SSD drive and it should automatically be brought back into
the mirror and re-synced.
• For single SSD boot drive configurations, there is no mirrored drive so Controller
Virtual Machine failure will occur. Controller Virtual Machine failure will cause a loss in
the local Stargate access. HA will provide failover for the hypervisor by pointing to a
remote Controller virtual machine for Stargate Services. The Controller virtual machine
has to be rebuilt.
• To replace the single SSD drive requires manual intervention. We will review the
detailed steps next.
Repair Single SSD Drive Failure Systems
Step 1 Gather Information Needed to Fix

Failed Component $
~/serviceability/bin/breakfix.p
y --boot --ip=cvm_ip_addr
Step 2 Remove the Failed

Metadata Drive
Step 3 Replace Failed

Metadata Drive
Step 4 Log on to Another CVM and Start the

Repair $ single_ssd_repair –s cvm_ip
Step 5 Check the SSD Repair

Status $
9 ssd_repair_status
Repair Single SSD Drive Failure Systems

• The Nutanix Controller VM files are stored on the SSD drives of a Nutanix node.
Nodes that have two SSD drives mirror the CVM partition for fault tolerance and
redundancy. If one of the SSD drives fails, replace, re-create, and the mirroring
software will fix the SSD replacement drive.
• Systems that have a single SSD drive cannot rely on the mirroring software to fix the
failed drive. For these systems you must follow the following procedures to repair
single SSD drive failures.
• Gather information about the CVM on the failed SSD drive node.
$ ~/serviceability/bin/breakfix.py --boot --ip=cvm_ip_addr

(cvm_ip_addr – IP Address of failed CVM)
• Next remove the failed SSD drive and replace with replacement drive.
• Log on to any other CVM in the cluster and run the following command to start the
SSD drive repair.
$ single_ssd_repair –s cvm_ip
• The repair process will partition format and rebuild the CVM on the node where the
SSD drive failed. ssd_repair_status can be used to track the status of the repair
process.
nutanix@cvm$ ssd_repair_status
Sample output:
SSD repair state machine completed

=== Single SSD Repair Info ===
{ "curr_state": "**FINALIZED**",
"final_status": "Completed",
"final_timestamp": "Sun, 03 Jan 2016 18:54:52",
"host_ip": "10.5.20.191",
"host_type": 3,
"message": null,
"prev_state": "sm_post_rescue_steps",
"start_timestamp": "Sun, 03 Jan 2016 18:48:38",
"svm_id": 4, "svm_ip": "10.5.20.195",
"task": "3c46363a-67ac-4827-a154-b20b281f49bb" }
Troubleshooting Single SSD Drive Failure System Repairs
If t he SSD Repair Fails

• Clean t he fourt h p art it ion of t he b oot d rive
$sudo /cluster/bin/clean_disks -p /dev/sdX4
• Rest art Had es
$sudo /usr/local/nutanix/bootstrap/bin/hades restart
• Rest art Genesis

$genesis restart
10
Troubleshooting Single SSD Drive Failure System Repairs

SATA SSD Metadata Drive
Node
Chassis or Node Fan
Memory
Power Supply
Chassis
11

SATA HDD or SATA SSD Dat a Drive Failure
Failure Indicators
An Alert was Received on the Nutanix UI that a Disk was Marked

O ffline
Stargate Process Marks the Disk O ffline if I/ 0 Stops for more than 20
Seconds
• A n A lert is Generat ed
Next Steps
Run the Command
12
• ncli> disk ls id=xx - t o g et t he d isk locat ion t o rep lace
SATA HDD or SATA SSD Data Drive Failure

• Sample Output:
ncli> disk ls id=xx (where xx is the ID shown in the alert) and verify the following:
Storage Tier: SSD-SATA
Location: [2-6]
Online: false
SATA HDD or SATA SSD Data Drive – SATA Controller Failure
Multiple Drives Marked O ffline during the Same Timeframe

• The SATA Cont roller on t he Node May be Bad
Verify by Running the smartctl Command on the Disk Checking for any
Errors
• nutanix@cvm$ sudo smartctl -a /dev/sdc
Sample Output:
• SMART overall-health self-assessment test result: PASSED
• SMART Error Log Version: 1
• No Errors Logged
If No Errors Replace the Node and Mark the Disks O nline

13
• nutanix@cvm$ ncli -h true (h is for Hidden Commands)
• ncli> disk mark-online id=disk_id
SATA HDD or SATA SSD Data Drive – SATA Controller Failure


Node
Chassis or Node Fan
Memory
Power Supply
Chassis
14

Node Failure
Failure Indicators Next St eps
No Green Light on Power Button Node does not pow er on, reseat t he
w hen trying to Power Up Node. If t hat does not resolve t he
issue, replace t he node.
O ne of the on-board NIC Ports is
not W orking A ny ot her failure indicat ors, replace
t he node.
A diagnosed Memory Failure Actual
Memory Slot Failure Troubleshoot NIC issues. If t he NIC is
st ill not w orking replace t he nod e.
Multiple HDDs go O ffline, No Drive
Errors Reporting
15
Node Failure
Node
Chassis or Node Fan
Memory
Power Supply
Chassis
16

Chassis or Node Fan Failure
Failure Indicators Next Steps
A n A lert w as Received on t he Nut anix Replace t he Chassis or Node Fan

UI t hat a Fan Had St opped, or
Running t oo Fast or Slow Fan Speed st ill report s 0 RPM. This is
a PDB Sensor issue, replace t he
Running ipmitool sel list from Chassis
t he A HV Hypervisor Show s Fan Errors
for Fan1 or Fan2
Running ipmitool sensor list from

t he A HV Hypervisor Show s 0 RPM for
Eit her Fan1 or Fan2, and t here are
Tem perat ure A lert s
17
Chassis or Node Fan Failure


Node
Chassis or Node Fan
Memory
Power Supply
Chassis
18

Memory Failure
Failure indicators Next St eps
NCC returns a Correctable –ECC Correct able/ Uncorrect able ECC

Error Errors in IPMI Event Log, replace
DIMM
Running ipmitool sel list on
the AHV Hypervisor Shows an Undet ect ed Mem ory, replace DIMM
Uncorrectable memory Error for a
DIMM Run t he follow ing com m and t o
det erm ine locat ion of uninst alled
Not All Memory is Detected, DIMM
System has 128GB and Host root@ahv# virsh sysinfo |
Shows 120 GB egrep "size|'locator"
19
Memory Failure

Node
Chassis or Node Fan
Memory
Power Supply
Chassis
20

Power Supply Failure
Failure Indicators Next Step s
Nut anix UI Show s a Pow er Sup p ly A lert Run t he

/home/nutanix/serviceability/bin/
ipmitool Event Log Show s a Pow er Sup p ly breakfix.py to d eterm ine w hich PSU
Failure failed to rep lace
A lert Lig ht on Front of Nod e Flashes at

4 -second intervals
A lert Lig ht on Pow er Sup p ly Unit is Orang e
A ll Nod es in A Block Fail (Bot h Pow er

Sup p lies Failed )
21
Power Supply Failure


Node
Chassis or Node Fan
Memory
Power Supply
Chassis
22

Chassis Failure
Failure Indicators Next Steps
Errors or Failures on Mult ip le Drives and by Rep lace t he Chassis

Rep lacing t he Drives d oes Resolve t he
Issues
Errors or Failures on a Sing le Drive and by

Rep lacing t he Drives d oes Resolve t he
Issues
PDB Sensor Failure b ut Act ual Com p onent

d id not Fail
Physical Dam ag e to t he Chassis
23
Chassis Failure
SATA SSD Boot Drive Failure – Rescue
Shell
Log in to Rescue Shell to Diagnose Issues w it h t he
Boot Device
Steps to Enable t he Rescue Shell
• Creat e t he ISO Im ag e svmrescue.iso
nutanix@cvm$ cd ~/data/installer/*version* -
AOS Version
nutanix@cvm$ ./make_iso.sh svmrescue Rescue 10
• Launch t he Recovery Shell
o Shut d ow n t he Cont roller VM if St ill Running
24
SATA SSD Boot Drive Failure – Rescue Shell

Disk Manag er (Had es)
Disk Manager (Hades)

Explain t he Role of Disk Manager ( Hades)
What is Self Monitoring, Analysis, and Reporting Tool SMART?

Stargate Marks Disks Offline if Disk I/O is Experiencing Delayed Response Times
Hades Service Automates the Smartctl Drive Self-Tests

Logs to Troubleshoot Disk Manager (Hades)
• Hades logs all of its actions in
/home/nutanix/data/logs/hades.out
• Stargate Log “marked offline” Entries
• Smartctl log
26
Explain the Role of Disk Manager (Hades)

• Self Monitoring, Analysis and Reporting is a technology built into Disk Drives and
Solid State Drives to predict drive failures. By performing various self-tests on the
drive, SMART can sense and provide early warnings of a disk drive failure. This
allows for administrators to be proactive to migrate data and replace the disk before
the drive fails.
• Note: Drive failures are not as severe in a Nutanix Cluster due to the Replication
Factor. With RF2 there is always another copy on another node in the cluster. With
RF3 there are always 2 copies on 2 other nodes in the cluster.
• Best practice is to watch for drive failures and replace as soon as possible.
• smartctl is the command used to perform the self-tests on the drive.
• Each node has a CVM running on it that provides access to the disks for that node.
Each node’s local disks are placed into one large Nutanix Storage Pool, or what is
known as the Acropolis Distributed Storage Fabric (ADSF).
• The Service on the node responsible for all reads and writes to the direct attached
disks is Stargate. Stargate monitors the I/O times. If drives are having delayed I/O
times, then Stargate will mark the drive offline.
• Although slow I/O times can be a very good indicator of a drive going bad, it’s
sometimes not enough information to prove the drive is failing or failed. Smartctl self-
tests can be run on the offline drive to provide more extensive testing.
• After the drive is marked offline by the Stargate service, the Hades Disk Manager
service automatically removes the drive from the data path and runs smartctl self-
tests against the drive. The smartctl self-tests provide a more thorough testing of the
drive.
• If the drive passes the smartctl self-tests, the drive is marked back online and
returned to the data path.
• If the drive fails the smartctl self-tests or Stargate marks the disk offline 3 times within
an hour (regardless of smartctl results) the drive is removed from the cluster.
• Hades logs all actions to the following logfile:
/home/nutanix/data/logs/hades.out
• This logfile will provide information to assist when troubleshooting drive failure issues.
Any drives marked offline by stargate will be tested automatically using smartctl by
the Hades service.
• Hades also performs other disk operations besides disk monitoring such as disk
adding, formatting, and removal. The Hades logfile will be helpful to assist with any
issues for these operations as well.
• There is also a smartctl log that can be examined for more detailed information about
the smartctl drive self-tests being performed and the diagnosis on the drive that is
failed or failing.
• To view drives marked offline search the stargate logfile for “marked offline”
entries. Review the time stamp of occurrence then review the smartctl log for the
corresponding drive self-tests for the prediction good or bad and detailed information
on the drive.
• There will also be entries in the Hades logfile specifying these actions running by
Hades to test the offline drive marked by Stargate.
Prism – Hardware – Disk
Page
Selected Disk Below

is the Summary
27

• Disk page in Prism with summary information for all disk in the cluster. This page also
includes performance metrics for each disk.
Gather the $sudo smartctl -T permissive -a [device]

smartctl Log to
Find out if Drive is
Failing or Failed: $df –h - To Get the Device Name of the Drive
How to Check the $sudo /usr/local/nutanix/bootstrap/bin/hades status
Status of Hades
How to Restart Hades $sudo /usr/local/nutanix/bootstrap/bin/hades restart

Service
How to Run $sudo smartctl --test=short|long /dev/sdc

smartctl Tests
Manually
$sudo smartctl –X To Abort Test
28

• Sample Output:
$sudo smartctl -T permissive -a /dev/sda – SCSI DISK a
The log indicates that the drive is failing:

=== START OF READ SMART DATA SECTION === SMART overall-health self-
assessment test result: FAILED! Drive failure expected in less than 24 hours.
SAVE ALL DATA. Failed Attributes: ID# ATTRIBUTE_NAME FLAG VALUE
WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5
Reallocated_Sector_Ct 0x0033 001 001 005 Pre-fail Always FAILING_NOW
2005
• Run the following command to check the status or restart the Hades service:
$sudo /usr/local/nutanix/bootstrap/bin/hades status

$sudo /usr/local/nutanix/bootstrap/bin/hades restart
How to run smartctl drive tests manually:
$sudo smartctl --test=short|long /dev/sdc
Short Test
The goal of the short test is the rapid identification of a defective hard drive.
Therefore, a maximum run time for the short test is 2 min. The test checks the disk
by dividing it into three different segments. The following areas are tested:
~ Electrical Properties: The controller tests its own electronics, and since this is
specific to each manufacturer, it cannot be explained exactly what is being tested. It
is conceivable, for example, to test the internal RAM, the read/write circuits or the
head electronics.
~ Mechanical Properties: The exact sequence of the servos and the positioning
mechanism to be tested is also specific to each manufacturer.
~ Read/Verify: It will read a certain area of the disk and verify certain data, the size
and position of the region that is read is also specific to each manufacturer.
Long Test
The long test was designed as the final test in production and is the same as the
short test with two differences. The first: there is no time restriction and in the Read/
Verify segment the entire disk is checked and not just a section. The Long test can,
for example, be used to confirm the results of the short tests.
What Happens to Drives that Fail Smartctl or Stargate

The failed disk
The cluster is marked as a
The Red LED
automatically Tombstoned
The disk is of the disk is
begins to Disk to
marked for turned on to
This disk is create new prevent it
removal within provide a
unmounted. replicas of any from being
the cluster Zeus visual
data that is used again
configuration. indication of
stored on the without
the failure.
disk. manual
intervention.
Safe to Remove the Physical Drive

29 Once the RED LED is ON

• The Stargate service will mark drives offline if disk I/O delays are occurring. Again,
you can review the local Stargate logfile to see the marked offline events for drives.
• If Stargate marks a drive offline 3 times within 1 hour, the drive is marked as failed
and removed from the cluster.
• The detailed steps above show the drive removal process.
• If not, Hades will run self-tests on the drive to verify it is failed or failing. If the drive
self-tests fail Hades will mark the drive failed and remove from the cluster. If the drive
passes the drive self-tests it is brought back online.
• If the drive has failed, a replacement drive needs to be sent to the customer. The
physical drive replacement can be performed by Authorized Support Partners or by
the customer. The Support Portal has step-by-step documentation for drive
replacement procedures.
• The above steps will also occur in the following 2 scenarios:

• A disk is manually marked for removal using NCLI or Prism.
• A disk is pulled out of a running system.
How to bring back a disk that was marked as failed by Hades:
edit-hades change is_mounted:False to is_mounted:True
• Remove the date/time entries for when Hades thought the disk was bad.
• Restart Genesis on that local CVM.
• Stop and restart Stargate for this local CVM.
W hat happens w hen I insert a new replacement disk?

• A ut om at ically Ad d ed if only 1 St orag e Pool
• Manual Part it ion and Form at – Prism – Hard w are – Disk
p ag e
W hat happens if I re- insert t he original disk or a

previously used disk?
• Disk w ill not b e A ut om at ically Ad d ed ( Exist ing Dat a)
• Rep art it ion and Form at
30
• W ill also Rem ove Disk from Tom b st one List

• After replacing the failed disk with a replacement disk, Hades should automatically
repartition and format the drive and bring it online if the cluster is using a single
Storage Pool.
• If the replacement drive does not come online automatically, use Prism to manually
prepare the new drive. In Prism on the On the Hardware – Diagram page select the
new drive and then click Repartition and Format.
• After re-inserting the existing disk, or any disk that has old data on it, the newly-
inserted disk will need to be manually repartitioned and formatted in Prism. The disk
cannot be used until these tasks are performed.
Prism – Hardware – Diagram
Page
31

How to Review Hades Historical Information for Disks

$edit-hades <cvm_id> print - ncli host ls – To obtain cvm_id
Hades Prints a Disk Inventory with the Following Information:
• slot_list – Disk Mod el, Serial num b er and Firmw are
• offline_timestamp_list – Each Tim e Disk w as Marked Offline
• test_list - Sm artct l Test s Run by Had es
slot_list {
location: 1
disk_present: true
disk {
serial: "BTHC512104MM800NGN"
model: "INTEL SSDSC2BX800G4"
is_mounted: true
current_firmware_version: "0140"
target_firmware_version: "0140"
is_boot_disk: true
32
vendor_info: "Not Available"

• The edit-hades <cvm_id> print command will list the disk inventory for a CVM.
• This can be helpful to examine disks being marked offline by Stargate and the
outcome of the drive self-tests performed by Hades. The inventory will show each
time a drive has been marked offline and the smartctl tests performed on the drive.
Sample Output:
slot_list {
location: 1
disk_present: true
disk {
serial: "BTHC512104MM800NGN"
model: "INTEL SSDSC2BX800G4"
is_mounted: true
current_firmware_version: "0140"
target_firmware_version: "0140"
is_boot_disk: true
}
}
slot_list {
location: 2
disk_present: true
disk {
serial: "Z1X6QR49"
model: "ST2000NM0033-9ZM175"
is_mounted: true
current_firmware_version: "SN04"
target_firmware_version: "SN04"
}
Lifecycle Manag er (LCM)
Lifecycle Manager (LCM)

Apply Firmware Updates using Lifecycle Manager
LCM
• A rchit ect ural Overview

• Perform ing Invent ories and Up g rad es
• Road m ap
• Troub leshoot ing
• Know n Issues & Caveat s
• Op ening ONCA LLs for LCM
• A d d it ional Resources
34
Apply Firmware Updates using Lifecycle Manager LCM

LCM Architect ure
The ability to perform certain types of upgrades is dependent on the

versions of 3 components:
1. LCM-AOS Int erface: Part of AOS release

2. LCM Fram ew ork (Exam ple 1.0 & 1.1)*
3. LCM Updat e Modules*
* Can be released and on a faster schedule than AO S and upgraded out -of-
band.
• Note: These do not get updat ed w it h AOS. Only t he Int erface.
Foundation software is potentially a dependency since a subset of its

functions can be called by LCM, but this should be publicized if it is the case
35
in the future.
LCM Architecture
LCM Inventory & Upgrade W orkflow
Inventory:
• User clicks on t he Inventory tab inside t he LCM page.
• LCM URL is set t o t he Nut anix port al by default . If t his needs t o be changed,
cust om er can change t his by clicking on Advanced Settings.
• User clicks on Options->Perform Inventory t o p erform an invent ory.
• Progress can be t racked t hrough t he LCM page or Task view.
Update:
• If an updat e exist s for an ent it y, t hey show up on t he Available update page.
• User can choose t o updat e all ent it ies by clicking on t he Update All but t on.
• User can choose t o updat e individual ent it ies by clicking on Define.
36
LCM Inventory & Upgrade Workflow

37
Options
Available Updates Inventory
Cluster software component BIOS BMC

1 Entities 3 Entities 3 Entities
Last Updated: Unknown See all Last Updated: Unknown See all Last Updated: Unknown See all
DISK HBA NIC

18 Entities 3 Entities 3 Entities
Last Updated: Unknown See all Last Updated: 13 days ago See all
Last Updated: Unknown See all
38

Available Updates Inventory
All Updates > Host Boot Devices
HOST NTNX-BLOCK-B Options
SATADOM-SL 3ME – 20150311AA19384
Installed Version: S161 Update All

Update to S170 Change
No dependencies found
Update Required
Update Selected
HOST NTNX-BLOCK-1C
SATADOM-SL 3ME – 20150730AA10960

Advanced Settings
Installed Version: S161
Update to S170 Change
No dependencies found
39

LCM will automatically change the URL from the 1.0

repository to the 1.1 path after running an Inventory.
http://download.nutanix.com/lcm/1.1
40
More on Updates
• LCM is d esig ned t o aut om at ically account for d ep end encies
and up g rad e p at hs for firm w are.
• Fut ure up d at es w ill b e t ag g ed “ Crit ical” , “ Recom m end ed ” , or
“ Op t ional” .
• LCM w ill g roup -t og et her and p riorit ize up g rad es b ased on
low est “ im p act ” :
1. CVM ( Example: LSI SA S 30 0 8 HBA , LCM Fram ew ork)
2. Host ( Example: NIC d river up d at e)
3. Phoenix ( Example: BIOS, Sat ad om )
Information displayed in Prism comes from Insights DB.
1

LCM Feat ure Roadmap
Jan 2017 May 2017 June 2017 2H 2017
v1.0 : v1.1 : v1.2 : LCM Future :

NX Platform:
NX Platform:
Initial Release of LCM NX Platform : Support remaining h/w updates
(requires AOS 5.0) • Support Disk Update XC Platform:
• Support SATADOM Update • Support NVMe Update
Support remaining h/w updates
Detects h/w on NX • Support HBA Update
Platforms only • Support Disk Inventory HX Support:
XC Platform :
(“Inventory” operation) • Support NIC Inventory (Inventory & Update)
Inventory and Update modules
Software Support Initiated :

- NCC
LCM Framework can be • Support Disks - AOS
self-updated without XC Platform: • Support HBA - Hypervisor

- Foundation
AOS version changing • Support NIC - AFS
• Support PSU
• Support SATADOM Update LCM on Prism Central to allow for upgrade
consolidation
LCM Framework:
Note: Some updates are Option to allow ‘auto upgrades’ of LCM
Framework
hypervisor type & version • LCM Framework URL to be
Software Only support initiated :
dependent – check Release Notes identical across future (SDK for LCM)
2 releases - Cisco UCS Series
- HPE
LCM Feature Roadmap

• BMC, BIOS, and iDRAC updates for NX and XC are most likely coming in LCM 1.3.
• NCC Upgrade possibly in LCM 1.3.
Describe the following plans:

• Complete all Hardware Detection and Firmware Update 1-click workflows in PE.
o NX, XC, HX as priority
• Start to incorporate Software Upgrade workflows.
o NCC, Foundation initially, followed by AOS and Hypervisor.
• LCM in Prism Central.
SDK for LCM.
o Enable support for Cisco UCS, HPE Proliant and framework for future software-
only solutions.
• 1-Click Consolidation in Prism Central for all Upgrade tasks.
LCM Troubleshoot ing
Logs
For Pre-check, Inventory, and other LCM framework-related issues you should look at the
genesis.out file on the LCM leader.
• Use “grep -i lcm” to see relevant only t he relevant ent ries.
For LCM module issues look at:
• ~/data/logs/lcm_ops.out
For module download issues look at:
• ~/data/logs/lcm_wget.out
To see status of LCM tasks outside of Prism:
Ergon page
Links http://<cvm-ip>:2090/tasks
Possible states: Queued, Running, Succeeded, Failed, Aborted
3 NOTE: LCM logs are not gathered by NCC log_collector until NCC 3.1.
LCM Troubleshooting
Identifying the LCM leader
1. zkls /appliance/logical/pyleaders/lcm_leader on any node should have as many lines

as there are nodes in the cluster. For example:
nutanix@NTNX-A-CVM:xx.yy.zz.81:~$ zkls /appliance/logical/pyleaders/lcm_leader/

n_0000000004
n_0000000005
n_0000000006
n_0000000007
2. zkcat /appliance/logical/pyleaders/lcm_leader/<first_entry> to get the LCM leader

IP. For example:
nutanix@NTNX-A-CVM:xx.yy.zz.81:~$ zkcat
/appliance/logical/pyleaders/lcm_leader/n_0000000004
xx.yy.zz.82
4 NOTE: LCM Leader w ill move to ot her CV Ms during some t ypes of upgrades.
LCM Troubleshooting
Additional Info
• Mod ules d ow nload ed t o ~/software_downloads/lcm/ and t hen
cop ied t o all CVMs in t he clust er.
• W hen LCM t asks fail t hey w ill not aut om at ically rest art . Place t he
clust er int o a “clean” st at e by p erform ing t he follow ing t o resum e:
1. zkrm /appliance/logical/lcm/operation
Always consult w ith a Senior SRE before making Zookeeper
edits.
2. cluster restart_genesis
3. In Prism: Perform LCM Invent ory
4 . In Prism: Perform LCM up d at e as req uired
5
LCM Troubleshooting
LCM Know n Issues and Caveat s
• LCM Framework must be the same version on all CV Ms.

• LCM Framework doesn’t get updated on nodes added through Expand
Cluster until AOS 5.1.1. W hen using LCM on clusters where framework is
already on 1.1 version, check md5sum of egg file to ensure that all CV Ms are
using the same version:
• nutanix@NTNX-A-CVM:$ allssh md5sum ~/cluster/lib/py/
nutanix_lcm_framework.egg
• Always run fresh Inventory operation after Break-Fix operations and
upgrades of any type. Updates available through earlier inventories
persist even if hardware/ software has changed.
• Avoid manually cancelling stuck Ergon tasks. Gather logs and consult
w ith local LCM specialist as an ENG/ O NCALL may be required.
• Alerts are not currently suppressed during LCM upgrades.
6
LCM Known Issues and Caveats

Nod e Rem oval
Node Removal
Describe and Perform a Node Removal in a Nut anix
Cluster
• Overview of Nod e Rem oval in a Nut anix Clust er
• How t o Perform a Nod e Rem oval
• Nod e Rem oval Process
• Troub leshoot ing and Monit oring Nod e Rem oval
Describe and Perform a Node Removal in a Nutanix Cluster

Overview of Node Removal in a Nutanix
Cluster
A Nutanix Cluster is made up of Multiple Physical Nodes.

Nodes can be added to a Nutanix Cluster Dynamically.
Nodes can be Removed from a Nutanix Cluster ( Non- Disruptive) .
• Reasons: Hard w are Maint enance, Join d ifferent Clust er, and so on
All UV Ms and Data have to be Migrated to other Nodes in the
Cluster before Removal.
V M migrations can leverage vMotion/ Live Migration Features of
the underlying Hypervisor to Perform O nline.
After UV M migration and Data Migration the Node is now
Prepared.
• Can now Physically Rem ove Nod e from t he Block.
9
Node Removal can be Performed in nCLI or Prism.
Overview of Node Removal in a Nutanix Cluster

• A Nutanix cluster is made up of Nodes. The nodes are servers with the AHV
hypervisors installed and the Controller Virtual Machine (CVM). The Controller Virtual
Machine provides the local storage disk access to the hypervisor and user virtual
machines.
• In a Nutanix cluster, nodes can be added dynamically for scale-out performance.

When you add nodes to the cluster there is no downtime.
• Nodes can be removed from the cluster dynamically also. Why would you want to
remove a node from a cluster?
o Maintenance, replacement, to move it to another cluster, and other reasons.
• When a node is marked for removal from the cluster, all user virtual machines and
data must be migrated to other nodes in the cluster. Until the data is migrated
throughout the existing nodes in the cluster the node cannot be removed.
• User virtual machines will be migrated to other existing nodes in the cluster using a
feature in AHV called Live Migration. In AHV, Live Migration is enabled by default.
• Live Migration moves a virtual machine from one node (hypervisor) to another. The
Live Migration of the User Virtual Machine is performed online requiring no downtime.
All virtual machines except for the CVM must be migrated off the node first. Time
consumption will depend on how many VMs need to be migrated. More VMs means
the operation will take more time.
• After all the virtual machines are migrated off the node, then the data on the disks
need to be replicated to other nodes in the cluster for access and redundancy
requirements. This data consists of Cassandra, Zookeeper, Oplog and Extent Store
data. Time consumption will depend on the amount of data stored on the disks.
• Once all the data is replicated from the disks the node is removed from the cluster.
How to Perform a Node Removal
Make sure there are at least 4 Nodes in the cluster ( Minimum 3 Nodes Required)
nCLI Method
In nCLI run:
• “host list” t o ob t ain IP Ad d ress and ID of Nod e t o Prep are for Rem oval from t he Clust er
In nCLI run
• host remove-start id=“ID” of nod e from t he “host list” out p ut
• host list and host get-remove-status w ill show
MARKED_FOR_REMOVAL_BUT_NOT_DETACHABLE w hich m eans d at a m ig rat ion st art ed and verifies
Prep aring Nod e for Rem oval.
The Node cannot be physically removed from Block until Data Migration Completes.
Hypervisor Caveats
• Acropolis Hypervisor - A ut om at ically Perform s Live Mig rat ions of UVMs d uring Nod e
Rem oval
• ESXi Hypervisor – Manual Mig rat ion of UVMs
• HyperV Hypervisor– Manual Mig rat ion of UVMs
10
How to Perform a Node Removal

How to Perform a Node Removal (co nt .)
Node Removal using Prism
• In Prism und er Hard w are op en Diagram View
• Click nod e in Diag ram to Rem ove X Rem ove Host link
Node
11
How to Perform a Node Removal (cont.)

How to Perform a Node Removal (co nt .)
Node Removal using Prism

• In Prism under Hard w are
open t he Diagram View:
Node • Click t he node t o
Rem ove
• Click X Remove Host
link
12
How to Perform a Node Removal (cont.)

Node Removal Process
The Node Removal Process Prepares the Node for Physical

Removal from Block.
• A ll UVMs have t o b e Mig rat ed off t he Host ( A HV A ut om at ic) .
o A dm inist rat ors w ill have t o m anually Migrat e UVMs for HyperV / vSphere
• A ll Dat a m ust b e Mig rat ed t o Exist ing Nod es in t he Clust er
o If Node is a Zookeeper Node - Must be Migrat ed t o A not her Node
o Cassandra Met adat a m ust be Migrat ed
o Curat or op log and Ext ent St ore m ust be Migrat ed
• Dat a Mig rat ion w ill t ake t he m ost Tim e - Dep end s on A m ount of
Dat a t o b e Mig rat ed
13
Node Removal Process

Troubleshooting and Monitoring Node
Removal
Verify Node Removal Process
• In ncli run t he follow ing
Com m ands:
host list id=“id of host”
host get-remove-status
– St at us w ill show
These Commands Show that the

Node has Been Marked for Removal.
• No inform at ion of Progress

(Cust om er w ait ing ).
• This St at us w ill show unt il Dat a
14
Migrat ion is Com plet e.
Troubleshooting and Monitoring Node Removal

Removal
Monitor progress in Prism on the Tasks Page
Shows a Progress Bar Percentage Complete for the Node and Disks
15

Removal
Before a node can be removed Migration of Data has to Take Place
• Cassand ra Dat a has to b e Mig rated
• If t he Nod e is a Zookeep er Nod e t hat funct ionalit y and Dat a has to b e m ig rated
• Curator has to Mig rate Extent Store and OPLOG
zeus_config_printer command will Allow insight into the Migration of Cassandra
Zookeeper and Curator during Node Removal
Under node_list find node check properties
• Nod e St at us: kToBeRemoved
• Nod e Rem oval ACK in Decim al not at ion for Dat a Mig rat ion d et ails (OP cod es)
During Node Removal these OP Codes ( Node Removal ACK value) will tell what has
migrated and what is still migrating during the Removal Process
4 OP codes ( bits) in Node Removal to review 0 x1111
• Values for t hese OP Cod es w ill eit her b e “0 ” or “1”
o 0 Means Task is not Com p let ed
o 1 Means Task has Com p let ed
16

Removal
Node Removal ACK will be in P1 P2 P3 P4
Decimal Notation (Use Calculator
to convert to Hexadecimal)
Node Removal ACK 0x1111
Acropolis Zookeeper Cassandra Curator

From left to right
First position (P1) after the 0x is for Acropolis Hypervisor
Will always be set to 1 if not running AHV will be set to 0 if AHV until all UVM’s are Migrated
AHV will Migrate the UVM’s Automatically
After all UVM’s are Migrated bit will be set to 1
Second Position (P2) is for Zookeeper
Will always be set to 1 if not a Zookeeper Node 0 if it is a Zookeeper Node
Zookeeper runs on either three of five nodes depending on the FT level of the Cluster (FT1=RF2 or FT2=RF3) Note: Minimum 5 nodes required for FT2
After all Zookeeper Data is Migrated bit will be set to 1
Third Position (P3) is for Cassandra
Will always be set to 0
Cassandra runs on all nodes of the Cluster
After Migration of Cassandra Metadata will be set to 1
Fourth bit (P4) is for Curator
Will always be set to 0
After Extent Store and oplog Migration will be set to 1
17 When All OP Codes (bits) are set to 1 Node is no longer in the Cluster and can be Removed, join to another Cluster etc
Removal
node_list {
service_vm_id: 8
service_vm_external_ip: "10.63.4.70"
node_status: kToBeRemoved
node_removal_ack: 4352 In Decimal Notation must convert to Hexidecimal Notation
hypervisor_key: "10.63.5.70"
management_server_name: "10.63.5.70"
hypervisor {
address_list: "10.63.5.70"
}
ipmi { Sample output from zeus_config_printer
}
address_list: "10.63.2.70" Can pipe the output to less
uuid: "69469c2f-5dd6-423f-ad21-6c193c4a07f7" EXAMPLE: zeus_config_printer | less
rackable_unit_id: 24
node_position: 3
cassandra_status: kDataDeletedNowInPreLimbo
software_version: "el6-release-euphrates-5.0.1-stable-c8b72a2e1c5fc48e6a2df45aa8ecbb09f02a77cb"
node_serial: "ZM162S035393"
hardware_config: "/appliance/physical/hardware_configs/18844811b901cf8b6d5e1df3c29623e6"
management_server_id: 21
cassandra_status_history {
cassandra_status: kToBeDetachedFromRing
state_change_source: kNodeRemoval
svm_id_source_of_status_change: 7
cassandra_status_change_timestamp: 1487971849018960
18

Removal
Curator w ill create Jobs - Scans to Migrate Data from Disks to

existing Nodes in the Cluster
View the Curator Jobs to View these Scans
http://controllervm_ip:2010
SSH to Controller V M ( Put t y)
• Run t he follow ing com m and : links
• Typ e in URL http://0:2010 ( t he 0 ind icat es a local CVM)
• On t he Curat or St at us Pag e find Master
• Click Curator Master IP Address t o View Curat or Job s
• Review Curat or Job s and scans w it h Reason “ToRemove” for m ore
Det ail
19

Removal
Curator Status Page will Show who is the Master
Have to connect to Master to view Scan Jobs for Node Removal
20

Removal
Scroll down and review the Jobs for Reason “ToRemove”
21

Disk Rem oval
Disk Removal
Describe and Perform a Disk Removal on a
Node
O verview of Disk Removal
How to Perform a Disk Removal in a Nut anix Cluster
Disk Removal Process
Troubleshoot ing and Monitoring Disk Removal
23
Describe and Perform a Disk Removal on a Node

O verview of t he Disk Removal
Part II
Each Nut anix Node in t he Cluster has it s ow n Associated Disks to Store Dat a.
Nodes can cont ain Hard Disk Drives ( HDD) Solid St ate Drives ( SSD) or bot h.
• Hyb rid Nod es w ill have b ot h, for exam p le: 2 SSDs and 4 HDDs
• A ll Flash nod es w ill have just SSD Drives.
Nodes can sust ain disk failures and remain operat ional.
For Cont ainers w it h RF2 ( 2 cop ies of all d at a) you can lose 1 disk per node and Node w ill remain
operat ional.
• Req uires 3 nod es m inim um
For Cont ainers w it h RF3 ( 3 copies of all dat a) you can lose 2 disks per node and Node w ill remain
operat ional.
• Req uires 5 nod es m inim um
Before a disk can be removed from a node t he dat a has to be migrated.
• If d isk has failed t hen of course t here w ill b e no d at a m ig rat ion.
• St ill have t o p erform a d isk rem oval for failed d rives b efore you can p hysically rem ove.
Disks t hat fail need to be removed and replaced immediately !
24
Overview of the Disk Removal

O verview of t he Disk Removal (co nt .)
Type of Drive Failures and Impact

• Boot Drive Failure
o Each Cont roller VM boot s from t he SATA-SSD .
– Cont roller VM w ill event ually fail.
– Hyp ervisor d oes not use t he b oot d rive.
– Guest VMs cont inue to run.
• Met aDat a Drive Failure (alw ays SSD)
o Met ad at a Drive hold s t he OPLOG.
o Used as a persistent dat a t ier.
o Used t o st ore t he Cassand ra Dat ab ase.
• Dat a Drive Failure
o Cold t ier d at a is st ored on t he HDDs.
These failed drives need to be removed and replaced using Disk Removal.
Disk Removal can be performed in nCLI or Prism Element.
25
Overview of the Disk Removal (cont.)

How to Perform a Disk Removal in a Nut anix
Cluster
nCLI Method
In nCLI run
• disk list –
To ob t ain t he d isk “ ID”
• disk remove-start id=
“disk ID” from Disk list out p ut
• disk get-remove-status –
To Verify Dat a Mig rat ion
0005464c-247c-31ea-0000-
00000000b7a0::415025
26
How to Perform a Disk Removal in a Nutanix Cluster

How to Perform a Disk Removal in a Nut anix
Cluster
In Prism under Hardware, then Diagram View
Click Disk in Diagram to Remove – X Remove Disk link
Click disk to remove
27
How to Perform a Disk Removal in a Nutanix Cluster

Troub leshoot ing and Monit oring
Troubleshooting and Monitoring

Troubleshoot ing and Monitoring
Before a Disk can be removed Migration of Data has to Take Place.

• HDD’s Extent Group s cold t ier d at a m ust b e Mig rated .
• SDD’s Cassand ra Met ad at a and OPLOG m ust b e Mig rated
o A st erix w ill st ore m et ad at a on b ot h SSDs.
– Rem oval of one SSD only has t o Mig rat e half t he Dat a
zeus_config_printer Command Allows Insight into the Migration of HDD/ SDD

Under disk_list find disk check properties.
• To_Rem ove: True
• Data_migration st at us: In Decim al not at ion for Dat a Mig rat ion d et ails (OP cod es)
During Disk Removal these OP Codes ( Data_migration_status value) tells you what
has migrated and what is still migrating during the Disk Removal Process.
3 OP codes ( bits) in Disk Removal to review 0 x111
• Values for t hese OP Cod es w ill eit her b e 0 or 1
• 0 Means Task is not Completed
29
• 1 Means Task has Completed

Troubleshooting and Monitoring Disk
Removal
data_migration_status will be in Decimal Notation (Use Calculator to convert to Hexadecimal)
P1 P2 P3
Data_Migration_Status 0x111
Cassandra oplog Extent Store

From left to right
• First position (P1) after the 0x is for Cassandra Metadata
• Will always be set to 1 if HDD
• After all Cassandra Metadata is Migrated bit will be set to 1
• Second Position (P2) is for oplog
• Will always be set to 1 if HDD
• After all oplog Data is Migrated bit will be set to 1
• Third Position (P3) is for Extent Store
• Will always be set to 0
• After Migration of Extent Store will be set to 1
• When All OP Codes (bits) are set to 1 Disk has migrated all the Data and now can be physically Removed
30
Troubleshooting and Monitoring Disk Removal

Troubleshoot ing and Monitoring
Sample output from zeus_config_printer
Can pipe the output to less EXAMPLE: zeus_config_printer | less
Sample output below:
Marked for Removal

Data Migration Status in Decimal Notation
Contains Metadata True/False
OPLOG
Notice the next disk in the disk_list

This drive is a HDD “DAS-SATA”
There is no Metadata or OPLOG data
31

Labs
32
Labs
• Module 5 Hardware Troubleshooting
Thank You!
33
Thank You!

• Once you get logged on to the Support Portal, there are many Nutanix-related things
you can do including view documentation, open/view cases, or download software or
tools. A good suggestion is to explore this page when time permits.
Module 6
Networking
Module 6 Networking
Course Agenda
1. Int ro 7. A FS
6. Net w orking
Course Agenda
• This is the Networking module.
O bject ives

• Explain basic networking concepts.
• Gain an understanding and practice of IP Subnetting.
• Describe the functionality of O pen vSw itch (OVS) in a
Nutanix environment.
• Troubleshoot OVS issues on an AHV host.
• Become familiar with W ireShark.
Objectives
• Explain basic networking concepts.
• Gain an understanding and practice of IP Subnetting.
• Describe the functionality of Open vSwitch (OVS) in a Nutanix environment.
• Troubleshoot OVS issues on an AHV host.
• Become familiar with WireShark.
Net w orking Basics
Networking Basics
Net working O verview
• A network is a set of connect ed d evices.

• Devices connect ed t o each ot her as p art of a net w ork have
t he ab ilit y t o share inform at ion w it h each ot her.
• Different t yp es of net w orks includ e:
o LA Ns
o WA Ns
o MA Ns
o W ireless LA Ns/ WA Ns
• A network topology is a d et ailed , g rap hical rep resent at ion
of t he d evices, cab les, and any ot her p erip herals of a
net w ork.
5
Networking Overview
OSI Model
Layer Name Example Protocols

HTTP, FTP,DNS,SNMP,
7 Application Layer
Telnet
6 Presentation Layer SSL, TLS
5 Session Layer NetBIOS, PPTP
4 Transport Layer TCP, UDP
3 Network Layer IP, ARP, ICMP, IPSec
2 Data Link Layer PPP, ATM, Ethernet
Ethernet, USB,
1 Physical Layer
Bluetooth, IEEE 802.11
6
OSI Model
vLANs
• A Virtual LAN is a log ical seg m ent at ion of a p hysical

net w ork t hat can b e seg m ent ed b ased on a num b er of
fact ors such as d ep art m ent req uirem ent s or int ernal g roup s.
• Log ically seg m ent ing a net w ork int o m ult ip le vLA Ns lim it s
t he b road cast d om ain com m unicat ion of t hose host s in t heir
sub net – a Layer 3 d evice is req uired t o com m unicat e
b et w een vLA Ns.
• vLA Ns are id ent ified by a num b er from 1 t o 4 0 94 . Not e t hat
som e of t he vLA Ns in t his rang e are reserved for sp ecific
t echnolog ies/ p rot ocols.
vLANs
vLAN Tagging
• In ord er t o id ent ify w hich vLA N a p acket b elong s t o, a

vLAN Tag is used . The vLAN ID is insert ed int o t he IP
p acket head er.
• On a t runk int erface, w hich is cap ab le of carrying
t raffic for m ult ip le vLA Ns, t he vLA N Tag allow s
sw it ches t o id ent ify w here p acket s should b e
t ransm it t ed .
• The nat ive vLA N, vLA N 1 by d efault , is alw ays sent as
an unt ag g ed p acket – t he cont ent s of t he p acket is not
m od ified t o includ e a vLA N ID.
8
vLAN Tagging
Subnet t ing
• Subnetting is t he p rocess of log ically d ivid ing an

IP net w ork into 2 or m ore net w orks.
• The Subnet Mask is used t o id ent ify w het her a
host is on t he local or a rem ot e sub net .
• The act of “ b orrow ing b it s” from t he host rang e
and ad d ing t hem t o t he sub net t ing b it s is used t o
d ivid e a net w ork in sm aller, log ical net w orks.
Subnetting
Subnetting Example
Interface Speeds
• W it hin a Nut anix syst em , 1G or 10 G int erfaces are

leverag ed t o t ransm it t raffic t o ot her nod es/ t he
up st ream .
• It is im p ort ant t o und erst and t he act ual sp eed
achieved by t hese int erfaces in ord er t o achieve
op t im al p erform ance.
• 1 Gig ab it Et hernet has a m axim um t ransfer rat e of 125
Meg aByt es/ second .
• 10 Gig ab it Et hernet is t he eq uivalent of 1250
Meg aByt es/ second .
11
Interface Speeds
et htool
• The ethtool com m and can b e used t o ob t ain t he

sp eed , as w ell as ot her charact erist ics and st at ist ics of
an int erface.
• Running ethtool <interface_name> w ill d isp lay t he
sp eed of an int erface, as w ell as t he d up lex,
t ransceiver t yp es, and ot her d et ails.
• ethtool –S <interface_name> w ill d isp lay count er
st at ist ics associat ed w it h t he sp ecified int erface.
• ethtool –i <interface_name> w ill d isp lay t he
inst alled d river on t he NIC.
12
Ethtool
et htool
13
Ethtool
Net working Configurat ion Files
• Bot h t he CVM and A HV host st ore

net w orking inform at ion in t he
follow ing file:
/etc/sysconfig/network-
scripts/ifcfg-<interface_name>
• Linux file ed it ors can b e used t o
m od ify t hese files for t he p urp oses
of chang ing IP ad d resses, sub net
m asks, g at ew ays, or any ot her
int erface-relat ed IP config urat ion.
14
Networking Configuration Files

OVS A rchit ect ure
OVS Architecture
W hat is O pen vSw itch ( OVS) ?
• O pen vSw itch is a p rod uct ion-q ualit y op en source

soft w are sw it ch d esig ned t o b e used as a sw it ch in
virt ualized server environm ent s.
• Forw ard s t raffic b et w een VMs on t he sam e host , and
b et w een VMs and t he p hysical net w ork.
• Op en vSw it ch can run on any Linux-b ased
virt ualizat ion p lat form .
• W rit t en p rim arily in p lat form -ind ep end ent C.
16
What is Open vSwitch (OVS)?

Is Open vSwitch like VMware DVS/ Cisco
10 0 0 V?
• Sort of…
o DVS and N1K are d ist rib ut ed virt ual sw it ches – a cent ralized
w ay t o m onit or t he net w ork st at e of VMs sp read across
m any host s. Op en vSw it ch is not d ist rib ut ed – a sep arat e
inst ance runs on each p hysical host .
• Op en vSw it ch includ es t ools such as ovs-ofctl and
ovs-vsctl t hat d evelop ers can scrip t and ext end t o
p rovid e d ist rib ut ed sw it ch cap ab ilit ies.
• Op en vSw it ch p rovid es t he virt ual net w orking on t he
A HV nod es in a Nut anix clust er.
17
Is Open vSwitch like VMware DVS/Cisco 1000V?

Main Component s of O penvSwitch
• ovs-vswitchd: A d aem on t hat im p lem ent s t he sw it ch, along

w it h a com p anion Linux kernel m od ule for flow -b ased
sw it ching .
• ovsdb-server: A lig ht w eig ht d at ab ase server t hat ovs-
vswitchd q ueries t o ob t ain it s config urat ion.
• ovs-dpctl: A t ool for config uring t he sw it ch kernel m od ule.
• ovs-vsctl: A ut ilit y for q uerying and up d at ing t he
config urat ion of ovs-vswitchd.
• ovs-appctl: A ut ilit y t hat send s com m and s t o running
Op envSw it ch d aem ons.
18
Main Components of OpenvSwitch

O penvSwitch Architect ure
19
OpenvSwitch Architecture
OVS Basic Config urat ion
OVS Basic Configuration

Bridge Commands
• Creat e a new b rid g e.

o [root@NTNX-Block-1-A ~]# ovs-vsctl add-br br0
• Delet e a b rid g e and all of it s p ort s:

o [root@NTNX-Block-1-A ~]# ovs-vsctl del-br br0
• List all exist ing b rid g es:

o [root@NTNX-Block-1-A ~]# ovs-vsctl list-br br0
21
Bridge Commands
Port Commands
List all port s on a bridge:

• [root@NTNX-Block-1-A ~]# ovs-vsctl list-ports br0
• bond0
• br0-arp
• br0-dhcp
• Vnet0
Add a port to a bridge:

• [root@NTNX-Block-1-A ~]# ovs-vsctl add-port br0 eth0
Delete a port on a bridge:

• [root@NTNX-Block-1-A ~]# ovs-vsctl del-port br0 eth0
Add a port to a bridge as an access port :

• [root@NTNX-Block-1-A ~]# ovs-vsctl add-port br0 eth0 tag=9
22
Port Commands
Port Commands (co nt ’d )
Configure an already added port as an access port :

• [root@NTNX-Block-1-A ~]# ovs-vsctl set port
eth0 tag=9
Print out bridge that contains a specific port:

• [root@NTNX-Block-1-A ~]# ovs-vsctl port-to-br
bond0 br0
23
Port Commands (cont’d)

List all int erfaces w it hin a bridge:
• [root@NTNX-Block-1-A ~]# ovs-vsctl list-ifaces br0
• br0-arp
• br0-dhcp
• eth0
• eth1
• eth2
• eth3
• Vnet 0
Print out bridge t hat cont ains a specific int erface:

• [root@NTNX-Block-1-A ~]# ovs-vsctl iface-to-br eth0
• br0
• [root@NTNX-Block-1-A ~]# ovs-vsctl iface-to-br bond0
• ovs-vsctl: no interface named bond0
*N ot e a port resides on t he bridge; an int erface is a net w ork device at t ached t o a port .
24
Interface Commands
OVS on Nut anix
26
OVS on Nutanix
• Each hypervisor hosts an OVS instance, and all OVS instances combine to form a
single switch.
OVS on Nutanix
Related Log Files
• /var/log/messages
• /var/log/openvswitch/ovs-vswitchd.log
28
Related Log Files

OVS versus VMware vSw itch
29
OVS versus VMware vSwitch

manage_ovs Script
• Nut anix p rovid es a ut ilit y, manage_ovs, w hich is

inst alled on each CVM and can b e used t o m anag e t he
Op en vSw it ch config urat ion on t he A HV host . The
manage_ovs scrip t is essent ially a w rap p ed for ovs-
vsctl com m and s run on t he A HV host in t he
b ackg round .
• The manage_ovs ut ilit y w as int rod uced in AOS 4 .1.
• See t he --helpshort out p ut for d et ails on usag e.
30
manage_ovs Script
nutanix@cvm:~$ manage_ovs --helpshort
USAGE: manage_ovs [flags] <action>
Where <action> is one of the following:
show_bridges: Shows a list of the uplink bridges.

show_interfaces: Shows a list of host physical interfaces.
show_uplinks: Shows the current uplink configuration for the OVS bridge.
update_uplinks: Updates the uplink configuration for the OVS bridge.
enable_bridge_chain: Enables bridge chaining on the host.
disable_bridge_chain: Disables bridge chaining on the host.
The update_uplinks action requires the --interfaces flag, which indicates the
desired set of uplinks for the OVS bridge. The script will remove any existing
uplinks from the bridge, and replace them with the specified set of uplinks on
a single bonded port.
flags:
/usr/local/nutanix/cluster/bin/manage_ovs:
--bond_name: Bond name to use
(default: 'bond0')
--bridge_name: Openvswitch on which to operate
(default: '')
--[no]dry_run: Just print what would be done instead of doing it
(default: 'false')
--[no]enable_vlan_splinters: Enable vLAN splintering on uplink interfaces
(default: 'true')
--[no]force: Reconfigure the bridge even if the set of uplinks has not changed
(default: 'false')
-?,--[no]help: show this help
--[no]helpshort: show usage only for this module
--[no]helpxml: like --help, but generates XML output
--host: Host on which to operate
(default: '192.168.5.1')
--interfaces: Comma-delimited list of interfaces to configure as bridge uplinks, or a
keyword: all, 10g, 1g
--mtu: Maximum transmission unit
(an integer)
--num_arps: Number of gratuitous ARPs to send on the bridge interface after updating
uplinks
(default: '3')
(an integer)
--[no]prevent_network_loop: Enables network loop prevention when bridge chain is
enabled.
(default: 'false')
--[no]require_link: Require that at least one uplink has link status
(default: 'true')
Packet A nalysis Tools
Packet Analysis Tools

• W ireshark
• tcpdump
32

W ireshark – W hat Is It ?
• W ireshark is a net w ork p acket analyzer t hat can

cap t ure p acket s and d isp lay t he associat ed p acket
d at a in as m uch d et ail as p ossib le.
• W ireshark is one of t he b est op en source p acket
analyzers availab le t od ay.
• W ireshark p rovid es t he ab ilit y t o d rill d ow n int o
p rot ocol-sp ecific d et ails of net w ork t raffic, all t he w ay
d ow n t o t he p ayload cont ent s.
33
Wireshark – What Is It?

W ireshark – W hen Should I Use It ?
• W ireshark’s cap ab ilit ies can b e leverag ed t o:

o Troub leshoot net w ork p rob lem s.
o Exam ine securit y p rob lem s.
o Deb ug p rot ocol im p lem ent at ions.
o Learn net w ork p rot ocol funct ionalit ies.
34
Wireshark – When Should I Use It?

W ireshark – St art ing a Capt ure
W hen st art ing a capt ure in W ireshark, an interface

needs to be selected. Ensure that you are capturing
t raffic on t he interface on w hich you expect to receive
traffic.
35
Wireshark – Starting a Capture

W ireshark – Filtering Traffic
• Aft er a cap t ure is st art ed in W ireshark, every p acket

t hat is t ransm it t ed or received by t he sp ecified
int erface w ill b e d isp layed . This is cum b ersom e t o
p arse t hroug h; how ever, w it h t he ap p licat ion of t raffic
filt ers, only t he d at a t hat ap p lies t o a cert ain p rot ocol
or IP can b e d isp layed .
• Traffic filt ers can b e ap p lied at t he t op of t he cap t ure
w ind ow. Traffic filt ers can b e com b ined w it h t he && or
|| op erat ors, w hich sp ecify AND or OR, resp ect ively.
36
Wireshark – Filtering Traffic

W ireshark – Filtering Traffic (co nt ’d )
37
Wireshark – Filtering Traffic (cont’d)

tcpdump
• tcpdump is a Linux t raffic m onit oring t ool t hat can b e

run from t he com m and line of a Nut anix CVM or A HV
nod e.
• Like m any ot her Linux com m and s, t here are a variet y
of flag s and op erat ions availab le t o fine t une t he
tcpdump com m and .
38
tcpdump
tcpdump – St art ing a capt ure
• A tcpdump cap t ure can b e st art ed by running t he

follow ing com m and : sudo tcpdump -i <interface>
• By d efault , t he t raffic cap t ured w ill b e d isp layed t o
st and ard out .
• In m any cases, it is easier t o save t he m onit ored t raffic
t o a file and copy it off for lat er review w it hin
W ireshark or anot her visualized p acket cap t ure t ool:
sudo tcpdump -i <interface> > traffic.pcap
39
tcpdump – Starting a capture

Lab s
Labs
Labs
Module 6 Net working
41
Labs
Thank You!
42
Thank You!
Module 7
Acropolis File Services
Troubleshooting
Module 7 Acropolis File Services

Course Agenda
1. Intro 7. A FS
2. Tools & NCC 8. A BS
4. Foundation 10 . Up g rad e
6. Networking
Course Agenda
• This is the AFS module.
O bject ives

• Describe Acropolis File Services
• Discuss Acropolis File Services Feat ures and
Requirem ent s
• Provid e an Overview of Microsoft Act ive Direct ory
Dependencies
• Describe How t o Im plem ent Acropolis File Services
• Troub leshoot ing
• List Logs of Int erest
Objectives
• In this Module you will learn about Acropolis File Services. We will discuss
Acropolis File Services Features and requirements. We will discuss how to setup
and implement Acropolis File Services. We will also be covering general
troubleshooting guidelines and techniques for Acropolis Files Services
including log file review.
• Acropolis File Services allows customers to implement Windows-based File

Services for SMB access to shares over the network. A File Server is created to
host the shares and take full advantage of the Acropolis Distributed Storage
Fabric (ADSF).
• We will have a quick discussion on Active Directory and its role and requirements
for Acropolis File Services. Acropolis Files Services depends on Microsoft Active
Directory for user authentication to provide secure file access. Discuss the role
DNS and NTP play in Active Directory and AFS file services.
• We will cover how to implement Acropolis File Services in the Prism interface.
We will talk about where to obtain the AFS software from the Nutanix Support
Portal at the following URL:
http://portal.nutanix.com under Downloads - Acropolis File Services (AFS).
o
• Please refer to the documentation and release notes for new features in the release,
installation instructions, and resolved/known issues.
• Upload the AFS software to the Cluster in Prism and launch the AFS Setup
Wizard on the File Server page. Go through the Setup Wizard and create an
Instance of AFS on the cluster.
• During the AFS Setup we will be examining the log files on the CVMs to review the
Setup process and the detailed steps. Learn how to use these CVM log files when
encountering AFS Setup issues. The AFS log files contain errors and warnings as
to why Setup is failing. We will learn how to look for errors in the AFS logfiles to
assist in troubleshooting the setup of AFS.
• After Setup, these same logfiles on the CVMs, and another set of log files on the
AFS UVMs, can be used in troubleshooting AFS issues. These logfiles can assist
to identify problems and resolve File Services issues. The CVM and FSVM
logfiles will provide in-depth troubleshooting information to help identify and solve
issues with Acropolis File Services.
A crop olis File Services (A FS)
Acropolis File Services (AFS)

W hat is Acropolis File Services?
What is Acropolis File Services?

• Acropolis File Services (AFS) allows the ability to create SMB file shares for Windows
desktops and servers running in a physical or virtual environment.
o Acropolis File Services is implemented through a set of clustered user virtual
machines running across separate nodes of the cluster.
o AFS gives you the ability to dynamically scale up and scale out as needed.
• A minimum of 3 user Virtual Machines are required to set up and implement AFS.
o The user VMs create an instance of a Windows File Server cluster for fault
tolerance and data redundancy (RF2 or RF3) using Acropolis Distributed
Storage Fabric (ADSF).
• Acropolis File Services AFS allows for the setup of multiple AFS instances on the
cluster.
o Each AFS instance is within the cluster.
o The FSVMs that make up an instance cannot span clusters.
• The Acropolis File Services are primarily used for SMB access to user home
directories and profiles.
o AFS eliminates the need for third-party file servers.
Acropolis File Services Features &
Requirement s
High Availability
• Uses A cropolis Block Services
• VM failures assum ing Different VM Resources
Networking
• Internal Network
o Storag e Net w ork
o FSVM to CVM for Block Access (iSCI)
– External Dat a Service IP
• Ext ernal Net w ork
o Client Access
o A D Aut hent icat ion
o DNS/ NTP access
• FSV M IP Address Requirements
6
Acropolis File Services Features & Requirements

• AFS provides two levels of high availability. Stargate path failures through Acropolis
Block Services. VM failures by assuming different VM resources.
• Acropolis File Services uses two networks:

o External and
o Internal.
• The External network is for client access to the shares hosted on the FSVMs.
o The external network is also used by the FSVMs for Active Directory
authentication, NTP, and Name Server access.
• The Internal network provides the FSVMs with iSCSI Initiator access to the External
Data Services IP which leverages the high availability features of Acropolis Block
Services (ABS).
• AFS virtual machines use volume groups and disks for file server storage. A new
storage container is built named Nutanix_afsservername_ctr to store the disks in the
volume groups.
FSVM IP Address requirements:
• The External network requires one IP address per FSVM (N).
o The Internal network requires one IP address per FSVM plus one extra IP
address for internal communications (N+1).
• If there are 3 FSVMs N=3, 3 external IP addresses are required, 4 internal IP
addresses = N+1 are required (7 total). The IP addresses do not have to be
contiguous. The internal and external network can be the same or different vLANs. If
the two networks are on different vLANs make sure that the proper vLAN tags
are configured.
Acropolis File Services Features & Requirement s
 Cluster Storage Container

• NutanixManagementShare
• Used by File Services for file St orag e
 File Shares
• Hom e Shares
o Default Perm issions - DA =Full access DU=Read only
• General Purp ose Shares
o Default Perm issions – DA =Full access DU=Full access
 ABE – Access Based Enumeration

7

• AOS creates a new storage container named NutanixManagementShare. The
storage container is used by AFS for file storage. Do not delete this storage
container, even if AFS is not installed.
• File Shares are folders that can be accessed over the network. In AFS there are
two types of shares:
o Home Shares and
o General Purpose Shares.
Home Shares
• Home shares are a repository for a user’s personal files. By default a home share
is created for each file server during setup. The share is distributed at the top level
directories within the home folder share.
• For example, in the home share when creating directories for each user user1, user2,
user3 and so on they are automatically distributed across the file server virtual
machines hosting the file server. The User1 directory would be hosted on FSVM1,
User2 on FSVM2, and user3 FSVM3... DFS client referrals built into Windows
machines will connect the user to the hosting file server virtual machine.
Home Share Default Permissions:
• Domain administrator: Full access
• Domain User: Read only
• Creator Owner: Full access (inherited only)
General Purpose Shares

• A general purpose share is a repository for files to share with a group of users.
General purpose shares do not get distributed across multiple virtual machines of the
file server. General purpose shares are distributed at the share level.
• For example, when creating a share on the file server named share1, by default it will
be a general purpose share and will be placed on only one FSVM. When the next
share is created share2, this share will be placed on another FSVM. All the folders
and files created within the share are stored on only one FSVM.
• General Purpose default permissions:

• Domain administrator: Full access
• Domain User: Full access
• Creator Owner: Full access (inherited only)
• Access to the shares are controlled by Share level Access Control Lists (ACLs) and
folder and file level Access Control Lists (ACLs). The Share level permissions are
basically roles with several permissions included. The Share level permissions
provide secure access to users and groups over the network. If the user does
not have share permissions, they will not be able to access the share. Share
permissions are Full Control, Read, or Change.
• In AFS, permissions can also be set at the folder and file level. This is referred to as
the local NTFS permissions. There is a more advanced set of Access Control List
(ACL) permissions at the folder or file level. It is recommended to use the folder/file
NTFS permissions to control access. To set or modify the local NTFS permissions,
use the Security tab for the properties of the folder/file or the cacls.exe Windows
command line tool.
• Both Share level and folder/file level permissions have to be satisfied for access. If
the permissions defined are different between the two, permissions at the Share level
and permissions at the File/folder level, then the most restrictive of the two apply.
• Let’s say that at the Share level you have full control, but at the File level you have
read-only.
• What is the most restrictive of the two:
o Share level permissions or File/folder level?
• In this case, the most restrictive of full control Share level and Read file level would
be read.
• Access-Based Enumeration (ABE) is a feature introduced in Windows Server 2003.
Prior to Windows Server 2003, when creating a share of an existing folder, anyone
who had permissions to the Share, even if it was read-only at the top level, could
actually view all the files in the shared folder regardless of the local user’s NTFS
permissions. This would allow users to see files in the Share without having local
NTFS permissions.
• Access-based enumeration, when enabled, honors the local NTFS permissions from
the Shares perspective. If the user has access to the share, with ABE enabled, they
must also have local NTFS permissions to the files and folders in the share. The user
needs minimum Read permissions (local NTFS permissions) to see any folders/files
in the share. In AFS, it supports enabling/disabling ABE at the share level.
Acropolis File Services Features &
Requirement s
o Defaul: Hourly – 24 Nightly – 7 Weekly – 4 Mont hly - 3

Schedule uses crontab – View for General Troubleshooting
• sudo crontab –l –u root
• Can view crontab Sched ules from any FSVM
o A ny FSVM for d efault “ Hom e” share
o FSVM Ow ner for “g eneral p urp ose” shares
– afs –share sharename get_owner
File Server Leader
• afs get_leader Com m and
Log Entries for auto snapshot troubleshooting will be present at /home/log/messages
file on the FSV M where the Share is owned
8 • afs –share sharename get_owner

• AFS also provides snapshots for file-based recovery. The feature is called Self
Service Restore. The Self Service Restore leverages the Windows Previous
Versions Client to restore individual files from the file server. The end user or help
desk can perform the Windows Previous Versions restore.
• In the AOS 5.1 release, the schedules can now be configured for hourly, nightly,
weekly, and monthly with different retentions for each. Only one occurrence of each
schedule type is allowed per file server. The snapshot scheduling is for the whole file
server all shares. You cannot have different snapshot schedules at the Share level.
• The default schedule starting in AOS 5.1 is:
o 24 hourly
o 7 Daily
o 4 Weekly
o 3 monthly
• Acropolis File Services snapshots are scheduled from the AFS Leader using crontab.
To troubleshoot we can look at the crontab file on the FSVMs to check the schedule
that is stored in the file /var/spool/cron/root. Do not attempt to modify the files in the
cron directory. Use the crontab command to view.
• The following command afs get_leader will get the leader FSVM. SSH to the AFS
Leader FSVM and review the crontab schedules.
o Note: You can view the crontab schedules from any FSVM.
• The following are crontab commands to review the snapshot schedules on the
FSVM.
o sudo crontab -u root –l (u=user, l=list) – The Schedules are listed for the root user.
Sample Output:
• 0 0 1 * * /usr/sbin/zfs-auto-snapshot --quiet --syslog --label=monthly --prefix=afs-auto-
snap --keep=3 --interval=1 //
• 0 * * * * /usr/sbin/zfs-auto-snapshot --quiet --syslog --label=hourly --prefix=afs-auto-
• 0 0 * * * /usr/sbin/zfs-auto-snapshot --quiet --syslog --label=daily --prefix=afs-auto-
• 0 0 * * Sun /usr/sbin/zfs-auto-snapshot --quiet --syslog --label=weekly --prefix=afs-
auto-snap --keep=4 --interval=1 //
• crontab schedules will exist on all FSVMs for the home folder default share because
that share is hosted on all FSVMs. So you will see crontab schedules on each FSVM
of the file server cluster. All other shares that are of type General Purpose will be
hosted on only one FSVM. To see the schedules for those shares, you must find the
owner FSVM of the particular share and then list the crontab schedules on that
FSVM.
• Once we know the owner of the share we can look in messages for these entries to
assist in troubleshooting snapshot scheduling and retention.
o sudo cat messages | grep afs-auto-snap

•
• The Messages file will indicate when snapshots are created and deleted. You can
grep for afs-auto-snap entries to review. This will give us insight into whether the
scheduled snapshots are being created and that proper deletion is occurring once the
older snapshots are past their retention.
• zfs list –t snapshot - This command will list the current snapshots that exist for a
given file server instance.
Acropolis File Services Features & Requirement s
Supported Configurations
• Act ive Direct ory Dom ain Funct ion Level
• W ind ow s Server Ed it ions Sup p ort ed
W indows Client Support
SMB Versions
• 2.0
• 2.1
System Limits

Supported Configurations
Domain Function Level: Supported Domain Controllers:

•Windows Server 2008 R2 Windows Server 2008 Note:
Note: 2.0.2 and 2.1 support Windows 2008. AFS 2.0.2 and 2.1 support
• Windows Server 2008 R2 Windows 2008
• Windows Server 2012 Windows Server 2008 R2
• Windows Server 2012 R2 Windows Server 2012
• Windows Server 2016 Windows Server 2012 R2
Windows Server 2016
Windows Client Support Supported Windows Server

•
• Supported Windows clients: Windows Server 2008
versions: Windows 2008 R2
• Windows 7 Windows Server 2012
• Windows 8 Windows Server 2012 R2
• Windows 8.1 Windows Server 2016
• Windows 10
• Acropolis File services implements version 2.0 and 2.1 of the SMB Server Message
Blocks file sharing Protocol. The SMB protocol can also be referred to as CIFS
(Common Internet File System).
SMB 2.0
• The SMB 2 protocol has a number of performance improvements over the former
SMB 1 implementation, including the following:
o General improvements to allow for better utilization of the network.

o Request compounding, which allows multiple SMB 2 requests to be sent as a
single network request.
o Larger reads and writes to make better use of faster networks, even those with
high latency.
o Caching of folder and file properties, where clients keep local copies of folders and
files.
o Durable handles to allow an SMB 2 connection to transparently reconnect to the
server in the event of a temporary disconnection, such as over a wireless
connection.
o Improved message signing with improved configuration and interoperability (HMAC
SHA-256 replaces MD5 as hashing algorithm).
o Improved scalability for file sharing (number of users, shares, and open files per
server greatly increased).
o Support for symbolic links.
SMB 2.1
• SMB 2.1 brings important performance enhancements to the protocol in Windows
Server 2008 R2 and Windows 7. These enhancements include the following:
o Client oplock leasing model.

o Large MTU support.
o Improved energy efficiency for client computers.
o Support for previous versions of SMB.
• System Limits:
• Configurable Item Maximum Value
• Number of Connections per FSVM 250 for 12 GB of FSVM memory

• 500 for 16 GB of FSVM memory
• Number of FSVMs 16 or equal to the number of
CVMs (choose the lowest number)
• Max RAM per FSVM 96 GB
• Max vCPUs per FSVM 12
• Data size per home share 200 TB per FSVM
• Data size per general purpose share 40 TB
• Number of characters for the share name 80 characters
• Number of characters for the file server name 15 characters
• Number of characters for the share description 80 characters
• Throttle bandwidth limit 2048 Mbps
• Data protection bandwidth limit 2048 Mbps
• Max recovery time objective for Async DR 60 minutes
Microsoft A ct ive Direct ory
Microsoft Active Directory

Overview of Microsoft Active Directory Dependencies
Active Directory
• Securit y Provid er
• Req uired by A FS
• Join Dom ain – Must b e Dom ain Ad m in
Domains
• FQDN – Learn.nut anix.local
• Net b ios - Learn
Domain Controllers
• LDA P Server
• Aut hent icat ion Servers - KDC
DNS Dependencies to Verify and Troubleshoot
• SRV Record s
• nslookup
• _ldap._tcp.learn.nutanix.local
Verify Correct DNS Server in /etc/resolv.conf
11
Overview of Microsoft Active Directory Dependencies

• Microsoft Active Directory is a distributed Directory Services feature. Microsoft Active
Directory provides authentication services on your network for Users, Groups,
Computers, and Services. Acropolis File Services relies on Active Directory
authentication to provide secure access to the file shares. The file server must be
joined to the Active Directory Domain.
• The Active Directory Services run on Microsoft Windows Servers. On the Microsoft
Windows servers, there is a role named AD DS Active Directory Services. This role
can be installed using Server Manager in the Roles and Features. After the role is
installed, the server can be promoted to be a Domain Controller for an Active
Directory Domain.
• Once the Windows Server is promoted to be a Domain Controller it can provide

authentication for users. Active Directory uses either Kerberos or NTLM security
protocols for user authentication. The Domain Controllers run the KDC Key
Distribution Service to manage the Kerberos tickets used for authentication.
• Microsoft Active Directory uses the concept of Domains in its Directory Services
implementation. A Microsoft Active Directory Domain is a collection of the following
objects:
o Users
o Groups
o Computers
o Shares
• The Domain is also a security boundary for these objects. Within the domain, there
is one set of administrators who have domain-wide permissions to administer any of
the objects within the respective domain. There is a Global Group named Domain
Admins. Any user account that is placed in the specified Domain Admins group has
full rights to all objects in the Domain. Rights include creating, deleting, or
modifying domain objects such as users, groups, and computer accounts. Domain
Admins can also join computers to the domain. You must be a Domain Admin to join
the AFS file server to the domain.
• Active Directory Domain Names use DNS names as the naming convention. The
domain name is established when the first Windows server is promoted to be the first
domain controller in a new domain. The domain name is set by the administrator.
Learn.nutanix.local is an example of a domain name. There will also be a place to set
the backwards-compatible Netbios Domain Name for older Windows systems. By
default, it will make the Netbios name the left-most domain name separated by a
period (.) in the domain name, but you can change to anything if desired. If the
domain name is learn.nutanix.com, then by default the Netbios domain name will be
learn.
• Domain admins also have the permission to join computers to the domain which
creates a computer account object in the domain. Once a computer is joined to the
Domain, the computer can use the Active Directory Domain and Domain Controllers
for user authentication.
• Domain Controllers are the Windows Servers hosting the Active Directory
Domain database and provide the authentication services to the computers and
users. The Windows workstations servers and AFS use DNS services to locate
Domain Controllers for joining and authenticating with the Domain Controllers.
• Active Directory Domains use DNS names so that the Domain can be hosted in a
DNS forward lookup zone. DNS forward lookup zones provide name-to-IP address
mapping lookups. Reverse lookup zones provide IP address to name mapping
lookups.
• The Domain Controllers provider several services to the Domain. These services
include LDAP for directory queries, Kerberos Security protocol for authentication, and
kpasswd for user password changes. In the DNS forward lookup zone, to publish
services in the domain, the domain controllers via the netlogon service will perform
Dynamic DNS updates to the forward lookup zone file for their “A” host record and
“SRV” service records.
• The SRV records tell what services the Domain Controller provides. These SRV
records are used by the Windows workstations servers and AFS to find the Domain
Controllers for authentication services. The SRV records are also used by the
computers and AFS when joining the domain.
• nslookup can be used to test and troubleshoot SRV record lookup. From any FSVM
you can run the nslookup command to verify that the SRV records exist in the DNS
forward lookup zone. Ping the “A” record to verify name resolution and connectivity.
• ssh to the FSVM using Putty and run the following commands to test DNS:
• Nslookup – to go into interactive mode of nslookup

• set type=srv – to tell nslookup you want to query and test Service Records
• _ldap._tcp.dnsdomainname
Sample output:
• nutanix@NTNX-16SM13150152-B-CVM:10.30.15.48:~$ nslookup
• > set type=srv
• > _ldap._tcp.learn.nutanix.local
• Server: 10.30.15.91
• Address: 10.30.15.91#53
• _ldap._tcp.learn.nutanix.local service = 0 100 389 dc01.learn.nutanix.local.
• Service = 0 100 389 – 0 is the priority, 100 is the weight, and 389 is the LDAP default
port. dc01.learn.nutanix.local is the host providing the service.
• If DNS replies with answers, then the DNS settings are correct and the FSVM can
find the Domain Controller services in the forward lookup zone.
• If DNS does not reply, verify that the FSVM is configured and using the correct DNS
server. The FSVM stores the DNS settings in the /etc/resolv.conf file.
How t o Im p lem ent A FS
How to Implement AFS

How to Implement Acropolis File Services
File Server
+ File Server + Share + Network Config
To set up Acropolis File

Services go to the File
Server Dashboard Page
Click here to launch

the File Server
Setup wizard.
13

• In Prism click the drop-down list to the right of the cluster name and choose File
Server.
o This will take you to the AFS Dashboard.
• On the AFS Dashboard over to the right is a button +File Server.
• Click the +File Server button to launch the File Server Setup wizard.
o The AFS Setup wizard will guide you through several pages to configure and set
up an instance of AFS.
• After AFS setup, the AFS Dashboard also provides file Service statistics, alerts and
events for troubleshooting.
Acropolis File Service Statistics

• To view information for each file server, click on the file server name in the File
Server view.
o The statistics for that file server are displayed.
• The AFS statistics include the following information:
o Usage
o Shares
o Alerts
o Events
• The + Share button is where shares can be created deleted or edited from the AFS
Dashboard.
• If the cluster is running AHV as the hypervisor, you will also find the +Network
Config button that you can use to set up the vLAN tags (AHV calls them a
Network).
New File Server Pre-Check

14

• Click on the + File Server button to launch the New file Server Pre-Check wizard.
• At the top of the dialogue box is the License Warning: The current license does not
allow the use of this feature.
• In Nutanix, even if the license has not been installed you can still set up the feature.
o You will just receive a warning.
o To be in compliance with your Nutanix software license agreement Acropolis File
Services requires a Pro License.
• The New File Server Pre-Check dialogue box has a link to the AFS requirements
Learn more.
• Download or Upload AFS Software

• This link launches the Upgrade Software dialogue box to upload the AFS 2.1.0.1
software.
• You must first go to the Support Portal and download the AFS 2.1.0.1
software for uploading onto the cluster for setup.
• Add Data Services IP - A highly-available CVM IP address for iSCSI traffic
between FSVMs and CVM storage.
• FSVM’s act as iSCSI initiators communicating to this IP address for CVM
storage.
• After the AFS binaries have been uploaded to the cluster and an external data
services IP address has been configured you can click Continue to set up the file
server. You must click on the EULA blue link before clicking Continue!
“Basics” Page
• File Server Nam e
• File Server Size ( Minim um 1TiB)
• Number of File Server VMS (Minimum 3)
• Number of vCPUS Per File Server VM (Default 4)
• Mem ory p er File Server VM ( Default 12GiB)
15

• On the “Basics” tab of the AFS setup wizard the following settings are configured:
• The file server name which has to be unique for the cluster.
• The File server name is the computername/hostname that will be used to access
the file server shares.
To connect to an actual share on the file server, a Universal Naming
Convention (UNC) path has to be specified.
o UNC Format:
• \\servername\sharename
• …where servername will be the File Server Name specified on the
Basics tab.
• The sharename is created at the time the share is created.
• The sharename has to be unique as well.
• Aliases can also be defined in DNS for the file server name.
• The file server size is how much space the file server can use for storage for the
files.
o The file server size is minimum 1 TiB (tebibyte) or greater.
• This setting can be increased later if more storage space is required.
• The number of file server VMs to set up initially to support the clients:
o A minimum of 3 is required to create a clustered instance of the file server.
• There should be no more than one FSVM per node per instance.
o Acropolis File Services scale-out architecture has the ability to add more FSVMs
for scalability as the load demands increase or if new nodes are added to the
existing cluster.
• Number of vCPUs per file server.

o The default is 4.
• Memory per file server defaults to 12 GiB (gibibytes).
• The Acropolis Files Services Guide on the Support Portal under Software
Documentation has overall guidelines for memory and vCPU based on the
number of users.
Basics Page
16

“Client Net work” Tab - External

• vLA N
• Gat ew ay
• Net m ask
• IP A d d resses – Do not have t o b e cont ig uous
• DNS Server
• NTP Server
17
How to Implement AFS - 5

• The Client Network page is where all the IP settings are configured for the external
network.
o Each FSVM requires 1 external IP address.
• The external IP address and network are used by the FSVMs to provide client access
to the shares, FSVM access to Active Directory, and DNS for user authentication.
• The DNS and NTP settings are critical.

o The FSVMs use DNS to find the Domain Controllers to join the Active Directory
Domain.
• FSVMs also use DNS to find domain controllers in the Active Directory Domain
to provide user authentication for share access.
o Active Directory uses either Kerberos or NTLM while performing user
authentication.
• The NTP settings are used to configure and sync the time on the FSVM’s with
an NTP server.
• If the time on the FSVM’s is not correct then Kerberos authentication will fail.
• The Kerberos security protocol is time sensitive.

o The Domain Controllers performing the Kerberos Authentication, AFS and clients
time cannot be more than 5 minutes apart for the Kerberos authentication to be
successful.
• The Domain Controllers for the Active Directory domain run SMTP Simple
Network Time Protocol.
o The AFS file server and the clients can use the domain controllers as the NTP
servers recommended.
• Caution: If you use a third-party NTP server, make sure that it is within 5
minutes of the Active Directory domain controllers clocks or Kerberos
authentication will fail.
• The domain controllers can point to the third-party time server.
• Note: If the DNS or NTP settings are misconfigured, the domain join task will fail
during initial setup.
Client Network Page
18

“Storage Network” Tab – Internal

• vLA N
• Gat ew ay
• Net m ask
• IP A ddresses – Do not have t o be cont iguous
“Join AD Domain” Page

• A ct ive Direct ory Dom ain – Must use t he FQDN
• Usernam e – Sam A ccount Nam e
“Summary” Page
• Prot ect ion Dom ain (Default NTNX-fileservername)
19

• The Storage Network tab provides the inputs for the internal network. vLAN,
gateway, netmask, and IP Address settings for the FSVM’s communications to the
CVMs.
• The Join AD Domain tab is where you input the domain name, username and
password of a user account with domain admin rights.
o Use the SamAccountName.
• To review again:
o What service does the file server use to find the Domain Controllers to join the
domain and perform user authentication for file share access?
• The Summary page summarizes the settings and also by default builds a Protection
Domain.
o The Protection Domain is named NTNX-fileservername.
• Protection Domains are a feature in AOS to setup replication of VMs to a remote
cluster in a remote site for disaster recovery.
Storage Network Page
20

Join AD Domain Page

21

Summary Page
22

Troubleshoot ing Scenario
Lets modify Network Configuration

Settings to an incorrect DNS
Let’s look at potential
server in the Update File Server
problems in setting up dialogue box.
AFS and use the
Tasks View in Prism
to Troubleshoot.
Update
23
Troubleshooting Scenario
• In the setup of the file server, let’s make some changes.
o Enter an incorrect address for the DNS servers.
o Now try to perform a domain join and see what happens.
• On the File Server tab, click the file server to highlight it and then over to the right
click on Update Link. This will bring up the Update File Server dialogue box. Click
the radial button to the left of Network Configuration and then click Continue.
• The Update File Server Network dialogue box is displayed with the Client Network
page to change any external IP addresses, Netmask, Default Gateway or modify
existing DNS/NTP settings. On this page change the DNS Servers to incorrect IP
addresses.
• Click Next to proceed to the Storage Network page. Do not change any settings on
this page. Click Save to save your changes.
• Now we will try to join the domain with correct domainname and credentials.
Troubleshoot ing Scenario (cont .)
The Tasks Page shows

the Join Domain Task
failed—DNS issue
24
Troubleshooting Scenario (cont.)

• In the Prism interface on the Tasks page you can review the pending File Server
creation task for progress. If there are any failures, the task will give you a reason
for the failure. In this example, this is the DNS issue. The FSVMs name server
settings are set to an incorrect or non-existent DNS Server.
• The DNS server settings point to a non-existent or errant DNS server. This fails the
Setup. The File server virtual machines hosting the shares need the correct DNS
settings to contact the Domain Controllers. The file server virtual machines use DNS
to query for the SRV records of the Active Directory domain controllers for
domain join operations and user authentication for share access.
• Tasks in Prism are very detailed and a good starting point to look for errors and
troubleshoot AFS deployment issues.
Troubleshoot ing Scenario (cont .)
The Tasks Page shows

the Join Domain Task
failed Kerberos error,
Time Skew Issue Kerberos authentication (kinit) failed
with error: kinit: Clock skew too great
while getting initial credentials
25
Troubleshooting Scenario (cont.)

• In the Prism interface on the Tasks page you can review the pending File Server
creation task for progress. If there are any failures, the task will give you a reason
for the failure. In this example, this is the time skew issue for Kerberos. The FSVM’s
clocks are more than five minutes apart from the Domain Controller.
• The NTP server settings point to a non-existent or errant NTP server. This Fails the
Setup. The File server virtual machines hosting the shares need to be time synched
with the Domain Controllers. The Kerberos authentication protocol requires the
FSVM’s clocks to be within five minutes of the Active Directory domain controllers
for successful authentication.
• Tasks in Prism are very detailed and a good starting point to look for errors and
troubleshoot AFS deployment issues.
Troubleshoot ing
Review minerva log on CV Ms and NV Ms for Troubleshooting Setup

• NVM = FSVM
One CVM will become the Minerva Leader
$minerva get_leader - (to find the minerva leader)
• /data/nutanix/data/logs/minerva_cvm.log on any CVM will
Sp ecify t he M i n e r v a Lead er
– Review minerva_cmv.log on Minerva Leader for Setup Issues
» Incorrect NTP settings
» Incorrect DNS Settings
» Incorrect Domain Name
The minerva_cvm.log on the Leader Details the Setup Process
26
• Please see A p p end ix B for full log review
Troubleshooting
• During the initial setup of the file server, several tasks are created to carry out the
deployment of an instance of Acropolis File Services on a Nutanix Cluster.
o FSVMs must be created on multiple nodes of the cluster.
• A minerva (still uses the old name for AFS) Leader CVM is elected among all CVMs
in the cluster.
o The Leader is responsible for scheduling and running these tasks to set up and
deploy the file server instance.
• The fileserver logfile named minerva_cvm.log on the minerva leader will contain the
information for each task being performed during setup.
o This logfile will contain information and errors to assist in troubleshooting AFS
setup issues.
Furthermore, the minerva logfiles on both the CVMs and NVMs should be
consulted for help with troubleshooting Acropolis File Services.
Run the following command on any CVM to find the minerva leader:
• SSH to one of the CVMs
o $minerva get_leader
Sample output from the minerva get_leader command:

• nutanix@NTNX-16SM13150152-A-CVM:10.30.15.47:~$ minerva get_leader
• 2017-05-24 17:15:28 INFO zookeeper_session.py:102 minerva is attempting to
connect to Zookeeper
• 2017-05-24 17:15:28 INFO commands.py:355 Minerva Leader ip is 10.30.15.48
• 2017-05-24 17:15:28 INFO minerva:197 Success!
• nutanix@NTNX-16SM13150152-A-CVM:10.30.15.47:~$
• The file server log file on the leader CVM is where to look for setup issues.
o Location: /home/nutanix/data/logs
o Filename: minerva_cvm.log
There are a lot of prechecks built in during setup of the file server. Reviewing the
minerva_cvm.log file on the minerva leader will provide details for all the setup tasks
and pre checks. One pre check is to verify the IP addresses internal and external using
ping to see if already in use.
The minerva_cvm.log file will also provide insight into the Active Directory domain join
task failures. Incorrect NTP settings will fail the authentication to join the domain.
Incorrect DNS settings to find and use the Active directory Services. Typing in the
wrong domain name username or password credentials. If any of these settings are
misconfigured the domain join task will fail and the log should indicate which type of
failure.
Previously we looked in Prism for Setup issues. In the below example, here are the
correct settings that we can change to incorrect values to cause setup issues. We will
then examine different domain join errors in the minerva_cvm.log on the minerva leader.
Really an extension of the minerva_cvm.log on the CVMs, there is also a logfile named
minerva_nvm.log on each individual NVM with specific tasks from the minerva leader.
We will cover the NVM logfile in more detail later.
Troubleshoot ing
Review file server logfiles on all CV Ms

• Locat ion: /home/nutanix/data/logs/
o minerva_cvm.out - Look for errors.
o minerva_cvm.log – On Leader w ill b e t h e m o st helpful
minerva_cvm.log on Leader will help Troubleshooting
Setup and File Server Tasks
• Up d at es t o t he File Server Config urat ion
• Join and Leave Dom ain Op erat ions
• Share Creat ion
27 • Quot a Policies
Troubleshooting
• Examining the minerva_cvm.log file on the leader can be very helpful in overall
troubleshooting, but you should review the logfile on all CVM’s for any other errors.
There will be a lot of duplication from the leader logfile to the other CVM logfiles in
the cluster but is does not hurt to check.
• After setup, the minerva_cvm.log file on the minerva leader will be the first place to
look for errors. Any of the following will create file server tasks:
o Updates to the File Server Configuration
o Join and Leave Domain Operations
o Share Creation
o Quota Policies
• The minerva_cvm.log on the minerva leader will provide verbose logging to help
troubleshoot file server tasks. The file server tasks will actually be implemented on
the FSVM virtual machines that make up the clustered file server. On the FSVMs,
there is a local logfile named minerva_nvm.log to examine for file server task errors
specific to that FSVM.
Command to get minerva leader:

• $minerva get_leader
Troubleshoot ing
Review log files on NV M/ FSV M

• Locat ion: / home/nutanix/data/logs/
• minerva_nvm.log
– Up d at es t o t he File Server Config urat ion
– Join and Unjoin Dom ain Op erat ions
– Share Creat ion
– Quot a Policies
• minerva_ha.log
– Ha.py – (grep –i ha.py)
– iSCSI Virt ual Targ et s – (grep –i iscsi)
– vDisks INFO
28
Troubleshooting
• Now that we have just covered the important logfiles to assist in troubleshooting on
the CVMs, there is another set of logfiles on the FSVMs. The logfiles on the local
FSVMs will help troubleshoot issues specific to that FSVM. The logfiles are in the
same location as on the CVMs.
• When troubleshooting the local FSVMs there are two logfiles that will provide the
most help.
o The minerva_nvm.log and minerva_ha.log logfiles will assist in most
troubleshooting cases when it comes to file server tasks, high availability, iscsi
connectivity to volume groups, and vDisks.
minerva_nvm.log
• File server tasks are created on the minerva leader then implemented on the FSVM
virtual machines. For file server tasks being run on local FSVMs, review the
minerva_nvm.log to assist in troubleshooting. Look for errors to troubleshoot issues.
Any updates to the configuration including network settings changes, share
creation/deletion, domain join and unjoin operations, and Quota Policies are logged.
minerva_ha.log
• The minerva_ha.log logfile will help to troubleshoot the overall functions on the
FSVMs.
• The FSVMs are configured in a cluster for HA (High Availability). This provides fault
tolerance to the shares in case of an FSVM virtual machine failure. The file server
virtual machines use HA.py to redirect storage access from the failed FSVM hosting
the share’s vDisks to an existing healthy FSVM virtual machine. iSCI is the protocol
being used to access the storage (vDisks) on the node of the failed FSVM.
• Ha.py events will be logged on all FSVMs. They also will indicate whether HA is
enabled on the FSVM.
• FSVMs use iSCI block storage access to vDisks in ADSF. The minerva_ha.log will
log any of these events as well. The FSVM uses the Linux iSCSI software initiator for
ADSF iSCSI access. This log file will contain all the iscsiadm commands to
troubleshoot the iSCSI software initiator in the FSVM.
• vDisk info is included and mount point information for ZFS.

Troubleshoot ing
Domain Join issues
Use the Fully Qualified Domain N ame

• learn.nutanix.local
• Do not use t he Net b ios Dom ain Nam e
Can the FSVM locate the SRV Service Location Records in DN S?
• cat /etc/resolv.conf
• nslookup
• Check DNS Server for Ent ries
Are the NTP Server Settings correct?
• Kerb eros Securit y Prot ocol is Tim e Sensit ive
Must have Domain Admin Credentials
• Req uired for Service Princip al Nam e Reg ist rat ion
AOS 5.1 allow s Domain Join w ithout Domain Admin Credentials
• gflag
• Pre-st ag e Com p ut er A ccount in A ct ive Direct ory
• Reset Com p ut er A ccount Passw ord Perm issions
• Have t o Manually Reg ist er SPN’s using setspn Tool
29
Troubleshooting
• During initial deployment of Acropolis File Services, one of the requirements is to join
the File Server to the Active Directory Domain for authentication and access to the
File Server shares. The domain join process needs to be established before any
access will be provided to shares on the File Server. During Setup, if the domain join
process fails, the FSVMs will be up and running but no share access will exist.
• Here are some of the things to check for when troubleshooting domain join issues
with Acropolis Files Services:
• For the domain name make sure you are using the Fully Qualified Domain Name
and that it is the correct domain name for the Active Directory Domain. Do not use
the Netbios domain name.
• SSH into the FSVMs and verify DNS settings and verify Service Location record
lookup.
• cat /etc/resolv.conf to confirm DNS server settings.
• Use nslookup to check A records and SRV records.
• Check the DNS Server and verify correct “A” and “SRV” entries in the forward
lookup zone for the AD domain.
• Check the NTP settings on the File Server. If possible use the Domain Controllers for
your NTP servers.
• Must be a member of the Domain Admins group. Can Verify with MMC Active
Directory Users and Computers.
• AOS 5.1 can be configured to allow a non-Domain Admin to join the File Server to
the Domain. There is a gflag that can be set on the CVMs to allow for this in AOS
5.1. The File Server can be deployed without joining the Domain. Through Prism or
from NCLI, the Domain Join task can be done later and the user does not require
Domain Admin rights. There are a few prerequisites.
• gflag has to be set on CVMs to allow for non-Domain Admin permissions to join the
domain.
• The computer account for the File Server Name has to be manually created in the
Active Directory Domain prior to the domain join task.
• The User who will perform the Domain Join task needs permission on the
computer account in Active Directory to reset password.
• You have to manually register SPNs with the setspn command.

Troubleshoot ing
30
Troubleshooting
• In the Join Domain dialogue box click Show Advanced Settings.
• Preferred Domain Controller allows you to configure which Domain Controller

will be preferred for authentication. If not selected, it will use site awareness to
choose a Domain Controller for authentication. If configured, make sure to choose a
Domain Controller in close proximity to the file server virtual machines or you may
encounter performance issues with authentication.
• Organizational Units are containers within the Active Directory domain to group
security principals for delegating administration tasks. They are also used to apply
Group Policy Settings. When the file server joins the Active Directory domain a
computer account will be created in the domain. By default the account will be stored
in the computers system container. You can specify what organization unit to use to
store the computer account.
• Hint: You can also use Active Directory Users and Computers to move the
computer account later.
• The Overwrite the existing AFS machine account (if present) check box will allow
the Join Domain even if the computer account exists in Active Directory. Instead of
checking this option, you can manually delete the computer account in Active
Directory.
Troubleshoot ing
Check FSV M DNS Settings

• cat /etc/resolv.conf
Verify Proper “A” host records in DNS
• A FS_Servernam e
• FSVMs
• Round Rob in DNS Sup p ort
• DDNS issues
$afs get_dns_mappings
Output:
File server afs01 is joined to d om ain learn.nutanix.local
------ A DD THE FOLLOW ING HOST <-> IP[ s] ON THE DNS SERVER ------
afs01.learn.nutanix.local <-> 10.30.15.241 10.30.15.243 10.30.15.242
NTNX-afs01-1.learn.nutanix.local <-> 10.30.15.241
NTNX-afs01-3.learn.nutanix.local <-> 10.30.15.243
31 NTNX-afs01-2.learn.nutanix.local <-> 10.30.15.242
Troubleshooting
• Acropolis File Services relies on DNS for proper hostname resolution. Not only is
DNS required for Active Directory but it is very important that the AFS file servers
register IP Addresses in DNS for client hostname resolution.
• On any FSVM the following command can be run to get the correct Host to IP
address mappings that are required in DNS.
o afs get_dns_mappings
• The fully-qualified domain name will be the afs_servername.dns_domainname.
• The Hostname will point to each external IP address configured for each FSVM in the
file server cluster. In the example examine the following:
o afs01 is the file server name.

o Learn.nutanix.local is the domain name.
o afs01.learn.nutanix.local is the Fully Qualified domain name.
• DDNS dynamic DNS updates can be flaky.

Troubleshoot ing
Troubleshooting Samba Issues

• A ut hent icat ion Failures
o smbd
o winbindd is t he A ut hent icat ion Module
o Check t hese w it h scli health-check Com m and in any FSVM
• Log on Failures
o scli smbcli global “log level” 10
o winbindd.log, log.wb-<DOMAIN>, <client-name/ip>.log and
smbd.log
• A ut horizat ion Issues
o Insufficient Share level perm issions/ NTFS perm issions
o smbcacls t o view NTFS perm issions
32
• DNS Referral or DNS Issues
Troubleshooting
• Here is a list of possible failures (issues) that an administrator or end user may face
while accessing file services from SMB clients.
Authentication Failures:
• Authentication is the first and foremost thing that happens before accessing any
share on the file server. By default, file services are available to all domain users.
Trusted domain users also can access the services based on the trust configuration.
• Winbindd is the authentication module that is running on each FSVM which is

responsible for user authentication. It supports both Kerberos and NTLM
authentication methods.
• If the primary domain (the domain to which AFS is joined) user authentication fails,
there could be multiple reasons for that. We first might want to:
o Check the domain join status.
o Check the share access using machine account credentials works fine.
o Check the clock skew with the communicating domain controller (DC).
o Check smbd and winbindd status.
• You can simply run scli health-check on any of the FSVMs to validate the above
things. Here is the sample output of this command:
o nutanix@NTNX-10-30-15-247-A-FSVM:~$ scli health-check
o Cluster Health Status:
o Domain Status: PASSED
o No home shares found!
o Share Access: PASSED
o Node: 10.30.15.245 Health Status:

o Clock Skew (0 second(s)): PASSED
o smbd status: PASSED
o winbindd status: PASSED


Logon Failures:
• There could be multiple reasons for logon failures. So here are a few things which
could be the reason.
• Invalid credentials: There is a good chance that the provided username or

password details are incorrect. We might want to ask the user about his credentials,
especially when other users are able to access the file server shares.
• Invalid realm or domain netbios name: If the end user is trying to connect share
from a non-domain client, he needs to specify the domain netbios name (for
Samaccountname format) or specify realm name (for User Principal Name format).
• Example: SAMAccountname - DOMAIN_NB\user
UPN name - user@domain.com
• AD replication issues: When there are multiple DCs in the domain there is a good
chance that newly created user accounts might not have replicated to the DC (could
be site local) that our cluster is currently talking to. In this case, we might want to
check with the customer to see if these users are created very recently (within a day)
and their cross DC replication interval.
• On site RODC: If we are talking to on-site Read Only Domain Control and have not
added the machine account to “Allowed list of Password Replication Policy” then
RODC will not be able to authenticate (over NTLM) the users. We have already
documented the steps to add the machine account to the list.
•
Things that we need to collect to debug authentication issues:
• Debug level 10 samba logs: We might want to enable the debug level 10 logs using
the sCLI command scli smbcli global “log level” 10. And using the NCC logs collector
we can fetch all the file server logs. And make sure to set log back to zero once
after we collect the required logs. The log files that we would be interested in
checking for authentication issues are:
o Winbindd.log, log.wb-<DOMAIN>, <client-name/ip>.log, and smbd.log.
Authorization issues:
•
• Access Denied errors: This is basically a post-authentication (or session setup)
issue where accessing share root is failing with access denied, or specific files or
folders are preventing the logged-in user accessing them.
•
• Insufficient Share level permissions: You might want to check if the share level
permissions are not open for everyone, then check if the specific user or the
group he is part of has permissions (at least read access).
• Share level permissions can be viewed on the properties windows of share. There is
no other tool to view them from AFS cluster side. Nutanix will probably develop
something in future.
• Insufficient NTACLs on share root: User or group must have access (at least read)
on share to successfully mount the share.
• NTACLs on share level permissions can be viewed on windows client from mount
point properties.
• NTACLs for Home share root are stored only in insightsDB. Currently there is no
command tool (or CLI) to dump the ACLs for home share root. Otherwise, we can
use smbcacls to dump the ACLs for any other file system objects.
•
Syntax and examples:
• #sudo smbcacls //FSVM-external-IP/share “full path to share root or folder/file path” -
Pr
o You can get the path from “sudo zfs list”.
• Example: Getting NTACLs on General purpose share root.

o $ sudo smbcacls //10.5.147.139/dept1 "/zroot/shares/e0fd8f16-922a-44e8-9777-
e2866d9fe15a/:2f72de65-de29-4942-906d-144abc0e2218/dept1" -Pr
o REVISION:1
o CONTROL:SR|PD|SI|DI|DP
o OWNER:BUILTIN\Administrators
o GROUP:BUILTIN\Users
o ACL:Creator Owner:ALLOWED/OI|CI|IO/FULL
o ACL:BUILTIN\Administrators:ALLOWED/OI|CI/FULL
o ACL:BUILTIN\Users:ALLOWED/OI|CI/FULL
•
• Insufficient NTACLs on specific files or folders: If the user is getting access
denied while performing the file operations, then it has to be with insufficient
permissions. We need to check with the administrator to see if the user or specified
group has permissions to perform the specific operation.
•
• - Permissions can be viewed from the Property window of specific file system objects
in a share. Or on the backend, we can use smbcacls to view the permissions.
•
• Example: Getting NTACLs for a TLD on home share.
o $ sudo smbcacls //10.5.147.139/home "/zroot/shares/e0fd8f16-922a-44e8-9777-
e2866d9fe15a/:638586b7-9999-493a-9ab2-905bc4e1181c/home/TLD4" -Pr
o REVISION:1
o CONTROL:SR|DI|DP
o OWNER:AUTOMATION_NB\administrator
o GROUP:AUTOMATION_NB\Domain Users
o ACL:AUTOMATION_NB\administrator:ALLOWED/I/FULL
o ACL:Creator Owner:ALLOWED/OI|CI|IO|I/FULL
o ACL:BUILTIN\Administrators:ALLOWED/OI|CI|I/FULL
o ACL:BUILTIN\Users:ALLOWED/OI|CI|I/READ
•
Things that we need to collect to debug authentication issues:
• Permissions on the object for which user is getting access denied.
• Using icacls from a Windows client. Or
• Using smbcacls as discussed above.
• Samba debug 10 client logs.
• Client side or Server side packet captures.
DFS referral or DNS issues:

• DFS referrals are primarily used for client redirection. General purpose shares are
actually tied to one particular FSVM in the cluster. So share level referrals are used
to redirect the client to remote node if the client tried to connect to a share that is not
locally hosted. Or it could be the case that the client is unable to resolve the
hostnames with DNS servers that they are talking to.
• Lack of DFS support on client: We need to make sure the client is DFS-capable.
By default all Windows clients (from Win 7 onwards) are DFS-capable. But this
capability can be disabled through the Windows registry setting. Make sure that this
is not turned off.
• TODO: Actual registry path.
o HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Mup
• DNS resolver issues: If clients connect to the server using the hostname (like \\AFS-
server.domain.com\share1), then the referral node information also will be in host
FQDN format. So the client should be able to resolve the referred host FQDN from
the DNS server it connects to. If not, share access will fail. If the client doesn’t have
access to the DNS server, then it can access the file server share using the IP
address (like \\10.1.2.3\share1). The end user needs to make sure they can ping the
AFS hostname if they are trying to connect using hostname.
•
• Possible DNS Scavenging: Currently we refresh DNS IPs every 5 days. If the
scavenging is enabled on the DNS server is less than this interval, then the DNS
server might have deleted the DNS mappings for AFS. In that case, we will need to
re-register the DNS IPs for the cluster.
•
• To get the listing, we can run the command afs get_dns_mappings on any of the
FSVM nodes in cluster.
• TLD namespace operation issues:

• Due to client side DFS limitations, there are certain TLD name space operations that
will fail when performed on from SMB clients. They are…
• Rename: If the top level folder happened to be a remote node hosting it, then the
client will not be able to rename it.
• Recursive Delete: Currently all TLDs are exported as reparse points. So when the
client tries to delete a TLD, it expects the folder to be empty. If not, delete fails with
a directory not empty error. To delete the non-empty TLD, the user will need to
delete the data at one level down and then delete the actual TLD.
• ACL propagation from share root: This is again for the same reason of DFS
limitations. When permissions are changed on home share root with some existing
TLDs, permissions made at shareroot will not be propagated to all TLDs. In fact they
will be propagated to TLDs which are local to the node that the client connected to.
•
• We are developing an MMC plugin to perform these all these operations without any
issues. This is planned to come in a post-Asterix.1 release.
•
• Non-domain admin user restriction: Due to these limitations, we have designed so
that non-domain admin users will not have the ability to perform Rename TLD,
Delete TLD, and Change permissions on home share root by default.
•
• If any non-domain administrator is managing the home share, they need access to
perform all the above operations. Then we can use the following commands to
enable access:
•
• #net sam addmem BUILTIN\\Administrators <member>
o Example: sudo net sam addmem BUILTIN\\Administrators DOMAIN_NB\\user1
•
• #sudo net sam listmem BUILTIN\\Administrators
o Verify that newly-added user or group is shown in the listing.
Miscellaneous issues:
• Reaching the connection limit: Currently we have a limit on the number of max
SMB connections that can be supported. This limit changes based on the FSVM
memory configuration. We start with a minimum of 250 connections (12G) per node
and keep increasing the limit to 500 (16G), 1000 (24G), 1500 (32G), 200 0(40G) and
4000 (>60G).
• We log a message in smbd.log when we reach the max limit:
o Check for message Maximum number of processes exceeded.
• Unable to write due to space limitations: If total used space used or quota is
exceeded then we prevent further write operations.
• Folder visibility: If any user complains that they are unable to view some of the files
or folders while other users are able to, then it could be the case that ABE is
enabled on the share and the user or that the group that he is part of does not have
READ access or has an explicitly-denied READ on those file system objects.
•
• In those cases, we need to check ACLs on those parent objects and see if there
are any denied permissions.
• You can run the scli share-info command to see the ABE status on all the shares.
• Here is an example that shows the ABE status.
o Share Name: dept1
o Share Type: DEFAULT
o ABE Enabled: DISABLED
o NVM Host: NTNX-mnrvatst-164525-1
•
• No SMB1 client support by default: Currently we disabled the support for SMB1
clients. If any clients are using the SMB1 protocol they will be denied the access to
share. SMB1 can be enabled using scli.
•
• The scli command to enable SMB1 is:
o #scli smbcli global “min protocol” NT1
•
• MAC and Linux client limitations: In Asterix release, MAC or linux clients are
officially not qualified but users can mount the AFS shares from MAC or linux clients.
But these clients have some limitations while accessing Home shares.
•
• MAC Client limitations:
• From MAC clients the remote Top Level Directories (TLDs) cannot be created on
Home shares. This client doesn’t understand the folder DFS referrals during the
creation of these folders. But they can get into any of the TLDs. And they will be able
to browse the TLDs.
• Sometimes, MAC clients will all not be displaying all TLDs on home share. This is
due to some TLDs sharing the same file ID. MAC will keep only one TLD (among
those entries sharing the same file ID) and will filter the rest. This is being addressed
in the Asterix.1 release.
•
• Linux client limitations:
• Similar to MAC, Linux smb2 clients have issues in following the folder-level DFS
referrals. Creating, Delete, or browsing remote Top Level Directories is not
feasible, whereas there is no issue in accessing general purpose shares. Here is the
command to perform smb2 mount from linux:
•
o #sudo mount -t cifs //<AFS-host FQDN>/<share-name> <local mount path> -o
domain=<domain netbios>,user=<AD user name>,pass=<password>,vers=“2.1”
•
• Example: sudo mount -t cifs //osl-fileserver.corp.nutanix.com/minerva
/mnt/afs_minerva -o domain=CORP,user=<AD user
name>,pass=<password>,vers=“2.1”
Troubleshoot ing
Useful Commands for Troubleshooting Acropolis File Services

CVM CLI:
• minerva get_leader
• minerva get_fileservers
• minerva nvmips
• minerva get_file_servers
• minerva force_fileserver_delete
• minerva force_fileserver_node_delete
CVM N CLI:
• fs list
FSVM CLI:
• afs get_dns_mappings
• afs get_fs_info
• afs get_leader
• afs --share=sharename get_owner_fsvm
• afs –-share=sharename get_share_info
FSVM SAMBA CLI:
• scli health-check
33 • scli sync-clock
Troubleshooting
Logs of Interest
• CVM Log s
• Locat ion: /home/nutanix/data/logs/
o minerva_cvm.log
• FSVM Log s
• Locat ion: /home/nutanix/data/logs
o minerva_ha.log
o minerva_nvm.log
• ncc log_collector fileserver_logs
34
Logs of Interest
Labs
Module 7
Acropolis File Services Troubleshooting
35
Labs
Thank You!
36
Thank You!
Module 8
Acropolis Block Services
Troubleshooting
Module 8 Acropolis Block Services Troubleshooting

Course Agenda
1. Intro 7. A FS
4. Foundation 10 . AOS Up g rad e
6. Networking
Course Agenda
• This is the ABS module.
O bject ives
After complet ing t his module, you w ill be able to:

• Describ e t he d ifference b et w een SA N and NA S Prot ocols
• Exp lain t he A crop olis Block Services Feat ure
• Define SA N Technolog y Term s
• Exp lain t he iSCSI Prot ocol and how it w orks
• Nam e t he Sup p ort ed Client Op erat ing Syst em s
• Describ e how A BS Load Balancing and Pat h Resiliency Work
• Exp lain How t o Im p lem ent A BS on W ind ow s
• Exp lain How t o Im p lem ent A BS on Linux
• List Log s of Int erest
• Discuss Key Point s
3
• Perform Lab s
Objectives
• Describe the difference between SAN and NAS Protocols
• Explain the Acropolis Block Services Feature
• Define SAN Technology Terms
• Explain the iSCSI Protocol and how it works
• Name the Supported Client Operating Systems
• Describe how ABS Load Balancing and Path Resiliency Work
• Explain How to Implement ABS on Windows
• Explain How to Implement ABS on Linux
• List Logs of Interest
• Discuss Key Points
• Perform Labs
SA N and NA S
How to Implement AFS

Describe the Difference Between SAN &
NAS Protocols
SAN – Storage Area Net work

• FCP
• FCOE
• iSCSI
NAS – Net work At t ached
Storage
• Pro t o co ls Used
o CIFS – Com m on Int ernet File
Syst em SMB
o NFS – Network File System
(Not Supported in red)
5
Describe Difference Between SAN & NAS Protocols

• What is SAN or Storage Area Network?
• SAN provides block storage over a network to hosts. To provide block storage over
the network to hosts requires a transport or protocol to carry the data and a physical
network. When it comes to SAN access there are three protocols that can be used for
block access. We will have a quick look at the different SAN protocols.
• FCP or Fiber Channel Protocol was invented in the early 1990’s as a protocol to run
over high speed networks to carry SCSI-3 protocol packets from initiators (hosts) to
targets (Nutanix Clusters). Fiber Channel Protocol is a truck or vehicle to carry the
SCSI-3 protocol over the network to the storage. Nutanix clusters do not support
FCP.
• Fiber Channel Protocol uses FCP HBA’s host bus adapters and Fiber Channel
Switches to provide the physical network for the FCP Protocol. The FCP Protocol is
also lossless. Buffer to buffer credits and pause frames guarantee no packet loss.
• Fiber Channel Over Ethernet FCOE protocol encapsulates the FCP protocol into an
Ethernet frame and using Ethernet equipment as the network. FCOE allows FCP to
run over Ethernet networks. Quality of Service and pause frames have to be
implemented on the Ethernet network to provide a reliable lossless network. Cisco
Nexus 5k and 9k are two Ethernet switch examples that support FCOE. Nutanix
clusters do not support FCOE.
• iSCSI SAN protocol uses the TCP transport protocol of TCP/IP to carry the SCSI-3
packets over the Ethernet network. Hosts can use iSCSI hardware HBA’s or standard
network interface cards with an iSCSI software Initiator installed on the host.
• iSCSI uses TCP/IP as the transport. TCP/IP does not guarantee packet delivery on
the receiving end. If the receiving host receives more packets than it can handle it will
start dropping the packets. When the sender does not receive the acknowledgement
from the receiver then it will resend the packets and could potentially overrun the
network. This is the main difference between iSCSI and FCP. Nutanix Clusters
support the iSCSI protocol for block access.
• What is NAS or Network Attached Storage? NAS is file sharing protocol that allows
access to files over the network. There are two NAS protocols CIFS Common
Internet file System developed by Microsoft and Network File System developed
originally by Sun Micro Systems (Not Supported on Nutanix Clusters).
• Both protocols, CIFS and NFS use TCP/IP as the transport over the network. Today
Nutanix supports CIFS or SMB server message blocks. Acropolis File Services is the
feature in Nutanix to provider file share access for Windows clients.
Explain t he Acropolis Block Services
Feat ure
Acropolis Block Services
iSCSI SAN
• LUNs
External Data Services IP – Virtual Target
Volume Groups
• vDisks/ Disks/ LUNs
• LUN – Log ical Unit Num b er
Use Cases Supported by ABS
• iSCSI for Microsoft Exchang e Server
• Shared St orag e for W ind ow s Server Failover Clust ering W SFC
• Bare Met al Hard w are
6
Explain the Acropolis Block Services Feature

• The Acropolis Block Services feature was introduced in AOS 4.7. Acropolis Block
Services provides block storage to bare metal hosts or virtual machines running on
Nutanix clusters. Acropolis Block Services (ABS) uses a single external IP address
(cluster-wide) to act as a virtual target for block storage to all initiators. iSCSI client
support includes Windows and Linux.
• All initiators will use the External Data Services IP address for target discovery and
initial connectivity to the cluster for block services. ABS exposes a single IP address
to the cluster as the virtual target for iSCSI connectivity. This allows for seamless
node additions without disruption to client connectivity and no need for client
reconfigurations. ABS also provides automatic load balancing of
vDisks/disks/LUNS across CVMs in the cluster.
• The External Data Services IP acts as an iSCSI redirector and will dynamically
map the Initiator to one of the CVM’s external IP address for vDisk/disk/LUN (Logical
Unit Number) access on that node. The Initiator only needs to connect to the single
External Data Services IP address and redirection occurs behind the scenes with
ABS.
• ABS will also provide intelligent failover in the event the CVM on the node where
the vDisk/disk/LUN is currently being accessed goes down. ABS will redirect the
Initiator connection to surviving CVMs for vDisk/disk/LUN access without any
disruption. Redirection to surviving nodes can be immediate or minimal delay of up to
10 seconds.
• Acropolis Block Services uses the iSCSI SAN protocol to provide block access
over the network to Nutanix Clusters. iSCSI leverages and uses the TCP/IP
transport protocol over the network to carry the iSCSI traffic.
• Using ABS does not require the use of multipath software on the Initiator but is
compatible with existing clients using MPIO.
• Acropolis Block Services exposes ADSF storage using Volume Groups and
disks. You can think of a Volume Group as a grouping of vDisks/disks/LUNS
mapped to a particular host or hosts (if using Microsoft Windows Server Failover
clustering) called LUN mapping/masking.
• Use cases supported by Acropolis Block Services:

o ABS can be used by Microsoft Exchange Server for iSCSI storage access to
ADSF.
o Microsoft Windows Server Failover Clustering can use ABS for iSCSI-based
shared storage.
o ABS supports SCSI-3 persistent reservations for shared storage-based Windows
clusters, which are commonly used with Microsoft SQL Server and clustered file
servers.
o Shared storage for Oracle RAC environments.
o ABS can be used by virtual machines running on the Nutanix Cluster.
o ABS can also be used for any bare metal hardware running outside of the Nutanix
Cluster.
Feat ure
Block Services Relationships

• A Volume Group has to be created and configured with disks before a host can
access any disk with ABS. Each host will require a Volume Group to be created and
the appropriate disks added for storage requirements.
• The Volume Group also serves as an Access Control List. After the Volume Group
is created, the initiator has to be added to the Volume Group to gain access to any
disks. The initiator’s IQN (iqn.1991-05.com.microsoft:server01 iSCSI Qualified
Name) or IP address must be added to the Volume Group to allow host access. This
is called LUN masking to control which hosts see which disks on the Nutanix
Cluster.
Feat ure

• A Volume Group can contain multiple initiators hosts.
o When more than one Initiator is added to the Volume Group, the storage will be
shared.
• For clustered hosts sharing the same Volume Group, it is very important that the
hosts are using a cluster aware file system, such as:
o Microsoft Windows Failover Clustering Service
o Linux clustering, and
o Oracle RAC
• These are just a few examples of supported cluster aware file systems.
Define iSCSI Protocol and SAN Terminology
• SAN uses the Initiator and Target nomenclature.
• The Initiator is the host that is going to connect to the storage using a block
protocol. The only block protocol that is supported today on Nutanix Clusters is
iSCSI. The host can be a virtual machine or bare metal.
• Hardware iSCSI initiator HBAs (Host Bus Adapter) will have the iSCSI protocol built
into the Adapter. The hardware Initiator HBA will also have its own processor to
offload the iSCSI and TCP/IP cycles from the Host. Several Vendors manufacture
hardware iSCSI HBAs including Qlogic and Emulex. A hardware HBA at best can
offer benefits if the host processor is too busy. The hardware iSCSI HBA will
offload the iSCSI and TCP/IP processing from the host processor to the HBA.
• Software iSCSI Initiator uses a standard network interface card and adds the iSCSI
protocol intelligence thru software. All of the modern operating systems today have
iSCSI software Initiators built-in.
• The Target is the Nutanix Cluster. In ABS, the Target is actually a virtual IP
Address to the External Data Services IP. The External Data Services IP is used
for discovery and the initial connection point. ABS then performs iSCSI redirection
mapping to a particular CVM.
• In iSCSI SAN, before Initiators can gain access to disks in Volume Groups a point to
point connection needs to be established between the Initiator and the Target, or in
the case of ABS a virtual target. To discover the target, in the iSCSI software the
Initiator will be configured to discover ABS by the External Data Services IP. The
discovery happens over port 3260 then redirected to the external IP address of an
online CVM to host the vDisk on port 3205. These ports have to be open to access
Nutanix Clusters for ABS.
• Once the Initiator discovers the target by the IP address, the Initiator has to create a
session to the target and then a connection over TCP. In the case of ABS, the
session and connection will be to the single External Data Services IP.
• When troubleshooting initial target discovery and connectivity there are several things
to verify:
o Is the External Data Services IP address configured on the cluster?
o What is the IP address of the External Data Services IP and is that the address we
are using for discovery?
o Are the ports open between the initiator and target?
o Is the Target IP address reachable from the initiator?
Define the iSCSI Protocol and SAN Terminology
AO S Support s Two Met hods

for iSCSI Connectivity
to Volume Groups
• iSCSI Init iat ors w it h MPIO Single IP Address
• iSCSI Init iat ors w it h Acrop olis for Entire Cluster
Block Services
o Int rod uced in AOS 4 .7
o Ext ernal Dat a Services IP
o A BS - iSCSI red irect or t o CVM
10 Ext ernal IP Address
Define the iSCSI Protocol and SAN Terminology

• Prior to AOS 4.7, to provide block access to Nutanix Clusters, Initiators would
discover and connect to each CVM in the Cluster. For example, if the cluster
contained 4 nodes, there would be one CVM per node. On the host, each CVM will
have to be discovered by IP address. Once all the CVMs are discovered, then a
session and connection would have to be configured to each individual CVM IP
address.
• Multipath software had to be installed and configured on each host to provide the
load balancing and path resiliency to the Nutanix Cluster.
• Cluster changes such as node additions and removals required manual iSCSI
reconfigurations on the hosts.
• In AOS 4.7, the Acropolis Block Services feature was introduced to the Nutanix
Cluster. With ABS, there is a single External Data Services IP Address that
represents a virtual target for the entire cluster. Initiators only have to discover the
single External Data Services IP and ABS uses iSCSI redirection on the back end to
map Initiators to the CVMs. The iSCSI redirection is transparent to the host.
• ABS exposes a single virtual target IP address for clients. Node additions and
removals will not require a host reconfiguration. If new nodes are added to the
cluster, hosts still configure iSCSI to use the External Data Services IP address.
iSCSI redirection will see the new nodes and load balance new disks to the CVMs on
the new nodes.
• By default, any node in the cluster can provide Target services for block access.
Supported Client O perat ing Systems
Supported
Clients
Note: Acropolis Block

Services does not
Support Exposing LUNs
to ESXi clusters.
11
Supported Client Operating Systems

• These are some of the operating systems that are supported by Acropolis Block
Services.
How ABS Load Balancing & Pat h Resiliency
W orks
3. Log on to CVM 1
VM A
CVM 1
VG 1
Disk 1
(VG 1)
iSCSI Initiator
Data Services IP
1. Logon
Request
Disk 2
(VG 1) 2. Logon CVM 2
Redirection
(CVM 1)
Disk 3
(VG 1)
CVM 3
12 Active (Preferred)
How ABS Load Balancing & Path Resiliency Works

• Acropolis Block Services uses a single External Data Services IP address as the
discovery portal and initial connection point to the cluster for iSCSI access.
o The External Data Services IP address is owned by one CVM in the cluster.
• If the CVM that owns the data services address goes offline it will move to
another CVM in the cluster so that it will always be available.
• Acropolis Block Services virtualizes the target and iSCSI redirection for disk load
balancing and path resiliency.
o Starting with the AOS 4.7 release, the Nutanix cluster uses a heuristic to balance
client connectivity, via target ownership, evenly across CVMs in the cluster.
• All CVMs in the cluster are candidates to be a target even with newly-added
nodes.
• The Initiators gain access to the vDisk via the Volume Group.
o The iSCSI redirector (Data Services Address) will redirect the iSCSI session and
connection to the CVM hosting the disk.
• In the diagram:
o Step 1 The Initiator discovers the virtual target (Data Services Address).
o Step 2 After login, ABS will redirect the login to the CVM hosting the Volume
Group and disk.
• Prior to AOS 4.7, the Initiators had to discover and connect directly to each CVM for
disk load balancing and path resiliency.
o Cluster expansion would require client reconfigurations.
• In AOS 4.7 and later, the ABS feature was implemented that exposed a single IP
address for the cluster and then performs iSCSI redirection to the hosting CVM for
access to the disk.
o ABS also provides dynamic load balancing of disks across multiple CVMS for
scale out and path resiliency.
• With ABS, multipath software is no longer required on the hosts.
W orks
Volume Group Virtual Targets
VM A
VG1 VTA
Virtual Targets
Disk 1
CVM 1
(VG 1 VTA)
iSCSI Initiator
Data Services IP
VG 1 VTB
Disk 2
(VG 1 VTB) CVM 2
Disk 3
VG 1 VTC
(VG 1 VTC)
CVM 3
Active (Preferred)
13
ABS Load Balancing and Path Resiliency

• ABS implements the concept of Virtual Targets.
o A Virtual Target can be the Volume Group or a subset of the Volume Group any
disk.
• The iSCSI Virtual Targets feature of ABS allows a Volume Group containing
multiple disks to export each disk to different virtual targets mapped to different
CVMs in the cluster.
• In the graphic diagram we have a Virtual Machine connected to ABS.
o A Volume Group is created for host access to the disks.
• The Volume Group has 3 disks defined.
o ABS uses iSCSI redirection to map each disk to a separate CVM using the virtual
target feature at the disk level.
• Disk1 gets redirected to CVM1 and a virtual target is created on the CVM
VG1\VTA.
VG1\VTB.
VG1\VTC.
• Creating Virtual Targets per disk allows for load balancing block services across all
the nodes in the cluster.
o As you add nodes to the cluster and create new disks in Volume Groups they will
spread out across the new nodes in the cluster or what we call “scale out.”
• By default a Volume Group is configured to have 32 Virtual Targets.
o How many Virtual Targets used will depends on how many disks are created in the
Volume Group.
• Each disk in the Volume Group will get mapped to a Virtual Target on a CVM.
• For example, if the Volume Group has 3 disks, then each disk will be mapped to its
own CVM virtual target.
o In this case, there will be 3 Virtual Targets, one per vDisk spread across three
CVMs.
• This will occur for up to a max of 32 Virtual Targets for the Volume Group.
o If the client has more than thirty two disks they will see a max of 32.
• The names for the virtual targets is the Volume Group name plus the Virtual Target
number.
• Example: Volume Group Name absvg1 with two disks
o Two Virtual Targets Named:
• absvg1-7d64abbe-0e5c-49b6-97e2-449e6db08f54-tgt0
• absvg1-7d64abbe-0e5c-49b6-97e2-449e6db08f54-tgt1
W orks
iSCSI Redirection on Failure
VM A
1. Session Disconnect
Disk 1 CVM 1
(VG 1 VTA)
iSCSI Initiator
Data Services IP
2. Log on
Disk 2 Request
VG 1 VTB
(VG 1 VTB)
3. Log on CVM 2
Redirection
Disk 3 (CVM3)
(VG 1 VTC) 4. Log on
(CVM 3)
VG1 VTA
VG 1 VTC
Active (Preferred) CVM 3
Preferred (Down)
14
ABS Load Balancing and Path Resiliency – 3

• In the case of a CVM failure will cause a disconnect of any active sessions from the
Initiator to that Virtual Target.
o After the disconnect, the Initiator will attempt a login to the Data Services IP and
iSCSI redirection to another surviving CVM in the cluster.
• In the graphic diagram:
o Step1 CVM1 goes offline which causes the iSCSI session to disconnect to CVM1.
o Step2 Initiator creates a log on request to the Data Services IP.
o Step2 iSCSI redirection creates a session to CVM3 to a new Virtual Target.
• There is no disruption to the client.
• ABS uses the concept of Preferred CVM for the iSCSI target.
o This happens automatically as part of the load balancing process but can be
configured manually.
• We will discuss this later.
• If the preferred CVM defined for the Virtual Target goes offline, then path failover will
redirect to a healthy CVM.
o When the failed CVM comes back online, the Initiator may disconnect from the
Virtual Target and fail back to the original CVM.
W orks
Cluster
Redirected to
Common CVM
15
ABS Load Balancing and Path Resiliency – 4

• If the iSCSI target is shared, then the Shared Hosts will all be redirected to the same
CVM.
o The image shows 3 hosts VM A, VM B, and VM C sharing the same Volume
Group.
• All the hosts will be connected to the same CVM initially and on redirection in
case the current preferred CVM goes offline.
A BS on W ind ow s – Creat e Volum e Group
ABS on Windows
How to Implement ABS on W indows
1. Create a Volume Group

• Creat e Disks
• Map t he Init iat o r t o t he Vo lum e Gro up
o IQN iSCSI Qualified Name or
o IP Address
• iSCSI Soft w are Init iat or
2. Discover Target
3. Log in and Establish the iSCSI Session and Connection
4. Make the Bindings Persistent
5. Prepare the Disk
17
How to Implement ABS on Windows

• Windows supports both Hardware and Software initiators.
o Here are the high level steps to set up Acropolis Block Services on a Windows
Initiator:
• Step 1 Create the Volume Group and disks, then map the Initiator to the Volume
Group by IQN or IP address.
• Step 2 Discover the Target.
• Step 3 Log in to target portal and establish the iSCSI session and connection.
• Step 4 Make the Bindings Persistent.
• Step 5 Scan for disk and prepare with Disk Manager or Diskpart.
• Now let’s go over these steps in detail.
ABS on W indows – Create Volume Group
Volume Group
Prism
Storage Page
Volume Group
18
ABS on Windows – Create Volume Group

• The Prism interface can be used to create Volume Groups on the Storage page.
o Click the +Volume Group link to launch the Create Volume Group dialogue box.
• Volume groups are required for Acropolis Block Services.
o Each host bare metal or virtual machine will have a Volume Group created with
disks.
• The Volume Group also controls which host will have access to the disks in
the Volume Group or LUN masking.
o The Initiator has to be mapped to the Volume Group for LUN access.
• On the host, the disks in the Volume Group show up as LUNs or Virtual Disks.
o The block protocol used to access the LUNS is iSCSI.
• iSCSI encapsulates the SCSI-3 protocol to carry it over the Ethernet network.
Create Volume Group Dialogue Box

• Name
• iSCSI Target Name Prefix Default s t o t he Volum e Group Nam e
• Description (opt ional)
• Storage
• V Ms
• Access Settings
• Enable External Client Access
• CHAP Authentication
o Targ et Passw ord
• +Add New Client
• Enable Flash Mode
19

• There are several fields to complete in the Create Volume Group dialogue box:
o Name: Name of the volume group (Best Practice name it after the Host)
o iSCSI Target Prefix: Optional.
• This will be a prefix added to the iSCSI virtual target.
• It uses the Initiator IQN as part of the Virtual Target by default.
o Description: Optional
o Storage: Disks to create in the Volume Group.
• Add new disk will ask for the Storage Container and size.
• All disks in ABS are thin-provisioned.
o VMs: Attach to existing VM on the Nutanix Cluster.
o Enable External client access: Check if you are whitelisting clients external to
the cluster.
o CHAP Authentication: To enable CHAP for login and target password, mutual
CHAP is also supported.
o +Add New Client: Add the hosts that will need access to the disks in the Volume
Group.
• Hosts can be added by their IP address or IQN in AOS 5.1.
o Enable Flash Mode: To have the disks pinned to the SSD hot tier.
Add New Disk - LUN Mapping Initiator to Volume Group
20

• The Add New Disk link will create the disks in the Volume Group.
o These disks will be presented to the initiators as LUNS using the iSCSI SAN
protocol.
• The Initiator needs to be mapped to the Volume Group before access to the disks
can occur.
o The +Add New Client button is for new initiators.
• The Initiator can be added by IQN or IP address.
o Notice the list of all existing initiators connected.
• Both the IQN and IP Address can be changed.
o These Changes would cause access and connection issues to existing initiators.
• Any changes to either the IQN or IP Addresses will require reconfiguration of
the Volume Group mappings.
For Each Disk You

Can Pick Different
LUN
Storage Containers
Masking
Disks Are
Thin-Provisioned
21

• The Add Disk dialogue box allows for Storage Container selection for the disk.
o Storage Containers have properties and features such as RF (Replication Factor),
Deduplication and Compression to name a few.
• Each container can have different settings and features enabled.
o The container in which the disk is stored decides which settings and features will
be applied to the data.
• Size is in GiB.
o The disk by default is thin-provisioned.
After the Host is Added by

IP address or IQN
Make Sure the IP Address
Box is Checked
Leave all other

Initiators Unchecked
22

• In the Create Volume Group dialogue box you will see all existing initiators
connected to the Nutanix Cluster.
o Choose only the Initiator or initiators that you want to provide access to the disks in
the Volume Group.
• Leave all others unchecked.
• Each host will have its own Volume Group mapped with disks.
o Volume Groups control which hosts see which disks based on the Volume Group
mappings.
A BS on W ind ow s – iSCSI Soft w are Init iat or
ABS on Windows – iSCSI Software Initiator

ABS on W indows – iSCSI Soft ware Init iator
iSCSI Management Tool –

Initiator iSCSI Qualified Name
24
ABS on Windows – iSCSI Software Initiator

• The Microsoft iSCSI Software Initiator is pre-installed with the operating system.
o In services you will see the Microsoft iSCSI Initiator Service.
• To configure the iSCSI software Initiator there is a management application.
o You can access the iSCSI administration tool by:
• Displaying the tile view of Administration tools or
• On the Tools menu in Server Manager.
• If the iSCSI service is not running the first time the iSCSI Management Application is
launched, the app will prompt to start the service.
o After the service successfully starts, the iSCSI properties dialogue box displays.
• The iSCSI Properties Dialogue box has several tabs.
o The last tab on the right named Configuration is where you can find the IQN of
the host.
• Each Initiator will get a unique IQN iSCSI Qualified Name.
- The IQN iSCSI Qualified Name can be modified but Nutanix recommends
that you leave the default as is.
• There is another tab named Discovery.
o This is the tab where you can specify the Target’s IP Address for discovery.
• Which IP address do we use for Discovery?
- The External Data Services IP Address.
• After the IP address is added and discovery is successful, there is another tab named
Targets.
o The Targets tab is where the iSCSI logon is performed, the iSCSI session is
created and the iSCSI connection is established.
• In the iSCSI configuration management tool there is a choice to Add the Target to
favorites.
o The default is checked by default, do not uncheck.
• This will make the bindings persistent across reboots.
• Windows has a Disk Manager MMC snap-in or a command line tool diskpart to
prepare the disk with a volume, filesystem (NTFS) and mountpoint or drive letter.
ABS on W indows – Discover Target
Enter the External Data Services IP.

Make Sure the IP address is Correct.
Click Verify with PING.
Discover Portal
25
ABS on Windows – Discover Target

• You can configure Discovery in the iSCSI Initiator Properties dialogue box on the
Discovery tab.
o Before an iSCSI session and connection can be established the Initiator has to
discover the target.
• There are two ways that iSCSI discovery can be performed.
• Manual Discovery uses the Target IP address.
• Acropolis Block Services uses the single External Data Services IP address
to perform initial discovery.
• iSNS iSCSI Name Server can be implemented for discovery.
o You probably will not see iSNS unless the iSCSI implementation is large, meaning
that you have hundreds or thousands of Initiators.
• iSNS is an application that can be run on Windows Unix or Linux machines.
• Manual Discovery Example:
o Click the Discover Portal to perform manual discovery of the target.
• The Discover Target Portal dialogue box displays.
o Type the Target IP address in the IP Address or DNS name field.
• Leave the Port field set to the default of 3260.
• When performing manual discovery, common issues include the IP address being
mistyped or incorrect.
o Also check to make sure the External Data Services IP address is configured on
the cluster.
• iSCSI uses TCP/IP as the transport so make sure all network configurations are
correct.
o Use ping to verify IP address connectivity from the Initiator to the Target.
o Make sure that the proper vLAN tags are applied.
• If the Volume Group is created after the Initiator discovery you will not see the Target
on the Targets tab of the iSCSI Initiator Properties dialogue box.
o Once the Volume Group is created with disks and the Initiator is mapped, the
target will then display.
• You may need to perform a refresh on the Targets tab.
ABS on W indows – Login to Target Port al
Connect to Target Dialogue Box
Virtual Target
• Persistent Bindings Setting

• Leave Enable Multi-path
Targets Tab - Connect Unchecked for ABS
26
ABS on Windows – Login to Target Portal

• On the Targets tab is where the virtual target shows up and initially after discovery it
shows inactive.
• In the diagram notice the IQN of the virtual target:
o iqn.2010-06.com.nutanix:windows-90e7bb28-755c-4e02-833b-b786d634fe6d-tgt0
• The highlighted portion was the iSCSI Target Name Prefix setting from the Volume
Group.
• Highlight the IQN of the Target in the Discovered Targets windows. If the Target is
not showing up then check to see if a Volume Group with disks in it was created for
the Initiator and mapped properly. If not, do so before proceeding. Once the Volume
Group is created and has disks then click Refresh to rediscover the target.
• Highlight the Target and click Connect to log in and make the point-to-point iSCSI
connection to the virtual target.
• The Add this connection to the list of Favorite Targets checkbox should never be
unchecked. Checking this box makes the iSCSI bindings persistent. Persistent
bindings will automatically reconnect in the case of computer restarts.
• The Advanced button allows for selection of a specific IP address on the Initiator to
use for the iSCSI connection. This is useful when the initiator is configured with more
than one IP address, in the Advanced Settings dialogue box there is the ability to
manually select which IP address on the Initiator to use for the iSCSI connection.
• When the settings are complete click OK. The status now changes from Inactive to
Connected.
• The Properties button will show the session and connection information. Target
portal group information is also in the properties.
• At this point, now the Initiator is SAN-attached to the Nutanix Cluster and has access
to the disks in the Volume Group.
ABS on W indows – Prepare t he DISK
Disk Manager in
Leaf Pane Computer Management MMC
Tree
Pane New Disk / LUN
27
Disk Manager in Computer Management MMC

• Disk Manager in the Computer Management console can be used to prepare the
disk.
• If the disk does not show up in Disk Manager right-click Disk Management in the
Tree and select Rescan Disks.
o If the disk does not show up, review the Volume Group Initiator mappings.
• Also review the iSCSI Connectivity.
• Right-click on Disk in the leaf pane and select Online.
• Right-click on Disk in the leaf pane a second time and select Initialize Disk.
o This creates a signature on the disk.
• Now a volume has to be created, formatted for NTFS, and mounted to a drive letter
or empty folder (mountpoint) before the disk can be used to store data.
o Right-click on the disk to create a Volume Format and mount to folder or drive
letter.
A BS on Linux
ABS on Linux
How to Implement ABS on Linux
Create a Volume Group

• Creat e Disks
• Map t he Init iat or t o t he Volum e Group
o IQN iSCSI Qualified Name or
o IP A d d r ess
iSCSI Software Initiator
Discover Target
Login and Establish the iSCSI Session and Connection
Make the Bindings Persistent
Prepare the Disk
29
Mount to Filesystem

• Linux supports both Hardware and Software initiators.
o Here are the high-level steps to set up Acropolis Block Services (ABS) on a Linux
Initiator:
• Step 1 Create the Volume Group and disks, map the Initiator to the Volume
Group by IQN or IP address.
• Step 2 Discover the Target.
• Step 3 Log in to the Target Portal and establish the iSCSI session and
connection.
• Step 4 Make the Bindings Persistent.
• Step 5 Scan for disks and prepare with fdisk.
• Now let’s go over these steps in more detail.
Install iSCSI Software Initiator

Package
• #yum install iscsi-initiator-utils
Start the iSCSId Service

• #service iscsid start or
systemctl iscsid start
Verify iSCSId is Running
• #systemctl
How to obtain the Initiator IQ N
• # cat /etc/iscsi/initiatorname.iscsi
30

• If the iSCSI Software Initiator package is not installed then it must be installed using
YUM Yellowdog Updater Modifier.
• To install the package run the following command:
o #yum install iscsi-initiator-utils
• This installs the iscsiadm utility and the iSCSId daemon.
• The systemctl command shows the Linux services and their state.
• Sample output:
o iscsi-shutdown.service loaded active exited Logout off all iSCSI
sessions on shutdown
o iscsi.service loaded active exited Login and scanning of iSCSI
devices
o iscsid.service loaded active running Open-iSCSI
o kdump.service loaded active exited Crash recovery kernel
arming
o kmod-static-nodes.service loaded active exited Create list of required static
device nodes
How to obtain the IQN for the initiator.

• Sample output:
o # cat /etc/iscsi/initiatorname.iscsi
o InitiatorName=iqn.1994-05.com.redhat:e050c3a95ab
Discover the Target

• #iscsadm –m discovery –t
• #yum install iscsi-initiator- sendtargets –p
utils 10.30.15.240
Start the iSCSId Service o -m = m ode -t = t ype -p = port al
• #service iscsid start or Log in and Establish the iSCSI
systemctl iscsid start Session and Connection
Verify iSCSId is Running • #iscsiadm –m node –p
• #systemctl 10.30.15.240 –l
o P = port al -l = login
How to obtain the Initiator IQ N
• # cat /etc/iscsi/ To Verify the iSCSI session:
initiatorname.iscsi • #iscsidam –m session
31

• How to obtain the IQN for the initiator.
• Sample output:
o # cat /etc/iscsi/initiatorname.iscsi
o InitiatorName=iqn.1994-05.com.redhat:e050c3a95ab
o Discover the Target using the iscsiadm command.
• Sample output:
o # iscsiadm -m discovery -t st -p 10.30.15.240
o 10.30.15.240:3260,1 iqn.2010-06.com.nutanix:centos7-1e1e80b9-f4ac-4186-8be0-
3d53b2d723b7-tgt0
o Log in to the Target.
• Sample output:
o # iscsiadm -m node -p 10.30.15.240 -l
o Logging in to [iface: default, target: iqn.2010-06.com.nutanix:centos7-1e1e80b9-
f4ac-4186-8be0-3d53b2d723b7-tgt0, portal: 10.30.15.240,3260] (multiple)
o Login to [iface: default, target: iqn.2010-06.com.nutanix:centos7-1e1e80b9-f4ac-
4186-8be0-3d53b2d723b7-tgt0, portal: 10.30.15.240,3260] successful.
Prepare the Disk

To List the New Disk:
• #fdisk –l or lsblk –scsi
• #lsblk –scsi
Sam p le Out p ut :
[ root@Centos01 iscsi]# lsblk --scsi
NAME HCTL TYPE VENDOR MODEL REV TRAN
sda 1:0:0:0 disk NUTANIX VDISK 0
sdb 4:0:0:0 disk NUTANIX VDISK 0 iscsi
sr0 0:0:0:0 rom QEMU QEMU DVD-ROM 2.3. ata
Disk sdb for t he Transport iSCSI

32

• Prepare the disk.
• To list the new Disk:
• Sample output:
o #fdisk -l
o Disk /dev/sda: 21.5 GB, 21474836480 bytes, 41943040 sectors
o Units = sectors of 1 * 512 = 512 bytes
o Sector size (logical/physical): 512 bytes / 4096 bytes
o I/O size (minimum/optimal): 4096 bytes / 1048576 bytes
o Disk label type: dos
o Disk identifier: 0x000a1676
o Device Boot Start End Blocks Id System

o /dev/sda1 * 2048 1026047 512000 83 Linux
o /dev/sda2 1026048 41943039 20458496 8e Linux LVM
o Disk /dev/mapper/centos-root: 18.8 GB, 18756927488 bytes, 36634624 sectors

o Disk /dev/mapper/centos-swap: 2147 MB, 2147483648 bytes, 4194304 sectors
o Disk /dev/sdb: 21.5 GB, 21474836480 bytes, 41943040 sectors

o /dev/sdb is the new disk.
• Sample output:
o # lsblk --scsi
o NAME HCTL TYPE VENDOR MODEL REV TRAN
o sda 1:0:0:0 disk NUTANIX VDISK 0
o sdb 4:0:0:0 disk NUTANIX VDISK 0 iscsi
o sr0 0:0:0:0 rom QEMU QEMU DVD-ROM 2.3. ata
fdisk to create the partition

fdisk /dev/sdb
• Typ e n – t o creat e Primary or Extended
• Typ e p – for Primary Partition
• Typ e 1 – for Partition Number
• First Sect or – Take t he d efault
• Last Sect or – Take t he d efault
o This w ill creat e 1 part it ion on t he disk using all t he Free Space
• Typ e w – W rit e t o t he Part it ion Tab le
fdisk –l – W ill now show sdb1 – SCSI disk b, Partition 1
mkfs to create the ext4 filesystem on the sdb1 partition
33 • mkfs.ext4 /dev/sdb1

• fdisk is the Linux utility to list the new SCSI disk and create a partition.
• fdisk –l to list all SCSI disks

• fdisk /dev/devicename
• mkfs is the Linux utility to make a filesystem on the new partitioned disk.
o Linux uses the ext4 filesystem.
• Mkfs.ext4 /dev/devicename_partition_number
• For example if the disk is sdb and the partition is number 1 then the path to the
device is the following:
• /dev/sdb1
Mount the disk to the Linux Filesystem

mount /dev/sdb1 /mnt/mountpoint
mount Command is not Persistent
Make a Mount Entry in the /etc/fstab to Make the MountPoint
Persistent
mount /dev/sdb1 /mnt/iscsi/ ext4 defaults,_netdev 0 0
mount –av t o c h ec k f st ab f ile f o r er ro r s
Type the mount command to verify the mount was successful
Test writing a file to the mountpoint:

cd /mnt/iscsi
34 Touch file1

• Now it is time to mount the disk in to the existing Linux filesystem.
• Use the mount command to mount the disk into the filesystem.
• Create an empty folder to mount the disk to.

o For example, under the /mnt filesystem create a folder named iSCSI.
• Type the following command to mount the disk to the new folder:
o #mount /dev/sdb1 /mnt/iscsi
• To make this persistent across reboots then we have to place the mount in the
/etc/fstab file.
o #nano /etc/fstab
o mount /dev/sdb1 /mnt/iSCSI ext4 defaults,_netdev 0 0
• _netdev flags that it is a network device
• Test the entry with the following command:

o #mount –av
o a = all, v = verbose
• Type the mount command to verify that the mount command ran successfully.
• cd to the mountpoint:
o #cd /mnt/iscsi
• Create a file on the mountpoint to test the disk:

o #touch file1
• File creation should be successful.
Troub leshoot ing
Troubleshooting
Troubleshoot ing - Storage
V erify Connectivity on the Target

• List all iSCSI client s
o <acro p o lis> iscsi_client .list
o Client Ident ifier Client UUID
o 10 .30 .15.9 3 5b a3f873-7c0 7-4 34 8-971f-74 6 ea4 5b e9 fd
o 10 .30 .15.9 5 3a2b e7cc-e521-4 2c8-b 8f6-ed 5fa66 77fce
• To Ret rieve Inform at ion ab out an iSCSI Client
o <acropolis> iscsi_client .get client _uuid_list =5ba3f873 -7c0 7-4348-971f-746ea45be9fd
o co nfig {
o iscsi_client _id {
o client _uuid : "5b a3f873-7c0 7-4 34 8-971f-74 6 ea4 5b e9 fd "
o iscsi_init iat or_net w ork_id : "10 .30 .15.93"
o }
o iscsi_t arg et _nam e: "iq n.20 10 -0 6 .co m.nut anix:w ind ow s-90 e7b b 28-755c-4 e0 2-833b -b 786d 6 34 fe6 d "
o t arg et _p aram s {
o num _virt ual_t arget s: 32
– }
o }
Check the Data Service IP
• # ncli cluster info | grep "External Data Services“
• # allssh ifconfig | grep eth0:2
36
Troubleshooting - Storage
• When troubleshooting Acropolis Block Services start with network connectivity.
o The Initiator relies on the network for access to the disk on the target.
• Any type of network issue can cause access problems, performance issues,
or disconnections to the disk.
• ACLI is the Acropolis Command Line which is accessible via ssh to any CVM in the
cluster.
• On the cluster, there are several ACLI commands to verify Initiator connectivity and
provide details about the iSCSI client.
o If the Initiator does not see the disks or experiences disconnects, verify that the
client is currently connected to the cluster. Run the following command to
retrieve a list of iSCSI clients connected to the cluster:
o $acli iscsi_client – In the output verify that the client is connected?
• If the command output does not list the client of interest, then check the iSCSI
software Initiator on the client and verify connectivity.
o Ask, “Did anything change—IP addresses, VLAN tags, network reconfiguration,
and so forth?”
• Check the Data Services IP address.
o The virtual Data Services IP address will live on one CVM in the Cluster and will
automatically fail over if the CVM hosting the Data Services IP address goes
offline.
o The IP address is assigned to the virtual network interface eth0:2. Allssh and grep
for eth0:2 to locate the CVM hosting the Data Services IP address.
o ping the Data Services IP address to test network connectivity.
Troubleshoot ing - Client
Verify Connectivity on the Initiator

• W ind ow s iSCSI Manag em ent Tool
• iscsiadm Ut ilit y Cent os
• Ping Targ et IP ad d ress
• Check Firew alls
• vLA N Tag s
• Physical
o Net w ork Ad ap t ers
o Cab les
o Sw it ches
37
Troubleshooting - Client
• Check and verify the network settings on the Client.
o Any network problems or misconfigurations will cause disk access problems.
• The iSCSI Software Initiator software installed on the Windows and Linux clients
can be helpful in diagnosing iSCSI network connectivity issues.
o Use the iSCSI utilities to confirm Discovery and Session connectivity.
• Try to ping the Target’s IP address form the client to verify TCP/IP connectivity.
o Verify the IP addresses of both the Target and the Initiator.
• Are the right IP addresses being used?
• Are the network ports configured for the proper VLAN?
• Have any switch configuration changes been implemented?
- Review all IP settings such as IP address, Subnet Mask, Gateway, and so
forth.
• Check the physical layer.

o Is there a bad network cable or network port?
• Disable firewalls on the client and test ping connectivity.

Troubleshoot ing - W indows Client
Verify Connectivity on
the Windows Initiator
Click to Highlight
38
Troubleshooting - Windows Client

• To verify connectivity on the Windows client go to the iSCSI Management tool.
o The Targets tab is where we can verify session and connectivity.
• Highlight the Discovered Targets and then click Properties.
o The Properties dialogue box shows the Session Information and the status.
• Look for Connected under Status.
• Click Devices to show the actual disk drive.

o This will provide the volume path name provided by the Windows Operating
System.
• The devices dialogue box will also specify the Port ID, Bus ID, Target ID, and
LUN ID for each LUN.
Troubleshoot ing - Linux Client
Verify Connectivity on the Linux Initiator

iscsadm –m discovery
• root@Centos01 ~]# iscsiadm -m discovery
• 10.30.15.240:3260 via sendtargets
iscsiadm -m node
• [root@Centos01 ~]# iscsiadm -m node
• 10.30.15.240:3260,1 iqn.2010-06.com.nutanix:centos7-1e1e80b9-
f4ac-4186-8be0-3d53b2d723b7-tgt0
iscsiadm –m sessions
• [root@Centos01 ~]# iscsiadm -m session
• tcp: [1] 10.30.15.240:3260,1 iqn.2010-06.com.nutanix:centos7-
1e1e80b9-f4ac-4186-8be0-3d53b2d723b7-tgt0 (non-flash)
/var/lib/iscsi/
39
Troubleshooting - Linux Client

• Use the iscsiadm utility to troubleshoot connectivity.
o There are several show commands to assist in examining the Initiator
connectivity to the Storage Target.
• Before an iSCSI session can be established the Target has to be discovered.

o In Linux, the easiest way is to use Send Targets for initial Discovery.
o The following command will verify that discovery was successful and the type
used in this case is Send Targets:
• iscsiadm –m discovery
o The command also returns the IP address used for Discovery.

• The IP address will be the External Data Services IP.
• The node and session options will return Target information and TCP connectivity
info.
• The iscsiadm –m node command shows the IP address Port and IQN for the ABS
Virtual Target.
• The iscsiadm –m session shows the actual session via tcp.
• The /var/lib/iscsi filesystem has several directories and files to show iSCSI
connectivity.
o The ifaces subfolder shows the iSCSI interface being used Default.
o The nodes subfolder shows target IQN and IP information.
o The send_targets subfolder has the all the iSCSI detailed settings.
Troubleshoot ing – LUN Mapping
Verify Volume Group Configurat ion

Obt ain Init iator IQN
• Linux - cat /etc/iscsi/initiatorname.iscsi
• W ind ow s – iscscli nodename
List Volume Groups
<acropolis> vg.list
Volume Group Name Volume Group UUID
linuxvg 1e1e80b9-f4ac-4186-8be0-3d53b2d723b7
W ind ow s 90e7bb28-755c-4e02-833b-b786d634fe6d
<acropolis> vg.get linuxvg
• The vg.get Com m and w ill list t he Init iat or Map p ing
o IQN or IP A d d ress
40
Troubleshooting – LUN Mapping

• When troubleshooting ABS also look at the Volume Group configuration. The client
needs to be added to the Volume Group for access to the disks.
• Initiators can be mapped to the Volume Group by either their IQN or IP Address.
Mapping the IQN or IP address is like a whitelist or Access Control List (ACL) of
which client has access to the Volume Group.
• If Initiators are being mapped using the IQN then first obtain the IQN of the Initiator.
On Windows hosts, use the iscsicli command line tool or the GUI tool on the
Configuration tab. For Linux hosts read the contents of the
/etc/iscsi/initiatorname.iscsi file.
• If the client is mapped using an IP address then use ipconfig for windows and ifconfig
for Linux to obtain IP addresses of the client.
Troubleshoot ing – LUN Mapping
Get Volum e Group VDISK Det ails
<acropolis> vg.get Windows $ vdisk_config_printer -nfs_file_name 99bff767-1970-4e37-b84f-
Windows { afa585650f21
annotation: "" vdisk_id: 124055
attachment_list { vdisk_name: "NFS:1:0:302"
client_uuid: "5ba3f873-7c07-4348-971f-746ea45be9fd" vdisk_size: 10737418240
external_initiator_network_id: "10.30.15.93" iscsi_target_name: "iqn.2010-06.com.nutanix:windows-90e7bb28-755c-
target_params { 4e02-833b-b786d634fe6d"
num_virtual_targets: 32 iscsi_lun: 0
} container_id: 8
}
creation_time_usecs: 1496950570611638
disk_list {
vdisk_creator_loc: 6
container_id: 8
container_uuid: "ef583ded-60c5-441d-ab60-a1a9271bf0b6"
flash_mode: False
nfs_file_name: "99bff767-1970-4e37-b84f-afa585650f21"
index: 0
vmdisk_size: 10737418240
iscsi_multipath_protocol: kMpio
vmdisk_uuid: "99bff767-1970-4e37-b84f-afa585650f21"
scsi_name_identifier: "naa.6506b8d08d6857207fbb11caa356ec4e"
} vdisk_uuid: "5c0538cd-1857-4083-bae4-103fecc57520"
flash_mode: False chain_id: "8a14a147-0fda-406b-8f83-60e015512ca0"
iscsi_target_name: "windows-90e7bb28-755c-4e02-833b-b786d634fe6d" last_modification_time_usecs: 1496950570623304
logical_timestamp: 2
name: "Windows"
shared: True
uuid: "90e7bb28-755c-4e02-833b-b786d634fe6d"
41 version: "kSecond"
}
Troubleshooting – LUN Mapping

• Either the IQN or the IP address of the client is what is added to the Access List in
the Volume Group for disk access. The Acropolis Command Line (aCLI) or Prism can
be used to create and verify configuration settings for Volume Groups.
• The following command will list all Volume Groups:

o $acli vg.list
o Volume Group name Volume Group UUID
o Centos71e1e80b9-f4ac-4186-8be0-3d53b2d723b7
o Windows 90e7bb28-755c-4e02-833b-b786d634fe6d
• From the list you can then get more details for a specific volume group with vg.get
vgname:
o $acli vg.get Windows
o Windows {
o annotation: ""
o attachment_list {
o client_uuid: "5ba3f873-7c07-4348-971f-746ea45be9fd"
o external_initiator_network_id: "10.30.15.93"
o target_params {
o num_virtual_targets: 32
o }
o }
o disk_list {
o container_id: 8
o container_uuid: "ef583ded-60c5-441d-ab60-a1a9271bf0b6"
o flash_mode: False
o index: 0
o vmdisk_size: 10737418240
o vmdisk_uuid: "99bff767-1970-4e37-b84f-afa585650f21"
o }
o flash_mode: False
o iscsi_target_name: "windows-90e7bb28-755c-4e02-833b-b786d634fe6d"
o logical_timestamp: 2
o name: "Windows"
o shared: True
o uuid: "90e7bb28-755c-4e02-833b-b786d634fe6d"
o version: "kSecond"
o }
• In the output, look for attachment_list. The attachment_list will show the
external_initiator_network_id as either the IQN or IP address of the client. Verify that
the external_initiator_network_id matches the client requiring access to the disks in
the Volume Group.
• From the acli vm.get vgname command we can get the vmdisk_uuid. Using the
vmdisk_uuid and the vdisk_config_printer command you can look up the iSCSI
target name, LUN ID, vDisk size, and other details of the vDisk.
o $ vdisk_config_printer -nfs_file_name 99bff767-1970-4e37-b84f-afa585650f21

o vdisk_id: 124055
o vdisk_name: "NFS:1:0:302"
o vdisk_size: 10737418240
o iscsi_target_name: "iqn.2010-06.com.nutanix:windows-90e7bb28-755c-4e02-
833b-b786d634fe6d"
o iscsi_lun: 0
o container_id: 8
o creation_time_usecs: 1496950570611638
o vdisk_creator_loc: 6
o nfs_file_name: "99bff767-1970-4e37-b84f-afa585650f21"
o iscsi_multipath_protocol: kMpio
o scsi_name_identifier: "naa.6506b8d08d6857207fbb11caa356ec4e"
o vdisk_uuid: "5c0538cd-1857-4083-bae4-103fecc57520"
o chain_id: "8a14a147-0fda-406b-8f83-60e015512ca0"
o last_modification_time_usecs: 1496950570623304
Troubleshoot ing – pithos_cli
To View the Virtual Targets for all Initiators

• $ pithos_cli --lookup iscsi_client_params
32 Virtual Targets by default

• Can b e m od ified in pithos_cli
Each Virtual Target is Assigned to a Different Node (A BS

Load Balancing )
42
Troubleshooting – pithos_cli
• The main service on AOS to manage disks is pithos. There is a command line tool
on the CVM named pithos_cli to lookup the virtual targets for all Initiators or for a
specific Initiator. This can be helpful in troubleshooting ABS issues to pinpoint where
the vDisk resides, and on which CVM. From there the local Stargate logfiles can be
examined to assist in diagnosing the issues.
• To view virtual targets for all initiators type the following command:
o Pithos_cli –lookup iscsi_client_params
o nutanix@NTNX-16SM13150152-A-CVM:10.30.15.47:~$ pithos_cli -lookup

iscsi_client_params
• ---------------------------------------------------
• iscsi_client_id {
• iscsi_initiator_network_id: "10.30.15.93"
• client_uuid: "[\243\370s|\007CH\227\037tn\244[\351\375"
• }
• iscsi_target_name: "iqn.2010-06.com.nutanix:windows-90e7bb28-755c-4e02-833b-
b786d634fe6d"
• target_params {
• num_virtual_targets: 32
• }
• Client UUID: 5ba3f873-7c07-4348-971f-746ea45be9fd
• last_modification_time_usecs: 1496950394322187
• --- Target Distribution ---
• CVM IP CVM Id Target count Target
• 10.30.15.48 5 10
• - iqn.2010-06.com.nutanix:windows-90e7bb28-755c-
4e02-833b-b786d634fe6d-tgt2
4e02-833b-b786d634fe6d-tgt5
4e02-833b-b786d634fe6d-tgt8
4e02-833b-b786d634fe6d-tgt11
4e02-833b-b786d634fe6d-tgt14
4e02-833b-b786d634fe6d-tgt17
4e02-833b-b786d634fe6d-tgt20
4e02-833b-b786d634fe6d-tgt23
4e02-833b-b786d634fe6d-tgt26
4e02-833b-b786d634fe6d-tgt29
• 10.30.15.47 4 11
4e02-833b-b786d634fe6d-tgt1
4e02-833b-b786d634fe6d-tgt4
4e02-833b-b786d634fe6d-tgt7
4e02-833b-b786d634fe6d-tgt10
4e02-833b-b786d634fe6d-tgt13
4e02-833b-b786d634fe6d-tgt16
4e02-833b-b786d634fe6d-tgt19
4e02-833b-b786d634fe6d-tgt22
4e02-833b-b786d634fe6d-tgt25
4e02-833b-b786d634fe6d-tgt28
4e02-833b-b786d634fe6d-tgt31
• 10.30.15.49 6 11
4e02-833b-b786d634fe6d-tgt0
4e02-833b-b786d634fe6d-tgt3
4e02-833b-b786d634fe6d-tgt6
4e02-833b-b786d634fe6d-tgt9
4e02-833b-b786d634fe6d-tgt12
4e02-833b-b786d634fe6d-tgt15
4e02-833b-b786d634fe6d-tgt18
4e02-833b-b786d634fe6d-tgt21
4e02-833b-b786d634fe6d-tgt24
4e02-833b-b786d634fe6d-tgt27
4e02-833b-b786d634fe6d-tgt30
• Total targets: 32
Troubleshoot ing – pithos_cli
To View t he Virt ual Targ et s for a Sp ecific Init iat or

$ pit hos_cli - - lookup iscsi_client _params - - iscsi_init iator_net work_identifier=10 .30 .15.9 5
• Use t he –iscsi_initiator_network_identifier A rg um ent if t he Init iat or is m ap p ed t o t he
Volum e Group by IP ad d ress
$pithos_cli --lookup iscsi_client_params --iscsi_initiator_name=IQ N_of_init iator
• Use t he –iscsi_initiator_name A rg um ent if t he Init iat or is m ap p ed t o t he Volum e Group by IP
ad d ress
To view Target dist ribut ion by CV M
CVM IP CVM Id Target count

10.30.15.48 5 10
10.30.15.47 4 11
10.30.15.49 6 43
Total targets: 64
43
Troubleshooting – pithos_cli
• To view virtual targets for a specific Initiator type the following command:
• $ pithos_cli --lookup iscsi_client_params --

iscsi_initiator_network_identifier=10.30.15.95
o Use the –iscsi_initiator_network_identifier Argument if the Initiator is mapped to the
Volume Group by IP address.
• or
• $pithos_cli --lookup iscsi_client_params --iscsi_initiator_name=IQN_of_initiator
o Use the –iscsi_initiator_name Argument if the Initiator is mapped to the Volume
Group by IP address.
• nutanix@NTNX-16SM13150152-A-CVM:10.30.15.47:~$ pithos_cli -lookup

iscsi_client_params -iscsi_initiator_network_identifier=10.30.15.93
• ---------------------------------------------------
• iscsi_client_id {
o iscsi_initiator_network_id: "10.30.15.93"
o client_uuid: "[\243\370s|\007CH\227\037tn\244[\351\375"
• }
• iscsi_target_name: "iqn.2010-06.com.nutanix:windows-90e7bb28-755c-4e02-833b-
b786d634fe6d"
• target_params {
• }
• Client UUID: 5ba3f873-7c07-4348-971f-746ea45be9fd
• last_modification_time_usecs: 1496950394322187
• --- Target Distribution ---
• CVM IP CVM Id Target count Target
• 10.30.15.48 5 10
4e02-833b-b786d634fe6d-tgt2
4e02-833b-b786d634fe6d-tgt5
4e02-833b-b786d634fe6d-tgt8
4e02-833b-b786d634fe6d-tgt11
4e02-833b-b786d634fe6d-tgt14
4e02-833b-b786d634fe6d-tgt17
4e02-833b-b786d634fe6d-tgt20
4e02-833b-b786d634fe6d-tgt23
4e02-833b-b786d634fe6d-tgt26
4e02-833b-b786d634fe6d-tgt29
• 10.30.15.47 4 11
4e02-833b-b786d634fe6d-tgt1
4e02-833b-b786d634fe6d-tgt4
4e02-833b-b786d634fe6d-tgt7
4e02-833b-b786d634fe6d-tgt10
4e02-833b-b786d634fe6d-tgt13
4e02-833b-b786d634fe6d-tgt16
4e02-833b-b786d634fe6d-tgt19
4e02-833b-b786d634fe6d-tgt22
4e02-833b-b786d634fe6d-tgt25
4e02-833b-b786d634fe6d-tgt28
4e02-833b-b786d634fe6d-tgt31
• 10.30.15.49 6 11
4e02-833b-b786d634fe6d-tgt0
4e02-833b-b786d634fe6d-tgt3
4e02-833b-b786d634fe6d-tgt6
4e02-833b-b786d634fe6d-tgt9
4e02-833b-b786d634fe6d-tgt12
4e02-833b-b786d634fe6d-tgt15
4e02-833b-b786d634fe6d-tgt18
4e02-833b-b786d634fe6d-tgt21
4e02-833b-b786d634fe6d-tgt24
4e02-833b-b786d634fe6d-tgt27
4e02-833b-b786d634fe6d-tgt30
• Total targets: 32
Troubleshoot ing – Preferred SVM
Use pithos_cli to Modify Client Params
• Mod ify num b er of Virt ual Targ et s
• Config ure a p referred SVM
Preferred SV M setting will Instruct the Initiator to place all LUNS on that Node Example All
Flash Nodes added to the Cluster for ABS
$pithos_cli --lookup iscsi_client_params --iscsi_initiator_network_identifier=10.30.15.95 --edit
--editor=vi or nano
iscsi_client_id {
iscsi_initiator_network_id: "10.30.15.95"
client_uuid: ":+\347\314\345!B\310\270\366\355_\246g\177\316"
}
iscsi_target_name: "iqn.2010-06.com.nutanix:centos7-1e1e80b9-f4ac-4186-8be0-3d53b2d723b7"
target_params {
preferred_svm_id: 6
}
44
Troubleshooting – Preferred SVM

• By default ABS will load balance the disks in the Volume Group across nodes in the
cluster. For example, if a Volume Group has two disks, ABS will map one of the disks
to one CVM in the cluster and the second disk to a different CVM.
• ABS by default will configure the client with 32 virtual targets distributed throughout
nodes in the cluster. We can see this with the previous pithos_cli –lookup
iscsi_client_params.
• The AOS 5.1 release now supports mixed clusters. Mixed clusters can have hybrid
and all flash nodes in the same cluster. A hybrid node has a mix of SSD and HDD
drives. All flash nodes are equipped with all SSD drives.
• Customers may want to use the all flash nodes for ABS. Configuring a preferred SVM
will instruct the client disks to be placed on that node, bypassing load balancing of
the disks to different nodes.
• The pithos_cli command is where you can set the preferred SVM for disk
placement for that client. Type the following command to edit the
iscsi_client_params with any text editor on the CVM.
• $pithos_cli --lookup iscsi_client_params --
iscsi_initiator_network_identifier=10.30.15.95 --edit --editor=vi or nano
• At the bottom of the file before the last bracket (}), insert a new line and add the
following
o preferred_svm_id: id
• ID is the ID of the CVM returned from the pithos_cli –lookup iscsi_client_params.

Troubleshoot ing – Preferred SVM Verified
$pithos_cli --lookup iscsi_client_params --iscsi_initiator_network_identifier=10.30.15.95
---------------------------------------------------
iscsi_client_id {
iscsi_initiator_network_id: "10.30.15.95"
client_uuid: ":+\347\314\345!B\310\270\366\355_\246g\177\316"
}
iscsi_target_name: "iqn.2010-06.com.nutanix:centos7-1e1e80b9-f4ac-4186-8be0-3d53b2d723b7"
target_params {
num_virtual_targets: 6
preferred_svm_id: 6
}
Client UUID: 3a2be7cc-e521-42c8-b8f6-ed5fa6677fce
last_modification_time_usecs: 1497487491799363
--- Target Distribution ---
CVM IP CVM Id Target count Target
10.30.15.49 6 6
- iqn.2010-06.com.nutanix:centos7-1e1e80b9-f4ac-4186-8be0-3d53b2d723b7-tgt0
Total targets: 6
45
Troubleshooting – Preferred SVM Verified

• After the iscsi_client_params are modified and preferred svm added we can see all
the virtual targets now pointing to the CVM id 6. All disks in the Volume Group for this
initiator will placed on the node of CVM id 6.
• Modifying the virtual targets from 32 to 6 now shows only 6 virtual targets for this
Initiator.
Logs of Interest – View iSCSI Adapter
Messages
 Logs Files to Review w hen Troubleshooting ABS

 Information is Duplicated Across Stargate Logfiles
 To View iSCSI Adapter Messages
$allssh "zgrep 'scsi.*cc' /home/nutanix/data/logs/stargate.INFO*"
Sample Output
I0619 13:50:52.103484 23730 iscsi_server.cc:2010] Processing 1 Pithos vdisk configs for
vzone
I0619 13:50:52.103548 23730 iscsi_server.cc:2022] Processing vdisk config update for
179371 timestamp 1497905452100683 config vdisk_id: 179371 vdisk_name: "NFS:4:0:260"
vdisk_size: 10000000000 iscsi_target_name: "iqn.2010-06.com.nutanix:linuxvg-a4df083e-
babe-47de-9857-5596e1a0effb" iscsi_lun: 1 container_id: 152253 creation_time_usecs:
1497905452088823 vdisk_creator_loc: 6 vdisk_creator_loc: 279 vdisk_creator_loc:
109079661 nfs_file_name: "bf86cde1-ad54-4b53-9d48-6edce4670cc1"
iscsi_multipath_protocol: kMpio chain_id:
"\210\332\305\275\231%I\311\256o\327CS\225m\037" vdisk_uuid:
"c\363\023v\010\211K\026\202tlo\312U\355\372" scsi_name_identifier:
46
"naa.6506b8d799ce64bc385db2c66e90af50"
Logs of Interest – View iSCSI Adapter Messages

• Stargate is the core service that allows the iSCSI access to the cluster for ABS.
Stargate allows NFS, CIFS, and iSCSI access to vDisks stored on the local disks for
the node.
• The External Data Services IP address allows for a single IP address for
discovering and connecting to the virtual targets. The Data Services IP address is
hosted on one CVM in the cluster at a time. The Data Services IP is hosted on
interface eth0:2.
• The allssh ifconfig | grep eth0:2 command will show which CVM is currently hosting
the IP address. If the CVM hosting the Data Services IP address goes offline, another
CVM will be elected to host the IP address.
• The actual iSCSI session and connection will be hosted by one of the CVMs after
login and iSCSI redirection to the CVM hosting the vDisk. By default, the vdisks for a
Volume Group will be load balanced across the CVMs in the cluster. After the host is
connected to the vDisk thru the CVM, if the CVM goes offline, then ABS will redirect
the host to another CVM for data access.
• All messages pertaining to iSCSI will be available in the Stargate logfiles. The main
logfile to review for troubleshooting issues will be the stargate.INFO log on each
CVM.
• To view all iSCSI adapter messages, use allssh and zgrep to search thru each
CVM’s Stargate logs. Below are a few samples:
• To view all iSCSI adapter messages:

•
• allssh zgrep 'scsi.*cc' /home/nutanix/data/logs/stargate.INFO* -
Sample Output
• I0619 13:50:52.103484 23730 iscsi_server.cc:2010] Processing 1 Pithos vdisk
configs for vzone
• I0619 13:50:52.103548 23730 iscsi_server.cc:2022] Processing vdisk config update
for 179371 timestamp 1497905452100683 config vdisk_id: 179371 vdisk_name:
"NFS:4:0:260" vdisk_size: 10000000000 iscsi_target_name: "iqn.2010-
06.com.nutanix:linuxvg-a4df083e-babe-47de-9857-5596e1a0effb" iscsi_lun: 1
container_id: 152253 creation_time_usecs: 1497905452088823 vdisk_creator_loc: 6
vdisk_creator_loc: 279 vdisk_creator_loc: 109079661 nfs_file_name: "bf86cde1-
ad54-4b53-9d48-6edce4670cc1" iscsi_multipath_protocol: kMpio chain_id:
"\210\332\305\275\231%I\311\256o\327CS\225m\037" vdisk_uuid:
"c\363\023v\010\211K\026\202tlo\312U\355\372" scsi_name_identifier:
"naa.6506b8d799ce64bc385db2c66e90af50"
• I0619 13:50:52.103561 23730 iscsi_server.cc:2701] Adding state for vdisk 179371
from Pithos
• I0619 13:50:52.103637 23730 iscsi_logical_unit.cc:67] Creating state for
10000000000 byte VDisk disk as LUN 1, Target iqn.2010-06.com.nutanix:linuxvg-
a4df083e-babe-47de-9857-5596e1a0effb, VDisk 179371, NFS:4:0:260
• I0619 13:50:52.103642 23730 iscsi_target.cc:59] Added LUN 1 for VDisk
'NFS:4:0:260' to iqn.2010-06.com.nutanix:linuxvg-a4df083e-babe-47de-9857-
5596e1a0effb
• zgrep can search regular text files as well as compressed files.

Logs of Interest – Discovered LUNS Messages
 Find each Discovered iSCSI Base LUN

$allssh "zgrep 'iscsi_logical_unit.*Creating state' /home/nutanix/data/
logs/stargate*log*INFO*“
Sample Output:
/home/nutanix/data/logs/stargate.NTNX-16SM13150152-C-
CVM.nutanix.log.INFO.20170613-162520.23713:I0619 11:31:12.797281 23727 iscsi_logical_unit.cc:67]
Creating state for 10000000000 byte VDisk disk as LUN 0, Target iqn.2010-06.com.nutanix:linuxvg-
a4df083e-babe-47de-9857-5596e1a0effb-tgt0, VDisk 162769, NFS:4:0:259
Creating state for 10000000000 byte VDisk disk as LUN 1, Target iqn.2010-06.com.nutanix:linuxvg-
Creating state for 5000000000 byte VDisk disk as LUN 2, Trget iqn.2010-06.com.nutanix:linuxvg-
47
Logs of Interest – Discovered LUNS Messages

• Examining Stargate logs across the cluster will provide information to assist in
troubleshooting ABS issues.
• Use zgrep and searching for Creating state will provide details on the LUN creation.
Looking at the log shows the vDisk creation, LUN ID (Unique for each LUN in the
Volume Group), size, and the IQN of the target.
• The following command will search the Stargate logfiles for iSCSI LUN creation
events:
o allssh "zgrep 'iscsi_logical_unit.*Creating state'
/home/nutanix/data/logs/stargate*log*INFO*“
• The target in the example in the slide shows the following:

o Target iqn.2010-06.com.nutanix:linuxvg-a4df083e-babe-47de-9857-5596e1a0effb
• The Target Name is comprised of the several pieces starting with:

o Iqn. – iSCSI Qualified Name.
o 2010-06.com.nutanix: - The year and domain name reversed for Nutanix.
o Linuxvg- – iSCSI Target Name Prefix (defaults to the Volume Group Name)
o GUID - a4df083e-babe-47de-9857-5596e1a0effb randomly generated
• There will also be a –tgtx where X is an integer starting at zero that will be
determined at iSCSI redirection time.
• On the Initiator you can examine each LUN for the –tgtx value per device.
Logs of Interest – iSCSI Redirect ion Messages
Session Redirection Messages

$allssh "zgrep 'iscsi_login.*redirect’ /home/nutanix/data/logs/stargate*log*INFO*“
Sample O utput:
/home/nutanix/data/logs/stargate.NTNX-16SM13150152-B-CVM.nutanix.log.INFO.20170605-181306.6610:I0616 04:03:07.361250
6628 iscsi_login_op.cc:690] Session from (iqn.1991-05.com.microsoft:server01, ISID 0x0) for target iqn.2010-
06.com.nutanix:windowsvg-f6e76459-f8d8-47cb-a1bb-81cbd2c8bb68-tgt0 redirected to 10.30.15.48:3205
6627 iscsi_login_op.cc:690] Session from (iqn.1994-05.com.redhat:e050c3a95ab, ISID 0x0) for target iqn.2010-
06.com.nutanix:linuxvg-a4df083e-babe-47de-9857-5596e1a0effb-tgt0 redirected to 10.30.15.47:3205
O n w hich CV M Does iSCSI Redirection Take Place?
• Hint : eth0:2 – Review t he stargate.INFO log file on t hat Nod e for iSCSI Red irect ion Messag es
Initiator IQ N
Target IQN
48 Default iSCSI Port 3260 but iSCSI Redirection uses the Port 3205
Logs of Interest – iSCSI Redirection Messages

• In ABS, Nutanix clusters expose a single IP address for Target Portal Discovery.
The single IP address used is the External Data Services IP address. Initiators only
have to point to the Data Services IP. After the Initiator performs discovery, login
and session creation have to be performed.
• During the Login to the Data Services IP address is when iSCSI redirection will
occur. iSCSI redirection will map the LUN to a CVM in the cluster. The iSCSI
redirection uses the external IP address of the CVM and a non-standard iSCSI
port 3205 for the connection.
• The iSCSI redirection will also occur whenever the CVM hosting iSCSI
connections goes offline. When the CVM goes offline this will cause a logout
event on the Initiator. The Initiator will automatically attempt to log in using the Data
Services IP address. iSCSI redirection will perform a logout and a login to an
existing CVM in the cluster.
• All of the iSCSI redirection happens on the CVM hosting the data services IP address
eth0:2.
• In the example above you can see this iSCSI redirection occurring. Performing a:
o $allssh "zgrep 'iscsi_login.*redirect’ /home/nutanix/data/logs/stargate*log*INFO*“
• …will reveal which CVM is hosting the Data Services IP address also. The
stargate*log*INFO* logfile on that CVM has the iSCSI redirection messages.
Logs of Interest – Virtual Target Session
Messages
Find virtual target session
$allssh "zgrep 'iscsi_server.*virtual target’
/ home/ nutanix/ data/ logs/ stargate*log*INFO*"
Sample Output:
/home/nutanix/data/logs/stargate.NTNX-16SM13150152-C-CVM.nutanix.log.INFO.20170613-162520.23713:I0615
05:57:17.110442 23727 iscsi_server.cc:1238] Adding virtual target iqn.2010-06.com.nutanix:windows-
90e7bb28-755c-4e02-833b-b786d634fe6d-tgt0 to session 0x4000013700000001 from base target iqn.2010-
06.com.nutanix:windows-90e7bb28-755c-4e02-833b-b786d634fe6d with 32 virtual targets
49
Logs of Interest – Virtual Target Session Messages

• Virtual Target Session messages detail the information after the iSCSI redirection.
The messages reveal the actual session being established from the Initiator to the
Virtual Target. The Virtual Target is the mapping from the Initiator to an actual
CVM after iSCSI redirection.
• Review these logfile messages for errors occurring during the session establishment.
Logs of Interest – Session Login & Logout Messages
Find session login and logout events

$allssh "zgrep 'iscsi_session’
Sample O utput:
/home/nutanix/data/logs/stargate.NTNX-16SM13150152-C-CVM.nutanix.log.INFO.20170613-162520.23713:I0615 05:57:17.150655 23729
iscsi_session.cc:155] Updated target params for target iqn.2010-06.com.nutanix:windows-90e7bb28-755c-4e02-833b-b786d634fe6d-tgt0 in
session 0x0 to: num_virtual_targets: 32
iscsi_session.cc:155] Updated target params for target iqn.2010-06.com.nutanix:windows-90e7bb28-755c-4e02-833b-b786d634fe6d-tgt0 in
session 0x0 to: num_virtual_targets: 32
iscsi_session.cc:165] Updated initiator id for target iqn.2010-06.com.nutanix:windows-90e7bb28-755c-4e02-833b-b786d634fe6d-tgt0 in session
0x0 to: 10.30.15.93
iscsi_session.cc:135] Leading connection 0x1 (10.30.15.93:59106) on session 0x4000013700000001
iscsi_session.cc:1365] Preferred SVM for iqn.2010-06.com.nutanix:windows-90e7bb28-755c-4e02-833b-b786d634fe6d-tgt0 is 6
Preferred SV M
IP address of initiator Creating the Session
50
Logs of Interest – Session Login & Logout Messages

• The login and logout messages can be useful for troubleshooting connectivity and
redirection issues.
• Session information can also be viewed from the Initiator. On Linux use the iscsiadm
utility, in Windows the iSCSI management GUI, Powershell, or iscsicli.
Logs of Interest – LUN Event Messages
Logical Unit ( LUN) add,disable,enable, replace events
$allssh "zgrep 'scsi.*Processing vdisk config update’
Sample Output:
iscsi_logical_unit.cc:67] Creating state for 10000000000 byte VDisk disk as LUN 1, Target iqn.2010-06.com.nutanix:linuxvg-a4df083e-babe-47de-
9857-5596e1a0effb, VDisk 179371, NFS:4:0:260
iscsi_target.cc:59] Added LUN 1 for VDisk 'NFS:4:0:260' to iqn.2010-06.com.nutanix:linuxvg-a4df083e-babe-47de-9857-5596e1a0effb
iscsi_logical_unit.cc:67] Creating state for 5000000000 byte VDisk disk as LUN 2, Target iqn.2010 -06.com.nutanix:linuxvg-a4df083e-babe-47de-
9857-5596e1a0effb, VDisk 179588, NFS:4:0:261
iscsi_logical_unit.cc:898] Disabling LUN 2, Target iqn.2010-06.com.nutanix:linuxvg-a4df083e-babe-47de-9857-5596e1a0effb, VDisk 179588,
NFS:4:0:261
iscsi_target.cc:74] Removed LUN 2 from iqn.2010-06.com.nutanix:linuxvg-a4df083e-babe-47de-9857-5596e1a0effb
iscsi_logical_unit.cc:67] Creating state for 5000000000 byte VDisk disk as LUN 2, Target iqn.2010 -06.com.nutanix:linuxvg-a4df083e-babe-47de-
9857-5596e1a0effb, VDisk 179677, NFS:4:0:262
51
Logs of Interest – LUN Event Messages

• LUN Event messages can be useful in troubleshooting LUN creation and deletion
operations.
Labs
Module 8
Acropolis Block Services Troubleshooting
52
Labs
Thank You!
53
Thank You!
Module 9
DR
Nutanix Troubleshooting 5.x
Module 9 DR
Course Agenda
1. Intro 7. AFS
2. Tools & Utilities 8. ABS
4. Foundation 10. AOS Upgrade
5. Hardware 11. Performance
6. Networking
Course Agenda
• This is the DR module.
Data Protection
Data Protection
Dat a Protect ion O verview
• It is an important business requirement to many that critical

data and applications be replicated to an alternate site for
backup/disaster recovery.
• You can implement a Data Protection strategy by configuring
Protection Domains and remote sites through the Prism web
console.
• Data Protection implementations and status across the
cluster can be monitored from Prism.
Data Protection Overview

• Replication is a fundamental component of any enterprise data protection solution,
ensuring that critical data and applications can be reliably and efficiently replicated to
a different site or a separate infrastructure. While enterprise IT architects have many
technology options, there are replication capabilities that are requisite for any
successful enterprise data protection initiative.
• Per-VM Backup. The ability to designate certain VMs for backup to a different site is
particularly useful in branch office environments. Typically, only a subset of VMs
running in a branch location require regular back up to a central site. Such per-VM
level of granularity, however, is not possible when replication is built on traditional
storage arrays. In these legacy environments replication is performed at a coarse
grain level, entire LUNs or volumes, making it difficult to manage replication across
multiple sites.
• Selective Bi-directional Replication. In addition to replicating selected VMs, a
flexible replication solution must also accommodate a variety of enterprise topologies.
It is no longer sufficient to simply replicate VMs from one active site to a designated
passive site, which can be "lit up" in the event of a disaster. Supporting different
topologies demands that data and VMs can be replicated bi-directionally. 
• Synchronous Datastore Replication (Metro Availability). Datastores can be
spanned across two sites to provide seamless protection in the event of a site
disaster.
Protect ion Domains
• A protection domain is a defined set of virtual machines, volume

groups, or storage containers to be protected.
• There are two types of protection domains:
o Async DR: A defined group of entities that are backed up locally on
a cluster and optionally replicated to one or more remote sites.
o Metro Availability: Consists of a specified (active) storage container
in the local cluster linked to a (standby) container with the same
name on a remote site in which synchronous data replication occurs
when Metro Availability is enabled.
Protection Domains
Dat a Protect ion Concept s
• Remote Site: A separate cluster used as a target location to

replicate backed-up data. You can configure one or more
remote sites for a cluster.
• Replication: The process of asynchronously copying
snapshots from one cluster to one or more remote sites.
• Schedule: A property of a Protection Domain that specifies the
intervals to take snapshots and how long the snapshots should
be retained. A schedule optionally specifies which remote site
or sites to replicate to.
Data Protection Concepts

Dat a Protect ion Concept s (co nt ’d )
• Consistency Group: A subset of the entities in a protection

domain. All entities within a consistency group for that
protection domain are snapshotted in a crash-consistent
manner.
• Snapshot: A read-only copy of the state and data of a VM or
volume group at a point in time.
• Time Stream: A set of snapshots that are stored on the same
cluster as the source VM or volume group.
Data Protection Concepts (cont’d)

Met ro Availabilit y
• Nutanix offers a Metro Availability option that spans a

datastore across two sites.
• This is accomplished by pairing a container on the local
cluster with one on a remote site and then synchronously
replicating data between the local (active) and remote
(standby) containers.
• Metro Availability minimizes application downtime due to
planned or unplanned failure events.
• Metro Availability is only supported on ESXi.
9
Metro Availability
Async DR
• An Async Protection Domain is a group of VMs that are backed up

locally on a cluster and optionally replicated to one or more remote
sites.
• Asynchronous data replication is used to create snapshots
• A Protection Domain on a cluster will be in one of two modes:
o Active: Manages live VMs; Makes replications and expires snapshots.
o Inactive: Receives snapshots from a remote cluster.
• Async DR is the supported data protection method for the Nutanix
AHV hypervisor.
10
Async DR
Async DR (co nt ’d )
11
Async DR (cont’d)
• Clicking the Async DR tab displays information about Protection Domains configured
for asynchronous data replication in the cluster. This type of Protection Domain
consists of a defined group of virtual machines to be backed up (snapshots) locally
on a cluster and optionally replicated to one or more remote sites.
o The table at the top of the screen displays information about all the configured
Protection Domains, and the details column (lower left) displays additional
information when a Protection Domain is selected in the table. The following table
describes the fields in the Protection Domain table and detail column.
o When a Protection Domain is
selected, Summary: protection_domain_name appears below the table, and
action links relevant to that Protection Domain appear on the right of this line. The
actions vary depending on the state of the Protection Domain and can include one
or more of the following:
• Click the Take Snapshot link to create a snapshot (point-in-time backup) of this
Protection Domain.
• Click the Migrate link to migrate this protection domain.
• Click the Update link to update the settings for this protection domain.
• Click the Delete link to delete this protection domain configuration.
o Eight tabs appear that display information about the selected Protection Domain:
• Replications, VMs, Schedules, Local Snapshots, Remote
Snapshots, Metrics, Alerts, Events.
Async DR Configurat ion – Protect ion

Domain
12
Async DR Configuration – Protection Domain

• In order to create an Async DR Protection Domain, navigate to the Data Protection
pane in Prism, click the + Protection Domain button and select Async DR. The
Protection Domain (Async DR) creation screen displays.
• Enter a name for the Protection Domain and click on the Create button – this will
immediately create the Protection Domain even if you do not continue to the next
step. The Protection Domain name has the following limitations:
o The maximum length is 75 characters
o Allowed characters are uppercase and lowercase standard Latin letters (A-Z and
a-z), decimal digits (0-9), dots (.), hyphens (-), and underscores (_).
Async DR Configurat ion – Protected VMs
13
Async DR Configuration – Protected VMs

• Following, the Entities tab is displayed, which is divided into two sections –
Unprotected Entities and Protected Entities.
o The administrator should check the boxes next to the VMs that they want in a
consistency group from the list of unprotected VMs. The list can be filtered by
entering a VM or hostname in the Filter By: field. Protection Domains can have
multiple consistency groups. You can protect a maximum of 50 VMs within any
number of consistency groups (up to 50) in a single Protection Domain.
• Select a consistency group for the checks VMs. Click the Use VM Name button to
create a new consistency group for each checked VM with the same name as that
VM.
• Click the Use an existing CG button and select a consistency group from the
drop-down list to add the checked VMs to that consistency group, or click the Create
a new CG button and enter a name in the field to create a new consistency group
with that name for all the checked VMs.
• Click the Protect Selected Entities button to move the selected VMs into the
Protected VMs column.
• Click Next at the bottom of the screen when all desired consistency groups have
been created.
Async DR Configurat ion – Schedule
14
Async DR Configuration – Schedule

• The Schedule screen displays for the administrator to configure a backup schedule
for the Protection Domain. As there is no default backup schedule defined, all
schedules must be defined by the user. The following define the various fields
presented in the Schedule screen:
• Repeat every ## [minutes|hours|days]: Click the appropriate circle for minutes,

hours, or days and then enter the desired number in the box for the scheduled time
interval. The interval cannot be less than an hour, so the minutes value must be at
least 60. 
• Occur [weekly|monthly]: Select which days the schedule should run.
o If you select weekly, check the boxes for the days of the week the schedule should
run.
o If you select monthly, enter one or more integers (in a comma separated list) to
indicate which days in the month to run the schedule. For example, to run the
schedule on the first, tenth, and twentieth days, enter "1,10,20”.
• Start on: Enter the start date and time in the indicated fields. The default value is the
current date and time. Enter a new date if you want to delay the schedule from
starting immediately.
oNote: A container-level protection domain requires a system metadata scan
(Curator process) to populate the file list. In some cases, this scan might take a
few hours. Any snapshots started before the scan completes do not contain any
data.
• End on: To specify an end date, check the box and then enter the end date and time
in the indicated fields. The schedule does not have an end date by default, and it
runs indefinitely unless you enter an end date here.
• Retention Policy: Enter how many intervals of snapshots that you want to
retain. For example, if you enter 1 in the retention policy check box and select one of
the following:
o For Repeat every minutes/hours/days option: Total of only 1 snapshot is
retained.
o For Repeat weekly/monthly option: Total of 1 for each period/occurrence is
retained. Suppose if you select Occur weekly with Monday, Tuesday, Friday, a
total of 3 snapshots are retained.
• Enter the number to save locally in the Local line keep the last xx snapshots field.
The default is 1.
o A separate line appears for each configured remote site. To replicate to a remote
site, check the remote site box and then enter the number of snapshots to save on
that site in the appropriate field. Only previously configured remote sites appear in
this list.
• The number of snapshots that are saved is equal to the value that you have entered
in the keep the last ## snapshots field + 1. For example, if you have entered the
value keep the last ## snapshots field as 20, a total of 21 snapshots are saved.
When the next (22nd) snapshot is taken, the oldest snapshot is deleted and replaced
by the new snapshot.
• When all of the field entries are, click Create Schedule. ***JEREMY—WHEN ALL
OF THE FIELD ENTRIES ARE…WHAT?
Async Protect ion Domain Failover -
Planned
• A Protection Domain can be migrated to a remote site as part
of planned system maintenance.
• This action is performed using the Migrate option from the
Async DR tab of the Data Protection pane.
15
Async Protection Domain Failover – Planned

Migrating a Protection Domain does the following:
• Creates and replicates a snapshot of the Protection Domain.
• Powers off the VMs on the local site.
• Creates and replicates another snapshot of the Protection Domain.
• Unregisters all VMs and removes their associated files.
• Marks the local site Protection Domain as inactive.
• Restores all VM files from the last snapshot and registers them on the remote site.
• Marks the remote site protection domain as active.
The VMs on the remote site are not powered on automatically. This allows you to
resolve any potential network configuration issues, such as IP address conflicts,
before powering on the VMs.
Async Protection Domain Failover -
Unplanned
• When a site disaster, or unplanned failover event occurs, the

target Protection Domain will need to be activated.
• This action is performed using the Activate option from the
Async DR tab of the Data Protection pane.
16
Async Protection Domain Failover – Unplanned

• The Activate operation does the following:
o Restores all VM files from the last fully-replicated snapshot.
o Registers the VMs on the recovery site.
o Marks the failover site Protection Domain as active.
• The VMs are not powered on automatically. This allows you to resolve any potential
network configuration issues, such as IP address conflicts, before powering on the
VMs.
Async Protect ion Domain Failback
• The following steps can be performed to failback a Protection

Domain from the remote site to the primary site.
• From the Prism web console of the remote site, the Migrate
action under the Async DR tab of the Data Protection pane is
used.
17
Async Protection Domain Failback

Async Protect ion Domain Limit at ions
• Protection Domains can have no more than 50 VMs.

• To be in a Protection Domain, a VM must be entirely on Nutanix
(no external storage).
• It is not possible to make snapshots of entire filesystems or
containers
• The shortest possible snapshot frequency is once per hour.
• Do not inactivate and then delete a Protection Domain that
contains VMs. Either delete the Protection Domain without
inactivating it, or remote the VMs from the Protection Domain
before deleting it.
18
Async Protection Domain Limitations

Troubleshooting
Troubleshooting
Cerebro Service Master
• In order to troubleshoot Data Protection related issues, the

Cerebro logs are a good place to start. Like other services
inside of a Nutanix box, the Cerebro service (port 2020) has a
master that is elected.
• In order to find the master, the following command can be
used: cerebro_cli get_master_location
• Alternatively, logging into the 2020 page will provide a handle
to the Cerebro master.
20
Cerebro Diagnost ics Page
• The Cerebro Service Diagnostics Page contains some

service related information, such as the Start Time and
Master since fields, which would be helpful in identifying any
sort of process crash or restart instances.
• A consistently changing Incarnation ID would indicate that
the service is in a crash loop.
Cerebro Diagnostics Page

Cerebro Diagnost ics Page
• Drilling down into a specific Protection Domain will display the

associated Meta Ops.
• Every time a replication starts (either scheduled or manually),
a meta op will be generated to keep track of the activity.
Cerebro Diagnostics Page

snapshot_tree_printer
• Displays the snapshot chains. Allows the parent-child

relationships between vDisks to be verified.
• Each line of the snapshot_tree_printer output shows details
about one complete snapshot chain.
• There are multiple entities listed in each line and each entity is
enclosed in square brackets.
snapshot _t ree_printer
Connect ion Errors
• Ensure that the primary and remote sites are reachable via
ping.
• Verify that ports 2020 and 2009 are open between the source
and destination. The following loop can be used to verify port
connectivity when run from the primary site:
• for i in `svmips`; do (echo "From the CVM $i:";
ssh $i 'for i in `source /etc/profile;ncli rs ls
|grep Remote|cut -d : -f2|sed 's/,//g'`; do echo
Checking stargate and cerebro port connection to
$i ...; nc -v -w 2 $i -z 2009; nc -v -w 2 $i -z
2020; done');done
7
Connection Errors
cerebro.INFO
• The cerebro.INFO log is useful in troubleshooting DR-

related issues.
• Tasks can be tracked from the log such as when details of a
remote site are gathered and when a snapshot of a VM is
taken.
cerebro.INFO
• I0719 09:14:24.648830 8133 fetch_remote_cluster_id_op.cc:672] Fetched
cluster_id=47785 cluster_incarnation_id=1499975040912401
cluster_uuid=00055438-277c-4c11-0000-00000000baa9 for remote=remote01 with
dedup support 1 with vstore support 1 cluster operation mode 0
• I0719 09:14:29.651686 8133 snapshot_protection_domain_meta_op.cc:4343] Start

snapshotting consistency group Server01 with order number: 2147483647
• I0719 09:14:29.719172 8133 snapshot_protection_domain_meta_op.cc:3654]

Snapshot for CG Server01 (order number: 2147483647 ) completed with error
kNoError
Labs
Labs
Labs
Module 9 DR
Lab Configuration Protection Domains
10
Labs
• Module 9 DR
• Lab Configuration Protection Domains
Thank You!
11
Thank You!
Module 10
AOS Upgrade Troubleshooting
Module 10 AOS Upgrade Troubleshooting

Course Agenda
1. Intro 7. A FS
4. Foundation 10 . AOS Up g rad e
6. Networking
Course Agenda
• This is the AOS Upgrade module.
O bject ives
After complet ing t his module, you w ill be able to:

• Review AOS Up g rad e Prereq uisit es
• Review all AOS Release Not es and Securit y Ad visories
• Und erst and Acrop olis Recom m end ed Up g rad e Pat hs
• Run NCC Before Perform ing AOS Up g rad es
• Prep are t o Up g rad e AOS
• Perform an AOS Up g rad e using One-Click in Prism
• Troub leshoot AOS Up g rad e Issues
• Id ent ify AOS Up g rad e Log s of Int erest
• Ad d Nod e t o Clust er Consid erat ions
3
Objectives
• Review AOS Upgrade Prerequisites.
• Review all AOS Release Notes and Security Advisories.
• Understand Acropolis Recommended Upgrade Paths.
• Run NCC Before Performing AOS Upgrades.
• Prepare to Upgrade AOS.
• Perform an AOS Upgrade using One-Click in Prism.
• Troubleshoot AOS Upgrade Issues.
• Identify AOS Upgrade Logs of Interest.
• List Add Node to Cluster Considerations.
Before Up g rad ing AOS
Before Upgrading AOS

Review AOS Upgrade Prerequisites
Upgrade Prism Cent ral First .
Cont roller V M w here Upgrade is Init iated Rest art s.
CV M Memory Increase for 5.1 Upgrade.
Review AOS Upgrade Prerequisites

• Prism Element provides a One Click to perform AOS upgrades seamlessly on a
Nutanix Cluster. Nutanix releases major and minor upgrades for the AOS software
regularly.
• In the Prism Element interface there is a One Click to Upgrade the AOS version on
the Nutanix Cluster. Acropolis performs a rolling upgrade with no downtime for the
virtual machines.
• Click the Gear icon in Prism Element then click the Upgrade Software to get to the
Upgrade Software dialogue box. The Upgrade Software dialogue box has several
tabs to perform different one clicks for various software components in the Nutanix
Cluster.
• The upgrades for software components include Acropolis, File Server, Hypervisor,
Firmware, Foundation, and Container.
• The Acropolis tab is where to perform the AOS rolling Upgrades. On the Acropolis
tab in the Upgrade Software dialogue box you can Enable Automatic Download.
• When the Enable Automatic Download is checked, the Cluster periodically checks
for new versions and automatically downloads the software package when available
to the cluster.
• The Cluster will connect to the Nutanix download server over the internet to check for
new versions. Internet connectivity and properly configured DNS settings are
required for the Enable Automatic Downloads to work successfully.
• For customers that cannot provide internet access to the Nutanix Clusters for
example a dark site, on the Nutanix Support Portal under the Downloads link
customers can manually download the binaries and corresponding .json file to their
local machine first. The Upgrade software dialogue box has a link to upload the
binaries manually. Click the link, browse the hard drive for the software and .json file
and choose Upload now.
• Due to the increased number of services introduced in 5.x more memory is needed
by the common core and stargate processes. During the upgrade to release 5.1, a
feature was introduced to increase the memory allocated to CVM’s on nodes to
address the higher memory requirements of Acropolis.
• When upgrading to AOS 5.1 there will be an increase to the CVM memory for all
nodes in the cluster with 64GB of memory or greater. For any nodes in the cluster
that have less than 64GB, no memory increase will occur on the CVM.
• Nodes identified as candidates for the CVM memory increase, nodes having 64
GB of RAM and less than 32GB allocated to the CVM will be increased by 4GB.
Upgrade Prism Cent ral First
Prism Central Upgrade

Pat hs
Prism Central Prism Central
4.7 5.1 or Later
Prism Central Prism Central Prism Central

4.6.x and 4.7 5.1 or Later
Previous • Firewall Access to Nutanix Download Site
• Choose Upgrade Prism Central from the Gear Icon
• Prism Central 5.1 Supports Acropolis Upgrades
• Enable Automatic Download to Check for New Versions
7 • Manual Download
Upgrade Prism Central First

• If the Nutanix cluster is being managed by Prism Central, then Upgrade Prism
Central first to the version compatible with the AOS version the cluster will be
upgraded to.
• Prism Central provides a One-Click to upgrade to newer versions of Prism Central.

o Under the Gear icon click the link Upgrade Prism Central.
• The Upgrade Prism Central dialogue box looks similar to the AOS Upgrade
dialogue box.
• There is an Enable Automatic Download choice for Prism Central to automatically

check for new versions.
o You also have the option to upload the software binaries manually to the cluster.
• You can now upgrade AOS on some or all of the clusters registered to and managed
by Prism Central.
o The upgrade procedure, known as 1-click Centralized Upgrade, enables you to
upgrade each managed cluster to a specific version compatible with Prism
Central.
• Acropolis Upgrade support starts in Prism Central 5.1.
Review All AOS Release Notes & Securit y Advisories
 Support Port al – Documentation – Securit y Advisories
Review All AOS Release Notes & Security Advisories

• Review all Release Notes and Security Advisories for the AOS version of the
upgrade.
o The Nutanix Support Portal is where to find the articles.
• On the Support Portal page under Documentation is the link to Security Advisories.
Review All AO S Release N ot es & Securit y Ad visories (cont.)
 Sup p ort Port al - Soft w are D ocum ent s Pag e –AOS Version – Release N ot es
Review all AOS Release Notes & Security Advisories (cont.)

• Go to the Software Documents main page on the Nutanix Support portal.
• Filter by:
o Software type: AOS
o Release: X.X
o All
Underst and Acropolis Recommended Upgrade
Pat hs
Link Under Documentation
10
Acropolis Recommended Upgrade Paths

• Before upgrading AOS always check the Nutanix Support Portal for the
recommended upgrade paths.
• The Acropolis Upgrade Paths link to the page can be found under the
Documentation link at the top of the Support page.
Acropolis Release Cadence and Numbers
AOS Release Numbers - Exam p le:

W.X.Y.Z
5.1.0 .3
Major Minor Starting @ 1 Maintenance Patch
W =Major ( 5)
• Major and Minor Releases are Typ ically m ad e Availab le every 3 to 6 Mont hs
X=Minor ( 1)
Y=Maintenance ( 0 )
• Maintenance Releases for t he latest Major or Minor release are Typ ically m ad e Availab le every
6 to 12 Weeks
Z =Patch ( 3)
11 • Patch Releases are m ad e Availab le on an as-need ed Basis
Acropolis Release Cadence and Numbers

Run NCC Before Performing AOS
Upgrades
Actions
• Always Upgrade NCC to the Latest Version Using

One Click In Prism Element
• NCC Health Checks can be Run from the Health
Page in Prism Element
12
Run NCC before Performing AOS Upgrades

• Before performing the AOS Upgrade an NCC health check needs to be run to
identify any issues. After running the NCC, if there are any issues reported then
resolve before performing the upgrade.
• Make sure that the latest version of NCC is installed on the cluster. The Upgrade
Software dialogue box has a one click for upgrading NCC software. You can
manually upload the new NCC installer file to the cluster or if Enable Automatic
Download is checked the cluster will check for new versions periodically.
• NCC can also be upgraded to the latest version manually. The NCC installer file will
be either a single installer file (ncc_installer_filename.sh) or an installer file in a
compressed tar file.
• The NCC single file or tar file needs to be copied to a directory on one of the
controller VMs. The Directory were the file is copied must exist on all CVMs in the
cluster.
• Using SCP or WINSCP, copy the file to any cvm in the cluster to the
/home/nutanix directory. Use the following command to check the MD5 value of the
file after copying to cluster:
nutanix@cvm$ md5sum ./ncc_installer_filename.sh
• If this does not match the MD5 value published on the Support Portal then the file
is corrupted. Try downloading the file again form the Support Portal.
• For a single NCC installer file perform the following steps to upgrade the NCC
software for the entire cluster:
• On the CVM where the single NCC installer file was copied run the following
commands:
nutanix@cvm$ chmod u+x ./ncc_installer_filename.sh
• This command modifies permissions to add the x execute permission to the

Installer file.
nutanix@cvm$ ./ncc_installer_filename.sh
• This command upgrades NCC on all of the nodes in the cluster.
• If the Installer file is packaged in a tar file then run the following commands:
nutanix@cvm$ tar xvmf ncc_installer_filename.tar.gz --recursive-unlink
• This command will extract the Installer package.
nutanix@cvm$ ./ncc/bin/install.sh [-f install_file.tar
• This command copies the tar file to each CVM in the cluster and performs the
upgrade.
• If the installation completes successfully a Finished Installation message displays.
• To troubleshoot the upgrade there are log files depending on the version in the
following two locations:
• /home/nutanix/data/logs/
o OR
• /home/nutanix/data/serviceability/ncc
Run NCC Before Performing AOS
Upgrades
Command Line Output Snippet:
nutanix@NTNX-16SM13150152-A-CVM:10.30.15.47: node with service vm id 6
~$ ncc health_checks run_all service vm external ip: 10.30.15.49
ncc_version: 3.0.2-a83acc39 hypervisor address list: [u'10.30.15.46']
cluster id: 47802 hypervisor version: 3.10.0-229.26.2.el6.
cluster name: Eric nutanix.20160217.2.x86_64 Acropolis
node with service vm id 4 ipmi address list: [u'10.30.15.43'] Version AOS
service vm external ip: 10.30.15.47 Hypervisor software version: euphrates-5.1.0.1-stable
Version
hypervisor address list: [u'10.30.15.44'] software changeset ID:
hypervisor version: 3.10.0-229.26.2.el6.nutanix.20160217.2.x86_64 419aa3a83df5548924198f85398deb20e8b615fe
ipmi address list: [u'10.30.15.41'] node serial: ZM163S033719
software version: euphrates-5.1.0.1-stable rackable unit: NX-1065S
software changeset ID: 419aa3a83df5548924198f85398deb20e8b615fe Running : health_checks run_all
node serial: ZM163S033945 Add Node to Cluster Considerations
rackable unit: NX-1065S [================================== ] 98%
13
Run NCC Before Performing AOS Upgrades

• This is an example of running NCC from the command line of one of the CVMs in the
cluster.
o The following command runs the health checks:
ncc health_checks run_all
• You can check any NCC-related messages in /home/nutanix/data/logs/ncc-output-

latest.log.
Preparing to Upgrade AOS
Things to Perform before AOS Upgrade

$cluster enable_auto_install
• Enabled by Default
Replication Operations are Not Allowed During Upgrade Disable
• nutanix@cvm$ ncli protection-domain list
• nutanix@cvm$ ncli protection-domain list-replication-status
name=pd_name
• nutanix@cvm$ ncli pd ls-schedules name= protection_domain_name
• nutanix@cvm$ ncli pd remove-from-schedules name=protection_domain_name
id=schedule_id
• nutanix@cvm$ ncli alerts update-alert- config enable=false
• Disable Alerts for CVM Reboot
Review Upgrade History File
cat /home/nutanix/config/upgrade.history
14
Perform 1-Click Upgrade Software from the Gear Icon
Preparing to Upgrade AOS

• There are several things to prepare before performing the AOS one-click software
upgrade. The following command has to be run to enable auto install on one of the
CVMs:
$cluster enable_auto_install
• If the cluster has any Protection Domains set up for disaster recovery, replication is
not allowed during the AOS upgrade. Use the following commands to verify
Protection Domains on the cluster and any outstanding replication updates
processing:
nutanix@cvm$ ncli protection-domain list
• This command will list all the Protection Domains for the cluster. The Protection
Domain name will be used in the following command to check the replication status:
nutanix@cvm$ ncli protection-domain list-replication-status name="pd_name“
• Be sure to check the replication status for all Protection Domains. If any replication
operations are in progress, output similar to the following is displayed:
ID : 12983253
Protection Domain : pd02
Replication Operation : Receiving
Start Time : 09/20/2013 17:12:56 PDT
Remote Site : rs01
Snapshot Id : rs01:54836727
Aborted : false
Paused : false
Bytes Completed : 63.15 GB (67,811,689,418 bytes)
Complete Percent : 84.88122
• If no replication operations are in progress, no output is displayed and you can

continue upgrading.
• List the Protection Domain schedules and delete during upgrade and then re-
create after the upgrade is complete.
• The CVMs are upgraded in parallel. Then serially each CVM has to be rebooted and
memory increased if upgrading to 5.1. During the reboot alerts will be fired on the
cluster indicating a down CVM. You can disable the alerts during the upgrade then
re-enable after completion
nutanix@cvm$ ncli alerts update-alert-config enable=false

Up g rad ing AOS
Upgrading AOS
Perform an AOS Upgrade Using One -Click in Prism
Element
Upgrade Software Dialogue Box
16
Perform an AOS Upgrade Using One-Click in Prism Element

• Prism Element provides a One-Click Upgrade Software option to perform an
Acropolis Upgrade. Upgrading the Acropolis software copies and installs new
binaries on the Controller VM for each node of the cluster. The upgrade is
performed dynamically with no down time to the cluster or user virtual machines.
• Once the software binaries are uploaded to the cluster the Upgrade Software
dialogue box shows the Upgrade button. When you click Upgrade it gives you two
options:
o Pre-upgrade or
o Upgrade Now.
• The Pre-upgrade runs thru the pre-checks but does not actually perform the
upgrade. The pre-upgrade can reveal any issues in the cluster that will prevent the
upgrade from running successfully.
• The Upgrade Now option will perform the pre-checks also, if there are no issues
then the upgrade will proceed.
• If the upgrade pre-checks fail, you can review the following logfile:
/home/nutanix/data/logs/preupgrade.out
• …to assist in troubleshooting. This file is placed randomly on one of the CVMs in the
cluster. To search for which CVM has the log file to review run the following
command:
$allssh ls -ltr /home/nutanix/data/logs/pre*
• The command will return the CVM where the files were written. SSH to that CVM and
review the contents for errors that are causing the upgrade pre-checks to fail.
Perform an AOS Upgrade Using One -Click in Prism
Element
How to Control NOS/AOS Rolling Upgrade Sequence
Disable Auto-upgrade on Cluster
Restart genesis on all Nodes after Disabling Auto-upgrade
Trigger the 1-Click NOS/AOS Upgrade from the Prism GUI
Remove the Marker File from One Node
17
Perform an AOS Upgrade Using One-Click in Prism Element

• When the AOS upgrade is performed the nodes are upgraded in a random order. If
you would like to control the order in which the nodes get upgraded, then disable
Auto upgrade on the cluster.
$cluster disable_auto_install
• After the cluster disable_auto_install command is run, Genesis needs to be restarted

on all nodes of the cluster to be aware of the changed setting. If you do not restart
Genesis the cluster will not know about the Auto-upgrade being disabled and will
start the upgrade.
• Once the Genesis service on each node restarts a hidden marker file is created in
the /home/nutanix directory named .node_disable_auto_upgrade.
• You can verify the file exists with the following command:
nutanix@CVM:~$ allssh ls -a .node_disable_auto_upgrade

Node_disable_auto_upgrade
• The file should now exist on all CVMs in the cluster. Once the file is there you can
now go back to the One-click and select Upgrade Now.
• Proceed to the CVM that you want to upgrade. Remove the hidden marker file on
the node and restart Genesis for the upgrade to proceed on that CVM. Once
completed go to the next CVM and repeat the process until all the nodes are
upgraded.
• The command to delete the file:
nutanix@CVM:~$ rm -rf .node_disable_auto_upgrade
• After you delete the file you must restart Genesis on that node to start the upgrade.
Type the following command to restart Genesis for that node only:
nutanix@CVM:~$ genesis restart

Complet ing Upgrade to AOS 5.x
SNMP Considerations
Download Latest MIB from Prism
Copy to the SNMP Management Station
Re-create the Protection Domain Schedules
Enable Alerts
nutanix@cvm$ ncli alerts update-alert-

18 config enable=true
Completing Upgrade to AOS 5.x

• If you are using SNMP to manage or collect information from the Nutanix clusters,
after the upgrade update the MIB on the SNMP management system.
• The MIB can be downloaded from the Prism web console under the Gear icon —>
SNMP link.
• You will also have to re-create all the Protection Domain asynchronous
schedules.
• Enable Alerts using ncli.

Troub leshoot ing AOS Up g rad es
Troubleshooting AOS Upgrades

Troubleshoot AOS Upgrade Issues – Tasks
View
Prism Web Console – Tasks Page
Preupgrade.out Logfile
Install.out Logfile
20
Troubleshoot AOS Upgrade Issues – Tasks View

• The AOS upgrade creates several tasks. In the Prism web console you can view on
the Tasks page.
• The example in the slide shows the steps performed to upgrade AOS. A task is
created for the following:
• Pre-upgrade steps succeeded or failed. If the task has failed you can hover the
mouse over the task to get a tool tip for the actual error. For more details review
the preupgrade.out logfile.
• Upgrading Acropolis task. Click the Details blue link to see the tasks for each
CVM. The CVM details will show the steps processed to upgrade the CVM.
Review the install.out logfile for more advanced troubleshooting.
Troubleshoot AOS Upgrade Issues
CV M Command to Monitor Upgrade st at us for all nodes in t he cluster

$upgrade_status
Sample O ut put :
nutanix@NTNX-16SM13150152-B-CVM:10.30.15.48:~/data/logs$ upgrade_status
2017-06-28 14:49:44 INFO zookeeper_session.py:76 Using host_port_list:
zk1:9876,zk2:9876,zk3:9876
2017-06-28 14:49:44 INFO upgrade_status:38 Target release version: el6-release-euphrates-
5.1.0.1-stable-419aa3a83df5548924198f85398deb20e8b615fe
2017-06-28 14:49:44 INFO upgrade_status:43 Cluster upgrade method is set to: automatic rolling
upgrade
2017-06-28 14:49:44 INFO upgrade_status:96 SVM 10.30.15.47 still needs to be upgraded.
Installed release version: el6-release-danube-4.7.5.2-stable-
7c83bbaf29e9b603f3a0825bee65f568d79603b9
7c83bbaf29e9b603f3a0825bee65f568d79603b9, node is currently upgrading
21 7c83bbaf29e9b603f3a0825bee65f568d79603b9
Troubleshoot AOS Upgrade Issues

Troubleshoot AOS Upgrade Issues –
Prechecks
Upgrade Pre-checks are w ritten to the preupgrade.out Logfile
$allssh ls -ltr /home/nutanix/data/logs/pre To Locate Logfile
Sample O utput snippets:
nutanix@NTNX-16SM13150152-C-CVM:10.30.15.49:~/data/logs$ cat preupgrade.out | grep preupgrade_checks
2017-06-28 14:42:56 INFO preupgrade_checks.py:136 No upgradable file server exists
Checking the Version of NCC
2017-06-28 14:43:06 INFO preupgrade_checks.py:1162 ncc tar path:
/home/nutanix/software_uncompressed/nos/5.1.0.1/install/pkg/nutanix-ncc-el6-release-ncc-3.0.2-latest.tar.gz
2017-06-28 14:43:06 INFO preupgrade_checks_ncc_helper.py:141 NCC Version - release-ncc-3.0.1.1-stable-
c4b2eae6e74da3dd004a0c123c3e7df037fd0408
2017-06-28 14:43:08 INFO preupgrade_checks_ncc_helper.py:141 NCC Version - release-ncc-3.0.2-stable-
a83acc398263ed6b10fd2b1da355bedf574ac01b
NCC Bundle with NOS Image is Newer
2017-06-28 14:43:08 INFO preupgrade_checks_ncc_helper.py:225 NCC bundled with NOS image is newer 3.0.1.1 to
3.0.2
2017-06-28 14:43:12 INFO preupgrade_checks_ncc_helper.py:291 Running NCC upgrade command:
'/home/nutanix/ncc_preupgrade/ncc/bin/ncc'
2017-06-28 14:43:25 INFO preupgrade_checks.py:785 Skipping OVA compatibility test for non-PC clusters
2017-06-28 14:43:25 INFO preupgrade_checks.py:670 Skipping PC to PE compatibility check for non-PC clusters
Upgrade Version is Compatible
2017-06-28 14:43:28 INFO preupgrade_checks.py:934 Version 5.1.0.1 is compatible with current version 4.7.5.2
22
Troubleshoot AOS Upgrade Issues – CVM
Hangs
If the upgrade fails to complete on a Controller V M:
Log on to the hypervisor host w ith SSH or the IPMI remote console.
Log on to the Controller V M from Hypervisor to 192.168.5.254.
Confirm that package installation has completed and that genesis is not running.
nutanix@cvm$ ps afx | grep genesis
nutanix@cvm$ ps afx | grep rc.nutanix
nutanix@cvm$ ps afx | grep rc.local
If any processes are listed, w ait 5 minutes and try again.
If none of these processes are listed, it is safe to proceed.
Restart the Controller V M

nutanix@cvm$ sudo reboot
Check the upgrade status again.

24
Troubleshoot AOS Upgrade Issues – CVM Hangs

AOS Up g rad e Log s of Int erest
AOS Upgrade Logs of Interest

Ident ify AOS Upgrade Logs of Interest
genesis leader – allssh ntpq –pn
CV M Logfiles:
All genesis details pre-, during, and post -upgrades
nutanix@cvm$ /home/nutanix/data/logs/genesis.out
Details about svm_upgrade which is the parallel part of upgrade that runs on each CV M
simultaneously where the new AOS bits are applied to the alt-boot partition.
nutanix@cvm$ /home/nutanix/data/logs/install.out
Details about the serial part of upgrade that is run on one node at a time. Reboot and
memory increase happens here.
nutanix@cvm$ /home/nutanix/data/logs/finish.out
For memory increase specifically genesis.out and finish.out are required

25
Identify AOS Upgrade Logs of Interest

• The AOS Upgrade log files will be helpful in troubleshooting AOS upgrade issues.
• Before the AOS upgrade begins there are several upgrade pre-checks that run. The
upgrade pre-checks make sure the upgrade version is compatible with the
current version running on the cluster etc.
• If the pre-checks fail, then use the preupgrade.out logfile to investigate issues with
the pre-checks and resolve. If the pre-checks pass, then the upgrade will begin.
• The AOS tar file uploaded to the cluster is copied to each CVM. In parallel the
upgrade runs on all the CVMs in the cluster. The two logfiles that can be used to
troubleshoot the upgrade are genesis.out and install.out. install.out will have the
information needed to troubleshoot the actual upgrade on the CVM.
• genesis.out on the master will contain details about the task and upgrade being
scheduled on the cluster. Look in genesis.out if the upgrades are not starting on the
CVMs.
• All CVMs are upgraded in parallel. Once the CVMs are upgraded they need to be
rebooted and memory increased if the upgrade is from any previous AOS release
to any 5.1 releases.
• The reboot and memory increase is logged to the finish.out logfile on the CVM. If any
issues occur during this phase review the finish.out logfile to assist in solving the
issue.
Ident ify AOS Upgrade Logs of Interest
The Upgrade Process W rites t hese Addit ional Log Files

on the Controller V Ms:
/home/nutanix/data/logs/boot.out
Det ailed Informat ion for First boot
/home/nutanix/data/logs/boot.err
Detailed Information for Firstboot
/home/nutanix/data/logs/svm_upgrade.tar.out
26
Identify AOS Upgrade Logs of Interest

• There are several other logfiles that may help in troubleshooting AOS upgrade issues
on the CVM.
• The boot.out and boot.err logfiles have detailed information about the firstboot
process.
• The svm_upgrade.tar.out logfile has information about the Installer file but probably
will not use this logfile that often.
A d d Nod e t o Clust er Consid erat ions
Add Node to Cluster Considerations

Add Node to Cluster Considerat ions
CPU Family Processors and O t her Set t ings
O lder Generat ion Hardware Requirement s
Node Addition Configurations
28
Add Node to Cluster Considerations

Add Node - CPU Family Processors and Ot her
Set t ings
CV M V LAN Configured
• Fact ory Prep ared Nod es Discovery and Set up w it hout vLA N
Config urat ion in A HV 4 .5 or Lat er
CPU Features are Automatically Configured in AHV Clusters

• Ad d Nod e w it h Different Processor A rchit ect ure Sup p ort ed
• AOS 4 .5 or Lat er Releases
Guest Virtual Machines are Not Migrated during an AO S Upgrade
CV M Reboots
29
Add Node - CPU Family Processors and Other Settings

• Nutanix Clusters can be expanded dynamically. Customers can add Blocks and
Nodes into an existing cluster with no required downtime.
• Purchase new equipment nodes and blocks. Cable the new equipment into the racks.
Use Prism Element and launch the Expand Cluster wizard and dialogue box from
the Gear icon. Expand Cluster wizard and dialogue box can also be launched from
the Hardware page of Prism Element.
• The Nutanix nodes are prepared and imaged at the factory. The default Hypervisor is
AHV on the factory-prepared nodes.
• If the existing cluster has configured a vLAN tag for the CVMs public interface eth0,
factory-prepared nodes should be able to be discovered and set up. It is best to work
with the network team to configure the new nodes CVM with the same vLAN tag.
• The command to configure a vLAN tag for the eth0 ethernet interface is the following:
• SSH to the CVM
nutanix@cvm$ change_cvm_vlan vlan_id

• The command to configure a vLAN tag on the Hypervisor public interface is the
following:
• SSH to the AHV Hypervisor and log in as root.
root@ahv# ovs-vsctl set port br0 tag=host_vlan_tag
• CPU features are automatically set in AHV clusters. The cluster has to be running
AOS version 4.5 or later for the CPU automatic configuration.
• When adding nodes from a different processor class to the existing cluster, ensure
that there are no running VMs on the node.
• CVMs will be rebooted and upgraded serially. The User VMs do not have to be
migrated during AOS upgrades.
Add Node - Older Generat ion Hardware
Requirement s
G5 Broadwell G5 Broadwell
UVM UVM
UVM
UVM
Power Cycle UVM
UVM Power Cycle
VM VM
G4 Haswell
Power Cycle
G4 Haswell
Power Cycle VM
VM UVM
UVM
G5 Broadwell G5 Broadwell
30 Migrating VMs to the Older Hardware Requires a Power Cycle
Add Node - Older Generation Hardware Requirements

• For example, if you are adding a G4 Haswell CPU based node to a G5 Broadwell
CPU-based, power cycle guest VMs hosted on the G5 nodes before migrating them
to the added G4 node.
Add Node - Node Addit ion Configurat ions
Same Hypervisor and Yes Upgrade AOS

Add Node to
AOS Version Command Line
the Cluster
or Re-image
No Node
AOS Version
Lower - Same Yes
Hypervisor
Version Option to
Re-image
No Node
AOS Version
Same - Yes
Hypervisor Re-image
Different No/Else Node
31
Add Node - Node Addition Configurations

Factory-prepared nodes have the latest AOS and AHV versions when they are shipped
out to customers.
Node Addition Configurations:

Configuration Description
Same hypervisor and AOS version The node is added to the cluster without
re-imaging it.
AOS version is different The node is re-imaged before it is added. Note:

If the AOS version on the node is different (lower) but the hypervisor version is the
same, you have the option to upgrade just AOS from the command line. To do this, log
into a Controller VM in the cluster and run the following command:
nutanix@cvm$ /home/nutanix/cluster/bin/cluster -u new_node_cvm_ip_address

upgrade_node
After the upgrade is complete, you can add the node to the cluster without re-imaging it.
Alternately, if the AOS version on the node is higher than the cluster, you must either
upgrade the cluster to that version (see UpgradingAOS) or re-image the node.
AOS version is same but hypervisor different You are provided with the option
to re-image the node before adding it. (Re-imaging is appropriate in many such cases,
but in some cases it may not be necessary such as for a minor version difference.)
Data-At-Rest Encryption If Data-At-Rest Encryption is enabled for the

cluster (see Data-at-Rest Encryption), you must configure Data-At-Rest Encryption for
the new nodes. (The new nodes must have self-encrypting disks.) Note: Re-imaging is
not an option when adding nodes to a cluster where Data-At-Rest Encryption is
enabled. Therefore, such nodes must already have the correct hypervisor and AOS
version.
Note: You can add multiple nodes to an existing cluster at the same time.
Labs
Module 10
AOS Upgrade
32
Thank You!
33
Thank You!
Module 11
Performance
Module 11 Performance
Click to Agenda
Course edit Master t it le st yle
1. Int ro 7. A FS
2. Tools & NCC 8. A BS
4. Found at ion 10 . Up g rad e
6. Net w orking
Course Agenda
• This is the Performance module.
Click
O bject
toives
edit Master t it le st yle

• Define p erform ance relat ed t erm inolog y and b asic
st at ist ics
• Describ e how t o p rop erly fram e a p erform ance issue
• Describ e t he layered ap p roach t o p erform ance
t roub leshoot ing
• Use Prism UI t o analyze p erform ance st at ist ics
Objectives
• Define performance related terminology and basic statistics
• Describe how to properly frame a performance issue
• Describe the layered approach to performance troubleshooting
• Use Prism UI to analyze performance statistics
This module is designed to provide an overview of Performance Troubleshooting

• The intention of this module is to provide a basic, but valuable, starting point in how
to approach performance issues.
• The most important takeaways from this module will be developing an understanding
of basic performance concepts as well as how to properly frame performance issues.
o Proper framing and vetting of performance concerns are keys to successfully
handling performance cases.
Performance
Performance
W hat do we talk about when we talk about
Click to edit Master t it le st yle
Performance?
Performance cases are based upon customer

expectations and requirements
• Not t he usual b reak/ fix kind of issues
• It ’s w hen expect ations or requirement s are not b eing
m et
§ Our VDI users are com p laining t hat t heir d eskt op s are
slow
§ SQL rep ort s are t aking t w ice as long as t hey d id b efore
§ We are seeing p eriod s of hig h lat ency
What do we talk about when we talk about Performance? (apologies to Raymond

Carver)
Performance cases are based upon customer expectations and requirements

• Not the usual break/fix kind of issues.
• Specifically it’s often when expectations or requirements are not being met.
o Our VDI users are complaining that their desktops are slow.
o SQL reports are taking twice as long as they did before.
o We are seeing periods of high latency.
In the examples above we can infer expectations/requirements

• The VDI users expect snappy desktops.
• SQL reports should be twice as fast as they are now.
• The admin doesn’t expect latency to be as high as what they are seeing.
Click to edit Master
Performance Troubleshoot
t it le st yle
ing
How do you choose a line w hen checking out at t he

super market?
§ Look for t he short est q ueue
§ How m any it em s d oes each p erson have in t he q ueue?
§ If I only have a few t hing s, is t here an exp ress lane?
§ Did t he p erson in front just p ull out a check b ook? OH NO!
§ How efficient / exp erienced is t he cashier?
Performance analysis is somet hing we do all t he t ime
Performance Analysis Is Second Nature
How do you choose a line when checking out at the super market?
Pause and take input from the class, then proceed with the animated bullets.
Typically you’ll hear things related to decisions aimed at ensuring the fastest path out of
the store.
If you encounter a comedian student that feels they are being clever by adding “the
hottest cashier,”
point out that his choice speaks to his definition of Quality-Of-Service where success is
defined by
enjoyment more than expedience.
Performance Concepts
and Basic Statistics
Performance Terminology
t it le st yle
Basic Terminology used in the Realm of

Performance
• Latency - Tim e d elay b et w een an inp ut and out p ut
§ A 4 K read is issued and 10 m illisecond s elap ses b efore t he d at a
is ret urned -- lat ency for t he op erat ion is 1.5m s
• Throughput - Measure of b and w id t h consum p t ion
over som e int erval
§ Usually exp ressed in t erm s of KB/ sec, MB/ sec, as an exam p le
• IOPS - Averag e num b er of I/ O op erat ions com p let ed
p er second
8
§ Read s/ sec, W rit es/ sec, Op s/ sec
Basic terminology used in the realm of performance

• Latency - time delay between an input and output
o A 4K read is issued and 10 milliseconds elapses before the data is returned --
latency for the operation is 1.5ms
• Throughput - Measure of bandwidth consumption over some interval
o Usually expressed in terms of KB/sec, MB/sec, as an example
• IOPS - average number of I/O operations completed per second
o Reads/sec, Writes/sec, Ops/sec
Most customers generally use the term latency to express dissatisfaction with
performance. We only get cases when latency is deemed to be too high.
Most customers generally talk about throughput when it’s deemed to be too low.
t it le st yle

Performance
• Ut ilizat ion- Measure of b usyness over a t im e int erval.
§ W hen w e say a d isk is 50 % ut ilized on averag e over a 10 -
second sam p le, w e m ean t hat it w as b usy for 5 second s of
t hat 10 -second p eriod .
– Thought exercise: It could ’ve b een 10 0 % ut ilized for t he
first 5 second s and t hen id le for 5 second s.

• Utilization- Measure of busyness over a time interval
o When we say a disk is 50% utilized on average over a 10-second sample, we
mean that it was busy for 5 seconds of that 10-second period
• Thought exercise: It could’ve been 100% utilized for the first 5 seconds and then
idle for 5 seconds.
The “thought exercise” here points the way towards an appreciation of what averages
convey. For this example one might assume that this disk could not have been
saturated over the sample interval if only considering the average utilization over the
interval.
t it le st yle

Performance
• O pSize - averag e size p er op erat ion.
§ Can b e A p p roxim at ed from Throughput / IO PS.
§ Prism UI p rovid es UVM Op Size hist og ram s for
Read s/ W rit es.
§ In g eneral t here is m ore cost servicing larg er op erat ions.
10

• OpSize - average size per operation
o Can be approximated from Throughput / IOPS
o Prism UI provides UVM OpSize histograms for Reads/Writes
o In general there is more cost servicing larger operations
Typically, very large opsize (> 64KB) might indicate a more sequential workload,
whereas smaller ops are typically random in nature. This is completely application-
dependent.
A sudden shift in workload opsize characteristics, particularly the introduction of more

larger opsizes, might speak to why there are noticeable variations in average system
latency.
Click to
Basic St at
edit
ist ics
Master t it le st yle
The majorit y of dat a we consider w hen doing

Performance Analysis are averages calculated over
sample intervals.
• The Ent it y and Met ric chart s in Prism are d erived from
sam p les t aken every 30 second s at it s m ost g ranular
level.
• It ’s im p ort ant t o und erst and w hat an average or
st at ist ical mean is m eant t o convey.
11
Basic Statistics
The majority of data we consider when doing performance analysis are averages
calculated over sample intervals.
• The Entity and Metric charts in Prism are derived from samples taken every 30
seconds at its most granular level.
• It’s important to understand what an average or statistical mean is meant to convey.
The average is meant to convey a sense of the general tendency of a specific variable
over a sample.
Click to
Basic St at
edit
ist ics
Let ’s Explore some Basic Terminology used in St at ist ics

• Population - Set of all values, such as all UVM read
op erat ions.
• Sample - A sm all sub set of t he populat ion, read s over
20 second s.
• Variable - A charact erist ic t han can vary over a sample.
• Average or Mean - The g eneral t end ency of a variable
over a sample.
• Variance - A m easure of t he d ist rib ut ion of t he
12
ind ivid ual variables from t he mean.
Basic Statistics
Let’s explore some basic terminology used in statistics

• Population - set of all values, such as all UVM read operations
• Sample - a small subset of the population
• Variable - a characteristic than can vary over a sample (example: Read Latency)
• Average or Mean - the general tendency of a variable over a sample
• Variance - a measure of the distribution of the individual variables from the mean
Talking points:
• When we consider the performance charts in Prism, we are seeing means calculated
over a sample. For “live” data, each sample interval covers 30 seconds. So when we
look at a chart that tells us something about read latency, we are relying upon the
calculated mean to get a sense of performance.
Variance conveys to us how meaningful the mean really is. Prism charts do not convey
variance, but it’s important to understand that an average or mean of a sample can be
skewed greatly by the presence of outliers.
ClickWto
Fun it hedit
Numbers
Consider the follow ing set of read latency values

collected over an interval ( all in ms) :
{ 0.71, 0.63, 0.88, 0.67, 0.66, 0.79, 0.82, 1.08, 1.1, 0.78, 0.79,
0.83 }
• Calculat e t he mean
(0.71 + 0.63 + 0.88 + 0.67 + 0.66 + 0.79 + 0.82 + 1.08 + 1.1 + 0.78 + 0.79 +
0.83) / 12 == 0.811666
§ Let ’s call it 0 .81m s

• How exp ressive is 0 .81m s in d escrib ing t he g eneral
tend ency of t his sam p le?
13
Fun With Numbers
Consider the following set of read latency values collected over an interval (all in
ms):
{ 0.71, 0.63, 0.88, 0.67, 0.66, 0.79, 0.82, 1.23, 1.4, 0.78, 0.79, 0.83 }
• Calculate the mean
(0.71 + 0.63 + 0.88 + 0.67 + 0.66 + 0.79 + 0.82 + 1.23 + 1.4 + 0.78 + 0.79 + 0.83) / 12
== 0.8491666
o Let’s call it 0.85ms
• How expressive is 0.85ms in describing the general tendency of this sample?
ClickWto
Fun it hedit
Numbers
Despite there being some

variance within the sample,
the calculated mean is a fair
approximation of read latency
over the sample.
14
Fun With Numbers
ms):
{ 0.71, 0.63, 0.88, 0.67, 0.66, 0.79, 0.82, 1.23, 1.4, 0.78, 0.79, 0.83 }
(0.71 + 0.63 + 0.88 + 0.67 + 0.66 + 0.79 + 0.82 + 1.23 + 1.4 + 0.78 + 0.79 + 0.83) / 12
== 0.8491666
ClickWto
Fun it hedit
Numbers
Consider t he follow ing set of Read Latency values

collected over an interval ( all in ms) :
{ 0.71, 0.63, 0.88, 0.67, 0.66, 0.79, 0.82, 1.08, 94.5, 0.78, 0.79, 0.83 }
• Calculat e t he mean.
(0.71 + 0.63 + 0.88 + 0.67 + 0.66 + 0.79 + 0.82 + 1.08 + 94.5 + 0.78 + 0.79 + 0.83) /
12 == 8.595000 We’re using the same data as before
with one exception. We’ve added a
§ Let ’s call it 8.60 m s.
significant outlier to the set.
• How exp ressive is 8.60 m s in d escrib ing t he g eneral
t end ency of t his sam p le?
15
Fun With Numbers
ms):
{ 0.71, 0.63, 0.88, 0.67, 0.66, 0.79, 0.82, 1.08, 94.5, 0.78, 0.79, 0.83 }
(0.71 + 0.63 + 0.88 + 0.67 + 0.66 + 0.79 + 0.82 + 1.08 + 94.5 + 0.78 + 0.79 + 0.83) / 12
== 8.595000
ClickWto
Fun it hedit
Numbers
The presence of an outlier skews our mean to a point where it’s not a
reliable characterization of the general tendency of read latency in
this sample. This sample has a very high degree of variance due to
the one outlier.
16
Fun With Numbers
ms):
{ 0.71, 0.63, 0.88, 0.67, 0.66, 0.79, 0.82, 1.08, 94.5, 0.78, 0.79, 0.83 }
(0.71 + 0.63 + 0.88 + 0.67 + 0.66 + 0.79 + 0.82 + 1.08 + 94.5 + 0.78 + 0.79 + 0.83) / 12
== 8.595000
Click to and
Underst edit W
Master
hat The
t it le
Datstayle
Conveys
The basic t akeaway from t he previous few slides is t hat

an average may or may not reveal general tendency of
a variable.
• Sup p ose your cust om er is concerned b ecause t heir
UVM ap p licat ion is log g ing som e W rit e op erat ions
exceed ing 10 0 m s every hour.
§ Can you rule out Nut anix st orag e p erform ance aft er
review ing W rit e Lat ency averag es collect ed every 30
second s, w hen t hose averag es show values t hat rang e
b et w een 1m s and 10 m s over t im e?
17
Understand What The Data Conveys
The basic takeaway from the previous few slides is that an average may or may
not reveal general tendency of a variable
• Suppose your customer is concerned because their UVM application is logging some
write operations exceeding 100ms every hour.
o Can you rule out Nutanix storage performance after reviewing write latency
averages collected every 30 seconds, when those averages show values that
range between 1ms and 10ms over time?
The answer here is no, quite obviously. We will cover this a bit more when we talk
about case framing techniques.
Histograms - Finding t he Needle in t he
Hayst ack
W hile statistical mean is oftentimes very useful in

analyzing the performance of a system,
intermittent and spikey behavior is sometimes
washed out.
• A histogram is a m eans of account ing for rang es
of out com es over t im e.
• We rely on histograms w hen w e at tem p t to
valid at e sub t le and short -lived chang es in
b ehavior t hat g et lost in averag ing .
18
Histograms - Finding the Needle in the Haystack
While a Statistical Mean is oftentimes very useful in analyzing the performance of

a system, intermittent and spikey behavior is sometimes washed out.
• A histogram is a means of accounting for ranges of outcomes over time.
• We rely on histograms when we attempt to validate subtle and short-lived changes
in behavior that get lost in averaging.
Mention that we do have some histogram information in Prism (IO Metrics section) and
also on the 2009/latency page.
Click to
Read Latency
edit Master
Histogram
t it le st
- yle
Slide 14
This is a histogram representation of the data

from Slide 14, where the mean would convey an
average read latency of 8.6ms. The histogram
allows us to see the presence of the outlier that
is skewing the mean.
19
Read Latency Histogram - Slide 14
ms):
{ 0.71, 0.63, 0.88, 0.67, 0.66, 0.79, 0.82, 1.08, 94.5, 0.78, 0.79, 0.83 }
(0.71 + 0.63 + 0.88 + 0.67 + 0.66 + 0.79 + 0.82 + 1.08 + 94.5 + 0.78 + 0.79 + 0.83) / 12
== 8.595000
The histogram is better suited for issues where we are trying to validate the
presence of very short-lived spikes in behavior. In this case, the one read operation
that was ~94.5ms would not be evident if all we had was a graph showing average read
latency of 8.6ms for the sample.
Click
Lit t le’s
toLaw
A thorough understanding of Little’s Law isn’t a requirement,

but it does provide a valuable lesson regarding t he
relationship between Throughput, Latency, and Concurrency.
• Little’s Law st at es t hat t he Pop ulat ion ( N) of a syst em is
eq ual t o t he Resp onse Tim e of t hat syst em ( R) m ult ip lied by
t he Throug hp ut ( X) of t hat syst em in a st ead y st at e.
§ We t end t o charact erize R as Latency or Response Time
and X as t he Num b er of Op s Com p let ed . We can also t hink
of N as a m easure of Concurrency or t he num b er of
out st and ing op s in t he syst em .
20
Little’s Law
A thorough understanding of Little’s Law isn’t a requirement, but it does provide

a valuable lesson regarding the relationship between throughput, latency, and
concurrency.
• Little’s Law states that the Population (N) of a system is equal to the Response
Time of that system (R) multiplied by the Throughput (X) of that system in a steady
state
o We tend to characterize R as Latency and X as the Number of Ops Completed.
We can also think of N as a measure of Concurrency or the number of
outstanding ops in the system.
Click
Lit t le’s
toLaw
N =X *R
X=N / R
Concurrency of a system (N) is equal to throughput(X)
multiplied by the latency(R)
• How can you increase throughput ( X) of a syst em?
§ Increase concurrency ( N) or red uce latency ( R) .
Throughput (X) of a system is equal to Concurrency
• We increase concurrency by increasing p arallelism .
(N) divided by Latency (R)
• We d ecrease latency by ad d ressing b ot t lenecks in t he

syst em .
21
Little’s Law
N=X*R
X=N/R
• How can you increase Throughput (X) of a system?

o Increase Concurrency (N) or reduce Latency (R).
Increasing Concurrency
• Open up more checkout lanes at the grocery store.
• Increasing the Number of threads used for processing operations.
• Increasing the Size of each operation, which increases the overall amount of work
done with each op.
Most of the time, it’s a workload modification needed. IOW, increasing concurrency is
usually done within the I/O generator.
Decreasing Latency
• Locate the bottleneck and determine changes needed to reduce/eliminate.
Click to ion
Correlat edit Master t it le st yle
Another aspect of Performance Analysis that can

be overlooked is using correlation between stats
• Ob serving a sing le st at ist ic, such as averag e
lat ency, w it hout consid ering ot her m et rics, is not
analysis, it ’s only an ob servat ion.
• A nalysis involves careful consid erat ion and
correlat ion of m ult ip le st at ist ics.
22
Correlation
Another aspect of performance analysis that can be overlooked is using

correlation between stats.
• Observing a single statistic, such as average latency, without considering other
metrics, is not analysis, it’s only an observation.
• Analysis involves careful consideration and correlation of multiple statistics.
• Customers will sometimes have concerns about a single metric, typically Latency.
They might see some increase in average Latency and express concern.
o When something like this occurs, it is our job to start investigating all of the
available workload characteristics in order to understand that change in Latency.
o In many cases, an explanation of “why” is what’s needed to satisfy the
customer.
Performance Case Framing
CVM-Based Foundation
Click to edit
Framing t he Issue
Proper framing of a Performance issue is a crucial

step towards successful troubleshooting.
• The g oal is to st ate t he custom er ’s issue in a w ay
w here it s m ap p ing t o our archit ect ure is clear.
• Prop er fram ing g uid es our analysis and our
ap p roach t o t roub leshoot ing t he issue.
24
Framing the Issue
Proper framing of a performance issue is a crucial step towards successful

troubleshooting.
• The goal is to state the customer’s issue in a way where its mapping to our
architecture is clear.
• Proper framing guides our analysis and our approach to troubleshooting the issue.
Essentially our goal is to translate the customer’s problem statement into a clear
understanding of how that aligns to interaction of the Nutanix platform.
Click to edit
Framing t he Issue
During your init ial engagement w it h t he customer it ’s

imperative that you ensure that the system is generally
healt hy before diving too deeply in t he performance
analysis.
• Did you run NCC?
• Broken HW or m isconfig urat ions?
• Have Best Practices b een im p lem ent ed ?
25
Framing the Issue
During your initial engagement with the customer it’s imperative that you ensure
that the system is generally healthy before diving too deeply in the performance
analysis.
• Did you run NCC?
• Broken HW or misconfigurations?
• Have Best Practices been implemented?
Running NCC and getting the full output of all health tests should be done with any
performance engagement.
NCC Caveat: Use common sense. Did NCC identify something that might impact
performance?
Best Practices are a requirement for ensuring optimal performance for many
workloads, such as Microsoft SQL.
Click to edit
Framing t he Issue
Successful Handling of Performance Issues Involves:

• Develop ing a d et ailed und erst and ing of t he cust om er’s
issue.
§ Pro Tip: Rest at e t he p rob lem b ack t o t he cust om er and
ensure t hat t hey ag ree w it h your exp lanat ion.
• Defining m ut ually ag reed -up on Success Criteria.
§ We need t o know w hen w e can d eclare t he issue resolved .
26
Framing the Issue
Successful handling of performance issues involves:

• Developing a detailed understanding of the customer’s issue.
o Pro Tip: Restate the problem back to the customer and ensure that they agree.
• Defining mutually agreed-upon success criteria.
o We need to know when we can declare the issue resolved.
Customer service professionals will frequently restate their understanding of the key
points of an issue
as well as agreed-upon success criteria and ensure that the customer agrees with how
they are describing it.
One often overlooked detail that is crucial is the Success Criteria. Without having this
established up front, we cannot be sure when the case is resolved.
Click to edit
Framing t he Issue
You should ensure that you can answer some key questions:
• W hat is b eing m easured , how is it b eing m easured , and where
is it b eing m easured from?
• W hen is t he issue exp erienced ? Sp ecific t im est am p s?
Tim ezone?
• W hen d id t his st art occurring ? Has anyt hing chang ed ?
• How is it occurring ? Cont inuously or int erm it t ent ly?
Rep rod ucib le?
• W hich syst em s are im p act ed ? W hat is t he b usiness im p act ?
• W hat is t he exp ect at ion (Success Crit eria) ?
27
Framing the Issue
You should ensure that you can answer some key questions:
• What is being measured, how is it being measured, and where is it being measured
from?
• When is the issue experienced?
• When did this start occurring? Has anything changed?
• How is it occurring? Continuously or intermittently? Is it reproducible?
• Which systems are impacted? What is the business impact?
• What is the expectation (Success Criteria)?
While we probably won’t ask a customer “Why do think latency should be lower?” or
“Why do you expect higher throughput?”, the “Why” is often implied in the answers to
the other questions. If the problem is something new, as in “It used to work fine” …
that’s your “Why”.
On the other hand, if this is something that has never performed as expected, we may
have to evaluate things differently.
• Was the solution sized appropriately for this workload?
• Are they using best practices for their workload?
• Is anything broken?
Click to edit
Framing t he Issue
Proper framing of an issue guides your approach to

troubleshooting the issue.
• Is t his a read p erform ance issue? W rit e p erform ance
issue? Bot h?
• A re w e focused on t ot al runt im e for a t ask or is t his a
p er-op lat ency issue?
• A re w e only seeing an issue w it h a sp ecific VM, sp ecific
nod e, sp ecific t im efram e?
28
Framing the Issue
Proper framing of an issue guides your approach to troubleshooting the issue.

• Is this a read performance issue? Write performance issue? Both?
• Are we focused on total runtime for a task or is this a per-op latency issue?
• Are we only seeing an issue with a specific VM, specific node, specific timeframe?
This will not only get us thinking about likely bottlenecks but also focus our attention on
what we should be investigating.
Click to edit
Framing t he Issue
Underst anding t he success criteria is crit ical.

• Som e w orkload s are m ore Throughput- sensitive.
§ A backup job or a d at ab ase rep ort g enerat ion t ask are
consid ered successful w hen t hey com p let e w it hin an
accep t ab le t im efram e, or SLA .
• Som e w orkload s are m ore Latency- sensitive.
§ In g eneral, ap p licat ions w it h issue sm all rand om read s or
w rit es are far m ore d ep end ent on how q uickly each op
com p let es, like an O LTP ap p licat ion.
29
Framing the Issue
Understanding the success criteria is critical

• Some workloads are more Throughput-sensitive.
o A backup job or a database report generation task are considered successful
when they complete within an acceptable timeframe, or SLA.
• Some workloads are more Latency-sensitive.
o In general, applications with issue small random reads or writes are far more
dependent on how quickly each op completes, like an OLTP application.
Click to edit
Framing t he Issue
Underst anding t he Success Criteria is crit ical.

• In g eneral w e are less concerned w it h p er -op Latency
w hen d ealing w it h a Throughput - sensitive w orkload .
§ Typ ically t hese w orkload s are charact erized by larg e
cont inuous seq uent ial op erat ions.
• In g eneral w e are less focused on overall t hroughput
w hen d ealing w it h Latency- sensit ive w orkload s
§ Typ ically t hese w orkload s are charact erized by sm all
rand om op erat ions w it h som e t hink t im e b et w een arrivals.
30
Framing the Issue
Understanding the success criteria is critical.

• In general we are less concerned with per-op Latency when dealing with a
Throughput-sensitive workload.
o Typically these workloads are characterized by large continuous sequential
operations.
• In general we are less focused on overall Throughput when dealing with Latency-
sensitive workloads.
o Typically these workloads are characterized by small random operations with
some think time between arrivals.
Click to edit
Framing t he Issue
The t akeaway here is “it depends.”

• You p rob ab ly should n’t b e concerned w it h seeing
~20 m s averag e read lat ency if t he w orkload is
Throughput - sensitive and com p let ing w ell w it hin it s
SLA.
• You p rob ab ly should n’t b e concerned at seeing very
low Throug hput for a Latency- sensitive w orkload as
long as op s are seeing accep t ab le averag e lat encies.
31
Framing the Issue
The takeaway here is “it depends.”

• You probably shouldn’t be concerned with seeing ~20ms average read latency if the
workload is Throughput-sensitive and completing well within its SLA.
• You probably shouldn’t be concerned at seeing very low throughput for a Latency-
sensitive workload as long as ops are seeing acceptable average latencies.
That’s not to say that we might not attempt to do something to lower latency for a
Throughput-sensitive workload.
• However, we need to evaluate what would be required to do so.
o It might simply not make sense if this would require more/different hardware or
changes in how the system is configured or utilized.
• As an example, full backups will almost certainly read data that might’ve been down-
migrated to the Cold Tier.
o Would it be worth it to go with an all-flash system just to mitigate HDD read costs
in such a scenario, particularly if all other workloads are doing quite well?
A Layered Approach
to Performance Analysis
A Layered Approach to Performance Analysis.

Click
A Layered
to edit
Approach
In complex systems, we can measure performance

characteristics at various points. W hat ’s measured at
one layer w ill account for accumulat ive cost s of doing
business with the layers below.
• For Lat ency, how long an I/ O t akes t o com p let e at each
layer account s for t he RTT at t hat p oint in t he syst em .
• A s w e consid er t he corresp ond ing m easurem ent s from
all t he layers b elow, w e can st art t o p inp oint a lat ency
b ot t leneck by answ ering : W here are w e sp end ing t he
m ost t im e?
33
A Layered Approach
In complex systems, we can measure performance characteristics at various

points. What’s measured at one layer will account for accumulative costs of
doing business with the layers below.
• For Latency, how long an I/O takes to complete at each layer accounts for the RTT at
that point in the system.
• As we consider the corresponding measurements from all the layers below, we can
start to pinpoint the bottleneck by answering:
o Where are we spending the most time?
Another way to say this is that we take a top-down approach when doing analysis of the
system.
• This also helps to convey the importance and relevance of asking the framing
question related to “what are you measuring and where are you measuring it from?”
Knowing where something is measured helps us start to visualize all the potential
places in the system where we might be seeing costs.
A Layered
Click to Approach
edit Master t - le
it Thinking
st yle “Top
Dow n”
UVM Hypervisor CVM
• Application • Scheduler/Manager • Protocol Adapter

These are some of the (NFS, iSCSI, SMB)
• Guest OS • NFS/SMB/iSCSI client
• Driver Stack software components within
• Network Stack • Admission Control
• Emulated I/O our CVM that are collectively
• Driver Stack • vDisk Controller
controller referred to as Stargate.
• Physical HW and • Extent Store
• Virtual Disks controllers • Pass-through Disks
(SSD/HDD)
34
A Layered Approach
Here we provide a very basic and high-level, top-down, breakdown within 3 key
components:
• UVM
• Hypervisor
• CVM
This is not meant to be a complete list of all possible “layers,” but it does show an
approach to thinking about storage I/O as it travels from UVM to our CVM and back.
We could also breakdown our CVM as:

• Stargate application (all of the various software components)
• Linux OS
• EXT4 file system
• Driver stack
• Pass-through disks
There are many Stargate components other than those listed here.
Click
A Layered
to edit
Approach
Master t it- le
St argate
st yle Layers
At a very high- level, we can characterize round - t rip t ime

spent through the basic Stargate layers when servicing
an I/ O. Latency measured at the protocol
Protocol Adapter adapter level accounts for the
Admission Controller round-trip time spent within the
Determining where we
Theses deltas accounts
other layers below and should be
spend most of that time
vDisk Controller for each layer’s cost.
very close to what the hypervisor
reveals the bottleneck.
sees for NFS/iSCSI/SMB disk latency.
Extent Store
35
A Layered Approach
At a very high-level, we can characterize round-trip time spent through the basic
Stargate layers.
The example seems to convey that most of the time was spent within the vDisk
controller layer.
• From here, we would then investigate that layer further, considering all of associated
costs within that layer to determine why we are spending time there.
o Answering why within the vDisk layer will lead us towards possible resolutions.
Click
A Layered
to edit
Approach
Master t it- le
20st
0 yle
9 / latency
The 20 0 9/ latency page

• The 20 0 9/ lat ency p ag e is an excellent exam p le of how
w e ap p ly t his layered ap p roach
• Let ’s consid er averag e read lat ency on a sp ecific nod e
…
36
A Layered Approach
The 2009/latency page

• The 2009/latency page is an excellent example of how we apply this layered
approach
• Let’s consider average read latency on a specific node …
In order to enable the 2009/latency stats run the following on any CVM:
for i in `svmips`; do curl http://$i:2009/h/gflags?latency_tracker_sampling_frequency=1;
curl http://$i:2009/h/gflags?activity_tracer_default_sampling_frequency=1; done
Run the same with ‘0’ to set it back to default.
• You can then either use:

links http:0:2009/latency
• …or you can disable the firewall, run from CVM:

allssh sudo service iptables stop
Now just point a browser at http://<CVM IP>:2009/latency

These stats continue to accumulate over time, so if you wish to reset the values use:
http://<CVM IP>:2009/latency/reset
Run:
allssh sudo service iptables start
…when you are done.
Click
A Layered
to edit
Approach
Master t it- le
20st
0 yle
9 / latency
Now, let’s examine the layers below …
Starting at the top-most layer, the

Since this is the top-most layer, it makes
Let’s examine average read latency over
protocol adapter, we see an average of
sense that 100% of the average latency is
a single collect_perf iteration.
23.154ms per read operation.
accounted for here.
37
A Layered Approach - 2009-latency
Mention that these are small portions of the overall 2009-latency page.
We are only focused on some of the fields here to illustrate the basic approach to
bottleneck detection.
Q: How many reads did we see over this interval?

A: 47097
Click
A Layered
to edit
Approach
Master t it- le
20st
0 yle
9 / latency
We are accounting for 99% of the total a

Most of the latency seen at the protocol
this layer, so we know that layer above
adapter level is also measured at the
was only ~1% of the total cost.
next layer down, Admission Control.
38
Q: How much latency is accounted for at the protocol adapter level?

A: 23154us - 23133us ~= 21us (approximately 21us)
Click
A Layered
to edit
Approach
Master t it- le
20st
0 yle
9 / latency
The next layer down, vDisk Controller,

and 62% of the total average latency.
accounts for 14.576ms of the overall cost…
39

Click
A Layered
to edit
Approach
Master t it- le
20st
0 yle
9 / latency
and 4% of the total average latency.

Our last layer, Extent Store, accounts for
1.031ms of the overall cost.
40
A Layered Approach - 2009-latency.
It should be very clear that we are spending most of the time within the vDisk Controller
level.
Click
A Layered
to edit
Approach
W here are spending t he most t ime? W e can subt ract

percentages between layers:
• Protocol Adapter: 10 0 % - 99% ~= 1% of t he t im e
• Admission Cont rol: 99% - 62% ~= 37% of t he t im e
• vDisk Controller: 62% - 4 % ~= 58% of t he t im e
• Extent Store: 4 % of t he t im e
At this stage we know that our most significant

bottleneck is time spent in the vDisk Controller.
41
However, this is just an observation and not a diagnosis.
A Layered Approach
Using latency values:

Protocol Adapter: 23154 - 23133 ~= 21us
Admission Control: 23133 - 14576 ~= 8557us
vDisk Controller: 14576 - 1031 ~= 13545us
Extent Store: 1031us
Click
A Layered
to edit
Approach
O ur next course of act ion would be invest igat ing t he

costs within the vDisk Controller.
• We w ould t hen ap p ly t he layered ap p roach w it hin t he
vDisk Cont roller layer and d et erm ine w here w e are
sp end ing m ost of our t im e w it hin t his layer.
• Ag ain, t he ob servat ions m ad e reg ard ing w here w e are
sp end ing t im e are not a d iag nosis.
§ Det ailed know led g e of t he syst em is req uired t o arrive at
t heories and p rop osed resolut ions.
42
A Layered Approach
It’s important to realize that observing where we spend the most time is only a small
part of being able to assist in resolving the issue.
• However, we do expect all SREs to be able to make such observations when framing
the case for internal collaboration.
Click
A Layered
to edit
Approach
Thought exercise: The customer is measuring a

persistent ~30 ms average write latency within an
application on a Linux UV M, w hile running a specific
workload.
• Y ou w ere running collect_perf w hen t his w as occurring ,
and over t he sam e t im efram e you see t hese st at s for it s
vDisk.
• Does t heir p rob lem seem t o b e relat ed t o Nut anix st orag e

I/ O?
43
• W hat t hing s m ig ht you consid er in your invest ig at ion?
A Layered Approach
Thought exercise: The customer is measuring a persistent ~30ms average read

latency from iostat on a Linux UVM, while running a specific workload.
Tell the class that there is only 1 disk assigned to this UVM.
While this is a very simple example, they should be able to conclude that the Stargate
Layer is a much lower latency average than what is measured within the UVM.
• Therefore, this does not seem like a Nutanix storage I/O issue.
It seems that the issue is somewhere above our CVM, so the investigation would
probably focus on:
- The hypervisor stats,
- The UVM configuration,
- …and so forth.
Click
A Layered
to edit
Approach
The bot tom line here is t hat a top - dow n layered

approach is a useful method of bottleneck
ident ificat ion.
• W hen one layer account s for t he m ajorit y of t he overall
cost , w e’ve locat ed t he m ost sig nificant b ot t leneck.
• We can st art at very hig h-level layers init ially and t hen
“ zoom in” as need ed
§ Each layer can b e b roken d ow n int o it s corresp ond ing
layers as need ed .
44
A Layered Approach
The bottom line here is that a top-down layered approach is a useful method of
bottleneck identification.
• When one layer accounts for the majority of the overall cost, we’ve located the most
significant bottleneck.
• We can start at very high-level layers initially and then “zoom in” as needed.
o Each layer can be broken down into its corresponding layers as needed.
What we mean here is that layers can start off very generalized and then we can focus
in further as needed.
Key Points About
Nutanix Performance
Key Points About Nutanix Performance.

Click
Nut anix
to edit
Performance
Understanding some basic aspects about the Nutanix

platform will help you develop a sense of intuition
when it comes to performance issues.
• W hat d oes t he I/ O p at h look like for read s and w rit es?
• W hat are t he key com p onent s and services t hat p lay a
role in Nut anix p erform ance?
Over t he next few slid es w e w ill exp lore t hese t op ics.
46
Nutanix Performance
Understanding some basic aspects about the Nutanix platform will help you
develop a sense of intuition when it comes to performance issues.
• What does the I/O path look like for reads and writes?
• What are the key components and services that play a role in Nutanix performance?
Over the next few slides we will explore these topics.
Promote nutanixbible.com.
• There’s a lot of very good information there when it comes to basic architecture and
I/O path.
What we mean by intuition is the combination of knowledge and experience that will
allow you to quickly develop theories as to the likely reasons for a given performance
concern.
• It’s this intuition that helps to guide our exploration of the available data.
Click
Nut anix
to edit
Performance
From a Nutanix I/ O path perspective, we will primarily

consider the following services:
• Stargate - Hand les St orag e I/ O w it hin t he CVM
• Cassandra - Dist rib ut ed clust er-w id e m et ad at a st ore
• Medusa - St arg at e’s int erface t o Cassand ra m et ad at a
47
Nutanix Performance
From a Nutanix I/O path perspective, we will primarily consider the following
services
• Stargate - Handles storage I/O within the CVM.
• Cassandra - Distributed cluster-wide metadata store.
• Medusa - Stargate’s interface to Cassandra metadata.
Obviously there are a lot more services that run on the CVM, but we will only focus on
the most basic interactions that occur when servicing UVM I/O requests.
Click
Nut anix
to edit
Performance
SSD
The OpLog is implemented as two
The Unified Cache spans Memory
Cassandra OpLog
separate components:
and SSD and provides the read
Store
- OpLog Index (Memory) is the
cache for user data and metadata.
Unified
Cassandra data is contained within Cache metadata component of the OpLog.
The metadata cached here is
Memory
Finally, we the Extent Store which
the SSD tier across all the CVMs in - OpLog Store (SSD) is the data store
accessed from Cassandra by the
OpLog
Let’s start adding the basic
Index
spans SSD and HDD tiers and provides
the cluster. for inbound random writes.
Medusa interface.
components that play a role in the
Next layer is the vDisk Controller. A
Extent
The Protocol Adapter is the top
the long-term persistent storage for Store
Next we have Admission Control
storage I/O path
separate instance is created for
user data. layer of storage I/O from the
which performs rate limiting to the
Both of these components are
each vDisk in Memory with some
Protocol perspective of the CVM. This is
inbound workload to ensure that
Adapter
AdmCtl
completely implemented in the
features leveraging the SSD tier …
where we receive inbound I/O
large bursts do not overwhelm the
CVM’s Memory allocation
requests from the Hypervisor.
underlying sub-systems.
vDisk
Controller HDD
48
Nutanix Performance
Protocol Adapter - This provides the SMB (Hyper-V), NFS (ESXi, AHV), or iSCSI
(Volume Groups) interface to our CVM.
Admission Control - Performs inbound throttling of IOPS with per vDisk, node overall,
and even underlying disk awareness. The overall goal is to smooth out aggressive
bursts of operations. It also serves to prioritize UVM I/O over backend/system I/O on a
node.
Unified Cache - Manages content via LRU and dual tiers: Single Touch and Multi
Touch. First entry into Single Touch, subsequent accesses promotes into Multi Touch.
Multi Touch data is first migrated down to SSD before being evicted. Single Touch
items that do not get “touched” are evicted from Memory.
Oplog Index - Memory resident metadata needed to index and access data in the
OpLog
Oplog Store - The SSD store for inbound random writes. These are eventually
coalesced and flushed down to the extent store.
Medusa/Cassandra - Medusa provides the interface for fetching file system metadata
from the Cassandra DB. Cassandra data is distributed across all nodes in the cluster.
Extent Store - Manages the underlying data repository spanning both SSD and HDD
(hybrid systems).
Click
Nut anix
to edit
Performance
Master t it-leExtent
st yle Store Read
The retrieved metadata is

After consulting the metadata
stored in the Unified Cache to
5 the read data is retrieved from
expedite future accesses
the Extent Store
The retrieved data is written
6 8 into the Unified Cache
7
4
A Medusa call is used to 3 OpLog Index is consulted, but
We don’t find our metadata in
The appropriate protocol
retrieve the metadata from the since read is cold the data is not
the Unified Cache either
9 response is sent back up the
Cassandra DB in the OpLog Store
I/O stack to the hypervisor
The data is then sent up to the
The Protocol Adapter
Protocol Adapter
receives a read operation
Admission Control applies
1 2
and converts into a
applicable rate limits if needed
Stargate read operation
49
Nutanix Performance
Let’s first examine a completely “cold” read operation
1 - Basically it converts the protocol-specific read to what’s needed to access the data
via Stargate: vDisk, offset, size.
2 - This is where the number of outstanding operations plays a role in possible throttling
(queueing).
3 - Consult the OpLog Index to see if the data is there.
4 - Given that this is a “cold” read, we know that we won’t find our metadata here.
5 - A Medusa lookup is issued to get the necessary metadata needed to access the
data.
6 - Since this is the first time this metadata is read, it gets added into the ”First Touch”
pool of the Unified Cache (Memory).
7 - With the metadata information the read data can be retrieved from the extent store.
8 - The read data is also added into the “First Touch” pool in the Unified Cache.
9 - The data is sent up to the Protocol Adapter.
10 - The Protocol Adapter sends the appropriate response.
Nutanix Performance - Unified Cache Hit
Click
Read to edit Master t it le st yle
The appropriate protocol

The data is then sent up to the
response is sent back up the
Protocol Adapter
4 3 OpLog Index is consulted but the
I/O stack to the hypervisor We have a 100% cache hit for
data is not in the OpLog Store
the metadata and subsequent
5 user data in the Unified Cache.
Could result in a promotion
from Single Touch to Multi
1 2 applicable rate limits if needed
and converts into a Touch
50
Nutanix Performance - Unified Cache Hit Read.
Unified Cache Hit Read
(queueing).
3 - Consult the OpLog Index to see if the data is there.
4 - We find all the metadata and user data in the Unified Cache -- This is how data gets
“hot”. Multiple accesses promotes from single-touch to multi-touch.
5 - The data is sent up to the Protocol Adapter.
Explain that a partial cache-hit would result in a combination of behaviors.

Nut anix Performance - Reading From OpLog
Click
Store to edit Master t it le st yle
A copy of the data is then sent 3

up to the Protocol Adapter We have a hit for the data in
The appropriate protocol the OpLog Index
4
5 response is sent back up the
I/O stack to the hypervisor
1 2 applicable rate limits if needed
and converts into a
51
Nutanix Performance - Reading From OpLog Store.
Reading from OpLog Store.
Considered rare, based on how most workloads behave. Would require some kind of
write-than-read access pattern.
Another possibility is that the UVM is struggling to keep data cached, so it has to read
something from storage that was just written.
(queueing).
3 - This time we find a hit for the data in the OpLog Index.
4 - We read the data from the OpLog Store and send it to the Protocol Adapter.
Click
Nut anix
to edit
Performance
Master t it-leTwo
st yle
W rite Pat hs
Inbound w rites can t ake t wo different pat hs:

• Sequential and > 1.5MB out st and ing d at a, skip OpLog
and w rit e d irect ly t o t he Extent Store.
§ Lat ency is hig her d ue t o inline m et ad at a and d isk
sub syst em cost s.
§ Should b e fine, g iven t hat hig hly concurrent seq uent ial
w rit es are m ore t hroug hp ut sensit ive t han lat ency
sensit ive.
• Random w rit es are w rit t en int o t he O pLog.
§ Low lat ency, since m et ad at a up d at es and d isk w rit es are
52 d one asynchronously.
Nutanix Performance - Two Write Paths.
Inbound writes can take two different paths:

• Sequential and >= 1.5MB outstanding data, skip OpLog and write directly to the
Extent Store.
o Latency is higher due to inline metadata and disk subsystem costs.
o Should be fine, given that highly concurrent sequential writes are more throughput
sensitive than latency sensitive.
• Random writes are written into the OpLog.
o Low latency, since metadata updates and disk writes are done asynchronously.
This design is quite sensible given that nearly all highly concurrent sequential writer
workloads, like backups or other bulk data load operations, are throughput-sensitive and
not latency-sensitive. As long as this work completes within a reasonable amount of
time, performance is acceptable. Whereas, random, low concurrent, write workloads are
typically far more latency-sensitive. The OpLog allows us to complete these random
writes very quickly and then optimize/coalesce them when we flush them to the Extent
Store.
In rare cases where a customer has a workload that isn’t completing fast enough AND
the inbound writes are being handled as sequential AND the application cannot be
optimized any further, there is some tuning which can be done to allow more of this
work into the OpLog. This should only be done when working with senior-level support
resources and/or engineering engagements.
Click
Nut anix
to edit
Performance
Master t it-leWst
rite
yleto O pLog
When the write completes

to the OpLog Store and we 5
receive an ACK from the
remote peer write, the 4
completion is passed to the
Protocol Adapter The OpLog write is sent to
the local OpLog Store and a
3
remote peer write is sent to
6 comply with RFThe OpLog
replicationwrite is queued
The Protocol Adapter adapter
receives a write operation
sends an ACK up to the I/O Admission Control applies
1 2
and converts into a
stack of the hypervisor throttling where necessary
Stargate write operation
53
Nutanix Performance
Write to OpLog.
1 - Converts the protocol write to what’s needed for Stargate: vDisk, offset, payload.
(queueing).
3 - Queue the write operation.
4 - A thread processes the queued writes and sends them to the participating oplog
stores.
• Ideally one copy will be saved locally, but if the local oplog is full, it will commit two
remote replicas.
5 - When we receive the acknowledgement that the local and remote writes have
completed, we send the completion to the Protocol Adapter.
6 - The Protocol Adapter sends the write acknowledgement.
For RF3, the remote writes are chained and not parallel. The node where the write
originates sends a remote RF write request that requires 2 replica commitments. The
node that receives this request sends out the next replica write in sequence. This is
done to avoid saturating the local network with parallel outbound writes.
Click
Nut anix
to edit
Performance
Master t it-leBypassing
st yle O pLog
The extent metadata is

persisted in the Unified Cache
7
5 Any metadata lookups are
6
stored in the Unified Cache
4
The data is written to the
The metadata cache is consulted to
9 8 Extent Store and remote
prepare for data placement. Any misses
The Protocol Adapter send the replica writes are issued to
The I/O is categorized as
here require Medusa lookups.
When both local and remote
ACK up to the I/O stack of the comply with RF factor
receives a write operation
sequential, so OpLog is
1 2
replica writes complete, the
3
hypervisor and converts into a bypassed
throttling where necessary
completion is sent to the
Stargate write operation
Protocol Adapter
54
Nutanix Performance
Bypassing OpLog - when writes are deemed sequential.
1 - Converts the protocol write to what’s needed for Stargate: vDisk, offset, payload.
(queueing).
3 - Write is categorized as sequential so OpLog is bypassed.
4 - The Unified Cache is consulted for the necessary metadata for data placement and
misses results in Medusa ops.
5 - Any cache-miss data is added to the Unified Cache single touch pool.
6 - Data is persisted to the Extent store and replica write(s) is issued.
7 - Updated metadata is persisted to the Unified Cache.
8 - Once all local and remote operations complete, we let the protocol adapter know.
9 - The Protocol Adapter formulates the correct protocol ACK for the write.
Click
Nut anix
to edit
Performance
Master t it-leILM
st yle
Information Lifecycle Management ( ILM) manages the

movement of data between the SSD and HDD tiers.
• Each nod e’s SSD/ HDD resources are p ooled across t he ent ire
clust er.
• W hen usag e of a nod e’s SSD(s) exceed s 75%, Curator ILM
t asks w ill hand le d ow nw ard m ig rat ion from t he SSDs t o HDDs
• If Stargate accesses cold d at a w it hin an ext ent oft en enoug h
w it hin a short p eriod of t im e, t his w ill lead t o an upw ard
m ig rat ion b ack t o t he SSD t ier.
• Cold Tier read s t end t o see hig h lat ency d ue t o HDD access
t im es.
55
Nutanix Performance
Information Lifecycle Management (ILM) manages the movement of data between

the SSD and HDD tiers.
• Each nodes SSD/HDD resources are pooled across the entire cluster.
• When usage of the SSD tier exceeds 75%, Curator ILM tasks will handle the
downward migration from the SSDs to the HDDs.
• If Stargate accesses cold multiple time within a short period of time, this will lead to
an upward migration back to the SSD tier.
The 75% tier threshold is the default and can be modified.

There is no ITLM tier movement on all flash clusters, just disk balancing.
Upward migration is controlled by several criteria taking into account read patterns and
frequency. The egroup is what is up-migrated. Reads and writes can drive this upward-
migration.
Curator is also responsible for disk balancing, which should help to spread data across
all nodes. During disk balancing, the preference is to keep colder data within the same
tier on other nodes.
Click
Nut anix
to edit
Performance
At a very high level, here are some key aspects to Nutanix

performance that you should be aware of:
• Unified Cache p rovid es efficient read s for user d at a and
m et ad at a.
• O pLog is used t o ab sorb rand om w rit es at low lat ency.
• Sequential vDisk w rit es t hat are b ot h seq uent ial and have
m ore t han 1.5MB out st and ing d at a are w rit t en d irect ly t o
Extent Store.
• Medusa Lookups ad d lat ency w hen t here is a m et ad at a
m iss.
• Cold Tier read s/ w rit es see hig her lat ency d ue m ost ly t o
56
HDD cost s.
Nutanix Performance
At a very high level, here are some key aspects to Nutanix performance that you
should be aware of:
• Unified Cache provides efficient reads for user data and metadata.
• OpLog is used to absorb random writes at low latency.
• Sequential vDisk writes that are both sequential and have more than 1.5MB
outstanding data are written directly to Extent Store.
• Medusa Lookups add latency when there is a metadata miss.
• Cold Tier reads/writes see higher latency due mostly to HDD costs.
Obviously this isn’t all you need to know about, but these do cover many common
performance issues.
There is one special case where sequential writes don’t bypass OpLog and that’s when
they are overwriting data already in OpLog. This is done to ensure we won’t have stale
data in OpLog.
Using Prism Performance Charts

Click to
Using Prism
edit Performance
Master t it le stChart
yle s
The Prism UI consumes dat a from t he Arit hmos service

and provides some very useful charts that aide
performance analysis.
• Arithmos - Service w hich runs p er CVM t hat collect s
p erform ance met rics for various ent it ies.
• Ent it ies - Cat eg orical ob ject t yp es, such as Cluster,
Host , V M, Virt ual Disk, and ot hers.
• Met rics – Ent it y-sp ecific p erform ance st at ist ics
p rovid ed by A rit hm os.
58
The Prism UI consumes data from the Arithmos service and provides some very
useful charts that aide performance analysis.
• Arithmos - Service which runs per CVM that collects performance metrics for
various entities.
• Entities - Categorical object types, such as Cluster, Host, VM, Virtual Disk, and
others.
• Metrics – Entity-specific performance statistics provided by Arithmos.
In some cases, a single Prism chart might be a weighted average across multiple
entities in the cluster. We’ll see an example of this on the next slide.
Click to
Using Prism
edit Performance
yle s
Entity: Cluster
Metric: Storage Controller IOPS
Sum across all Cluster Storage Containers

Entity: Storage Container
Metric: Storage Controller Bandwidth
Entity: Cluster
Metric: Storage Controller Latency
Entity: Cluster
Metric: Hypervisor CPU Usage (%)
Metric: Hypervisor Memory Usage (%)
59
The default home page has many useful charts that are derived from Arithmos that are
meant to convey cluster-level statistics.
Click to
Using Prism
edit Performance
yle s
Scrolling through we can see

all of the various Entity
types available
From the Analysis page, you can

create new Metric or Entity Charts
Let’s take a look at a New Entity Chart
60
You can explore the various Entity types available via Arithmos when you create a
custom Entity Chart.
There are numerous Entities across the cluster. We are going to focus on the VM type
next.
Click to
Using Prism
edit Performance
yle s
Let’s see what Metrics are

available for a UVM.
Scrolling through we can see
all of the various Metrics
available.
61
While Entity Charts allow you to choose a specific Entity and then see an applicable list
of Metrics, a custom Metric Chart does the opposite.
The custom Metric Chart allows you to start with a Metric and then it will provide a list
of applicable Entities.
The resulting charts are the same.
Click to
Using Prism
edit Performance
yle s
In addit ion to t he Home page and t he Analysis page,

the V M page provides very nice V M- specific
performance chart s.
• The V M view p rovid es m ore d et ail w hen cust om ers are
focused on sp ecific ap p licat ions.
• The V M land ing p ag e p rovid es som e t op t alker st at s
w hich m ig ht b e a g ood p lace t o st art w hen
invest ig at ing unexp ect ed chang es in clust er
p erform ance st at s.
62
In addition to the Home page and the Analysis page, the VM page provides very
nice VM-specific performance charts.
• The VM view provides more detail when customers are focused on specific
applications.
• The VM landing page provides some top talker stats which might be a good place to
start when investigating unexpected changes in cluster performance stats.
Let’s walk through an example where the customer is concerned about a very
noticeable increase in cluster-wide latency.
Click to
Using Prism
edit Performance
yle s
When you get the VM

landing page you notice Your customer notices
We click on WindowsVM-2 Normally they see less
that WindowsVM-2 is some very unexpected
and this brings us to the than 2ms average latency
seeing rather high latency
From the Home screen
VM table view latency spikes happening
despite it only doing ~147 at this time of day.
we go to the VM page.on the system
IOPS.
63
We start on the Home page in the Prism UI and notice the increase in latency there.
We go from the Home page to the VM page.
On the VM page, we see the “Top talker” in latency is the UVM WindowsVM-2.
PAUSE AND ASK:
Q: Which VM is doing the most work from an IOPS perspective
A: WindowsVM
Q: Do we see this VM in the top latencies?
A: No
We click on WindowsVM-2 row in the Latency box to quickly get to the VM Table view.
Click to
Using Prism
edit Performance
yle s
When we land on the VM

We see the workload is all
table view our VM is
What’s the approximate
write IOPS.
already selected.
OpSize for each write?
64
Pause after “What’s the approximate OpSize for each write” and take input from the
class.
Answer: (40.92 MB/sec) / (161 writes/sec) ~= 0.25416149 MB per operation
0.25416149 MB * 1000 ~= 254.161490 KB
Most likely seeing 256KB ops from the UVM.
Q: Does the UVM ‘WindowsVM” appear impacted from latency perspective?

A: No, it’s seeing < 2ms, which matches the customer’s expectation for cluster average.
Click to
Using Prism
edit Performance
yle s
Scroll down the page for a view of the

basic performance statistics for the VM.
Let’s click on Virtual Disks to get a per vDisk
view of performance.
65
When a VM is selected in the table view, we can scroll down the page to get some nice
performance charts specific to that VM.
Let’s take a look at things from the specific UVM “Disk” perspective .
Click to
Using Prism
edit Performance
yle s
We can see that scsi.1 is where

This appears to be a 100%
Let’s click on Additional Stats.
this write workload is active.
sequential write workload.
66
The Virtual Disk entries are from the perspective of the UVM, but if we consult the UVM
configuration we could map that to the underlying vDisk.
From the “Additional Stats” section we can see the relative sequential vs. random
aspect of the inbound workload.
We also see the current approximation for the write working set size, but this is merely
based on what has been observed so far. IOW, working set size calculations are based
on what’s been seen so far.
Click to
Using Prism
edit Performance
yle s
Let’s click on I/O Metrics.
67
Let’s take a look at what we get in the I/O Metrics section.

Click to
Using Prism
edit Performance
yle s
The I/O Metrics tab provides not only

We know that we want to focus on
latency characteristics but also gives
writes so let’s see what the
us some very useful histograms
histograms provide
68
Clicking on I/O Metrics provides a view of read vs. write latency over time.
Clicking on the latency curve changes the view below. The view below provides some
useful histograms to consider.
Click to
Using Prism
edit Performance
yle s
The average write opsize is falling

What was your approximation of per
into the 64K to 512K range.
write opsize previously?
69
Using Prism Performance Charts.
Write OpSize Histogram.

- We can see that all the writes are falling into the 64-512KB per write bucket.
- We previously calculated ~254KB per write, and concluded 256KB ops from the UVM
(slide 64).
Click to
Using Prism
edit Performance
yle s
The latency is quite variable, but

most samples are distributed around
~50ms range.
70
Write Latency Histogram.

- We can see that more than half of all writes fall in the < 50ms range: 0.09 + 0.05 +
1.67 + 7.15 + 9.29 + 44.47 ~= 62.7%
- However, there are some significant outliers here as well.
- If we were monitoring write behavior from the guest at short durations, we’d see
spikes above 100ms.
Click to
Using Prism
edit Performance
yle s
We are also given a Random vs.

Based upon all that we’ve covered up
Sequential pie chart, showing us
to this point, what are your
again that we are seeing a sequential
conclusions?
write workload.
71
Write Random vs. Sequential.

- We know that it’s a sequential write.
- What kind of theory are you beginning to develop?
- Expected answers:
- Bypassing OpLog, straight to Extent Store.
- High latency is expected.
- This one workload is probably skewing the overall cluster average latency.
- The fact that it was unexpected at this time of day means that it’s a new
behavior.
Click to
Using Prism
edit Performance
yle s
A short while later, the sequential

write workload on WindowsVM-2.
completes …
How do you explain this to the
customer?
72
Write Random vs. Sequential.

- We know that it’s a sequential write.
- What kind of theory are you beginning to develop?
- Expected answers:
- Bypassing OpLog, straight to Extent Store.
- High latency is expected.
- This one workload is probably skewing the overall cluster average latency.
- The fact that it was unexpected at this time of day means that it’s a new
behavior.
Click to
Using Prism
edit Performance
yle s
The bot tom line here is t hat t he Prism UI does provide

very useful data that helps with Performance Analysis.
• Rem em b er t hat t he VM I/ O Met rics t ab p rovid es a
hist og ram view of op size and lat ency for read s and
w rit es.
§ Very useful w hen evaluat ing averag e lat ency. Help s answ er
t he q uest ion of “ How m eaning ful is m y mean?”
• You should b e ab le t o b uild a very g ood hig h -level
und erst and ing of p erform ance charact erist ics from
Prism UI.
73
Bottom line here is that the Prism UI does provide very useful data that aides in
Performance Analysis.
• Remember that the VM I/O Metrics tab provides a histogram view of opsize and
latency for reads and writes.
o Very useful when evaluating average latency. Helps answer the question of “How
meaningful is my mean?”
• You should be able to build a very good high-level understanding of performance
characteristics from Prism UI.
You might have to remind them about histograms and how they can convey samples with
a great deal of variance.
Stress how spending time in the Prism charts for a problem that has already happened
can aide in very comprehensive framing. Take screenshots of the graphs to share with
your colleagues.
Dat a Collect ion

W hat To Collect and How
74
Data Collection - What To Collect and How.

Click
Dat a Collect
to edit ion
W hen you are opening a customer case for a

performance issue, please include:
• A ny screenshot s cap t ured d uring t he p rob lem .
• A ny Guest OS log s, or ap p licat ion log s, t hat convey
t he issue.
§ Typ ically have t im est am p s w hen t hing s are log g ed .
§ Make sure you p rovid e t he relat ive t im e zone.
• Pret t y m uch anyt hing else you t hink w ill help your
p eers in Nutanix Support und erst and t he issue.
75
Data Collection
When you are opening a customer case for a performance issue, please include:
• Any screenshots captured during the problem.
• Any Guest OS logs, or application logs, that convey the issue.
o Typically have timestamps when things are logged.
o Make sure you provide the relative time zone.
• Pretty much anything else you think will help your peers in Nutanix Support
understand the issue.
There’s no such thing as too much data! Just make sure you provide detailed notes
about each data item’s relevance:
- ”This is a screenshot of iostat data collected on the customer’s Linux VM when the
problem was happening at around 7:43AM PDT.”
- “This is the log from the customer’s application. It’s logging I/O failures due to
timeouts. The timestamps are in GMT-5.”
Click
Dat a Collect
to edit ion
W hen an issue is current ly happening, predict able, or

reproducible:
• You w ill w ant t o g et a collect_perf t o share w it h
Nut anix Sup p ort .
• Ad d it ionally, it ’s a g ood id ea t o p rovid e NCC healt h
check out put and a full NCC log bundle.
• If p ossib le, collect p erform ance st at ist ics from t he
UVM(s) t hat are exp eriencing t he issue.
• Let ’s review som e st rat eg ies w hen it com es t o d at a
collect ion.
76
Data Collection
When an issue is currently happening, predictable, or reproducible:

• You will want to get a collect_perf to share with Nutanix Support.
• Additionally, it’s a good idea to provide NCC health check output and a full NCC log
bundle.
• If possible, collect performance statistics from the UVM(s) that are experiencing the
issue.
• Let’s review some strategies when it comes to data collection.
Click
Dat a Collect
to edit ion
Running collect_perf
• The m ost b asic collect ion is t o sim p ly run t he follow ing
com m and on any CVM in t he clust er:
collect_perf start
– It can t ake a few m inut es for it t o st art act ively collect ing d at a.
• Ensure t hat t he p rob lem occurs a few t im es d uring t he
collect ion.
§ Keep not es ab out w hat w as exp erienced and w hen d uring t he
collect ion.
• St op t he collect ion.
collect_perf stop
77
Data Collection
• The most basic collection is to simply run the following command on any CVM in the
cluster:
collect_perf start
• It can take a few minutes for it to start actively collecting data.
• Ensure that the problem occurs a few times during the collection.
o Keep notes about what was experienced and when during the collection.
• Stop the collection.
collect_perf stop
It can take several minutes for command to stop as well. Dependent upon the number
of nodes.
Point the class at the following KB, KB 1993:

https://portal.nutanix.com/#page/kbs/details?targetId=kA0600000008hQVCAY
This KB goes into great detail about many options to collect_perf and perhaps most
importantly important cautions about its use.
Click
Dat a Collect
to edit ion
• W hen t he com m and com p let es, you’ll find a *.tgz file
in / home/nutanix/data/performance on t he CVM
w here you ran t he com m and .
• At t ach t his collect ion t o t he case and m ake sure you
p rovid e any relevant ob servat ions t hat occurred d uring
t he collect ion.
78
Data Collection
• When the command completes, you’ll find a *.tgz file in
/home/nutanix/data/performance on the CVM where you ran the command.
• Attach this collection to the case and make sure you provide any relevant
observations that occurred during the collection.
If you collect any data from the UVM(s) during the collect_perf, include that as well. Just
be sure to specify timestamps to go along with corroborating evidence.
Remember this should accompany your properly framed case, so make sure you are
providing all the details:
- UVM(s) seeing the issue
- When it was seen
- What was seen
- …and so forth.
Click
Dat a Collect
to edit ion
Collect ing system logs w it h NCC.

• Syst em log s collect ed from NCC cont ain useful
p erform ance st at s.
• Collect a log b und le aft er collect _perf com p let es:
nohup The use of
ncc log_collector --last_no_of_hours=2 run_all &
nohup ensures that the command will continue
• to run if the connection to the CVM is lost.
Only collect t he t im e rang e need ed as t hese log
b und les can b e q uit
In addition to e larg e.
--last_no_of_hours=n you can
• W hen t healso use
com m and com p let es, t he b und le w ill b e in
--last_no_of_days=n
/home/nutanix/data/log_collector.
§ At t ach t his t o t he case.
79
Data Collection.
Collecting system logs with NCC.

• System logs collected from NCC contain useful performance stats.
• Collect a log bundle after collect_perf completes:
nohup ncc log_collector --last_no_of_hours=2 run_all &
• Only collect the time range needed as these log bundles can be quite large.
• When the command completes, the bundle will be in
/home/nutanix/data/log_collector.
o Attach this to the case.
Always use the most recent version of NCC.

Also, as mentioned before, make sure you run all the NCC health checks and include
that output log file.
Click
Dat a Collect
to edit ion
Now t hat you’ve shared all t he det ails w it h Support

and provided some meaningful data collections, we’re
much closer to resolut ion and a happy customer.
• How ever, p erform ance cases are oft en it erat ive.
§ Init ial analysis m ay lead t o sug g est ions t o t ry.
§ Once t hose sug g est ions have b een im p lem ent ed ,
reassess p erform ance.
§ If you haven’t yet achieved your g oal, g et new d at a
and rep eat t he p rocess.
80
Data Collection
Now that you’ve shared all the details with Support and provided some
meaningful data collections, we’re much close to resolution and a happy
customer.
• However, performance cases are often iterative.
o Initial analysis may lead to suggestions to try.
o Once those suggestions have been implemented, reassess performance.
o If we haven’t yet achieved our goal, get new data and repeat the process.
There’s always a bottleneck. We might successfully eliminate one, only to see another
one surface. The customer may still be seeing the same outcome, but we will need new
data to see what we are dealing with following any changes.
Sum m ary
81
Summary
t it le st yle
ing - Summary
Performance Concept s and Basic St at ist ics.

• Und erst and w hat an average value conveys and use
histograms w hen p ossib le t o evaluat e how exp ressive
t hey are.
§ “ How m eaning ful is m y mean?”
• Lit t le’s Law ( N=X*R) p rovid es a key t o und erst and ing
t he relat ionship b et w een latency and throughput.
• Correlat ion is key w hen b uild ing a t heory b ased on
st at ist ical d at a.
82
Performance Troubleshooting – Summary.
Performance Concepts and Basic Statistics.

• Understand what an average value conveys and use histograms when possible to
evaluate how expressive they are.
o “How meaningful is my mean?”
• Little’s Law (N=X*R) provides a key to understanding the relationship between
latency and throughput.
• Correlation is key when building a theory based on statistical data.
We also defined some terms like: latency, IOPS, throughput, utilization, and others.
t it le st yle
ing - Summary
Performance Case Framing.

• The g oal is t o st at e t he cust om er’s issue in a w ay w here
it s m ap p ing t o our archit ect ure is clear.
• W hat is b eing m easured ? How is it b eing m easured ?
W here is it b eing m easured from? W hat is t he
expect at ion, and w hy?
§ Don’t forg et t o d efine expectations, such as Success
Criteria
• Prop er Framing help s t o narrow our focus w hen w e
st art our analysis.
83
Performance Case Framing.

• The goal is to state the customer’s issue in a way where its mapping to our
architecture is clear.
• What is being measured? How is it being measured? Where is it being measured
from? What is the expectation, and why?
o Don’t forget to define expectations, such as Success Criteria.
• Proper Framing helps to narrow our focus when we start our analysis.
t it le st yle
ing - Summary

• Take a “ t op d ow n” ap p roach t o invest ig at ing a
p erform ance issue.
• W hat is b eing m easured ? How is it b eing m easured ?
W here is it b eing m easured from? W hat is t he
expect at ion, and w hy?
§ Don’t forg et t o d efine expectations, like Success Criteria.
• Prop er Framing help s t o narrow our focus w hen w e
st art our analysis.
84

• Take a “top down” approach to investigating a performance issue.
• What is being measured? How is it being measured? Where is it being measured
from? What is the expectation, and why?
o Don’t forget to define expectations, i.e. success criteria.
• Proper Framing helps to narrow our focus when we start our analysis.
The layers can start off very generalized. Each high-level “layer” can also be broken
down into its relevant layers.
The goal is bottleneck isolation.
t it le st yle
ing - Summary

• Unified Cache p rovid es efficient read s for user d at a and
m et ad at a.
• OpLog is used t o ab sorb rand om w rit es at low lat ency.
• Sequential vDisk w rit es t hat are b ot h seq uent ial and have
m ore t han 1.5MB out st and ing d at a are w rit t en d irect ly t o
Extent Store.
• Medusa Lookups ad d lat ency w hen t here is a m et ad at a
m iss.
• Cold Tier read s/ w rit es see hig her lat ency d ue m ost ly t o
85 HDD cost s

• Unified Cache provides efficient reads for user data and metadata.
• OpLog is used to absorb random writes at low latency.
• Sequential vDisk writes that are both sequential and have more than 1.5MB
outstanding data are written directly to Extent Store.
• Medusa Lookups add latency when there is a metadata miss.
• Cold Tier reads/writes see higher latency due mostly to HDD costs.
Cache-miss for metadata adds latency.

Sequential writes see higher latency, but by design they should be throughput sensitive
and not latency sensitive.
t it le st yle
ing - Summary
Using Prism Performance Chart s.

• Prism UI g et s it s p erform ance d at a from Arithmos.
• The V M p ag e p rovid es som e “ t op t alkers” d at a.
• The I/ O Met rics t ab p rovid es histograms for latency
and opsize.
• The Virt ual Disks and I/ O Met rics sect ions p rovid e
Sequential vs. Random insig ht .
• The g rap hs in Prism p rovid e a very p ow erful t ool for
p erform ance analysis.
86

• Prism UI gets its performance data from Arithmos.
• The VM page provides some “top talkers” data.
• The I/O Metrics tab provides histograms for latency and opsize.
• The Virtual Disks and I/O Metrics sections provide Sequential vs. Random insight.
• The graphs in Prism provide a very powerful tool for performance analysis.
t it le st yle
ing - Summary
Dat a Collect ion - W hat to Collect and How.

• W hen seeking help from Nutanix Support p rovid ing
all t he relevant d at a along w it h your p rop erly fram ed
case is a key for success.
• Provid e all t he d et ails w it h any d at a p rovid ed .
§ W hich nod e(s), UVM(s), vDisk(s), and so fort h.
§ Make sure t o sp ecify t im e zones.
87
Data Collection - What to Collect and How.

• When seeking help from Nutanix Support providing all the relevant data along with
your properly framed case is a key for success.
• Provide all the details with any data provided.
o Which node(s), UVM(s), vDisk(s), and so forth.
o Make sure to specify time zones.
Thank You!
88
Thank You!
Appendix A
Module 6 Foundation Troubleshooting - Foundation log

[nutanix@nutanix-installer 20170509-144208-5]$ pwd
/home/nutanix/foundation/log/20170509-144208-5
[nutanix@nutanix-installer 20170509-144208-5]$ cat node_10.30.15.47.log | more
20170509 14:42:08 INFO Validating parameters. This may take few minutes
20170509 14:42:08 DEBUG Setting state of <ImagingStepValidation(<NodeConfig(10.30.15.47)

@d1d0>) @d790> from PENDING to RUNNING
20170509 14:42:08 INFO Running <ImagingStepValidation(<NodeConfig(10.30.15.47) @d1d0>) @d790>
20170509 14:42:08 INFO Validating parameters. This may take few minutes
20170509 14:42:42 DEBUG Setting state of <ImagingStepValidation(<NodeConfig(10.30.15.47)

@d1d0>) @d790> from RUNNING to FINISHED
20170509 14:42:42 INFO Completed <ImagingStepValidation(<NodeConfig(10.30.15.47) @d1d0>)

@d790>
20170509 14:42:42 DEBUG Setting state of <GetNosVersion(<NodeConfig(10.30.15.47) @d1d0>)

@d950> from PENDING to RUNNING
20170509 14:42:42 INFO Running <GetNosVersion(<NodeConfig(10.30.15.47) @d1d0>) @d950>
20170509 14:42:42 INFO Node IP: CVM(10.30.15.47) HOST(10.30.15.44) IPMI(10.30.15.41)
20170509 14:44:07 INFO NOS Version is: 4.7.5.2
20170509 14:44:07 DEBUG Setting state of <GetNosVersion(<NodeConfig(10.30.15.47) @d1d0>)

@d950> from RUNNING to FINISHED
20170509 14:44:07 INFO Completed <GetNosVersion(<NodeConfig(10.30.15.47) @d1d0>) @d950>
20170509 14:44:07 DEBUG Setting state of <PrepareHypervisorIso(<NodeConfig(10.30.15.47)

@d1d0>) @da50> from PENDING to RUNNING
20170509 14:44:07 INFO Running <PrepareHypervisorIso(<NodeConfig(10.30.15.47) @d1d0>) @da50>
20170509 14:45:17 INFO AHV iso to be used:

/home/nutanix/foundation/cache/kvm_cd_image/kvm.20160601.50.iso
20170509 14:45:17 DEBUG Setting state of <PrepareHypervisorIso(<NodeConfig(10.30.15.47) @d1d0>)

@da50> from RUNNING to FINISHED
20170509 14:45:17 INFO Completed <PrepareHypervisorIso(<NodeConfig(10.30.15.47) @d1d0>)
@da50>
20170509 14:45:17 DEBUG Setting state of <ImagingStepTypeDetection(<NodeConfig(10.30.15.47)

@d1d0>) @da90> from PENDING to RUNNING
20170509 14:45:17 INFO Running <ImagingStepTypeDetection(<NodeConfig(10.30.15.47) @d1d0>)

@da90>
Get FRU Device Description via ipmitool
20170509 14:45:17 INFO Attempting to detect device type on 10.30.15.41
20170509 14:45:17 DEBUG factory mode is False
20170509 14:45:17 INFO Checking if this is Quanta
20170509 14:45:17 DEBUG Command '['ipmitool', '-U', u'ADMIN', '-P', u'ADMIN', '-H', '10.30.15.41',
'fru']' returned stdout:
FRU Device Description : Builtin FRU Device (ID 0)
Chassis Type : Other
Chassis Part Number : CSE-827HQ-R1K62MBP
Chassis Serial : ZM163S033945
Board Mfg Date : Sun Dec 31 16:00:00 1995
Board Mfg : Supermicro
Board Product : NONE
Board Serial : ZM163S033945
Board Part Number : X9DRT-HF+J-NI22
Product Manufacturer : Nutanix
Product Name : NONE
Product Part Number : NX-1065S
Product Version : NONE
Product Serial : 16SM13150152
Product Asset Tag : 47802
Check to see what type of system
20170509 14:45:17 INFO Checking if this is a Lenovo system.
20170509 14:46:14 INFO Checking if this is software-only node

20170509 14:46:15 INFO Checking if this is Dell
20170509 14:46:16 DEBUG Command '['/opt/dell/srvadmin/sbin/racadm', '-r', '10.30.15.41', '-u',

u'ADMIN', '-p', u'ADMIN', 'getc
onfig', '-g', 'idRacInfo']' returned stdout: Security Alert: Certificate is invalid - self signed certificate
Continuing execution. Use -S option for racadm to stop execution on certificate-related errors.
stderr:/opt/dell/srvadmin/sbin/racadm: line 13: printf: 0xError: invalid hex number ERROR: Unable to
connect to RAC at specified IP address.
20170509 14:46:16 INFO Checking if this is SMC
20170509 14:46:17 INFO Detected class smc_wa for node with IPMI IP 10.30.15.41
Preparing NOS package and making node specific Phoenix image
20170509 14:46:17 DEBUG Setting state of <ImagingStepTypeDetection(<NodeConfig(10.30.15.47)

@d1d0>) @da90> from RUNNING to FINISHED
20170509 14:46:17 INFO Completed <ImagingStepTypeDetection(<NodeConfig(10.30.15.47) @d1d0>)

@da90>
20170509 14:46:17 DEBUG Setting state of <ImagingStepPrepareVendor(<NodeConfig(10.30.15.47)

@d1d0>) @db90> from PENDING to RUNNING
20170509 14:46:17 INFO Running <ImagingStepPrepareVendor(<NodeConfig(10.30.15.47) @d1d0>)

@db90>
20170509 14:46:17 DEBUG Setting state of <ImagingStepPrepareVendor(<NodeConfig(10.30.15.47)

@d1d0>) @db90> from RUNNING to FINISHED
20170509 14:46:17 INFO Completed <ImagingStepPrepareVendor(<NodeConfig(10.30.15.47) @d1d0>)

@db90>
20170509 14:46:17 DEBUG Setting state of <ImagingStepInitIPMI(<NodeConfig(10.30.15.47) @d1d0>)

@dc10> from PENDING to RUNNING
20170509 14:46:17 INFO Running <ImagingStepInitIPMI(<NodeConfig(10.30.15.47) @d1d0>) @dc10>
20170509 14:46:17 INFO Preparing NOS package and making node specific Phoenix image
Mount Phoenix ISO and Boot into Phoenix
20170509 14:46:49 INFO NOS version is 4.7.5.2
20170509 14:46:49 INFO Powering off node
20170509 14:46:54 INFO Booting into Phoenix
20170509 14:46:54 INFO Starting SMCIPMITool
20170509 14:46:55 INFO Detecting power status
20170509 14:46:57 INFO Power status is on

20170509 14:46:57 INFO Powering off node
20170509 14:46:58 INFO Power status is off
20170509 14:46:58 INFO Disconnecting virtual media
20170509 14:46:59 INFO Attaching virtual media: /home/nutanix/foundation/tmp/sessions/20170509-

144208-5/phoenix_node_isos/foun
dation.node_10.30.15.47.iso
20170509 14:47:04 DEBUG [1/3] Checking virtual media:

/home/nutanix/foundation/tmp/sessions/20170509-144208-5/phoenix_node_iso
s/foundation.node_10.30.15.47.iso
20170509 14:47:04 DEBUG vmwa status: vmwa status Device 1: None Device 2: ISO File
[/home/nutanix/foundation/tmp/sessions/20170509-144208-
5/phoenix_node_isos/foundation.node_10.30.15.47.iso]
20170509 14:47:56 DEBUG Virtual media is mounted successfully:

/home/nutanix/foundation/tmp/sessions/20170509-144208-5/phoenix
_node_isos/foundation.node_10.30.15.47.iso
20170509 14:47:56 INFO Power status is on
20170509 14:47:56 INFO BMC should be booting into phoenix
20170509 14:47:59 INFO [3/900s] Waiting for Phoenix
20170509 14:50:27 INFO Rebooted into Phoenix successfully

20170509 14:50:27 INFO Exiting SMCIPMITool
20170509 14:50:27 DEBUG Setting state of <ImagingStepInitIPMI(<NodeConfig(10.30.15.47) @d1d0>)

@dc10> from RUNNING to FINISHED
20170509 14:50:27 INFO Completed <ImagingStepInitIPMI(<NodeConfig(10.30.15.47) @d1d0>) @dc10>
20170509 14:50:27 DEBUG Setting state of <ImagingStepPreInstall(<NodeConfig(10.30.15.47) @d1d0>)

@dcd0> from PENDING to RUNNING
20170509 14:50:27 INFO Running <ImagingStepPreInstall(<NodeConfig(10.30.15.47) @d1d0>) @dcd0>
Rebooting into staging environment
20170509 14:50:27 INFO Rebooting into staging environment
20170509 14:50:28 INFO Node with ip 10.30.15.47 is in phoenix. Generating hardware_config.json
20170509 14:50:29 INFO Boot device has no WWN
20170509 14:50:29 INFO Boot device serial: 20151117AAAA00000767
20170509 14:50:29 INFO Scp-ied platform_reference.json to phoenix at ip 10.30.15.47
20170509 14:50:30 INFO NIC with PCI address 04:00:0 will be used for NIC teaming if default teaming
fails
20170509 14:50:30 INFO This node is not a gold node
20170509 14:50:31 ERROR Command '/usr/bin/ipmicfg-linux.x86_64 -tp info' returned error code 13
stdout: stderr: Not TwinPro
20170509 14:50:31 ERROR Command '/usr/bin/ipmicfg-linux.x86_64 -tp nodeid' returned error code 13
stdout: stderr:Not TwinPro
20170509 14:50:31 DEBUG Setting state of <ImagingStepPreInstall(<NodeConfig(10.30.15.47) @d1d0>)

@dcd0> from RUNNING to FINISHED
20170509 14:50:31 INFO Completed <ImagingStepPreInstall(<NodeConfig(10.30.15.47) @d1d0>)

@dcd0>
20170509 14:50:31 DEBUG Setting state of <ImagingStepPhoenix(<NodeConfig(10.30.15.47) @d1d0>)

@dd90> from PENDING to RUNNING
20170509 14:50:31 DEBUG Setting state of <InstallHypervisorKVM(<NodeConfig(10.30.15.47) @d1d0>)

@de50> from PENDING to RUNNING
20170509 14:50:31 INFO Running <ImagingStepPhoenix(<NodeConfig(10.30.15.47) @d1d0>) @dd90>
20170509 14:50:31 INFO Making node specific Phoenix json. This may take few minutes
20170509 14:50:31 INFO Running <InstallHypervisorKVM(<NodeConfig(10.30.15.47) @d1d0>) @de50>
20170509 14:50:42 INFO Start downloading resources, this may take several minutes
20170509 14:50:42 INFO Waiting for Phoenix to finish downloading resources
20170509 14:50:43 INFO Model detected: NX-1065S
20170509 14:50:43 INFO Using node_serial from FRU
20170509 14:50:43 INFO Using block_id from FRU
20170509 14:50:43 INFO Using cluster_id from FRU
20170509 14:50:43 INFO Using node_position from AZ_CONFIG
20170509 14:50:43 INFO node_serial = ZM163S033945, block_id = 16SM13150152, cluster_id = 47802,

model = USE_LAYOUT, model_stri
ng = NX-1065S, node_position = A
20170509 14:50:43 INFO Running updated Phoenix
Downloading Acropolis Tarball
20170509 14:50:43 INFO Downloading Acropolis tarball
20170509 14:50:43 INFO Downloading Acropolis tarball: Downloading Acropolis tarball
20170509 14:50:43 INFO Downloading file 'nutanix_installer_package.tar' with size: 3702784000 bytes.
Downloading Hypervisor ISO
20170509 14:50:52 INFO Downloading hypervisor iso
20170509 14:50:52 INFO Downloading hypervisor iso: Downloading hypervisor iso
20170509 14:50:52 INFO Downloading file 'hypervisor_iso.10.30.15.47.kvm.20160601.50.iso' with size:

499826688 bytes.
20170509 14:50:54 INFO Running CVM Installer: Running CVM Installer
20170509 14:50:54 INFO Running CVM Installer
20170509 14:50:54 INFO Generating unique SSH identity for this Hypervisor-CVM pair.
20170509 14:50:54 INFO Generating SSL certificate for this Hypervisor-CVM pair.
20170509 14:50:55 INFO Extracting the SVM installer into memory. This will take some time...
20170509 14:50:56 INFO Injecting SSH keys into SVM installer.
20170509 14:50:56 INFO Using hcl from /phoenix/hcl.json with last_edit 1488347720
20170509 14:50:56 INFO Imaging the SVM
20170509 14:50:56 INFO Formatting all data disks ['sdd', 'sde', 'sdc']
20170509 14:50:56 INFO Executing /mnt/svm_installer/install/bin/svm_rescue with arg_list ['-i',

'/mnt/svm_installer/install',
'--factory_deploy', '--node_name=16SM13150152-A', '--node_serial=ZM163S033945', '--
node_model=USE_LAYOUT', '--cluster_id=47802
', '--node_uuid=ca38fdbb-3e9a-43ae-b6a9-22905d61cf0d']
20170509 14:50:57 INFO kvm iso unpacked in /tmp/tmpbuwRtm
20170509 14:50:57 INFO Kickstart file created
20170509 14:50:57 INFO Bootloader config created
20170509 14:50:57 INFO KVM iso packed
20170509 14:50:57 INFO Created tap0 tap device
20170509 14:50:57 INFO Cuttlefish running on Phoenix
20170509 14:50:57 INFO Installation Device = /dev/sdb
20170509 14:50:57 INFO Installer VM is now running the installation
20170509 14:50:57 INFO Installer VM running with PID = 1567
Install Hypervisor configure RAID
20170509 14:51:20 INFO Installing AHV: Installing AHV
20170509 14:51:27 INFO [30/1230] Hypervisor installation in progress
20170509 14:52:28 INFO Scanning all disks to assemble RAID. This may take few minutes
20170509 14:52:29 INFO Creating layout file for NX-1065S in position A
20170509 14:52:29 INFO Running firmware detection: Running firmware detection
20170509 14:52:29 INFO Detected Supermicro platform. Performing firmware detection..
20170509 14:52:40 INFO Using SUM tool version 1.5
20170509 14:52:42 ERROR Config file /mnt/nutanix_boot/etc/nutanix/firmware_config.json does not

exist
20170509 14:52:42 INFO Creating firmware_config.json..
20170509 14:52:42 INFO Populating firmware information for device bmc..
20170509 14:52:43 INFO Using SUM tool version 1.5
20170509 14:53:02 INFO Populating firmware information for device bios..
20170509 14:53:03 INFO Populating firmware information for device satadom..
20170509 14:53:04 INFO Imaging of SVM has completed successfully!
20170509 14:53:04 DEBUG 2017-05-09 21:51:20 INFO svm_rescue:602 Will image ['/dev/sdc'] from
/mnt/svm_installer/install/images
/svm.tar.xz.
2017-05-09 21:51:20 INFO svm_rescue:97 exec_cmd: mdadm --stop --scan
2017-05-09 21:51:20 INFO svm_rescue:439 Disks detected from Phoenix: ['/dev/sdd', '/dev/sde',
'/dev/sdc']
2017-05-09 21:51:20 INFO svm_rescue:461 Repartitioning disk /dev/sdd
2017-05-09 21:51:20 INFO svm_rescue:97 exec_cmd: /usr/bin/python2.6

/mnt/cdrom/bin/repartition_disks -d /dev/sdd
2017-05-09 21:51:21 INFO svm_rescue:461 Repartitioning disk /dev/sde

/mnt/cdrom/bin/repartition_disks -d /dev/sde
2017-05-09 21:51:23 INFO svm_rescue:461 Repartitioning disk /dev/sdc
2017-05-09 21:51:23 INFO svm_rescue:97 exec_cmd: parted -s /dev/sdc unit s print
2017-05-09 21:51:23 INFO svm_rescue:149 No partition table present on disk /dev/sdc
2017-05-09 21:51:23 INFO svm_rescue:413 Need to repartition and format blank boot drive /dev/sdc

/mnt/cdrom/bin/repartition_disks -b /dev/sdc
2017-05-09 21:51:23 INFO svm_rescue:399 Formatting disks ['/dev/sdc4', '/dev/sdd1', '/dev/sde1']
2017-05-09 21:51:23 INFO svm_rescue:97 exec_cmd: /usr/bin/python2.6 /mnt/cdrom/bin/clean_disks -

p /dev/sdc4,/dev/sdd1,/dev/sde1
2017-05-09 21:51:35 INFO svm_rescue:399 Formatting disks ['/dev/sdc1', '/dev/sdc2', '/dev/sdc3']
2017-05-09 21:51:35 INFO svm_rescue:97 exec_cmd: /usr/bin/python2.6 /mnt/cdrom/bin/clean_disks -

p /dev/sdc1,/dev/sdc2,/dev/sdc3
2017-05-09 21:51:36 INFO svm_rescue:97 exec_cmd: mount /dev/sdc1 /mnt/disk
2017-05-09 21:51:36 INFO svm_rescue:97 exec_cmd: cd /mnt/disk; tar -xJpvf

/mnt/svm_installer/install/images/svm.tar.xz
2017-05-09 21:52:00 INFO svm_rescue:97 exec_cmd: bash -c 'cd /mnt/disk/usr/local/nutanix; tar -xvzf
/mnt/svm_installer/install
/pkg/nutanix-bootstrap-el6-release-danube-4.7.5.2-stable-
7c83bbaf29e9b603f3a0825bee65f568d79603b9.tar.gz'
/pkg/nutanix-core-el6-release-danube-4.7.5.2-stable-
/pkg/nutanix-diagnostics-el6-release-danube-4.7.5.2-stable-
/pkg/nutanix-infrastructure-el6-release-danube-4.7.5.2-stable-
/pkg/nutanix-serviceability-el6-release-danube-4.7.5.2-stable-
/pkg/nutanix-ncc-el6-release-ncc-3.0.1.1-latest.tar.gz'
/pkg/nutanix-minervacvm-el6-release-danube-4.7.5.2-stable-
/pkg/nutanix-perftools-el6-release-danube-4.7.5.2-stable-
/pkg/nutanix-syscheck-el6-release-danube-4.7.5.2-stable-
2017-05-09 21:52:13 INFO svm_rescue:97 exec_cmd: bash -c 'cd /mnt/disk/srv/; tar -xvzf
/mnt/svm_installer/install/pkg/nutanix-
salt-el6-release-danube-4.7.5.2-stable-7c83bbaf29e9b603f3a0825bee65f568d79603b9.tar.gz'
2017-05-09 21:52:13 INFO svm_rescue:644 Making post deployment modifications on /dev/sdc1
2017-05-09 21:52:13 INFO svm_rescue:97 exec_cmd: blkid -c /dev/null /dev/sdc1
2017-05-09 21:52:13 INFO svm_rescue:367 Root filesystem (on /dev/sdc1) UUID is a394d0df-aa23-4d8d-
8f1c-4c3f6130dd6a instead of
198775c3-801f-44fb-9801-9bf243f623bc.
2017-05-09 21:52:13 INFO svm_rescue:175 Creating Nutanix boot marker file from grub.conf ...
2017-05-09 21:52:13 INFO svm_rescue:200 Wrote marker file, contents:
KERNEL=/boot/vmlinuz-3.10.0-229.46.1.el6.nutanix.20170119.cvm.x86_64
CMDLINE='ro root=UUID=a394d0df-aa23-4d8d-8f1c-4c3f6130dd6a rd_NO_LUKS rd_NO_LVM

LANG=en_US.UTF-8 rd_NO_MD quiet SYSFONT=latarc
yrheb-sun16 rhgb crashkernel=no KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM audit=1 nousb fips=1

nomodeset panic=30 console=ttyS0,115
200n8 console=tty0'
INITRD=/boot/initramfs-3.10.0-229.46.1.el6.nutanix.20170119.cvm.x86_64.img
2017-05-09 21:52:13 INFO svm_rescue:231 Adding /dev/sdc3 to /etc/fstab to be mounted at /home...
2017-05-09 21:52:13 INFO svm_rescue:97 exec_cmd: blkid -c /dev/null /dev/sdc3
2017-05-09 21:52:13 INFO svm_rescue:97 exec_cmd: mount /dev/sdc3 /mnt/data
2017-05-09 21:52:13 INFO svm_rescue:689 Fixing permissions on /home.
2017-05-09 21:52:13 INFO svm_rescue:689 Fixing permissions on /home./nutanix
2017-05-09 21:52:13 INFO svm_rescue:714 Copying installer to /mnt/data/nutanix/data/installer/el6-

release-danube-4.7.5.2-stabl
e-7c83bbaf29e9b603f3a0825bee65f568d79603b9
2017-05-09 21:52:13 INFO svm_rescue:97 exec_cmd: tar -C '/mnt/svm_installer/install' -cf - . | tar -C

'/mnt/data/nutanix/data/
installer/el6-release-danube-4.7.5.2-stable-7c83bbaf29e9b603f3a0825bee65f568d79603b9' -xvf -
2017-05-09 21:52:16 INFO svm_rescue:97 exec_cmd: tar -xvzf /mnt/svm_installer/install/pkg/nutanix-

foundation-3.7-20170217-df26
9745.tar.gz -C /mnt/data/nutanix
2017-05-09 21:52:24 INFO svm_rescue:742 Setting up SSH configuration
2017-05-09 21:52:24 INFO svm_rescue:748 Writing out rc.nutanix
2017-05-09 21:52:24 INFO svm_rescue:97 exec_cmd: sync; sync; sync
2017-05-09 21:52:40 INFO svm_rescue:97 exec_cmd: umount /mnt/disk
2017-05-09 21:52:41 INFO svm_rescue:97 exec_cmd: umount /mnt/data
20170509 14:53:04 INFO Imaging thread 'svm' has completed successfully
Hypervisor installation
20170509 14:58:34 INFO Rebooting AHV: Rebooting AHV
20170509 14:58:34 DEBUG Setting state of <InstallHypervisorKVM(<NodeConfig(10.30.15.47) @d1d0>)

@de50> from RUNNING to FINISHED
20170509 14:58:34 INFO Completed <InstallHypervisorKVM(<NodeConfig(10.30.15.47) @d1d0>)

@de50>
20170509 14:58:58 INFO Installer VM finished in 480.645823956s.
20170509 14:58:58 INFO Hypervisor installation is done
20170509 14:58:58 INFO Killing cuttlefish
20170509 14:58:58 INFO Destroying tap0 interface
20170509 14:58:58 INFO Re-reading partition table for /dev/sdb
20170509 14:59:01 INFO Customizing KVM instance
20170509 14:59:01 INFO Setting hostname to 'node-1'
20170509 14:59:01 INFO Copying SVM template files
20170509 14:59:01 INFO Configuring SVM resources
20170509 14:59:01 INFO Setting CVM memory to 16GB
20170509 14:59:01 INFO Setting vCPUs to 8
20170509 14:59:01 INFO Setting cpuset to 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
20170509 14:59:01 INFO Fetching the list of HBAs from layout
20170509 14:59:01 INFO Adding device [u'03:00.0', u'1000', u'0072'] to passthrough devices
20170509 14:59:01 INFO Regenerating initramfs
20170509 14:59:30 INFO Setting up authorized_keys
20170509 14:59:30 INFO Copying firstboot scripts into /mnt/stage/root/firstboot
20170509 14:59:30 INFO Copying SSH keys
20170509 14:59:30 INFO Copying network configuration crashcart scripts into /mnt/stage/root/nutanix-
network-crashcart
20170509 14:59:30 INFO Installing firstboot marker file
20170509 14:59:30 INFO Cleaning up
20170509 14:59:32 INFO Imaging thread 'hypervisor' has completed successfully
20170509 14:59:32 ERROR Could not copy installer logs to '/mnt/stage/var/log'! Continuing..
20170509 14:59:32 INFO Imaging process completed successfully!
20170509 14:59:32 INFO Installation of Acropolis base software successful: Installation successful.
20170509 14:59:32 INFO Rebooting node. This may take several minutes: Rebooting node. This may
take several minutes
20170509 14:59:32 INFO Rebooting node. This may take several minutes
20170509 15:01:36 INFO Running cmd [u'/bin/ping -c 1 10.30.15.90']
20170509 15:01:36 INFO Running firstboot scripts: Running firstboot scripts
20170509 15:01:36 INFO Expanding boot partition. This may take some time.
20170509 15:01:36 INFO Running cmd ['/usr/bin/nohup /sbin/resize2fs /dev/sda1 &>/dev/null &']
20170509 15:01:36 INFO Running cmd ['touch /root/firstboot/phases/expand_boot_partition']
20170509 15:01:36 INFO Running cmd ['virsh net-destroy default']
20170509 15:01:36 INFO Running cmd ['virsh net-undefine default']
20170509 15:01:36 INFO Running cmd ['virsh net-destroy VM-Network']
20170509 15:01:36 INFO Running cmd ['virsh net-undefine VM-Network']
20170509 15:01:36 INFO Running cmd ['virsh net-define /root/net-VM-Network.xml']
20170509 15:01:36 INFO Running cmd ['virsh net-start VM-Network']
20170509 15:01:36 INFO Running cmd ['virsh net-autostart VM-Network']
20170509 15:01:36 INFO Running cmd ['virsh net-destroy NTNX-Local-Network']
20170509 15:01:36 INFO Running cmd ['virsh net-undefine NTNX-Local-Network']
20170509 15:01:36 INFO Running cmd ['virsh net-define /root/net-NTNX-Local-Network.xml']
20170509 15:01:36 INFO Running cmd ['virsh net-start NTNX-Local-Network']
20170509 15:01:37 INFO Running cmd ['virsh net-autostart NTNX-Local-Network']
20170509 15:01:37 INFO Running cmd ['touch /root/firstboot/phases/create_libvirt_networks']
Creating New CVM
20170509 15:01:37 INFO Creating a new CVM: Creating a new CVM
20170509 15:01:37 INFO Running cmd ['echo "$(hostname)-CVM"']
20170509 15:01:37 INFO Running cmd [u'sed -i "s/<name>NTNX-CVM/<name>NTNX-node-1-CVM/"

/root/NTNX-CVM.xml']
20170509 15:01:37 INFO Running cmd ['sed -i "/<uuid>/d" /root/NTNX-CVM.xml']
20170509 15:01:37 INFO Running cmd ['chmod 644 /var/lib/libvirt/NTNX-CVM/svmboot.iso']
20170509 15:01:37 INFO Running cmd ['chown qemu:qemu /var/lib/libvirt/NTNX-CVM/svmboot.iso']
20170509 15:01:37 INFO Running cmd ['virsh define /root/NTNX-CVM.xml']
20170509 15:01:37 INFO Running cmd [u'virsh start "NTNX-node-1-CVM"']

20170509 15:01:39 INFO Running cmd [u'virsh autostart "NTNX-node-1-CVM"']
20170509 15:01:39 INFO Run ssh cmd on SVM
20170509 15:01:39 INFO Running cmd ['/usr/bin/ssh -i /root/firstboot/ssh_keys/nutanix -o

StrictHostKeyChecking=no -o NumberOfP
asswordPrompts=0 -o UserKnownHostsFile=/dev/null nutanix@192.168.5.2 "ls

/tmp/svm_boot_succeeded"']
20170509 15:01:42 INFO Cmd [['/usr/bin/ssh -i /root/firstboot/ssh_keys/nutanix -o

StrictHostKeyChecking=no -o NumberOfPassword
Prompts=0 -o UserKnownHostsFile=/dev/null nutanix@192.168.5.2 "ls /tmp/svm_boot_succeeded"']]

failed 1 times with reason [
FIPS mode initialized ssh: connect to host 192.168.5.2 port 22: No route to host]. Will retry in 5 seconds






FIPS mode initialized ssh: connect to host 192.168.5.2 port 22: Connection refused]. Will retry in 5
seconds


seconds

seconds


seconds


seconds


FIPS mode initialized Warning: Permanently added '192.168.5.2' (RSA) to the list of known hosts.
Nutanix Controller VM
ls: cannot access /tmp/svm_boot_succeeded: No such file or directory]. Will retry in 5 seconds


Nutanix Controller VM ls: cannot access /tmp/svm_boot_succeeded: No such file or directory]. Will retry
in 5 seconds



in 5 seconds


in 5 seconds




in 5 seconds

in 5 seconds


in 5 seconds


in 5 seconds


FIPS mode initialized
Warning: Permanently added '192.168.5.2' (RSA) to the list of known hosts.
in 5 seconds


in 5 seconds




in 5 seconds




FIPS mode initialized
Warning: Permanently added '192.168.5.2' (RSA) to the list of known hosts.







in 5 seconds
20170509 15:04:16 INFO CVM booted up successfully: CVM booted up successfully
20170509 15:04:16 INFO Last reboot complete: Firstboot successful
20170509 15:04:16 INFO Running cmd ['touch /root/.firstboot_success']
20170509 15:04:16 INFO Running cmd ['service ntpd start']
20170509 15:04:17 INFO Running cmd ['/sbin/chkconfig ntpd on']
20170509 15:04:19 INFO Timezone America/Tijuana set successfully

20170509 15:04:19 DEBUG Setting state of <ImagingStepPhoenix(<NodeConfig(10.30.15.47) @d1d0>)
@dd90> from RUNNING to FINISHED
20170509 15:04:19 INFO Completed <ImagingStepPhoenix(<NodeConfig(10.30.15.47) @d1d0>) @dd90>
20170509 15:04:49 DEBUG Unable to ssh using private key
20170509 15:06:48 INFO Uploading iso whitelist
20170509 15:06:50 INFO Successfully uploaded iso whitelist to cvm at 10.30.15.47

Appendix B
Module 7 AFS Troubleshooting - Minerva log

nutanix@NTNX-16SM13150152-A-CVM:10.30.15.47:~/data/logs$ cat minerva_cvm.log
2017-05-26 13:40:11 rolled over log file
2017-05-26 13:40:11 INFO minerva_cvm:37 MinervaCvm not monkey patched
2017-05-26 13:40:11 INFO zookeeper_session.py:102 minerva_cvm is attempting to connect to

Zookeeper
2017-05-26 13:40:11 INFO server.py:128 Starting the leadership thread
2017-05-26 13:40:11 INFO server.py:90 Starting the serve_http thread
2017-05-26 13:40:11 INFO server.py:378 Leader is 10.30.15.47
2017-05-26 13:40:11 INFO server.py:391 Creating upgrade zookeeper nodes if not already created
2017-05-26 13:40:11 INFO server.py:382 This node is the leader
2017-05-26 13:40:11 INFO server.py:138 Starting task service
2017-05-26 13:40:11 INFO server.py:177 classmap {'minerva.cvm.upgrade_task':

['FileServerUpgradeTask'], 'minerva.cvm.disaster_recovery': ['FileServerDRTask'],
'minerva.cvm.share_task': ['ShareCvmTask'], 'minerva.cvm.domain_task': ['DomainTasks'],
'minerva.cvm.file_server': ['FileServer'], 'minerva.cvm.cvm_config_change_notify':
['CvmConfigChangeNotify']}
2017-05-26 13:40:11 INFO server.py:178 methodmap {'FileServer': ['FileServerAdd', 'FileServerDelete',

'FileServerUpdate', 'FileServerActivate', 'FileServerNodeAdd', 'FileServerNodeDelete',
'FileServerRebalance', 'FileServerVmUpdate', 'FileServerVmDeploy'], 'DomainTasks':
['JoinDomainTask', 'LeaveDomainTask'], 'ShareCvmTask': ['ShareAdd', 'ShareUpdate', 'ShareDelete'],
'FileServerUpgradeTask': ['FileServerUpgrade', 'FileServerVmUpgrade', 'FileServerUpgradeAll'],
'CvmConfigChangeNotify': ['CvmConfigChangeNotifyMasterTask', 'CvmConfigChangeNotifySubTask'],
'FileServerDRTask': ['FileServerProtect', 'FileServerPdDeactivate']}
2017-05-26 13:40:11 INFO server.py:179 Registering method name and class name
2017-05-26 13:40:11 INFO server.py:183 Starting Dispatcher with component name: minerva_cvm
2017-05-26 13:40:11 INFO server.py:215 Watch for CVM configuration changes..
2017-05-26 13:40:11 INFO dispatcher.py:273 starting dispatcher
2017-05-26 13:40:11 INFO server.py:219 Initializing the CVM IP address list cache at zk node
/appliance/logical/samba/cvm_ipv4_address_list_cache
2017-05-26 13:40:11 INFO dispatcher.py:78 Recreating the sequence id for component:minerva_cvm
2017-05-26 13:40:11 INFO dispatcher.py:82 Fetching the task list for the component
2017-05-26 13:40:11 INFO server.py:206 Successfully cached the CVM IP address list present in zeus.
2017-05-26 13:40:11 INFO dispatcher.py:88 Fetched the task list
2017-05-26 13:40:11 INFO dispatcher.py:90 No list found
2017-05-26 13:40:11 INFO dispatcher.py:283 Got last known sequence id
2017-05-26 13:40:11 INFO dispatcher.py:284 Running the dispatcher thread
2017-05-26 13:40:11 INFO dispatcher.py:295 sequence id: 0
2017-05-30 10:07:43 INFO file_server.py:158 Creating File server.

Zookeeper
2017-05-30 10:07:43 INFO minerva_utils.py:3487 add_spec {
uuid: "9e2de0ba-3d4c-411f-8f85-621a2f5f6542"
name: "afs01"
nvm_list {
uuid: "ac3c7060-db9e-4297-b6f4-564b1083ef92"
name: "NTNX-afs01-1"
num_vcpus: 4
memory_mb: 12288
local_cvm_ipv4_address: ""
nvm_list {
uuid: "ddbf4012-800d-48e1-a92c-c6ca2c612563"
num_vcpus: 4
memory_mb: 12288
nvm_list {
uuid: "66386092-223b-48d5-9340-9f25b6338229"
num_vcpus: 4
memory_mb: 12288
internal_network {
ipv4_address_list: "10.30.14.244"
netmask_ipv4_address: "255.255.240.0"
gateway_ipv4_address: "10.30.0.1"
virtual_network_uuid: "243cb3ae-8ba5-4fb5-ba8e-5fed88ca108b"
external_network_list {
dns_ipv4_address: "10.30.15.91"
ntp_ipv4_address: "10.30.15.91"
container_uuid: "cdc48f84-afcf-4b8d-b53c-7026d62512a0"
join_domain {
realm_name: "learn.nutanix.local"
username: "administrator"
password: "********"
set_spn_dns_only: false
}
size_bytes: 1099511627776
is_new_container: true
status: kNotStarted
time_stamp: 1496164063
task_header {
file_server_uuid: "9e2de0ba-3d4c-411f-8f85-621a2f5f6542"
file_server_name: "afs01"
2017-05-30 10:07:43 INFO minerva_utils.py:857 target_release_version el6-release-euphrates-5.1.0.1-

stable-419aa3a83df5548924198f85398deb20e8b615fe, nos_version 5.1.0.1

Zookeeper
2017-05-30 10:07:43 INFO file_server_utils.py:208 software_nos_version 5\.0\.0\.[0-9]$
2017-05-30 10:07:43 INFO minerva_utils.py:828 Found software for compatible_nos_version 5.1.0.1
2017-05-30 10:07:43 INFO minerva_utils.py:922 target software: name: "2.1.0.1"
version: "2.1.0.1"
size: 1360936448
md5_sum: "42b2b3336a5dd02e3ac58ce0b4b048e3"
service_vm_id: 6
filepath: "ndfs:///NutanixManagementShare/afs/2.1.0.1"
compatible_version_list: "2.1.0"
compatible_version_list: "2.0.0.3"
compatible_version_list: "2.0.0.2"
operation_type: kUpload
operation_status: kCompleted
operator_type: kUser
compatible_nos_version_list: "5.0"
compatible_nos_version_list: "5.0.0.*"
compatible_nos_version_list: "5.0.1"
compatible_nos_version_list: "5.1"
compatible_nos_version_list: "5.1.0.1"
download_time: 1495836373927000
release_date: 1494010736
full_release_version: "el6-release-euphrates-5.1.0.1-stable-
419aa3a83df5548924198f85398deb20e8b615fe"
2017-05-30 10:07:43 INFO minerva_utils.py:941 directory path /afs/2.1.0.1

Zookeeper
2017-05-30 10:07:43 INFO genesis_utils.py:2484 Node 10.30.15.47 is not light compute node
2017-05-30 10:07:43 INFO file_server_misc.py:704 CVM without light compute:

10.30.15.49,10.30.15.48,10.30.15.47
2017-05-30 10:07:43 INFO validation.py:167 Validating File server: afs01

2017-05-30 10:07:43 INFO uvm_network_utils.py:231 network_config_list {
uuid: "$<\263\256\213\245O\265\272\216_\355\210\312\020\213"
type: kBridged
identifier: 0
name: "vlan0"
2017-05-30 10:07:43 ERROR uvm_network_utils.py:234 Network is acropolis managed.
2017-05-30 10:07:43 INFO validation.py:72 ipv4_address_list: "10.30.14.244"
2017-05-30 10:07:43 INFO validation.py:100 [u'10.30.14.244', u'10.30.14.245', u'10.30.14.246',

u'10.30.14.247', u'10.30.0.1']
uuid: "$<\263\256\213\245O\265\272\216_\355\210\312\020\213"
type: kBridged
identifier: 0
name: "vlan0"
2017-05-30 10:07:43 INFO validation.py:72 ipv4_address_list: "10.30.15.241"

2017-05-30 10:07:43 INFO validation.py:100 [u'10.30.15.241', u'10.30.15.242', u'10.30.15.243',

u'10.30.0.1']

Zookeeper
2017-05-30 10:07:43 INFO file_server_misc.py:704 CVM without light compute:

10.30.15.49,10.30.15.48,10.30.15.47
2017-05-30 10:07:43 INFO validation.py:187 Total number of usable cvms: 3, nvms requested: 3
2017-05-30 10:07:43 INFO uvm_utils.py:630 Checking if VM exists: ac3c7060-db9e-4297-b6f4-

564b1083ef92
2017-05-30 10:07:43 INFO uvm_utils.py:636 is vm exists 30
2017-05-30 10:07:43 INFO uvm_utils.py:638 VM ac3c7060-db9e-4297-b6f4-564b1083ef92 doesn't exists

Zookeeper
2017-05-30 10:07:43 INFO uvm_utils.py:328 vm create arg uuid:

"\254<p`\333\236B\227\266\364VK\020\203\357\222"
parent_task_uuid: "\037\3734t\277\004B\252\235\r?\212\267\224\324s"
num_vcpus: 4
memory_size_mb: 12288
nic_list {
network_uuid: "$<\263\256\213\245O\265\272\216_\355\210\312\020\213"
}
ha_priority: 100
hwclock_timezone: "UTC"
2017-05-30 10:07:43 INFO uvm_utils.py:330 vm create ret task_uuid:

"\274\254\212\204DIO\236\247\005ch\t\233\224r"
2017-05-30 10:07:43 INFO minerva_task_util.py:1825 polling the tasks bcac8a84-4449-4f9e-a705-

6368099b9472
2017-05-30 10:07:43 INFO minerva_task_util.py:1247 tasks: bcac8a84-4449-4f9e-a705-6368099b9472
2017-05-30 10:07:43 INFO minerva_task_util.py:1283 Fetching pending tasks.
2017-05-30 10:07:44 INFO minerva_task_util.py:1859 polling sub tasks bcac8a84-4449-4f9e-a705-

6368099b9472 completed
2017-05-30 10:07:44 INFO file_server.py:1109
▒<p`۞B▒▒▒VK▒▒
2017-05-30 10:07:44 INFO file_server.py:1118 New vm uuid: ac3c7060-db9e-4297-b6f4-564b1083ef92
2017-05-30 10:07:44 INFO uvm_utils.py:630 Checking if VM exists: ddbf4012-800d-48e1-a92c-

c6ca2c612563
2017-05-30 10:07:44 INFO uvm_utils.py:638 VM ddbf4012-800d-48e1-a92c-c6ca2c612563 doesn't exists

Zookeeper

"\335\277@\022\200\rH\341\251,\306\312,a%c"
num_vcpus: 4
nic_list {
network_uuid: "$<\263\256\213\245O\265\272\216_\355\210\312\020\213"
ha_priority: 100

"\223\313\362\354\010]K\202\225\246\364\211\033\271\255\267"
2017-05-30 10:07:44 INFO minerva_task_util.py:1825 polling the tasks 93cbf2ec-085d-4b82-95a6-

f4891bb9adb7
2017-05-30 10:07:44 INFO minerva_task_util.py:1247 tasks: 93cbf2ec-085d-4b82-95a6-f4891bb9adb7
2017-05-30 10:07:44 INFO minerva_task_util.py:1859 polling sub tasks 93cbf2ec-085d-4b82-95a6-

f4891bb9adb7 completed
H▒,▒▒,a%c
2017-05-30 10:07:44 INFO file_server.py:1118 New vm uuid: ddbf4012-800d-48e1-a92c-c6ca2c612563
2017-05-30 10:07:44 INFO uvm_utils.py:630 Checking if VM exists: 66386092-223b-48d5-9340-

9f25b6338229
2017-05-30 10:07:44 INFO uvm_utils.py:638 VM 66386092-223b-48d5-9340-9f25b6338229 doesn't

exists

Zookeeper

"f8`\222\";H\325\223@\237%\2663\202)"
num_vcpus: 4
nic_list {
network_uuid: "$<\263\256\213\245O\265\272\216_\355\210\312\020\213"
ha_priority: 100

"\t\322\261\003\005\240MI\252;\252\265V\227\365t"
2017-05-30 10:07:44 INFO minerva_task_util.py:1825 polling the tasks 09d2b103-05a0-4d49-aa3b-

aab55697f574
2017-05-30 10:07:44 INFO minerva_task_util.py:1247 tasks: 09d2b103-05a0-4d49-aa3b-aab55697f574
2017-05-30 10:07:44 INFO minerva_task_util.py:1859 polling sub tasks 09d2b103-05a0-4d49-aa3b-

aab55697f574 completed
f8`▒";HՓ@▒%▒3▒)
2017-05-30 10:07:44 INFO file_server.py:1118 New vm uuid: 66386092-223b-48d5-9340-9f25b6338229
uuid: "$<\263\256\213\245O\265\272\216_\355\210\312\020\213"
type: kBridged
identifier: 0
name: "vlan0"
uuid: "$<\263\256\213\245O\265\272\216_\355\210\312\020\213"
type: kBridged
identifier: 0
name: "vlan0"
}
uuid: "$<\263\256\213\245O\265\272\216_\355\210\312\020\213"
type: kBridged
identifier: 0
name: "vlan0"
2017-05-30 10:07:44 INFO cvm_insights_store.py:398 No file server deployed.
2017-05-30 10:07:47 INFO minerva_utils.py:482 {u'ping,-c,1,-w,15,10.30.15.243': (256, 'PING

10.30.15.243 (10.30.15.243) 56(84) bytes of data.\nFrom 10.30.15.47 icmp_seq=1 Destination Host
Unreachable\nFrom 10.30.15.47 icmp_seq=2 Destination Host Unreachable\nFrom 10.30.15.47
icmp_seq=3 Destination Host Unreachable\nFrom 10.30.15.47 icmp_seq=4 Destination Host
Unreachable\n\n--- 10.30.15.243 ping statistics ---\n4 packets transmitted, 0 received, +4 errors, 100%
packet loss, time 3006ms\npipe 4\n', ''), u'ping,-c,1,-w,15,10.30.15.242': (256, 'PING 10.30.15.242
(10.30.15.242) 56(84) bytes of data.\nFrom 10.30.15.47 icmp_seq=1 Destination Host
(10.30.15.240) 56(84) bytes of data.\n64 bytes from 10.30.15.240: icmp_seq=1 ttl=64 time=0.739
ms\n\n--- 10.30.15.240 ping statistics ---\n1 packets transmitted, 1 received, 0% packet loss, time
0ms\nrtt min/avg/max/mdev = 0.739/0.739/0.739/0.000 ms\n', ''), u'ping,-c,1,-w,15,10.30.14.246': (256,
'PING 10.30.14.246 (10.30.14.246) 56(84) bytes of data.\nFrom 10.30.15.47 icmp_seq=1 Destination
Host Unreachable\nFrom 10.30.15.47 icmp_seq=2 Destination Host Unreachable\nFrom 10.30.15.47
packet loss, time 3006ms\npipe 4\n', '')}
2017-05-30 10:07:47 INFO minerva_utils.py:492 Pinging 10.30.15.240 PING 10.30.15.240 (10.30.15.240)

56(84) bytes of data.
64 bytes from 10.30.15.240: icmp_seq=1 ttl=64 time=0.739 ms
--- 10.30.15.240 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.739/0.739/0.739/0.000 ms

From 10.30.15.47 icmp_seq=1 Destination Host Unreachable
--- 10.30.14.244 ping statistics ---
4 packets transmitted, 0 received, +4 errors, 100% packet loss, time 3005ms
pipe 4

--- 10.30.14.245 ping statistics ---
pipe 4

--- 10.30.15.241 ping statistics ---
pipe 4

--- 10.30.14.246 ping statistics ---

pipe 4

--- 10.30.15.242 ping statistics ---
pipe 4

--- 10.30.14.247 ping statistics ---
pipe 4


--- 10.30.15.243 ping statistics ---
pipe 4
2017-05-30 10:07:47 INFO minerva_utils.py:500 Pinged ip list [u'10.30.15.240'], Non pinged ip list
[u'10.30.14.244', u'10.30.14.245', u'10.30.15.241', u'10.30.14.246', u'10.30.15.242', u'10.30.14.247',
u'10.30.15.243']
2017-05-30 10:07:47 INFO uvm_utils.py:1395 Creating VM Anti-affinity group, name : NTNX-afs01
2017-05-30 10:07:47 INFO uvm_utils.py:1419 Creating VM anti-affinity group with vm group name
NTNX-afs01.
2017-05-30 10:07:47 INFO uvm_utils.py:1433 VM Group Create Task id: 7307d02f-d116-43be-bcb4-

d03728a40653
2017-05-30 10:07:47 INFO minerva_task_util.py:1825 polling the tasks 7307d02f-d116-43be-bcb4-

d03728a40653
2017-05-30 10:07:47 INFO minerva_task_util.py:1247 tasks: 7307d02f-d116-43be-bcb4-d03728a40653
2017-05-30 10:07:47 INFO minerva_task_util.py:1859 polling sub tasks 7307d02f-d116-43be-bcb4-

d03728a40653 completed
2017-05-30 10:07:47 INFO file_server.py:3068 Anti affinity rule of vm group name: NTNX-afs01
2017-05-30 10:07:47 INFO file_server.py:3076 Updating Vm group id: f03203e8-b67e-4314-9977-

a2d195ce8144
2017-05-30 10:07:47 INFO uvm_utils.py:1466 Number of VMs to add to anti-affinity-group : 3
2017-05-30 10:07:47 INFO uvm_utils.py:1468 Adding VM(s) to Anti-affinity group vm_group_uuid :

f03203e8-b67e-4314-9977-a2d195ce8144
2017-05-30 10:07:47 INFO uvm_utils.py:1490 Updating VMs with uuids ac3c7060-db9e-4297-b6f4-

564b1083ef92,ddbf4012-800d-48e1-a92c-c6ca2c612563,66386092-223b-48d5-9340-9f25b6338229 to
anti-affinity group
2017-05-30 10:07:47 INFO uvm_utils.py:1507 Adding VM with uuid ddbf4012-800d-48e1-a92c-

c6ca2c612563 to anti-affinity group
2017-05-30 10:07:47 INFO uvm_utils.py:1507 Adding VM with uuid 66386092-223b-48d5-9340-

9f25b6338229 to anti-affinity group
2017-05-30 10:07:47 INFO uvm_utils.py:1507 Adding VM with uuid ac3c7060-db9e-4297-b6f4-
564b1083ef92 to anti-affinity group
2017-05-30 10:07:47 INFO minerva_task_util.py:1825 polling the tasks fc2e9d43-f581-4cb7-baa9-

c7bc1d8f3d12
2017-05-30 10:07:47 INFO minerva_task_util.py:1825 polling the tasks beedf714-b9b4-463a-828d-

b69cedaedf6c
2017-05-30 10:07:47 INFO minerva_task_util.py:1825 polling the tasks bab7b4e2-022e-465f-a9a8-

bf57f5a6d619
2017-05-30 10:07:47 INFO minerva_task_util.py:1247 tasks: fc2e9d43-f581-4cb7-baa9-c7bc1d8f3d12
2017-05-30 10:07:47 INFO minerva_task_util.py:1247 tasks: beedf714-b9b4-463a-828d-b69cedaedf6c
2017-05-30 10:07:47 INFO minerva_task_util.py:1247 tasks: bab7b4e2-022e-465f-a9a8-bf57f5a6d619
2017-05-30 10:07:47 INFO minerva_task_util.py:1859 polling sub tasks fc2e9d43-f581-4cb7-baa9-

c7bc1d8f3d12 completed
2017-05-30 10:07:47 INFO minerva_task_util.py:1859 polling sub tasks beedf714-b9b4-463a-828d-

b69cedaedf6c completed
2017-05-30 10:07:47 INFO minerva_task_util.py:1859 polling sub tasks bab7b4e2-022e-465f-a9a8-

bf57f5a6d619 completed
2017-05-30 10:07:48 INFO uvm_utils.py:938 Checking if VM is on: ac3c7060-db9e-4297-b6f4-

564b1083ef92
2017-05-30 10:07:48 INFO uvm_utils.py:945 vm_get_ret vm_info_list {
vm_uuid: "\254<p`\333\236B\227\266\364VK\020\203\357\222"
config {
num_vcpus: 4
num_cores_per_vcpu: 1
ha_priority: 100
agent_vm: false
}
hypervisor {
hypervisor_type: kKvm
state: kOff
hypervisor_specific_id: "\254<p`\333\236B\227\266\364VK\020\203\357\222"
allow_live_migrate: true
2017-05-30 10:07:48 INFO file_server.py:5046 Powering on fsvm : ac3c7060-db9e-4297-b6f4-

564b1083ef92
2017-05-30 10:07:48 INFO uvm_utils.py:938 Checking if VM is on: ddbf4012-800d-48e1-a92c-

c6ca2c612563
vm_uuid: "\335\277@\022\200\rH\341\251,\306\312,a%c"
config {
num_vcpus: 4
ha_priority: 100
agent_vm: false
hypervisor {
state: kOff
hypervisor_specific_id: "\335\277@\022\200\rH\341\251,\306\312,a%c"
2017-05-30 10:07:48 INFO file_server.py:5046 Powering on fsvm : ddbf4012-800d-48e1-a92c-

c6ca2c612563
2017-05-30 10:07:48 INFO uvm_utils.py:938 Checking if VM is on: 66386092-223b-48d5-9340-

9f25b6338229
vm_uuid: "f8`\222\";H\325\223@\237%\2663\202)"
config {
num_vcpus: 4
ha_priority: 100
agent_vm: false
hypervisor {
state: kOff
hypervisor_specific_id: "f8`\222\";H\325\223@\237%\2663\202)"
2017-05-30 10:07:48 INFO file_server.py:5046 Powering on fsvm : 66386092-223b-48d5-9340-

9f25b6338229
2017-05-30 10:07:48 INFO minerva_task_util.py:1825 polling the tasks 7896d589-80f6-433b-86c8-
54e4a0ac5ec9
2017-05-30 10:07:48 INFO minerva_task_util.py:1825 polling the tasks 999580e9-a57a-43f5-8e89-

1c1fc6a86bb6
2017-05-30 10:07:48 INFO minerva_task_util.py:1825 polling the tasks 9147a2a5-ebfa-4f77-938c-

e2a8ebe4d0d0
2017-05-30 10:07:48 INFO minerva_task_util.py:1247 tasks: 7896d589-80f6-433b-86c8-54e4a0ac5ec9
2017-05-30 10:07:48 INFO minerva_task_util.py:1247 tasks: 999580e9-a57a-43f5-8e89-1c1fc6a86bb6
2017-05-30 10:07:48 INFO minerva_task_util.py:1247 tasks: 9147a2a5-ebfa-4f77-938c-e2a8ebe4d0d0
2017-05-30 10:07:56 INFO minerva_task_util.py:1859 polling sub tasks 7896d589-80f6-433b-86c8-

54e4a0ac5ec9 completed
2017-05-30 10:07:56 INFO minerva_task_util.py:1859 polling sub tasks 999580e9-a57a-43f5-8e89-

1c1fc6a86bb6 completed
2017-05-30 10:07:56 INFO minerva_task_util.py:1859 polling sub tasks 9147a2a5-ebfa-4f77-938c-

e2a8ebe4d0d0 completed

Zookeeper
2017-05-30 10:07:56 INFO uvm_utils.py:1243 Using svmip : 10.30.15.47
2017-05-30 10:07:56 INFO uvm_utils.py:1248 Transferring file /home/nutanix/tmp/afs-software.info ->

/Nutanix_afs01_ctr/afs-software.info
2017-05-30 10:07:56 INFO uvm_utils.py:1274 Transferred 1018 bytes

Zookeeper
2017-05-30 10:07:56 INFO file_server.py:2174 installer_image_path:

/NutanixManagementShare/afs/2.1.0.1 install_disk_nfs_path: /Nutanix_afs01_ctr/el6-release-
euphrates-5.1.0.1-stable-419aa3a83df5548924198f85398deb20e8b615fe
2017-05-30 10:07:56 INFO uvm_utils.py:1367 Running cmd ['/usr/local/nutanix/bin/qemu-img',

'convert', '-p', '-f', 'qcow2', '-O', 'raw', u'nfs://127.0.0.1/NutanixManagementShare/afs/2.1.0.1',
u'nfs://127.0.0.1/Nutanix_afs01_ctr/el6-release-euphrates-5.1.0.1-stable-
419aa3a83df5548924198f85398deb20e8b615fe']
2017-05-30 10:07:56 INFO file_server.py:2235 Inital current pct 25
2017-05-30 10:07:56 INFO file_server.py:2238 Copying progress : (0.00/100%)
(5.01/100%)8:01 INFO file_server.py:2238 Copying progress : (1.00/100%)

(11.02/100%):06 INFO file_server.py:2238 Copying progress : (6.01/100%)
2017-05-30 10:09:16 INFO file_server.py:1080 Install Disk Nfs path /Nutanix_afs01_ctr/el6-release-

euphrates-5.1.0.1-stable-419aa3a83df5548924198f85398deb20e8b615fe
2017-05-30 10:09:16 INFO file_server.py:505 subtasks found :None
2017-05-30 10:09:16 INFO file_server.py:513 Creating FileServerVmDeploy Task
2017-05-30 10:09:16 INFO minerva_task_util.py:885 Created task with task_uuid: de734e01-a8d7-4fdc-

9962-04e136082633
2017-05-30 10:09:16 INFO minerva_task_util.py:885 Created task with task_uuid: 4c3ee85b-1da9-4834-

a3ed-fbb3af1ad1a2
2017-05-30 10:09:16 INFO minerva_task_util.py:145 FileServerVmDeploy [Seq Id = 2] : Fetching the

current state.
2017-05-30 10:09:16 INFO minerva_task_util.py:1579 Task current state: 200

Zookeeper
2017-05-30 10:09:16 INFO minerva_utils.py:3347 SVM subnet = 10.30.0.0/20
2017-05-30 10:09:16 INFO file_server.py:264 Deploying FileServer Vm NTNX-afs01-1
2017-05-30 10:09:16 INFO file_server.py:270 Preparing ISO for FileServerVm NTNX-afs01-1

2017-05-30 10:09:16 INFO minerva_task_util.py:885 Created task with task_uuid: 9236c6c9-5080-456d-
b88f-37b49e28d6c7
2017-05-30 10:09:16 INFO file_server.py:543 Polling all file server Deploy Vm Task
2017-05-30 10:09:16 INFO minerva_task_util.py:1825 polling the tasks de734e01-a8d7-4fdc-9962-

04e136082633
2017-05-30 10:09:16 INFO minerva_task_util.py:1825 polling the tasks 4c3ee85b-1da9-4834-a3ed-

fbb3af1ad1a2
2017-05-30 10:09:16 INFO minerva_task_util.py:1825 polling the tasks 9236c6c9-5080-456d-b88f-

37b49e28d6c7
2017-05-30 10:09:16 INFO minerva_task_util.py:1247 tasks: de734e01-a8d7-4fdc-9962-04e136082633
2017-05-30 10:09:16 INFO minerva_task_util.py:1247 tasks: 4c3ee85b-1da9-4834-a3ed-fbb3af1ad1a2
2017-05-30 10:09:16 INFO minerva_task_util.py:1247 tasks: 9236c6c9-5080-456d-b88f-37b49e28d6c7
2017-05-30 10:09:16 INFO minerva_utils.py:1494 Obtained public key for 7624b228-de3b-4357-aad2-

11b13c256c1f
2017-05-30 10:09:16 INFO minerva_utils.py:1494 Obtained public key for b6abd8d8-d6c0-4a6f-8c5e-

2ceada4841c7
2017-05-30 10:09:16 INFO minerva_utils.py:1494 Obtained public key for 6f95ea94-5636-4ad7-83a8-

cd0c5415d5f3
2017-05-30 10:09:16 INFO file_server.py:277 Creating iso on fsvm : 10.30.14.245
2017-05-30 10:09:16 INFO file_server_misc.py:358 #cloud-config
resolv_conf:
nameservers: ['10.30.15.91']
runcmd:
- sudo /usr/local/nutanix/cluster/bin/create_home_disk --home_disk=/dev/sdb --data_disk=/dev/sdc --

node_name="10.30.14.245-A" --ipconfig="10.30.14.245/255.255.240.0/10.30.0.1" /saved.tar.gz /
- sudo /usr/local/nutanix/cluster/bin/configure_network_configs --
internal_ipconfig="10.30.14.245/255.255.240.0/10.30.0.1" --
external_ipconfig="10.30.15.241/255.255.240.0/10.30.0.1" --external_interface_name="eth0:2"
- sudo /usr/local/nutanix/cluster/bin/configure_network_routes --
internal_config_iproute_list="10.30.0.0/20"
- sudo /usr/local/nutanix/cluster/bin/configure_services --enable_services="ntpdate" --
disable_services="" --dns_servers="10.30.15.91" --ntp_servers="10.30.15.91"
- sudo mount /home
- sudo /etc/rc.d/rc.local
users:
- name: nutanix
lock_passwd: false
ssh-authorized-keys:
- ssh-rsa
AAAAB3NzaC1yc2EAAAADAQABAAABAQCgvTZfgYRWtpb3+cE/qJwW1K7oGVbQKEgaQrdujnE07bKHecslQ
gCD2VnFJaEzeRZHsX5GC9LDOVrvDKDV6DltgeKMrv1k4yO4xH9nttSZMMPfjgGddHy6pW7Dc/ibU6wl4G/9
VHtjm8+vVbBo3wAEguU/lAR5lrbVkyZ0OT+HxYiVAagCPljWGYFrO7U7/AMjSWC1zqKFgC1q2ye7wFejawiB
86nxuHT6uMbiTxrbzMFL8X3VBZKe5PRrBMiDAjvRmm69ZD2vEUnl2B+YyGDOyNOwvdDdzfjsFCXn5oRRU6
GNybmDXeu9XCy7zna2GwcQwMcn2HHhS71paxPuaY6N nutanix@NTNX-16SM13150152-C-CVM
- ssh-rsa
AAAAB3NzaC1yc2EAAAADAQABAAABAQDeSSYYtSwStBLEo6MYx0HLn6eatXKnJNdkBQJoOOiKD3b4dzsLLy
T0jaSjLcPHhE1m1KBoWtGyhRT8xzm76YRksU+6fvB3h/mFHnAj0hme7n1TYPr224z34yUZLuwYlp/wQhArxZ
YODo/1wZrGA1crfIvymYMEw52JBlFiJu6QMC6MfF9RHfxFeu1b9vj8aqrZlfhWqPkkAIGErAEpYsbPlH0t7PM
oBTRSkYmM73UCs0xIAGzIn+MK0hCcYYK6oGRLPtJGe7S6beyZtxp/xTHoHVJR6SC1ub5nGnR723O7/AwbC
qf5dWqjoWoXCxah7Jc3FOPyvk5ROLfRrOfD14at nutanix@NTNX-16SM13150152-B-CVM
- ssh-rsa
AAAAB3NzaC1yc2EAAAADAQABAAABAQCiIkYVSgNUKAGslHBDu1QswH66JA/X5tPThr4k96SQy9LrKnNUSc
jaUOIg5H0iglpEqEnOxC8CtEX5XUybT82GU2l9PUxa3uwd2cMWKqcYeg1w3WIs+x9vO1G6QOTPVGLVM07
7uyVlVs2CVO7uuDC3KTEQaszfi7NHMIQKig/9w3KCMLD44c/zj4FqIOuKezyhMjCIjITOASj6/yRLlF8wS1EY08
OmrXUyFugv1ORVndOmTaQOMsYPd2XGm/jLIPCfqVtIGt7Trs8XT+kh2d8uSvtOopKnJ4Ej+s2SL0c1xN6Wo6
A/LigTYCEaHVpeFyMZddmrdCojimTArWO/f3VD nutanix@NTNX-16SM13150152-A-CVM
2017-05-30 10:09:16 INFO file_server_misc.py:488 Starting iso creation for fsvm 10.30.14.245.
2017-05-30 10:09:16 INFO file_server_misc.py:510 instance-id: b663904a-455a-11e7-952c-

525400dcb488
network-interfaces: |
auto eth0
iface eth0 inet static
address 10.30.14.245
network 10.30.0.0
netmask 255.255.240.0
broadcast 10.30.15.255
gateway 10.30.0.1

current state.

Zookeeper

current state.
2017-05-30 10:09:16 INFO file_server_misc.py:521 Iso made:

/home/nutanix/tmp/10.30.14.245/seednvm10.30.14.245.iso
2017-05-30 10:09:16 INFO file_server_misc.py:524 Deleting file:

/home/nutanix/tmp/10.30.14.245/meta-data

/home/nutanix/tmp/10.30.14.245/user-data

Zookeeper
2017-05-30 10:09:16 INFO uvm_utils.py:1248 Transferring file

/home/nutanix/tmp/10.30.14.245/seednvm10.30.14.245.iso -> /Nutanix_afs01_ctr/NTNX-afs01-1.iso

11b13c256c1f
2ceada4841c7

cd0c5415d5f3
resolv_conf:
nameservers: ['10.30.15.91']
runcmd:


- sudo mount /home
users:
- name: nutanix
lock_passwd: false
- ssh-rsa
- ssh-rsa
- ssh-rsa
2017-05-30 10:09:16 INFO file_server_misc.py:510 instance-id: b670c990-455a-11e7-b9df-

525400dcb488
auto eth0
address 10.30.14.246
network 10.30.0.0
netmask 255.255.240.0
gateway 10.30.0.1


11b13c256c1f


2ceada4841c7


cd0c5415d5f3

resolv_conf:
nameservers: ['10.30.15.91']
runcmd:


- sudo mount /home
users:
- name: nutanix
lock_passwd: false
- ssh-rsa
- ssh-rsa
- ssh-rsa
2017-05-30 10:09:16 INFO file_server.py:300 Cloud init dir /home/nutanix/tmp/10.30.14.245/
2017-05-30 10:09:16 INFO file_server_misc.py:510 instance-id: b676a270-455a-11e7-82ed-

525400dcb488
auto eth0
address 10.30.14.247
network 10.30.0.0
netmask 255.255.240.0
gateway 10.30.0.1
2017-05-30 10:09:16 INFO file_server.py:305 Powering off FileServer Vm's NTNX-afs01-1

564b1083ef92




vm_uuid: "\254<p`\333\236B\227\266\364VK\020\203\357\222"
config {
num_vcpus: 4
ha_priority: 100
agent_vm: false
hypervisor {
host_uuid: "\266\253\330\330\326\300Jo\214^,\352\332HA\307"
state: kOn

c6ca2c612563
vm_uuid: "\335\277@\022\200\rH\341\251,\306\312,a%c"
config {
num_vcpus: 4
ha_priority: 100
agent_vm: false
hypervisor {
host_uuid: "o\225\352\224V6J\327\203\250\315\014T\025\325\363"
state: kOn
2017-05-30 10:09:16 INFO uvm_utils.py:119 Task poll wait for Shutdown

9f25b6338229
vm_uuid: "f8`\222\";H\325\223@\237%\2663\202)"
config {
num_vcpus: 4
ha_priority: 100
agent_vm: false
hypervisor {
host_uuid: "\266\253\330\330\326\300Jo\214^,\352\332HA\307"
state: kOn
(100.00/100%)17 INFO file_server.py:2238 Copying progress : (95.52/100%)
2017-05-30 10:09:17 INFO uvm_utils.py:130 completed_tasks {
uuid: "!\t\002\237$\277ON\215|\232\276l\264\207<"
sequence_id: 8
request {
method_name: "VmChangePowerState"
arg {
embedded:
"\022\020L>\350[\035\251H4\243\355\373\263\257\032\321\242\032\020\335\277@\022\200\rH\3
41\251,\306\312,a%c \002"
response {
error_code: 0
ret {
embedded: ""
create_time_usecs: 1496164156955905
start_time_usecs: 1496164156985891
complete_time_usecs: 1496164157876469
last_updated_time_usecs: 1496164157876469
entity_list {
entity_id: "\335\277@\022\200\rH\341\251,\306\312,a%c"
entity_type: kVM
operation_type: "VmChangePowerState"
message: ""
percentage_complete: 100
status: kSucceeded
parent_task_uuid: "L>\350[\035\251H4\243\355\373\263\257\032\321\242"
component: "Uhura"
canceled: false
internal_opaque:
"(dp0\nS\'internal_vm_change_power_state_op_handle\'\np1\nS\"S\'\\xcb\\x8d)vN\\xd5\\x9d%\\x8c\
\x0f07\\xfa\\xfe\"\np2\nsS\'internal_current_state_info\'\np3\n(S\'CONSOLIDATE_RESULTS\'\np4\nNt
p5\ns."
deleted: false
internal_task: false
cluster_uuid: "\000\005PsR\302I\247\000\000\000\000\000\000\272\272"
disable_auto_progress_update: true
weight: 1000
2017-05-30 10:09:17 INFO uvm_utils.py:132 logical_timestamp: 6
uuid: "!\t\002\237$\277ON\215|\232\276l\264\207<"
sequence_id: 8
request {
arg {
embedded:
"\022\020L>\350[\035\251H4\243\355\373\263\257\032\321\242\032\020\335\277@\022\200\rH\3
41\251,\306\312,a%c \002"
response {
error_code: 0
ret {
embedded: ""
entity_list {
entity_id: "\335\277@\022\200\rH\341\251,\306\312,a%c"
entity_type: kVM
message: ""
status: kSucceeded
component: "Uhura"
canceled: false
internal_opaque:
"(dp0\nS\'internal_vm_change_power_state_op_handle\'\np1\nS\"S\'\\xcb\\x8d)vN\\xd5\\x9d%\\x8c\
\x0f07\\xfa\\xfe\"\np2\nsS\'internal_current_state_info\'\np3\n(S\'CONSOLIDATE_RESULTS\'\np4\nNt
p5\ns."
deleted: false
weight: 1000
2017-05-30 10:09:17 INFO uvm_utils.py:407 powered off vm ddbf4012-800d-48e1-a92c-c6ca2c612563
2017-05-30 10:09:17 INFO file_server.py:314 Attaching ISO to FileServerVm NTNX-afs01-2
2017-05-30 10:09:17 INFO uvm_utils.py:737 vm_info_list {
vm_uuid: "\335\277@\022\200\rH\341\251,\306\312,a%c"
config {
num_vcpus: 4
ha_priority: 100
agent_vm: false
hypervisor {
state: kOff

c6ca2c612563
uuid: "B#\227x`\313Mf\236:\006\014\217~\022\353"
sequence_id: 7
request {
arg {
embedded:
"\022\020\336sN\001\250\327O\334\231b\004\3416\010&3\032\020\254<p`\333\236B\227\266\364
VK\020\203\357\222 \002"
response {
error_code: 0
ret {
embedded: ""
entity_list {
entity_id: "\254<p`\333\236B\227\266\364VK\020\203\357\222"
entity_type: kVM
message: ""
status: kSucceeded
parent_task_uuid: "\336sN\001\250\327O\334\231b\004\3416\010&3"
component: "Uhura"
canceled: false
internal_opaque:
"(dp0\nS\'internal_vm_change_power_state_op_handle\'\np1\nS\'0\\x98og\\x9dMK\\xcc\\xb8O\\xf8\
\x03ZO\\xb2\\xbf\'\np2\nsS\'internal_current_state_info\'\np3\n(S\'CONSOLIDATE_RESULTS\'\np4\nNt
p5\ns."
deleted: false
weight: 1000

uuid: "B#\227x`\313Mf\236:\006\014\217~\022\353"
sequence_id: 7
request {
arg {
embedded:
"\022\020\336sN\001\250\327O\334\231b\004\3416\010&3\032\020\254<p`\333\236B\227\266\364
VK\020\203\357\222 \002"
response {
error_code: 0
ret {
embedded: ""
entity_list {
entity_id: "\254<p`\333\236B\227\266\364VK\020\203\357\222"
entity_type: kVM
message: ""
status: kSucceeded
component: "Uhura"
canceled: false
internal_opaque:
"(dp0\nS\'internal_vm_change_power_state_op_handle\'\np1\nS\'0\\x98og\\x9dMK\\xcc\\xb8O\\xf8\
\x03ZO\\xb2\\xbf\'\np2\nsS\'internal_current_state_info\'\np3\n(S\'CONSOLIDATE_RESULTS\'\np4\nNt
p5\ns."
deleted: false
weight: 1000
2017-05-30 10:09:17 INFO uvm_utils.py:407 powered off vm ac3c7060-db9e-4297-b6f4-564b1083ef92
vm_uuid: "\335\277@\022\200\rH\341\251,\306\312,a%c"
config {
num_vcpus: 4
ha_priority: 100
agent_vm: false
hypervisor {
state: kOff
vm_uuid: "\254<p`\333\236B\227\266\364VK\020\203\357\222"
config {
num_vcpus: 4
ha_priority: 100
agent_vm: false
hypervisor {
state: kOff

564b1083ef92
vm_uuid: "\254<p`\333\236B\227\266\364VK\020\203\357\222"
config {
num_vcpus: 4
ha_priority: 100
agent_vm: false
hypervisor {
state: kOff
2017-05-30 10:09:17 INFO file_server.py:331 Waiting for ISO to be attached NTNX-afs01-2
2017-05-30 10:09:17 INFO minerva_task_util.py:1825 polling the tasks 5cefe155-f8bb-4b94-99bb-

7ef39f672beb
2017-05-30 10:09:17 INFO minerva_task_util.py:1247 tasks: 5cefe155-f8bb-4b94-99bb-7ef39f672beb
2017-05-30 10:09:17 INFO minerva_task_util.py:1825 polling the tasks 9a1208fb-4124-4793-af29-

7e060af2a579
2017-05-30 10:09:17 INFO minerva_task_util.py:1247 tasks: 9a1208fb-4124-4793-af29-7e060af2a579
uuid: ":\0228+d^@\341\203e\225\262T\261\364\226"
sequence_id: 9
request {
arg {
embedded:
"\022\020\2226\306\311P\200Em\270\2177\264\236(\326\307\032\020f8`\222\";H\325\223@\237%
\2663\202) \002"
response {
error_code: 0
ret {
embedded: ""
entity_list {
entity_id: "f8`\222\";H\325\223@\237%\2663\202)"
entity_type: kVM
message: ""
status: kSucceeded
parent_task_uuid: "\2226\306\311P\200Em\270\2177\264\236(\326\307"
component: "Uhura"
canceled: false
internal_opaque:
"(dp0\nS\'internal_vm_change_power_state_op_handle\'\np1\nS\'wO\"FQ9J\\xf3\\x8f\\x15\\x96\\xb7
\\xd2\\x1c\\x95D\'\np2\nsS\'internal_current_state_info\'\np3\n(S\'CONSOLIDATE_RESULTS\'\np4\nNt
p5\ns."
deleted: false
weight: 1000
uuid: ":\0228+d^@\341\203e\225\262T\261\364\226"
sequence_id: 9
request {
arg {
embedded:
"\022\020\2226\306\311P\200Em\270\2177\264\236(\326\307\032\020f8`\222\";H\325\223@\237%
\2663\202) \002"
response {
error_code: 0
ret {
embedded: ""
}
entity_list {
entity_id: "f8`\222\";H\325\223@\237%\2663\202)"
entity_type: kVM
message: ""
status: kSucceeded
component: "Uhura"
canceled: false
internal_opaque:
"(dp0\nS\'internal_vm_change_power_state_op_handle\'\np1\nS\'wO\"FQ9J\\xf3\\x8f\\x15\\x96\\xb7
\\xd2\\x1c\\x95D\'\np2\nsS\'internal_current_state_info\'\np3\n(S\'CONSOLIDATE_RESULTS\'\np4\nNt
p5\ns."
deleted: false
weight: 1000
2017-05-30 10:09:18 INFO uvm_utils.py:407 powered off vm 66386092-223b-48d5-9340-9f25b6338229
vm_uuid: "f8`\222\";H\325\223@\237%\2663\202)"
config {
num_vcpus: 4
ha_priority: 100
agent_vm: false
hypervisor {
state: kOff

9f25b6338229
vm_uuid: "f8`\222\";H\325\223@\237%\2663\202)"
config {
num_vcpus: 4
ha_priority: 100
agent_vm: false
hypervisor {
state: kOff
2017-05-30 10:09:18 INFO minerva_task_util.py:1859 polling sub tasks 5cefe155-f8bb-4b94-99bb-

7ef39f672beb completed
2017-05-30 10:09:18 INFO file_server.py:341 Attaching afs disk to FileServer Vm NTNX-afs01-2
vm_uuid: "\335\277@\022\200\rH\341\251,\306\312,a%c"
config {
num_vcpus: 4
disk_list {
disk_addr {
adapter_type: kIDE
device_index: 0
vmdisk_uuid: "\354m\267\271dRH\200\234m\343/\343\001\035\225"
disk_label: "ide.0"
}
source_vmdisk_addr {
nfs_path: "/Nutanix_afs01_ctr/NTNX-afs01-2.iso"
container_id: 12406
scsi_passthrough_enabled: true
size_bytes: 376832
is_cdrom: true
ha_priority: 100
agent_vm: false
hypervisor {
state: kOff
2017-05-30 10:09:18 INFO uvm_utils.py:742 disk_addr {
adapter_type: kIDE
device_index: 0
disk_label: "ide.0"
}
container_id: 12406
size_bytes: 376832
is_cdrom: true
2017-05-30 10:09:18 INFO file_server.py:353 Attaching afs disk to FileServerVm NTNX-afs01-2

c6ca2c612563
2017-05-30 10:09:18 INFO minerva_task_util.py:1825 polling the tasks 650ee044-4637-45f1-8bec-

5b155ce83042
2017-05-30 10:09:18 INFO minerva_task_util.py:1247 tasks: 650ee044-4637-45f1-8bec-5b155ce83042
vm_uuid: "\335\277@\022\200\rH\341\251,\306\312,a%c"
config {
num_vcpus: 4
ha_priority: 100
agent_vm: false
hypervisor {
state: kOff
}
2017-05-30 10:09:18 INFO minerva_task_util.py:1859 polling sub tasks 9a1208fb-4124-4793-af29-

7e060af2a579 completed
vm_uuid: "\254<p`\333\236B\227\266\364VK\020\203\357\222"
config {
num_vcpus: 4
disk_list {
disk_addr {
adapter_type: kIDE
device_index: 0
vmdisk_uuid: "\346\315\034\272{BB\206\206\022,G\345\312\026D"
disk_label: "ide.0"
container_id: 12406
size_bytes: 376832
is_cdrom: true
}
ha_priority: 100
agent_vm: false
hypervisor {
state: kOff
adapter_type: kIDE
device_index: 0
vmdisk_uuid: "\346\315\034\272{BB\206\206\022,G\345\312\026D"
disk_label: "ide.0"
container_id: 12406
size_bytes: 376832
is_cdrom: true

564b1083ef92
vm_uuid: "\254<p`\333\236B\227\266\364VK\020\203\357\222"
config {
num_vcpus: 4
ha_priority: 100
agent_vm: false
hypervisor {
state: kOff
2017-05-30 10:09:18 INFO file_server.py:363 Waiting for afs disk attachment to FileServer Vm NTNX-
afs01-2
2017-05-30 10:09:18 INFO minerva_task_util.py:1825 polling the tasks f4d77a21-6800-4789-9679-

e70048f79355
2017-05-30 10:09:18 INFO minerva_task_util.py:1247 tasks: f4d77a21-6800-4789-9679-e70048f79355
afs01-1
2017-05-30 10:09:18 INFO minerva_task_util.py:1825 polling the tasks b2eac50c-618f-40dc-819c-

131c3ecb178f
2017-05-30 10:09:18 INFO minerva_task_util.py:1247 tasks: b2eac50c-618f-40dc-819c-131c3ecb178f
2017-05-30 10:09:18 INFO minerva_task_util.py:1859 polling sub tasks 650ee044-4637-45f1-8bec-

5b155ce83042 completed
vm_uuid: "f8`\222\";H\325\223@\237%\2663\202)"
config {
num_vcpus: 4
disk_list {
disk_addr {
adapter_type: kIDE
device_index: 0
vmdisk_uuid: "\002\253D3\326\237B\026\210ET.\265\253!\347"
disk_label: "ide.0"
container_id: 12406
size_bytes: 376832
is_cdrom: true
ha_priority: 100
agent_vm: false
hypervisor {
state: kOff
adapter_type: kIDE
device_index: 0
vmdisk_uuid: "\002\253D3\326\237B\026\210ET.\265\253!\347"
disk_label: "ide.0"
container_id: 12406
size_bytes: 376832
is_cdrom: true

9f25b6338229

vm_uuid: "f8`\222\";H\325\223@\237%\2663\202)"
config {
num_vcpus: 4
ha_priority: 100
agent_vm: false
hypervisor {
state: kOff
afs01-3
2017-05-30 10:09:18 INFO minerva_task_util.py:1825 polling the tasks 9989bb69-5da8-440d-9714-

2c058b5d1bcb
2017-05-30 10:09:18 INFO minerva_task_util.py:1247 tasks: 9989bb69-5da8-440d-9714-2c058b5d1bcb
2017-05-30 10:09:22 INFO file_server.py:2238 Copying progress :
2017-05-30 10:09:23 INFO minerva_task_util.py:1859 polling sub tasks b2eac50c-618f-40dc-819c-

131c3ecb178f completed
2017-05-30 10:09:23 INFO file_server.py:373 Afs Disk attached, attaching home disk to FileServer Vm
NTNX-afs01-1
vm_uuid: "\254<p`\333\236B\227\266\364VK\020\203\357\222"
config {
num_vcpus: 4
disk_list {
disk_addr {
adapter_type: kIDE
device_index: 0
vmdisk_uuid: "\346\315\034\272{BB\206\206\022,G\345\312\026D"
disk_label: "ide.0"
container_id: 12406
size_bytes: 376832
is_cdrom: true
disk_list {
disk_addr {
adapter_type: kSCSI
device_index: 0
vmdisk_uuid: "\r~\300\320\230\325FY\240l\214\340K\200\274T"
disk_label: "scsi.0"
}
nfs_path: "/Nutanix_afs01_ctr/el6-release-euphrates-5.1.0.1-stable-
419aa3a83df5548924198f85398deb20e8b615fe"
container_id: 12406
size_bytes: 12884901888
ha_priority: 100
agent_vm: false
hypervisor {
state: kOff
adapter_type: kIDE
device_index: 0
vmdisk_uuid: "\346\315\034\272{BB\206\206\022,G\345\312\026D"
disk_label: "ide.0"
}
container_id: 12406
size_bytes: 376832
is_cdrom: true
adapter_type: kSCSI
device_index: 0
419aa3a83df5548924198f85398deb20e8b615fe"
container_id: 12406
size_bytes: 12884901888
2017-05-30 10:09:23 INFO file_server.py:386 Attaching home disk to FileServerVm NTNX-afs01-1

564b1083ef92
vm_uuid: "\254<p`\333\236B\227\266\364VK\020\203\357\222"
config {
num_vcpus: 4
ha_priority: 100
agent_vm: false
hypervisor {
state: kOff
2017-05-30 10:09:23 INFO file_server.py:397 Waiting for home disk attachment to FileServer Vm NTNX-
afs01-1
2017-05-30 10:09:23 INFO minerva_task_util.py:1825 polling the tasks c0e4f5b5-e5a6-4356-84e2-

0b3c197082f4
2017-05-30 10:09:23 INFO minerva_task_util.py:1247 tasks: c0e4f5b5-e5a6-4356-84e2-0b3c197082f4
2017-05-30 10:09:24 INFO minerva_task_util.py:1859 polling sub tasks c0e4f5b5-e5a6-4356-84e2-

0b3c197082f4 completed
2017-05-30 10:09:24 INFO file_server.py:407 Home Disk attached, attaching cassandra disk to FileServer
Vm NTNX-afs01-1
vm_uuid: "\254<p`\333\236B\227\266\364VK\020\203\357\222"
config {
num_vcpus: 4
disk_list {
disk_addr {
adapter_type: kIDE
device_index: 0
vmdisk_uuid: "\346\315\034\272{BB\206\206\022,G\345\312\026D"
disk_label: "ide.0"
container_id: 12406
size_bytes: 376832
is_cdrom: true
disk_list {
disk_addr {
adapter_type: kSCSI
device_index: 0
419aa3a83df5548924198f85398deb20e8b615fe"
container_id: 12406
size_bytes: 12884901888
disk_list {
disk_addr {
adapter_type: kSCSI
device_index: 1
vmdisk_uuid: "\235\274x7\3643L\362\266\010[\311\031!\371\332"
container_id: 12406
size_bytes: 48318382080
ha_priority: 100
agent_vm: false
hypervisor {
state: kOff
adapter_type: kIDE
device_index: 0
vmdisk_uuid: "\346\315\034\272{BB\206\206\022,G\345\312\026D"
disk_label: "ide.0"
container_id: 12406
size_bytes: 376832
is_cdrom: true
adapter_type: kSCSI
device_index: 0
419aa3a83df5548924198f85398deb20e8b615fe"
container_id: 12406
size_bytes: 12884901888
adapter_type: kSCSI
device_index: 1
vmdisk_uuid: "\235\274x7\3643L\362\266\010[\311\031!\371\332"
container_id: 12406
size_bytes: 48318382080
2017-05-30 10:09:24 INFO file_server.py:419 Attaching cassandra disk to FileServerVm NTNX-afs01-1

564b1083ef92
vm_uuid: "\254<p`\333\236B\227\266\364VK\020\203\357\222"
config {
num_vcpus: 4
ha_priority: 100
agent_vm: false
hypervisor {
state: kOff
}
2017-05-30 10:09:24 INFO file_server.py:430 Waiting for cassandra disk attachment to FileServer Vm
NTNX-afs01-1
2017-05-30 10:09:24 INFO minerva_task_util.py:1825 polling the tasks 3c1aaff0-9725-4cdc-8ea3-

d287f5d70337
2017-05-30 10:09:24 INFO minerva_task_util.py:1247 tasks: 3c1aaff0-9725-4cdc-8ea3-d287f5d70337
2017-05-30 10:09:24 INFO minerva_task_util.py:1859 polling sub tasks 3c1aaff0-9725-4cdc-8ea3-

d287f5d70337 completed
2017-05-30 10:09:24 INFO file_server.py:439 Powering on FileServer Vm NTNX-afs01-1
2017-05-30 10:09:24 INFO uvm_utils.py:921 Checking if VM is off: ac3c7060-db9e-4297-b6f4-

564b1083ef92
2017-05-30 10:09:24 INFO uvm_utils.py:119 Task poll wait for Power on
2017-05-30 10:09:29 INFO minerva_task_util.py:1859 polling sub tasks f4d77a21-6800-4789-9679-

e70048f79355 completed
NTNX-afs01-2
vm_uuid: "\335\277@\022\200\rH\341\251,\306\312,a%c"
config {
num_vcpus: 4
disk_list {
disk_addr {
adapter_type: kIDE
device_index: 0
disk_label: "ide.0"
}
container_id: 12406
size_bytes: 376832
is_cdrom: true
disk_list {
disk_addr {
adapter_type: kSCSI
device_index: 0
vmdisk_uuid: "\007\010\032\030\273\'J\254\233h\313\333) \202\025"
419aa3a83df5548924198f85398deb20e8b615fe"
container_id: 12406
size_bytes: 12884901888
ha_priority: 100
agent_vm: false
hypervisor {
}
state: kOff
adapter_type: kIDE
device_index: 0
disk_label: "ide.0"
container_id: 12406
size_bytes: 376832
is_cdrom: true
adapter_type: kSCSI
device_index: 0
vmdisk_uuid: "\007\010\032\030\273\'J\254\233h\313\333) \202\025"
419aa3a83df5548924198f85398deb20e8b615fe"
}
container_id: 12406
size_bytes: 12884901888

c6ca2c612563
vm_uuid: "\335\277@\022\200\rH\341\251,\306\312,a%c"
config {
num_vcpus: 4
ha_priority: 100
agent_vm: false
hypervisor {
state: kOff
}
afs01-2
2017-05-30 10:09:29 INFO minerva_task_util.py:1825 polling the tasks 417ae1cc-40a7-4389-a821-

86d1e3b90ad2
2017-05-30 10:09:29 INFO minerva_task_util.py:1247 tasks: 417ae1cc-40a7-4389-a821-86d1e3b90ad2
uuid: "u\367 S\344=C\357\240\262+l\336g6\336"
sequence_id: 18
request {
arg {
embedded:
"\022\020\336sN\001\250\327O\334\231b\004\3416\010&3\032\020\254<p`\333\236B\227\266\364
VK\020\203\357\222 \001"
response {
error_code: 0
ret {
embedded: ""
entity_list {
entity_id: "\254<p`\333\236B\227\266\364VK\020\203\357\222"
entity_type: kVM
message: ""
status: kSucceeded
component: "Uhura"
canceled: false
internal_opaque:
"(dp0\nS\'internal_vm_change_power_state_op_handle\'\np1\nS\'3\\x92\\xaa\\x95\\x16?L\\xa7\\x96
<4\\x87\\xfc\\xa8kJ\'\np2\nsS\'internal_current_state_info\'\np3\n(S\'CONSOLIDATE_RESULTS\'\np4\n
Ntp5\ns."
deleted: false
weight: 1000
uuid: "u\367 S\344=C\357\240\262+l\336g6\336"
sequence_id: 18
request {
arg {
embedded:
"\022\020\336sN\001\250\327O\334\231b\004\3416\010&3\032\020\254<p`\333\236B\227\266\364
VK\020\203\357\222 \001"
}
}
response {
error_code: 0
ret {
embedded: ""
entity_list {
entity_id: "\254<p`\333\236B\227\266\364VK\020\203\357\222"
entity_type: kVM
message: ""
status: kSucceeded
component: "Uhura"
canceled: false
internal_opaque:
"(dp0\nS\'internal_vm_change_power_state_op_handle\'\np1\nS\'3\\x92\\xaa\\x95\\x16?L\\xa7\\x96
<4\\x87\\xfc\\xa8kJ\'\np2\nsS\'internal_current_state_info\'\np3\n(S\'CONSOLIDATE_RESULTS\'\np4\n
Ntp5\ns."
deleted: false
weight: 1000
2017-05-30 10:09:30 INFO minerva_task_util.py:1859 polling sub tasks 417ae1cc-40a7-4389-a821-

86d1e3b90ad2 completed
Vm NTNX-afs01-2
vm_uuid: "\335\277@\022\200\rH\341\251,\306\312,a%c"
config {
num_vcpus: 4
disk_list {
disk_addr {
adapter_type: kIDE
device_index: 0
disk_label: "ide.0"
container_id: 12406
size_bytes: 376832
is_cdrom: true
}
disk_list {
disk_addr {
adapter_type: kSCSI
device_index: 0
vmdisk_uuid: "\007\010\032\030\273\'J\254\233h\313\333) \202\025"
419aa3a83df5548924198f85398deb20e8b615fe"
container_id: 12406
size_bytes: 12884901888
disk_list {
disk_addr {
adapter_type: kSCSI
device_index: 1
vmdisk_uuid: "\312\275\374\235.\347E\022\206J\344K,k\363\344"
container_id: 12406
size_bytes: 48318382080
ha_priority: 100
agent_vm: false
}
hypervisor {
state: kOff
adapter_type: kIDE
device_index: 0
disk_label: "ide.0"
container_id: 12406
size_bytes: 376832
is_cdrom: true
adapter_type: kSCSI
device_index: 0
vmdisk_uuid: "\007\010\032\030\273\'J\254\233h\313\333) \202\025"
}
419aa3a83df5548924198f85398deb20e8b615fe"
container_id: 12406
size_bytes: 12884901888
adapter_type: kSCSI
device_index: 1
vmdisk_uuid: "\312\275\374\235.\347E\022\206J\344K,k\363\344"
container_id: 12406
size_bytes: 48318382080

c6ca2c612563
vm_uuid: "\335\277@\022\200\rH\341\251,\306\312,a%c"
config {
num_vcpus: 4
ha_priority: 100
agent_vm: false
hypervisor {
state: kOff
NTNX-afs01-2
2017-05-30 10:09:30 INFO minerva_task_util.py:1825 polling the tasks 3cbd1127-73d6-4d99-bd7d-

ad304dd51e86
2017-05-30 10:09:30 INFO minerva_task_util.py:1247 tasks: 3cbd1127-73d6-4d99-bd7d-ad304dd51e86
2017-05-30 10:09:31 INFO minerva_task_util.py:1859 polling sub tasks 3cbd1127-73d6-4d99-bd7d-

ad304dd51e86 completed
2017-05-30 10:09:31 INFO uvm_utils.py:921 Checking if VM is off: ddbf4012-800d-48e1-a92c-

c6ca2c612563
uuid: "c9i\202\331A@\353\215\253\r\007C\221\241?"
sequence_id: 21
request {
arg {
embedded:
"\022\020L>\350[\035\251H4\243\355\373\263\257\032\321\242\032\020\335\277@\022\200\rH\3
41\251,\306\312,a%c \001"
response {
error_code: 0
ret {
embedded: ""
entity_list {
entity_id: "\335\277@\022\200\rH\341\251,\306\312,a%c"
entity_type: kVM
message: ""
status: kSucceeded
component: "Uhura"
canceled: false
internal_opaque:
"(dp0\nS\'internal_vm_change_power_state_op_handle\'\np1\nS\'\\x93\\xfc\\x1a7\\xc4UB/\\x91\\x8
4\\x7f\\xe9\\xb3Y\\xa0\\xcd\'\np2\nsS\'internal_current_state_info\'\np3\n(S\'CONSOLIDATE_RESULT
S\'\np4\nNtp5\ns."
deleted: false
weight: 1000
uuid: "c9i\202\331A@\353\215\253\r\007C\221\241?"
sequence_id: 21
request {
arg {
embedded:
"\022\020L>\350[\035\251H4\243\355\373\263\257\032\321\242\032\020\335\277@\022\200\rH\3
41\251,\306\312,a%c \001"
response {
error_code: 0
ret {
embedded: ""
entity_list {
entity_id: "\335\277@\022\200\rH\341\251,\306\312,a%c"
entity_type: kVM
message: ""
status: kSucceeded
component: "Uhura"
canceled: false
internal_opaque:
"(dp0\nS\'internal_vm_change_power_state_op_handle\'\np1\nS\'\\x93\\xfc\\x1a7\\xc4UB/\\x91\\x8
4\\x7f\\xe9\\xb3Y\\xa0\\xcd\'\np2\nsS\'internal_current_state_info\'\np3\n(S\'CONSOLIDATE_RESULT
S\'\np4\nNtp5\ns."
deleted: false
weight: 1000
2017-05-30 10:09:33 INFO minerva_task_util.py:1859 polling sub tasks 9989bb69-5da8-440d-9714-

2c058b5d1bcb completed
NTNX-afs01-3
vm_uuid: "f8`\222\";H\325\223@\237%\2663\202)"
config {
num_vcpus: 4
disk_list {
disk_addr {
adapter_type: kIDE
device_index: 0
vmdisk_uuid: "\002\253D3\326\237B\026\210ET.\265\253!\347"
disk_label: "ide.0"
container_id: 12406
size_bytes: 376832
is_cdrom: true
disk_list {
disk_addr {
adapter_type: kSCSI
device_index: 0
vmdisk_uuid: "2ZH\024\023\364B\310\240\003\307~i\267\r\023"
419aa3a83df5548924198f85398deb20e8b615fe"
}
container_id: 12406
size_bytes: 12884901888
ha_priority: 100
agent_vm: false
hypervisor {
state: kOff
adapter_type: kIDE
device_index: 0
vmdisk_uuid: "\002\253D3\326\237B\026\210ET.\265\253!\347"
disk_label: "ide.0"
container_id: 12406
size_bytes: 376832
is_cdrom: true
adapter_type: kSCSI
device_index: 0
419aa3a83df5548924198f85398deb20e8b615fe"
container_id: 12406
size_bytes: 12884901888

9f25b6338229
vm_uuid: "f8`\222\";H\325\223@\237%\2663\202)"
config {
num_vcpus: 4
ha_priority: 100
agent_vm: false
}
hypervisor {
state: kOff
afs01-3
2017-05-30 10:09:33 INFO minerva_task_util.py:1825 polling the tasks 4a555020-4de1-44fd-b1e5-

1cb6202197dd
2017-05-30 10:09:33 INFO minerva_task_util.py:1247 tasks: 4a555020-4de1-44fd-b1e5-1cb6202197dd
2017-05-30 10:09:33 INFO minerva_task_util.py:1859 polling sub tasks 4a555020-4de1-44fd-b1e5-

1cb6202197dd completed
Vm NTNX-afs01-3
vm_uuid: "f8`\222\";H\325\223@\237%\2663\202)"
config {
num_vcpus: 4
disk_list {
disk_addr {
adapter_type: kIDE
device_index: 0
vmdisk_uuid: "\002\253D3\326\237B\026\210ET.\265\253!\347"
disk_label: "ide.0"
container_id: 12406
size_bytes: 376832
is_cdrom: true
disk_list {
disk_addr {
adapter_type: kSCSI
device_index: 0
419aa3a83df5548924198f85398deb20e8b615fe"
container_id: 12406
size_bytes: 12884901888
disk_list {
disk_addr {
adapter_type: kSCSI
device_index: 1
vmdisk_uuid: "\213\205\023\2733ZJf\254\237\r\336},\3403"
container_id: 12406
size_bytes: 48318382080
ha_priority: 100
agent_vm: false
hypervisor {
state: kOff
adapter_type: kIDE
device_index: 0
vmdisk_uuid: "\002\253D3\326\237B\026\210ET.\265\253!\347"
disk_label: "ide.0"
}
container_id: 12406
size_bytes: 376832
is_cdrom: true
adapter_type: kSCSI
device_index: 0
419aa3a83df5548924198f85398deb20e8b615fe"
container_id: 12406
size_bytes: 12884901888
adapter_type: kSCSI
device_index: 1
vmdisk_uuid: "\213\205\023\2733ZJf\254\237\r\336},\3403"
container_id: 12406
size_bytes: 48318382080

9f25b6338229
vm_uuid: "f8`\222\";H\325\223@\237%\2663\202)"
config {
num_vcpus: 4
ha_priority: 100
agent_vm: false
hypervisor {
state: kOff
NTNX-afs01-3
2017-05-30 10:09:34 INFO minerva_task_util.py:1825 polling the tasks 60af9be3-b37a-4d6b-9641-

12132d104c49
2017-05-30 10:09:34 INFO minerva_task_util.py:1247 tasks: 60af9be3-b37a-4d6b-9641-12132d104c49

2017-05-30 10:09:34 INFO minerva_task_util.py:1859 polling sub tasks 60af9be3-b37a-4d6b-9641-
12132d104c49 completed
2017-05-30 10:09:34 INFO uvm_utils.py:921 Checking if VM is off: 66386092-223b-48d5-9340-

9f25b6338229
uuid: "=\367MBd\304L\311\246x\342`17\243."
sequence_id: 24
request {
arg {
embedded:
"\022\020\2226\306\311P\200Em\270\2177\264\236(\326\307\032\020f8`\222\";H\325\223@\237%
\2663\202) \001"
response {
error_code: 0
ret {
embedded: ""
entity_list {
entity_id: "f8`\222\";H\325\223@\237%\2663\202)"
entity_type: kVM
message: ""
status: kSucceeded
component: "Uhura"
canceled: false
internal_opaque:
"(dp0\nS\'internal_vm_change_power_state_op_handle\'\np1\nS\'\\x8ba\\xf6\\xc0\\xa5\\xfbJ\\xdf\\x
bd
E\\x8b\\xda\\xd7\\x1a\\xd1\'\np2\nsS\'internal_current_state_info\'\np3\n(S\'CONSOLIDATE_RESULTS
\'\np4\nNtp5\ns."
deleted: false
weight: 1000
uuid: "=\367MBd\304L\311\246x\342`17\243."
sequence_id: 24
request {
arg {
embedded:
"\022\020\2226\306\311P\200Em\270\2177\264\236(\326\307\032\020f8`\222\";H\325\223@\237%
\2663\202) \001"
}
}
response {
error_code: 0
ret {
embedded: ""
entity_list {
entity_id: "f8`\222\";H\325\223@\237%\2663\202)"
entity_type: kVM
message: ""
status: kSucceeded
component: "Uhura"
canceled: false
internal_opaque:
"(dp0\nS\'internal_vm_change_power_state_op_handle\'\np1\nS\'\\x8ba\\xf6\\xc0\\xa5\\xfbJ\\xdf\\x
bd
E\\x8b\\xda\\xd7\\x1a\\xd1\'\np2\nsS\'internal_current_state_info\'\np3\n(S\'CONSOLIDATE_RESULTS
\'\np4\nNtp5\ns."
deleted: false
weight: 1000
2017-05-30 10:09:35 INFO minerva_task_util.py:1859 polling sub tasks de734e01-a8d7-4fdc-9962-

04e136082633 completed
2017-05-30 10:09:35 INFO minerva_task_util.py:1859 polling sub tasks 4c3ee85b-1da9-4834-a3ed-

fbb3af1ad1a2 completed
2017-05-30 10:09:35 INFO minerva_task_util.py:1859 polling sub tasks 9236c6c9-5080-456d-b88f-

37b49e28d6c7 completed
2017-05-30 10:09:35 INFO uvm_utils.py:1576 Get the VMs information: [ac3c7060-db9e-4297-b6f4-

564b1083ef92,ddbf4012-800d-48e1-a92c-c6ca2c612563,66386092-223b-48d5-9340-9f25b6338229]
vm_uuid: "\254<p`\333\236B\227\266\364VK\020\203\357\222"
config {
num_vcpus: 4
ha_priority: 100
agent_vm: false
hypervisor {
host_uuid: "\266\253\330\330\326\300Jo\214^,\352\332HA\307"
state: kOn
}
vm_info_list {
vm_uuid: "\335\277@\022\200\rH\341\251,\306\312,a%c"
config {
num_vcpus: 4
ha_priority: 100
agent_vm: false
hypervisor {
host_uuid: "v$\262(\336;CW\252\322\021\261<%l\037"
state: kOn
vm_info_list {
vm_uuid: "f8`\222\";H\325\223@\237%\2663\202)"
config {
num_vcpus: 4
ha_priority: 100
agent_vm: false
hypervisor {
host_uuid: "o\225\352\224V6J\327\203\250\315\014T\025\325\363"
state: kOn
2017-05-30 10:09:38 INFO file_server.py:2699 Waiting for FSVM to be up and running.
2017-05-30 10:14:01 INFO file_server.py:2560 pre cloud init check for NTNX-afs01-1
2017-05-30 10:14:01 INFO file_server.py:2564 Checking if able to ssh for FileServerVm:NTNX-afs01-1 ip
10.30.14.245
2017-05-30 10:14:01 INFO minerva_utils.py:350 Running command true

10.30.14.246

10.30.14.247
2017-05-30 10:14:02 INFO minerva_utils.py:357 command true output
2017-05-30 10:14:02 INFO file_server.py:2572 Pinging gateway [u'10.30.0.1', u'10.30.0.1'] from

FileServerVm NTNX-afs01-2 Ip 10.30.14.246


2017-05-30 10:14:02 INFO minerva_utils.py:482 {u'ping -c 1 -w 15 10.30.0.1': (0, 'PING 10.30.0.1

(10.30.0.1) 56(84) bytes of data.\n64 bytes from 10.30.0.1: icmp_seq=1 ttl=64 time=0.804 ms\n\n---
10.30.0.1 ping statistics ---\n1 packets transmitted, 1 received, 0% packet loss, time 0ms\nrtt
min/avg/max/mdev = 0.804/0.804/0.804/0.000 ms\n', '')}
2017-05-30 10:14:02 INFO minerva_utils.py:492 Pinging 10.30.0.1 PING 10.30.0.1 (10.30.0.1) 56(84)
bytes of data.
--- 10.30.0.1 ping statistics ---

bytes of data.
2017-05-30 10:14:02 INFO minerva_utils.py:500 Pinged ip list [u'10.30.0.1', u'10.30.0.1'], Non pinged ip
list []
2017-05-30 10:14:02 INFO file_server.py:2579 NvmInfoDict {u'NTNX-afs01-2': []} with FileServerVm

NTNX-afs01-2
2017-05-30 10:14:02 INFO file_server.py:2583 Checking if ntp timed out for FileServerVm:NTNX-afs01-2
ip 10.30.14.246
2017-05-30 10:14:02 INFO file_server.py:2587 Checking if ntp sync 10.30.15.91 for FileServerVm:NTNX-
afs01-2 ip 10.30.14.246

bytes of data.
bytes of data.

list []
2017-05-30 10:14:02 INFO file_server.py:2579 NvmInfoDict {u'NTNX-afs01-3': [], u'NTNX-afs01-2': []}

with FileServerVm NTNX-afs01-3
ip 10.30.14.247
afs01-3 ip 10.30.14.247

bytes of data.
bytes of data.

list []
2017-05-30 10:14:02 INFO file_server.py:2579 NvmInfoDict {u'NTNX-afs01-1': [], u'NTNX-afs01-3': [],

u'NTNX-afs01-2': []} with FileServerVm NTNX-afs01-1
ip 10.30.14.245
afs01-1 ip 10.30.14.245
2017-05-30 10:14:03 INFO file_server.py:2592 ntp offset 25238.626737 for FileServerVm:NTNX-afs01-2

ip 10.30.14.246
2017-05-30 10:14:03 INFO file_server.py:2597 ntp offset drifted 25238.626737 for FileServerVm:NTNX-
afs01-2 ip 10.30.14.246
2017-05-30 10:14:03 INFO file_server.py:2600 ntp offset not drifted 25238.626737 for
FileServerVm:NTNX-afs01-2 ip 10.30.14.246
2017-05-30 10:14:03 INFO file_server.py:2614 Checking if stargate vip pingable 10.30.15.240 for

ip 10.30.14.247
afs01-3 ip 10.30.14.247

ip 10.30.14.245
afs01-1 ip 10.30.14.245

0ms\nrtt min/avg/max/mdev = 0.866/0.866/0.866/0.000 ms\n', '')}

--- 10.30.15.240 ping statistics ---
2017-05-30 10:14:03 INFO minerva_utils.py:500 Pinged ip list [u'10.30.15.240'], Non pinged ip list []
2017-05-30 10:14:03 INFO file_server.py:2621 [] [u'NTNX-afs01-2', u'NTNX-afs01-3', u'NTNX-afs01-1'] []

for FileServerVm NTNX-afs01-2 ip 10.30.14.246


--- 10.30.15.240 ping statistics ---


--- 10.30.15.240 ping statistics ---


for FileServer
2017-05-30 10:14:03 INFO file_server.py:2656 Nvm info dict {u'NTNX-afs01-1': [], u'NTNX-afs01-3': [],
u'NTNX-afs01-2': []}
2017-05-30 10:14:03 INFO file_server.py:2681 FileServerVm NTNX-afs01-2,NTNX-afs01-3,NTNX-afs01-1

is out of sync with AD server 10.30.15.91 by more than 300
2017-05-30 10:14:03 INFO file_server.py:2720 Creating cluster with ips

10.30.14.245,10.30.14.246,10.30.14.247, virtual ip 10.30.14.244, RF 2
2017-05-30 10:14:03 INFO uvm_utils.py:1144 Fault tolerance: 1, number of ZK nodes: 3
2017-05-30 10:14:03 INFO uvm_utils.py:1146 DNS ip addresses: 10.30.15.91, NTP ip addresses

10.30.15.91
2017-05-30 10:14:03 INFO uvm_utils.py:1150 Configuring Zeus mapping ({u'10.30.14.247': 3,

u'10.30.14.246': 2, u'10.30.14.245': 1}) on node 10.30.14.245

u'10.30.14.246': 2, u'10.30.14.245': 1}) on node 10.30.14.246

u'10.30.14.246': 2, u'10.30.14.245': 1}) on node 10.30.14.247
2017-05-30 10:14:15 INFO file_server.py:3733 Bringing up File server services for afs01
2017-05-30 10:14:15 INFO minerva_utils.py:1028 waiting for services to come up 10.30.14.244
2017-05-30 10:14:18 WARNING minerva_utils.py:1040 Cannot get cluster status for 10.30.14.244,
retrying

retrying
retrying
retrying
retrying
retrying
retrying
retrying
retrying
retrying
retrying
retrying

retrying
retrying
retrying
retrying
retrying
retrying
retrying
retrying
retrying
retrying
2017-05-30 10:15:56 INFO file_server.py:3972 Sending add File server message to FSVM IP:
10.30.14.244
2017-05-30 10:15:56 INFO minerva_utils.py:3487 uuid: "9e2de0ba-3d4c-411f-8f85-621a2f5f6542"

name: "afs01"
nvm_list {
num_vcpus: 4
memory_mb: 12288
external_ip_list {
ip_address: "10.30.15.241"
internal_ip_address: "10.30.14.245"
nvm_list {
num_vcpus: 4
memory_mb: 12288
external_ip_list {
ip_address: "10.30.15.242"
nvm_list {
uuid: "66386092-223b-48d5-9340-9f25b6338229"
num_vcpus: 4
memory_mb: 12288
external_ip_list {
ip_address: "10.30.15.243"
internal_network {
virtual_ipv4_address: "10.30.14.244"
join_domain {
cvm_ipv4_address_list: "10.30.15.47"
stargate_vip: "10.30.15.240"
size_bytes: 1099511627776
number_of_schedulable_cvms: 3
is_new_container: true
2017-05-30 10:15:56 INFO file_server.py:3978 Add File server (CVM-to-FSVM) IP: 10.30.14.244
2017-05-30 10:15:56 INFO file_server_misc.py:117 Generating random password..
2017-05-30 10:15:56 INFO file_server_misc.py:141 Creating user: b59b7390-8d81-4823-ace8-

b396abc27dcc
2017-05-30 10:15:56 INFO file_server_misc.py:149 {'profile': {'username': 'b59b7390-8d81-4823-ace8-

b396abc27dcc', 'lastName': u'9e2de0ba-3d4c-411f-8f85-621a2f5f6542', 'password':
'Nutanix/4uczB5CnS6eib2XJL', 'emailId': u'9e2de0ba-3d4c-411f-8f85-621a2f5f6542@nutanix.com',
'firstName': u'9e2de0ba-3d4c-411f-8f85-621a2f5f6542'}, 'enabled': True}
/usr/local/nutanix/minerva/lib/py/requests-2.12.0-
py2.6.egg/requests/packages/urllib3/util/ssl_.py:334: SNIMissingWarning: An HTTPS request has been
made, but the SNI (Subject Name Indication) extension to TLS is not available on this platform. This may
cause the server to present an incorrect TLS certificate, which can cause validation failures. You can
upgrade to a newer version of Python to solve this. For more information, see
https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
py2.6.egg/requests/packages/urllib3/util/ssl_.py:132: InsecurePlatformWarning: A true SSLContext
object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain
SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more
information, see https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
py2.6.egg/requests/packages/urllib3/connectionpool.py:843: InsecureRequestWarning: Unverified
HTTPS request is being made. Adding certificate verification is strongly advised. See:
2017-05-30 10:15:59 INFO file_server_misc.py:171 Adding Roles to the User: b59b7390-8d81-4823-

ace8-b396abc27dcc
2017-05-30 10:16:00 INFO file_server_misc.py:194 Changing fsvm admin password
2017-05-30 10:16:02 INFO file_server_misc.py:217 Adding credentials in InsightsDB

Zookeeper
2017-05-30 10:16:02 INFO file_server_misc.py:243 user added b59b7390-8d81-4823-ace8-

b396abc27dcc
2017-05-30 10:16:03 INFO minerva_task_util.py:885 Created task with task_uuid: 548fe4bc-832b-4ae7-

87b2-05d1cb3c39ef
2017-05-30 10:16:03 INFO minerva_task_util.py:1825 polling the tasks 548fe4bc-832b-4ae7-87b2-

05d1cb3c39ef
2017-05-30 10:16:03 INFO minerva_task_util.py:1247 tasks: 548fe4bc-832b-4ae7-87b2-05d1cb3c39ef
2017-05-30 10:16:03 INFO disaster_recovery.py:754 Protect file server 9e2de0ba-3d4c-411f-8f85-

621a2f5f6542
2017-05-30 10:16:03 INFO disaster_recovery.py:773 fs_spec {
uuid: "9e2de0ba-3d4c-411f-8f85-621a2f5f6542"
name: "afs01"
nvm_list {
nvm_list {
nvm_list {
uuid: "66386092-223b-48d5-9340-9f25b6338229"
internal_network {
virtual_ipv4_address: "10.30.14.244"
stargate_vip: "10.30.15.240"
afs_version: "2.1.0-419aa3a83df5548924198f85398deb20e8b615fe\n"
nvm_uuid_vm_uuid_list {
nvm_uuid: "3a33d6ff-98f8-454f-b16e-9433d9ba4980"
vm_uuid: "ac3c7060-db9e-4297-b6f4-564b1083ef92"
nvm_uuid: "aa391c08-39fd-4f67-bcd1-7338d2b4fecb"
vm_uuid: "ddbf4012-800d-48e1-a92c-c6ca2c612563"
nvm_uuid: "b94bbb55-b07f-4f3b-8fff-36e84e694a57"
vm_uuid: "66386092-223b-48d5-9340-9f25b6338229"
2017-05-30 10:16:03 WARNING cerebro_interface_client.py:111 Cerebro RPC returned error 7.
2017-05-30 10:16:03 INFO disaster_recovery.py:162 PD NTNX-afs01 does not exists

Zookeeper
2017-05-30 10:16:03 INFO disaster_recovery.py:308 CG NTNX-afs01-NVMS successfully created on PD

NTNX-afs01
2017-05-30 10:16:03 INFO disaster_recovery.py:562

['\xac<p`\xdb\x9eB\x97\xb6\xf4VK\x10\x83\xef\x92', '\xdd\xbf@\x12\x80\rH\xe1\xa9,\xc6\xca,a%c',
'f8`\x92";H\xd5\x93@\x9f%\xb63\x82)']
2017-05-30 10:16:03 INFO disaster_recovery.py:570 vm_info_list {
vm_uuid: "\254<p`\333\236B\227\266\364VK\020\203\357\222"
config {
num_vcpus: 4
ha_priority: 100
agent_vm: false
hypervisor {
host_uuid: "\266\253\330\330\326\300Jo\214^,\352\332HA\307"
state: kOn
vm_info_list {
vm_uuid: "\335\277@\022\200\rH\341\251,\306\312,a%c"
config {
num_vcpus: 4
ha_priority: 100
agent_vm: false
}
hypervisor {
host_uuid: "v$\262(\336;CW\252\322\021\261<%l\037"
state: kOn
vm_info_list {
vm_uuid: "f8`\222\";H\325\223@\237%\2663\202)"
config {
num_vcpus: 4
ha_priority: 100
agent_vm: false
hypervisor {
host_uuid: "o\225\352\224V6J\327\203\250\315\014T\025\325\363"
state: kOn
}
2017-05-30 10:16:03 INFO disaster_recovery.py:614 Updated the consistency group
protection_domain_name: "NTNX-afs01"
consistency_group_name: "NTNX-afs01-NVMS"
add_vm {
vm_id: "ac3c7060-db9e-4297-b6f4-564b1083ef92"
vm_name: "NTNX-afs01-1"
power_on: false
add_vm {
vm_id: "ddbf4012-800d-48e1-a92c-c6ca2c612563"
power_on: false
add_vm {
vm_id: "66386092-223b-48d5-9340-9f25b6338229"
power_on: false
2017-05-30 10:16:03 INFO disaster_recovery.py:664 Adding nfs files [u'/Nutanix_afs01_ctr/el6-release-

euphrates-5.1.0.1-stable-419aa3a83df5548924198f85398deb20e8b615fe', u'/Nutanix_afs01_ctr/afs-
software.info'] to PD: NTNX-afs01
2017-05-30 10:16:03 INFO minerva_task_util.py:1859 polling sub tasks 548fe4bc-832b-4ae7-87b2-

05d1cb3c39ef completed
2017-05-30 10:16:03 INFO file_server.py:1185 Created PD for Fileserver 9e2de0ba-3d4c-411f-8f85-

621a2f5f6542

11b13c256c1f

2ceada4841c7

cd0c5415d5f3
2017-05-30 10:16:03 INFO file_server.py:4230 Adding public keys via REST API call.
2017-05-30 10:16:03 INFO file_server.py:4250 Adding public keys for fsvm 10.30.14.244
2017-05-30 10:16:03 INFO file_server.py:4262 file-server: 9e2de0ba-3d4c-411f-8f85-621a2f5f6542
Rest user: b59b7390-8d81-4823-ace8-b396abc27dcc
virtual-ip: 10.30.14.244
response: {"name":"7624b228-de3b-4357-aad2-11b13c256c1f","key":"ssh-rsa
GNybmDXeu9XCy7zna2GwcQwMcn2HHhS71paxPuaY6N nutanix@NTNX-16SM13150152-C-CVM"}
url: https://10.30.14.244:9440/PrismGateway/services/rest/v1/cluster/public_keys
2017-05-30 10:16:03 INFO file_server.py:4265 add-public-keys (CVM-to-FSVM) finished. IP: 10.30.14.244
virtual-ip: 10.30.14.244
response: {"name":"b6abd8d8-d6c0-4a6f-8c5e-2ceada4841c7","key":"ssh-rsa
qf5dWqjoWoXCxah7Jc3FOPyvk5ROLfRrOfD14at nutanix@NTNX-16SM13150152-B-CVM"}
virtual-ip: 10.30.14.244
response: {"name":"6f95ea94-5636-4ad7-83a8-cd0c5415d5f3","key":"ssh-rsa
A/LigTYCEaHVpeFyMZddmrdCojimTArWO/f3VD nutanix@NTNX-16SM13150152-A-CVM"}
2017-05-30 10:16:03 INFO nvm_rpc_client.py:122 Sending rpc with arg file_server_info_needed: true
svm_uuids_needed: true
all_volume_group_set_needed: true
nvm_uuid_to_vm_uuid_needed: true
2017-05-30 10:16:03 INFO file_server.py:4799 creating a JoinDomain task
2017-05-30 10:16:03 INFO minerva_task_util.py:885 Created task with task_uuid: 5993a6f8-5b99-41f3-

8c0c-a8197b368938
2017-05-30 10:16:03 INFO minerva_task_util.py:1825 polling the tasks 5993a6f8-5b99-41f3-8c0c-

a8197b368938
2017-05-30 10:16:03 INFO minerva_task_util.py:1247 tasks: 5993a6f8-5b99-41f3-8c0c-a8197b368938

2017-05-30 10:16:03 INFO minerva_task_util.py:145 JoinDomainTask [Seq Id = 6] : Fetching the current

state.
2017-05-30 10:16:03 INFO minerva_task_util.py:123 Building and validating JoinDomainTask arguments
2017-05-30 10:16:03 INFO domain_task.py:92 Fetched Wal
2017-05-30 10:16:03 INFO minerva_utils.py:3476 status: kJoinDomainNvmRestCall
task_header {
virtual_ip: "10.30.14.244"
2017-05-30 10:16:03 INFO disaster_recovery.py:1220 PD: NTNX-afs01 schedule status: False, suspend
status: False
2017-05-30 10:16:03 INFO disaster_recovery.py:1224 PD: NTNX-afs01 schedule suspend intent updated
to False
2017-05-30 10:16:03 INFO minerva_task_util.py:1505 Updated wal with status 200
2017-05-30 10:16:03 INFO minerva_utils.py:3487 file_server_uuid: "9e2de0ba-3d4c-411f-8f85-

621a2f5f6542"
organizational_unit: ""
validate_credential_only: false
nvm_only: true
preferred_domain_controller: ""
2017-05-30 10:16:03 INFO file_server.py:4139 Joining domain learn.nutanix.local for file-server
9e2de0ba-3d4c-411f-8f85-621a2f5f6542
2017-05-30 10:16:03 INFO file_server.py:4157 {"nvmOnly": true, "windowsAdPassword": "********",

"organizationalUnit": "", "spnDnsOnly": false, "windowsAdUsername": "administrator",
"validateAdCredential": false, "overwriteUserAccount": false, "preferredDomainController": "",
"windowsAdDomainName": "learn.nutanix.local"}
virtual-ip: 10.30.14.244
response: {"taskUuid":"0ab11425-cb3e-4141-ac34-33bb041767b3"}
url: https://10.30.14.244:9440/PrismGateway/services/rest/v1/vfilers/9e2de0ba-3d4c-411f-8f85-
621a2f5f6542/joinDomain
2017-05-30 10:16:05 INFO file_server.py:4174 join-domain (CVM-to-FSVM) finished. IP: 10.30.14.244
2017-05-30 10:16:05 INFO minerva_task_util.py:1247 tasks: 0ab11425-cb3e-4141-ac34-33bb041767b3
2017-05-30 10:16:08 INFO notify.py:103 notification=FileServerAudit file_server_uuid=9e2de0ba-3d4c-

411f-8f85-621a2f5f6542 file_server_name=afs01 message=File server afs01 joined domain
learn.nutanix.local
2017-05-30 10:16:08 INFO notify.py:84 Alert manager returned: result: true
2017-05-30 10:16:08 INFO file_server.py:4830 Maybe creating a home share.
2017-05-30 10:16:08 INFO file_server.py:4837 Creating home share for fileserver 9e2de0ba-3d4c-411f-
8f85-621a2f5f6542
2017-05-30 10:16:08 INFO file_server.py:4839 Args: ergon client:<ergon.client.client.ErgonClient object
at 0x77ac?▒▒▒▒s, task_timeout:1496168163
2017-05-30 10:16:08 INFO share_task.py:1176 validating add share arguments share_spec {
name: "home"
is_win_prev_version_enabled: false
description: "Home share created as part of fileserver creation"
share_type: kHomes
quota_policy_list {
principal_type: kUser
principal_name: "_default"
quota_size_bytes: 0
enforcement_type: kSoft
is_abe_enabled: false
2017-05-30 10:16:08 INFO share_task.py:1217 Share size requested is 0 bytes
2017-05-30 10:16:08 INFO nvm_rpc_client.py:122 Sending rpc with arg file_server_info_needed: true
svm_uuids_needed: true
all_volume_group_set_needed: true
2017-05-30 10:16:08 INFO nvm_rpc_client.py:122 Sending rpc with arg share_name: "home"
2017-05-30 10:16:08 INFO file_server.py:4903 creating a ShareAdd task
2017-05-30 10:16:08 INFO minerva_task_util.py:885 Created task with task_uuid: 808b6894-5da2-4d8c-

ace9-ec416e05fd7b
2017-05-30 10:16:08 INFO file_server.py:4913 Polling for completion of home share creation for fs
9e2de0ba-3d4c-411f-8f85-621a2f5f6542
2017-05-30 10:16:08 INFO minerva_task_util.py:1825 polling the tasks 808b6894-5da2-4d8c-ace9-

ec416e05fd7b
2017-05-30 10:16:08 INFO minerva_task_util.py:1247 tasks: 808b6894-5da2-4d8c-ace9-ec416e05fd7b
2017-05-30 10:16:08 INFO share_task.py:87 Building and validating ShareAddTask arguments
2017-05-30 10:16:08 INFO share_task.py:94 ShareAddTask: 7 : Fetching the current state.
2017-05-30 10:16:08 INFO share_task.py:96 ShareAddTask: 7 : wal status: kFetchFileServerInfo
task_header {
share_name_list: "home"
2017-05-30 10:16:08 INFO share_task.py:173 ShareAddTask: 7 FileServerUuid: 9e2de0ba-3d4c-411f-

8f85-621a2f5f6542 Shareuuid: (ShareName: home) : virtual ip 10.30.14.244

8f85-621a2f5f6542 Shareuuid: (ShareName: home) : Getting File server info
2017-05-30 10:16:08 INFO nvm_rpc_client.py:122 Sending rpc with arg svm_uuids_needed: true
container_uuid_for_vg_create: "cdc48f84-afcf-4b8d-b53c-7026d62512a0"
vg_info_by_container_to_create: true
share_type: kHomes
2017-05-30 10:16:08 INFO share_task.py:184 svm_uuid_vm_uuid_list =

<google.protobuf.internal.cpp._message.RepeatedCompositeContainer object at 0x7f706c212588>
2017-05-30 10:16:08 INFO share_task.py:185 vgs_spec =
2017-05-30 10:16:08 INFO share_task.py:192 svm_uuids = [u'3a33d6ff-98f8-454f-b16e-9433d9ba4980',

u'aa391c08-39fd-4f67-bcd1-7338d2b4fecb', u'b94bbb55-b07f-4f3b-8fff-36e84e694a57']
8f85-621a2f5f6542 Shareuuid: (ShareName: home) : Updating wal with svm uuids and virtual ip
2017-05-30 10:16:08 INFO disaster_recovery.py:1220 PD: NTNX-afs01 schedule status: False, suspend
status: False
2017-05-30 10:16:08 INFO disaster_recovery.py:1224 PD: NTNX-afs01 schedule suspend intent updated
to False
2017-05-30 10:16:08 INFO minerva_task_util.py:1505 ShareAddTask: 7 : Updated wal with status 300

8f85-621a2f5f6542 Shareuuid: (ShareName: home) : Retrieving volume group set
2017-05-30 10:16:08 INFO share_task.py:1370 iqn.2010-06.com.nutanix:3a33d6ff-98f8-454f-b16e-

9433d9ba4980 iqn.2010-06.com.nutanix:aa391c08-39fd-4f67-bcd1-7338d2b4fecb iqn.2010-
06.com.nutanix:b94bbb55-b07f-4f3b-8fff-36e84e694a57
2017-05-30 10:16:08 INFO share_task.py:247 nvm_iqn_list = [u'iqn.2010-06.com.nutanix:3a33d6ff-98f8-

454f-b16e-9433d9ba4980', u'iqn.2010-06.com.nutanix:aa391c08-39fd-4f67-bcd1-7338d2b4fecb',
u'iqn.2010-06.com.nutanix:b94bbb55-b07f-4f3b-8fff-36e84e694a57']

8f85-621a2f5f6542 Shareuuid: (ShareName: home) : No. of volume groups to create = 3 for volume
group set 6437b346-2fac-4a97-b928-9894e3a386fb

Zookeeper
2017-05-30 10:16:08 INFO storage_manager.py:132 *** ctr repl factor 2
2017-05-30 10:16:08 INFO share_task.py:1440 repl_factor = 2 for vgs 6437b346-2fac-4a97-b928-

9894e3a386fb
2017-05-30 10:16:08 INFO storage_manager.py:529 Get a client to the local acropolis server
2017-05-30 10:16:08 INFO storage_manager.py:540 Checking if the volume groups are already present

Zookeeper
2017-05-30 10:16:08 INFO storage_manager.py:103 ctr uuid: cdc48f84-afcf-4b8d-b53c-7026d62512a0
2017-05-30 10:16:08 INFO storage_manager.py:107 ctr id: 12406
2017-05-30 10:16:08 INFO storage_manager.py:571 Creating volume group NTNX-afs01-6437b346-2fac-

4a97-b928-9894e3a386fb-548d055f-5b54-481d-b22d-253172493e85 with 4 vdisks, with vdisk size of
10995116277760 bytes
2017-05-30 10:16:08 INFO storage_manager.py:575 Successfully created volume group:name=NTNX-

afs01-6437b346-2fac-4a97-b928-9894e3a386fb-548d055f-5b54-481d-b22d-
253172493e85,uuid=548d055f-5b54-481d-b22d-253172493e85
2017-05-30 10:16:08 INFO storage_manager.py:411 Retrieving the configured volume group 548d055f-
5b54-481d-b22d-253172493e85
2017-05-30 10:16:08 INFO storage_manager.py:282 creating vdisks and attaching to volume group
5b54-481d-b22d-253172493e85
2017-05-30 10:16:09 INFO storage_manager.py:307 Volume Disk Created and attached to VG.
5b54-481d-b22d-253172493e85
2017-05-30 10:16:09 INFO storage_manager.py:316 Added Disk list

[<acropolis.acropolis_types_pb2.VolumeDiskConfig object at 0x7f706c2151d0>,
<acropolis.acropolis_types_pb2.VolumeDiskConfig object at 0x7f706c215250>]
2017-05-30 10:16:09 INFO storage_manager.py:184 Added VDisks for SSD to the volume group
2017-05-30 10:16:09 INFO storage_manager.py:187 Not pinning Vdisks to SSD
2017-05-30 10:16:09 INFO storage_manager.py:595 Not creating any Vdisk to pin to SSD
2017-05-30 10:16:09 INFO storage_manager.py:623 Max possible target for vg_uuid 548d055f-5b54-
481d-b22d-253172493e85, number 6
2017-05-30 10:16:09 INFO storage_manager.py:633 Attaching initiator iqn.2010-

06.com.nutanix:aa391c08-39fd-4f67-bcd1-7338d2b4fecb

5b54-481d-b22d-253172493e85

Zookeeper

4a97-b928-9894e3a386fb-b822259b-a259-47df-a6f1-093f02c6ac8a with 4 vdisks, with vdisk size of
10995116277760 bytes

afs01-6437b346-2fac-4a97-b928-9894e3a386fb-b822259b-a259-47df-a6f1-
093f02c6ac8a,uuid=b822259b-a259-47df-a6f1-093f02c6ac8a
2017-05-30 10:16:09 INFO storage_manager.py:411 Retrieving the configured volume group b822259b-
a259-47df-a6f1-093f02c6ac8a
a259-47df-a6f1-093f02c6ac8a
a259-47df-a6f1-093f02c6ac8a

2017-05-30 10:16:09 INFO storage_manager.py:623 Max possible target for vg_uuid b822259b-a259-
47df-a6f1-093f02c6ac8a, number 6


a259-47df-a6f1-093f02c6ac8a

Zookeeper

4a97-b928-9894e3a386fb-a42e4bcf-3a13-4178-8e21-19829e58700f with 4 vdisks, with vdisk size of
10995116277760 bytes

afs01-6437b346-2fac-4a97-b928-9894e3a386fb-a42e4bcf-3a13-4178-8e21-
19829e58700f,uuid=a42e4bcf-3a13-4178-8e21-19829e58700f
2017-05-30 10:16:10 INFO storage_manager.py:411 Retrieving the configured volume group a42e4bcf-
3a13-4178-8e21-19829e58700f
3a13-4178-8e21-19829e58700f
3a13-4178-8e21-19829e58700f

2017-05-30 10:16:10 INFO storage_manager.py:623 Max possible target for vg_uuid a42e4bcf-3a13-
4178-8e21-19829e58700f, number 6


3a13-4178-8e21-19829e58700f

Zookeeper

4a97-b928-9894e3a386fb-52be82b1-88ad-4a49-b00b-af4e1d4236e3 with 4 vdisks, with vdisk size of
10995116277760 bytes

afs01-6437b346-2fac-4a97-b928-9894e3a386fb-52be82b1-88ad-4a49-b00b-
af4e1d4236e3,uuid=52be82b1-88ad-4a49-b00b-af4e1d4236e3
2017-05-30 10:16:10 INFO storage_manager.py:411 Retrieving the configured volume group 52be82b1-
88ad-4a49-b00b-af4e1d4236e3
88ad-4a49-b00b-af4e1d4236e3
88ad-4a49-b00b-af4e1d4236e3

2017-05-30 10:16:10 INFO storage_manager.py:623 Max possible target for vg_uuid 52be82b1-88ad-
4a49-b00b-af4e1d4236e3, number 6


88ad-4a49-b00b-af4e1d4236e3

Zookeeper

4a97-b928-9894e3a386fb-11bbfd7d-8882-4237-9b44-b444bfdb80a0 with 4 vdisks, with vdisk size of
10995116277760 bytes

afs01-6437b346-2fac-4a97-b928-9894e3a386fb-11bbfd7d-8882-4237-9b44-
b444bfdb80a0,uuid=11bbfd7d-8882-4237-9b44-b444bfdb80a0
2017-05-30 10:16:11 INFO storage_manager.py:411 Retrieving the configured volume group 11bbfd7d-
8882-4237-9b44-b444bfdb80a0
8882-4237-9b44-b444bfdb80a0
8882-4237-9b44-b444bfdb80a0

2017-05-30 10:16:11 INFO storage_manager.py:623 Max possible target for vg_uuid 11bbfd7d-8882-
4237-9b44-b444bfdb80a0, number 6


8882-4237-9b44-b444bfdb80a0

Zookeeper

4a97-b928-9894e3a386fb-dd70c208-12aa-4310-ac9d-673bdac07259 with 4 vdisks, with vdisk size of
10995116277760 bytes

afs01-6437b346-2fac-4a97-b928-9894e3a386fb-dd70c208-12aa-4310-ac9d-
673bdac07259,uuid=dd70c208-12aa-4310-ac9d-673bdac07259
2017-05-30 10:16:11 INFO storage_manager.py:411 Retrieving the configured volume group dd70c208-
12aa-4310-ac9d-673bdac07259
12aa-4310-ac9d-673bdac07259
12aa-4310-ac9d-673bdac07259

2017-05-30 10:16:11 INFO storage_manager.py:623 Max possible target for vg_uuid dd70c208-12aa-
4310-ac9d-673bdac07259, number 6


12aa-4310-ac9d-673bdac07259

Zookeeper

4a97-b928-9894e3a386fb-c6681eb9-8a75-4cf9-a2e4-b378e2ced1b6 with 4 vdisks, with vdisk size of
10995116277760 bytes

afs01-6437b346-2fac-4a97-b928-9894e3a386fb-c6681eb9-8a75-4cf9-a2e4-
b378e2ced1b6,uuid=c6681eb9-8a75-4cf9-a2e4-b378e2ced1b6
2017-05-30 10:16:12 INFO storage_manager.py:411 Retrieving the configured volume group c6681eb9-
8a75-4cf9-a2e4-b378e2ced1b6

2017-05-30 10:16:12 INFO storage_manager.py:623 Max possible target for vg_uuid c6681eb9-8a75-
4cf9-a2e4-b378e2ced1b6, number 6



Zookeeper

4a97-b928-9894e3a386fb-732ed7e0-f44f-400b-9872-7fda49571778 with 4 vdisks, with vdisk size of
10995116277760 bytes

afs01-6437b346-2fac-4a97-b928-9894e3a386fb-732ed7e0-f44f-400b-9872-
7fda49571778,uuid=732ed7e0-f44f-400b-9872-7fda49571778
2017-05-30 10:16:12 INFO storage_manager.py:411 Retrieving the configured volume group 732ed7e0-
f44f-400b-9872-7fda49571778
f44f-400b-9872-7fda49571778
f44f-400b-9872-7fda49571778

<acropolis.acropolis_types_pb2.VolumeDiskConfig object at 0x7f706c215a50>]
2017-05-30 10:16:13 INFO storage_manager.py:623 Max possible target for vg_uuid 732ed7e0-f44f-
400b-9872-7fda49571778, number 6


f44f-400b-9872-7fda49571778

Zookeeper

4a97-b928-9894e3a386fb-953300f2-155e-4328-a8f6-ea1e522a7ccd with 4 vdisks, with vdisk size of
10995116277760 bytes

afs01-6437b346-2fac-4a97-b928-9894e3a386fb-953300f2-155e-4328-a8f6-
ea1e522a7ccd,uuid=953300f2-155e-4328-a8f6-ea1e522a7ccd
2017-05-30 10:16:13 INFO storage_manager.py:411 Retrieving the configured volume group 953300f2-
155e-4328-a8f6-ea1e522a7ccd
155e-4328-a8f6-ea1e522a7ccd
155e-4328-a8f6-ea1e522a7ccd

[<acropolis.acropolis_types_pb2.VolumeDiskConfig object at 0x7f706c215ad0>,
<acropolis.acropolis_types_pb2.VolumeDiskConfig object at 0x7f706c215b50>]
2017-05-30 10:16:13 INFO storage_manager.py:623 Max possible target for vg_uuid 953300f2-155e-
4328-a8f6-ea1e522a7ccd, number 6


155e-4328-a8f6-ea1e522a7ccd

Zookeeper

4a97-b928-9894e3a386fb-4c070990-4eae-497f-897e-4a63828a4138 with 4 vdisks, with vdisk size of
10995116277760 bytes

afs01-6437b346-2fac-4a97-b928-9894e3a386fb-4c070990-4eae-497f-897e-
4a63828a4138,uuid=4c070990-4eae-497f-897e-4a63828a4138
2017-05-30 10:16:13 INFO storage_manager.py:411 Retrieving the configured volume group 4c070990-
4eae-497f-897e-4a63828a4138
4eae-497f-897e-4a63828a4138
4eae-497f-897e-4a63828a4138

[<acropolis.acropolis_types_pb2.VolumeDiskConfig object at 0x7f706c215bd0>,
<acropolis.acropolis_types_pb2.VolumeDiskConfig object at 0x7f706c215c50>]
2017-05-30 10:16:14 INFO storage_manager.py:623 Max possible target for vg_uuid 4c070990-4eae-
497f-897e-4a63828a4138, number 6


4eae-497f-897e-4a63828a4138

Zookeeper

4a97-b928-9894e3a386fb-a038f292-c981-4fe5-8d20-b44382f23feb with 4 vdisks, with vdisk size of
10995116277760 bytes

afs01-6437b346-2fac-4a97-b928-9894e3a386fb-a038f292-c981-4fe5-8d20-
b44382f23feb,uuid=a038f292-c981-4fe5-8d20-b44382f23feb
2017-05-30 10:16:14 INFO storage_manager.py:411 Retrieving the configured volume group a038f292-
c981-4fe5-8d20-b44382f23feb
c981-4fe5-8d20-b44382f23feb
c981-4fe5-8d20-b44382f23feb

[<acropolis.acropolis_types_pb2.VolumeDiskConfig object at 0x7f706c215cd0>,
<acropolis.acropolis_types_pb2.VolumeDiskConfig object at 0x7f706c215d50>]
2017-05-30 10:16:14 INFO storage_manager.py:623 Max possible target for vg_uuid a038f292-c981-
4fe5-8d20-b44382f23feb, number 6


c981-4fe5-8d20-b44382f23feb

Zookeeper

4a97-b928-9894e3a386fb-b466e86d-0837-4422-8ad8-94263440842a with 4 vdisks, with vdisk size of
10995116277760 bytes

afs01-6437b346-2fac-4a97-b928-9894e3a386fb-b466e86d-0837-4422-8ad8-
94263440842a,uuid=b466e86d-0837-4422-8ad8-94263440842a
2017-05-30 10:16:14 INFO storage_manager.py:411 Retrieving the configured volume group b466e86d-
0837-4422-8ad8-94263440842a
0837-4422-8ad8-94263440842a
0837-4422-8ad8-94263440842a

[<acropolis.acropolis_types_pb2.VolumeDiskConfig object at 0x7f706c215dd0>,
<acropolis.acropolis_types_pb2.VolumeDiskConfig object at 0x7f706c215e50>]
2017-05-30 10:16:15 INFO storage_manager.py:623 Max possible target for vg_uuid b466e86d-0837-
4422-8ad8-94263440842a, number 6


0837-4422-8ad8-94263440842a

Zookeeper

4a97-b928-9894e3a386fb-8f2e1f4e-6c6c-4f8a-8d3a-f04d51de545c with 4 vdisks, with vdisk size of
10995116277760 bytes

afs01-6437b346-2fac-4a97-b928-9894e3a386fb-8f2e1f4e-6c6c-4f8a-8d3a-
f04d51de545c,uuid=8f2e1f4e-6c6c-4f8a-8d3a-f04d51de545c
2017-05-30 10:16:15 INFO storage_manager.py:411 Retrieving the configured volume group 8f2e1f4e-
6c6c-4f8a-8d3a-f04d51de545c

[<acropolis.acropolis_types_pb2.VolumeDiskConfig object at 0x7f706c215ed0>,
<acropolis.acropolis_types_pb2.VolumeDiskConfig object at 0x7f706c215f50>]
2017-05-30 10:16:15 INFO storage_manager.py:623 Max possible target for vg_uuid 8f2e1f4e-6c6c-
4f8a-8d3a-f04d51de545c, number 6



Zookeeper

4a97-b928-9894e3a386fb-c5f1bc91-4990-4cd7-ad4b-c2e7e4393c93 with 4 vdisks, with vdisk size of
10995116277760 bytes

afs01-6437b346-2fac-4a97-b928-9894e3a386fb-c5f1bc91-4990-4cd7-ad4b-
c2e7e4393c93,uuid=c5f1bc91-4990-4cd7-ad4b-c2e7e4393c93
2017-05-30 10:16:16 INFO storage_manager.py:411 Retrieving the configured volume group c5f1bc91-
4990-4cd7-ad4b-c2e7e4393c93
4990-4cd7-ad4b-c2e7e4393c93
4990-4cd7-ad4b-c2e7e4393c93

[<acropolis.acropolis_types_pb2.VolumeDiskConfig object at 0x7f706c216050>,
2017-05-30 10:16:16 INFO storage_manager.py:623 Max possible target for vg_uuid c5f1bc91-4990-
4cd7-ad4b-c2e7e4393c93, number 6


4990-4cd7-ad4b-c2e7e4393c93

Zookeeper

4a97-b928-9894e3a386fb-e3dee4d6-b939-4c14-af72-c19d2be0882d with 4 vdisks, with vdisk size of
10995116277760 bytes

afs01-6437b346-2fac-4a97-b928-9894e3a386fb-e3dee4d6-b939-4c14-af72-
c19d2be0882d,uuid=e3dee4d6-b939-4c14-af72-c19d2be0882d
2017-05-30 10:16:16 INFO storage_manager.py:411 Retrieving the configured volume group e3dee4d6-
b939-4c14-af72-c19d2be0882d
b939-4c14-af72-c19d2be0882d
b939-4c14-af72-c19d2be0882d

[<acropolis.acropolis_types_pb2.VolumeDiskConfig object at 0x7f706c216150>,
2017-05-30 10:16:16 INFO storage_manager.py:623 Max possible target for vg_uuid e3dee4d6-b939-
4c14-af72-c19d2be0882d, number 6


b939-4c14-af72-c19d2be0882d
2017-05-30 10:16:16 INFO disaster_recovery.py:308 CG NTNX-afs01-6437b346-2fac-4a97-b928-

9894e3a386fb-1 successfully created on PD NTNX-afs01
2017-05-30 10:16:16 INFO disaster_recovery.py:365 VGs [('T\x8d\x05_[TH\x1d\xb2-%1rI>\x85', u'NTNX-

afs01-6437b346-2fac-4a97-b928-9894e3a386fb-548d055f-5b54-481d-b22d-253172493e85'),
('\xb8"%\x9b\xa2YG\xdf\xa6\xf1\t?\x02\xc6\xac\x8a', u'NTNX-afs01-6437b346-2fac-4a97-b928-
9894e3a386fb-b822259b-a259-47df-a6f1-093f02c6ac8a'), ('\xa4.K\xcf:\x13Ax\x8e!\x19\x82\x9eXp\x0f',
u'NTNX-afs01-6437b346-2fac-4a97-b928-9894e3a386fb-a42e4bcf-3a13-4178-8e21-19829e58700f'),
('R\xbe\x82\xb1\x88\xadJI\xb0\x0b\xafN\x1dB6\xe3', u'NTNX-afs01-6437b346-2fac-4a97-b928-
9894e3a386fb-52be82b1-88ad-4a49-b00b-af4e1d4236e3'),
('\x11\xbb\xfd}\x88\x82B7\x9bD\xb4D\xbf\xdb\x80\xa0', u'NTNX-afs01-6437b346-2fac-4a97-b928-
9894e3a386fb-11bbfd7d-8882-4237-9b44-b444bfdb80a0'),
('\xddp\xc2\x08\x12\xaaC\x10\xac\x9dg;\xda\xc0rY', u'NTNX-afs01-6437b346-2fac-4a97-b928-
9894e3a386fb-dd70c208-12aa-4310-ac9d-673bdac07259'),
('\xc6h\x1e\xb9\x8auL\xf9\xa2\xe4\xb3x\xe2\xce\xd1\xb6', u'NTNX-afs01-6437b346-2fac-4a97-b928-
9894e3a386fb-c6681eb9-8a75-4cf9-a2e4-b378e2ced1b6'),
('s.\xd7\xe0\xf4O@\x0b\x98r\x7f\xdaIW\x17x', u'NTNX-afs01-6437b346-2fac-4a97-b928-
9894e3a386fb-732ed7e0-f44f-400b-9872-7fda49571778'),
('\x953\x00\xf2\x15^C(\xa8\xf6\xea\x1eR*|\xcd', u'NTNX-afs01-6437b346-2fac-4a97-b928-
9894e3a386fb-953300f2-155e-4328-a8f6-ea1e522a7ccd'), ('L\x07\t\x90N\xaeI\x7f\x89~Jc\x82\x8aA8',
u'NTNX-afs01-6437b346-2fac-4a97-b928-9894e3a386fb-4c070990-4eae-497f-897e-4a63828a4138'),
('\xa08\xf2\x92\xc9\x81O\xe5\x8d \xb4C\x82\xf2?\xeb', u'NTNX-afs01-6437b346-2fac-4a97-b928-
9894e3a386fb-a038f292-c981-4fe5-8d20-b44382f23feb'),
('\xb4f\xe8m\x087D"\x8a\xd8\x94&4@\x84*', u'NTNX-afs01-6437b346-2fac-4a97-b928-9894e3a386fb-
b466e86d-0837-4422-8ad8-94263440842a'), ('\x8f.\x1fNllO\x8a\x8d:\xf0MQ\xdeT\\', u'NTNX-afs01-
6437b346-2fac-4a97-b928-9894e3a386fb-8f2e1f4e-6c6c-4f8a-8d3a-f04d51de545c'),
('\xc5\xf1\xbc\x91I\x90L\xd7\xadK\xc2\xe7\xe49<\x93', u'NTNX-afs01-6437b346-2fac-4a97-b928-
9894e3a386fb-c5f1bc91-4990-4cd7-ad4b-c2e7e4393c93'),
('\xe3\xde\xe4\xd6\xb99L\x14\xafr\xc1\x9d+\xe0\x88-', u'NTNX-afs01-6437b346-2fac-4a97-b928-
9894e3a386fb-e3dee4d6-b939-4c14-af72-c19d2be0882d')] added to the PD NTNX-afs01 [CG NTNX-
afs01-6437b346-2fac-4a97-b928-9894e3a386fb-1]

8f85-621a2f5f6542 Shareuuid: (ShareName: home) : volume groups configured for volume group spec
6437b346-2fac-4a97-b928-9894e3a386fb
2017-05-30 10:16:16 INFO cvm_insights_store.py:453 Volume group list from DB:None
2017-05-30 10:16:16 INFO share_task.py:1478 existing_vg_list None
2017-05-30 10:16:16 INFO share_task.py:1482 new_vg_list ['548d055f-5b54-481d-b22d-253172493e85',

'b822259b-a259-47df-a6f1-093f02c6ac8a', 'a42e4bcf-3a13-4178-8e21-19829e58700f', '52be82b1-88ad-
4a49-b00b-af4e1d4236e3', '11bbfd7d-8882-4237-9b44-b444bfdb80a0', 'dd70c208-12aa-4310-ac9d-
673bdac07259', 'c6681eb9-8a75-4cf9-a2e4-b378e2ced1b6', '732ed7e0-f44f-400b-9872-7fda49571778',
'953300f2-155e-4328-a8f6-ea1e522a7ccd', '4c070990-4eae-497f-897e-4a63828a4138', 'a038f292-c981-
4fe5-8d20-b44382f23feb', 'b466e86d-0837-4422-8ad8-94263440842a', '8f2e1f4e-6c6c-4f8a-8d3a-
f04d51de545c', 'c5f1bc91-4990-4cd7-ad4b-c2e7e4393c93', 'e3dee4d6-b939-4c14-af72-c19d2be0882d']

8f85-621a2f5f6542 Shareuuid: (ShareName: home) : Sending add volume group set rpc to File server.

8f85-621a2f5f6542 Shareuuid: (ShareName: home) : add volume group set rpc return task_uuid:
"\"\350ov\214\203@:\227\t\205\020\200\236&9"

8f85-621a2f5f6542 Shareuuid: (ShareName: home) : Polling for Nvm Volume group sets add task status
2017-05-30 10:16:17 INFO minerva_task_util.py:1247 tasks: 22e86f76-8c83-403a-9709-8510809e2639


Zookeeper

8f85-621a2f5f6542 Shareuuid: (ShareName: home) : Sending add share rpc to File server.

8f85-621a2f5f6542 Shareuuid: (ShareName: home) : add share rpc return task_uuid:
"D\021\001\317\026\334A\226\202\324\\;n\023{R"

8f85-621a2f5f6542 Shareuuid: (ShareName: home) : Polling for Nvm Task Status
2017-05-30 10:16:27 INFO minerva_task_util.py:1247 tasks: 441101cf-16dc-4196-82d4-5c3b6e137b52
2017-05-30 10:16:29 INFO notify.py:103 notification=CreateFileServerShareAudit

file_server_share_uuid=80211986-0887-4804-8ad0-a5b051829e51 file_server_uuid=9e2de0ba-3d4c-
411f-8f85-621a2f5f6542 file_server_name=afs01 file_server_share_name=home message=Share
{file_server_share_name} is created on File server {file_server_name}
2017-05-30 10:16:29 INFO notify.py:84 Alert manager returned: result: true
2017-05-30 10:16:29 INFO minerva_task_util.py:1505 ShareAddTask: 7 : Updated wal with status

10000
2017-05-30 10:16:29 INFO minerva_task_util.py:1859 polling sub tasks 808b6894-5da2-4d8c-ace9-

ec416e05fd7b completed
2017-05-30 10:16:29 INFO domain_task.py:164 Successfully created home share.
2017-05-30 10:16:29 INFO minerva_task_util.py:1859 polling sub tasks 5993a6f8-5b99-41f3-8c0c-

a8197b368938 completed
2017-05-30 10:16:29 INFO file_server.py:194 File server afs01: successfully created.

Zookeeper
nutanix@NTNX-16SM13150152-A-CVM:10.30.15.47:~/data/logs$

20180530-Nss 5.x Troubleshooting Course SG Aspv1.1

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

20180530-Nss 5.x Troubleshooting Course SG Aspv1.1

Transféré par

Droits d'auteur :

Formats disponibles

Nutanix

NSS 5.x Troubleshooting

© Copyright 2017 Nutanix Inc.

Module 2 Tools and Utilities

Module 3 Services and Logs

Module 5 Hardware Troubleshooting

Module 7 Acropolis File Services Troubleshooting

Module 8 Acropolis Block Services Troubleshooting

Module 10 AOS Upgrade Troubleshooting

Appendix A Module 4 Foundation Troubleshooting - Foundation log

Appendix B Module 7 Acropolis File Services Troubleshooting - Minerva log

Aft er com p let ing t his m od ule, you w ill b e ab le t o:

Installing a Nutanix Cluster

Learn by going through the steps

Use what you

Life Cycle of a Case Step 1

Life Cycle of a Case Step 2

Life Cycle of a Case Step 3

Step 4: Step 3: Gather

Life Cycle of a Case Step 4

Step 4: Step 3: Gather

Life Cycle of a Case Step 5

• List Possib le Causes

Step 1: Isolate and Identify Problems

Step 2: Research Problems in Documentation

Log s That Relat e t o Your Issue

Step 3: Gather and Analyze Logs

Use t he Info you Collect ed

Step 4: Troubleshoot the Problem

Tell t he Cust om er A b out t he Solut ion

Step 5: Communicate, Document, or Escalate the Case

NSS 5.x Certification Overview

• NSS 5.x Certification Overview

• Removal of soft-skills questions.

• NSS 5.x Certification Topics

• The NSS Lab exam is a four-hour, hands-on exam which requires

Lab Exam – Format

• The candidate will be presented with a series of Configuration tasks.

Lab Exam - Configuration Section

• The Nutanix cluster is pre-configured, but contains faults.

Lab Exam - Troubleshooting Section

• Recertification will be required every 2 years from the

Nut anix Troub leshoot ing 5.x

Module 2 Tools and Utilities

After completing this module, you w ill be able to:

Nutanix Troubleshooting Toolkit

• NCC/ Log Collector

Tools & Utilities

NCC & Log Collector

• Nut anix Cluster Check ( NCC) is a fram ew ork of

• NCC actions are grouped into Modules and Plugins:

• Each NCC plugin is a test that completes independently of

NCC Modules and Plugins

To determine w hat version of NCC is

Verify from the Compatibility Matrix on the Nutanix Support Portal

NCC Version (cont’d)

NCC Version (cont’d)

NCC Version (cont’d)

• It is alw ays recom m end ed t o run t he lat est version of NCC

NCC Upgrade – Prism

If t he version of NCC t hat is to be applied is downloaded

NCC Upgrade – Prism

O nce t he required files have been provided, an