Ba3 Ba3

Intelligent Rebuilds in vSAN 6.
6
First Published On: 05-08-2017
Last Updated On: 11-29-2018
Copyright © 2018 VMware, Inc. All rights reserved.

Intelligent Rebuilds in vSAN 6.6
Table of Contents
1. Introduction
1.1.Executive Summary
1.2.Introduction to Intelligent Rebuilds
2. vSAN 6.6 Intelligent Rebuilds
2.1.Purpose of Intelligent Rebuilds
2.2.Intelligent Rebuilds using Enhanced Rebalancing
2.3.Intelligent Rebuilds using Smart, Efficient Repairs
2.4.Intelligent Rebuilds using Partial Repairs
2.5.Intelligent Rebuilds using Resumable Resync
2.6.Intelligent Rebuilds with maintenance mode decommissioning
3. Usage scenarios
3.1.Activities that benefit from improved Intelligent Rebuilds
4. Conclusion
4.1.Conclusion

1. Introduction
Executive Summary, and Introduction

1.1 Executive Summary
Executive Summary
A common requirement of enterprise storage systems is the ability to maintain resiliency and expected
levels of performance in the face of hardware faults and other unforeseen conditions. VMware vSAN
is the industry leading distributed storage system built right into VMware vSphere, and is designed to
offer the highest level of resiliency and performance, with the maximum amount of agility should
hardware faults occur, or demands of the environment change. vSAN's awareness of data placement
is tightly integrated with routine activities such as host decommissioning, and VMware High
Availability (HA). vSAN can also place data intelligently based on site topology designs such as
stretched clusters, and user defined fault domains. These, and many more conditions are accounted
for in vSAN's ability to intelligently manage performance, efficiency, and availability of data stored on
a cluster powered by vSAN. These mechanisms fall under the category of "intelligent rebuilds" and will
be discussed in more detail in this document.
1.2 Introduction to Intelligent Rebuilds
An Introduction to Intelligent Rebuilds

In traditional three-tier architectures, resiliency of storage is provided by the use of redundant
hardware componentry within a single storage unit - typically in the form of an array enclosure. In
traditional storage systems, data is protected by using RAID protection schemes. These schemes are
defined on a storage array, and achieved by spreading bytes of data across multiple drives, without
regard to the type of data being stored on the LUN. Since virtualized environments might use a single
array to provide storage for hundreds of VMs, this limits the flexibility of delivering various levels of
protection and performance to individual VMs. Traditional arrays can only assign RAID protection
levels per LUN. This presents operational challenges to virtualized environments, where many VMs
could be running in a single LUN serving a datastore. If the array uses "wide striping" then it can only
protect at one RAID level across the entire array.
By contrast, VMware vSAN is a distributed storage system that uses physical storage devices on each
ESXi host in a cluster to contribute to the vSAN storage system. vSAN removes the concept and
limitations around defining LUNs, and presents storage as a single, distributed datastore visible to all
hosts in the vSphere cluster. VMware vSAN is an object-based storage system integrated into VMware
vSphere. Virtual machines that live on vSAN storage are comprised of a number of storage objects.
VMDKs, VM home namespace, VM swap areas, snapshot delta disks, and snapshot memory maps are
all examples of storage objects in vSAN.
In vSAN, RAID protection levels are defined and controlled in VMware vCenter, using storage policy
based management (SPBM). vSAN protects data at an object level, giving you the ability to protect
VMs using various levels of protection (RAID-1, RAID-5, RAID-6) and performance on a per VM, or
even per VMDK basis. vSAN uses the concept of a RAID tree to ensure protection of objects. A
"component" is the "leaf" of an object's RAID tree, and is how redundancy is provided to a given
object. When using a RAID-1 protection scheme, a “replica” represents a complete copy of all of the
components that make up an object. Figure 1 illustrates the relationship between between objects,
components, and replicas.
Components may be split into smaller components depending on environment conditions, policy
settings, and the size of the objects. These disaggregated pieces of an object can be stored separately
from the other components that make up an object. vSAN automatically manages the distribution of
components across the hosts that constitute a vSAN cluster, and will actively rebuild or resynchronize
components when VM objects are not currently adhering to their defined protection policies, severely
imbalanced, or in the event of some operational change in the environment.

Figure 1. The relationship between an object, components, and replicas
vSAN 6.6 has a number of improvements designed to offer more intelligent rebuilds, optimizing the
return to normal operations and compliance quickly, and automatically.

2. vSAN 6.6 Intelligent Rebuilds

Details behind the suite of functions and features that make up Intelligent Rebuilds in vSAN 6.6

2.1 Purpose of Intelligent Rebuilds
Purpose of Intelligent Rebuilds

The purpose of any type of component resync or rebuild is to restore or ensure the level of resiliency
defined for a given VMDK, VM, or collection of VMs. A number of improvements are included in vSAN
6.6 that assist in the rebuild process and placement of components. The goals of the improvements
were to provide the following:
• Achieve better balance of components across vSAN hosts.

• Reduce time of rebuild or resync activities.
• Reduce amount of backend resources (CPU, network, disk I/O) used for storage related
activities that are not the direct result of I/O generated by VMs. This includes a rebuild or
resync of a component.
vSAN 6.6 includes a number of changes related to component rebuilds, such as:
• Intelligent rebuilds using enhanced rebalancing

• Intelligent rebuilds using smart, efficient repairs
• Intelligent rebuilds using partial repairs
• Resumable resyncs
• Maintenance mode decommissioning improvements
Each enhancement will be described in more detail throughout this document, and will include
recommendations to ensure the best use of these technologies for your vSAN environment.
RECOMMENDATION: Use supplementary tools to understand vSAN backend activity. VMware

vRealize Log Insight paired with the Log Insight content pack for vSAN is a great way to better
visualize vSAN activity related activity. vRealize Operations paired with the management pack for
vSAN can also be used to better understand trends and workload behavior of your vSAN powered
environments.
2.2 Intelligent Rebuilds using Enhanced Rebalancing
Intelligent Rebuilds using Enhanced Rebalancing

Rebalancing is the act of redistributing components across the hosts that comprise a vSAN datastore.
The need to rebalance can be for a number of different reasons, such as thin provisioned virtual disks
being filled up, or introducing an additional host to vSAN, but the objective remains the same:
Effective balancing of components across hosts leads to more efficient utilization of resources for
both performance and capacity, and minimizes the effort to accommodate planned or unplanned
events in a cluster.
Rebalancing efforts fall into two general categories: Proactive rebalancing, and reactive rebalancing.
• Proactive rebalancing occurs when disks are less than 80% full. The opportunity to run a
proactive rebalance will only occur when any single disk has consumed 30% more capacity than
the disk with the lowest used capacity. vSAN automatically checks for these conditions, and

will provide a health check alert that will allow the user to invoke a rebalance using the
"Proactive Rebalance Disks" button in the Health section of the vSAN UI.
• Reactive rebalancing occurs when any disk is more than 80% full. This is an automatic process
taken by vSAN to ensure the best distribution of components. An imbalance will automatically
be detected, and vSAN will automatically invoke efforts to achieve better balance.
In vSAN 6.6, a number of enhancements have been made in rebalancing components.
• Better decision making in rebalancing efforts. When determining the best strategy for moving
components, vSAN 6.6 factors in additional information about the components on disks
exceeding 80% capacity. vSAN evaluates how much data must be moved out of the identified
disk in order to meet the desired balancing objective, and factors this into the component
selection process. A more sophisticated selection process of components has been introduced
in vSAN 6.6 that prioritizes components so that the components selected for moving will make
a more meaningful difference in the rebalancing effort. Previous editions of vSAN relied on a
rebalancing master to publish placement decisions for all components moving out of the
source disk and host. This information could grow stale by the time the hosts act on the
information. In vSAN 6.6, data placement decisions are now based on a more recent cluster
state, courtesy of more up to date information from other hosts. This improved decision
making process for component placement can reduce the number of components being
moved, and the number of unsuccessful component moves, thereby reducing the overall
amount of CPU and network resources used to maintain proper balance.
• Splitting large components during redistribution. vSAN 6.6 can now take an individual large
component, and break it into smaller pieces for more optimal distribution. The breaking of
components into smaller chunks will only occur during reactive balancing, when a disk has
consumed more than 80% of its capacity. During the rebalancing process, if vSAN finds a new
disk with sufficient capacity for the component without the need for breaking into multiple
components, it will place the component there in its entirety, and not split the component. As
seen in Figure 2, if vSAN can't find a new disk without breaking up the component, it will break
into two, and retry the rebalancing effort. If it still cannot find a disk after the initial split, it will
continue to further divide the component, and retry the previous step. This logic will be
followed for up to 8 splits. The splitting of a component will only occur on the component being
redistributed. Previous editions of vSAN did not have this ability, and rebalancing efforts could
either lead to less than ideal placement, or simply the inability to rebalance due to large
component sizes coupled with minimal free space.

Figure 2. Reactive Rebalance with component split
• Improved visibility of rebalancing. Two improvements have been made to provide better
visibility when rebalancing. The rebalance status has been updated to provide more frequent
updates with more accurate reports of progress. The improvements make it much easier to
report the number of bytes being moved out for rebalancing, and new resync graphs have been
introduced in the performance service so that resync traffic can be monitored in greater detail.
As shown in Figure 3, this can be found at the disk group level in the host-related vSAN
metrics.

Figure 3. Visibility of resync performance per disk group.
vSAN provides a number of rebalancing "states" in the UI, as shown in Figure 4.
• Proactive rebalance is needed. This state is triggered after vSAN determines proactive
rebalancing is needed following a 30 minute monitoring period.
• Proactive rebalance is in progress. This state is shown when a proactive rebalance is in
progress.
• Proactive rebalance failed. This is when the object manager couldn’t find any components
that could be moved due to policy restrictions, lack of resources, etc.
• Reactive rebalance task is in progress. This state simply notifies the user that a reactive
rebalance is in progress.
Figure 4. An example of a rebalancing state as seen in the UI
Note that rebalancing of components aims to achieve a relatively even distribution of components
across the storage devices contributing to storage in a vSAN data store. This is designed both to
optimize performance, by distributing across the most devices, and to minimize the size of any
10

given fault domain. It will not rebalance data in such a way that it will compromise a given
protection policy. For instance, vSAN will never place the other leaf component of an object with a
RAID-1 mirror on the same drive for the sake of balance.
• Resync Throttling. The traffic generated by any resyncing or rebalancing of components

occurs on the network designated for vSAN traffic. As described in this document, a number of
enhancements in vSAN 6.6 should reduce the effective amount of data traversing the vSAN
network during rebalancing or repair processes. In some cases, such as permanent host
decommissioning during heavy workload, there may be times in which one would wish to
reduce the resources used for this backend vSAN activity. A new throttling mechanism is now
available that helps achieve this. As shown in Figure 5, resync throttling is controlled at the
cluster level using a slider bar in the UI.
Figure 5. The throttling mechanism found in the UI
The throttle represents the total bandwidth used by an individual disk group on a host, not the
entire host. The vSAN UI will show the resync bandwidth for the highest disk group per host. Once
resync throttling is applied, it will apply this limit to all disk groups that attempt to exceed that
limit. Throttling can be restricted down to 1MB per second.
RECOMMENDATION: Resync throttling should generally only be used to address temporary

conditions, and only if the performance metrics are indicating contention. Using resync throttling can
only slow the effort down, not speed it up. The purpose of the resync throttling is to unblock VM I/Os
during these situations, and should otherwise be left disabled.
2.3 Intelligent Rebuilds using Smart, Efficient Repairs
Intelligent Rebuilds using Smart, Efficient Repairs

vSAN is designed to automatically detect when an object is no longer compliant with an assigned
protection policy, and after a designated amount of time will rebuild the relevant components
elsewhere to regain compliance. A common scenario would be if a host in a vSAN cluster is
unexpectedly offline. These component rebuilds are invoked when the compliance failure has
occurred for 60 minutes or longer, and no explicit error codes were sensed. A rebuild would occur on
the entire set of components deemed absent, in order to regain compliance with the assigned
protection policies. Note that in scenarios where vSAN detects an object no longer compliant with an
assigned policy as a result of a device failure and resulting error code sensed, a rebuild will begin
immediately.
11

In vSAN 6.6, key enhancements have been made in the effort of restoring compliance of protection
policies to objects, while minimizing resources used for a rebuild.
• Additional repair method. Prior to version 6.6, vSAN offered a strict component rebuild process
invoked after components were marked as absent for 60 minutes or longer. Once this rebuild
process started, it would continue until completion, even if the affected host would come back
online shortly after the 60 minute timeout window. This process would also resync the slightly
outdated components on the recently restored host while rebuilding the entire component
elsewhere, but could only use the rebuilt component upon completion. vSAN 6.6 introduces a
new repair method that also allows vSAN to take advantage of the resyncronized component
on the host, should vSAN deem this the more efficient method to use.
• New logic to determine best method to use. When a host or device comes back online after a
60 minute window, vSAN will look at the amount of data remaining for a component rebuild to
complete, versus how long it would take to repair or resync the outdated component, and
choose the method that will complete with the least amount of effort, cancelling the other
rebuild operation.
These improvements significantly reduce the amount of data that needs to be rebuilt, and the time
that it takes for objects to regain compliance of their protection policies. Reducing the need for full
component rebuilds can also free up space that is not needed immediately for the rebuild process.
The added repair method, and the new logic in determining the best method for repair is extremely
beneficial for hosts that were in their maintenance window just slightly longer than the 1 hour timeout
period, as well as for hosts that hold a large amount of consumed capacity for objects.
2.4 Intelligent Rebuilds using Partial Repairs
Intelligent Rebuilds using Partial Repairs

When objects are no longer in compliance with their designated protection policy, vSAN manages the
repair and rebuilding process in the most efficient way possible. It takes into account a number of
factors, including component size, distribution of objects, storage policy conditions, and capacity
remaining on devices. When the repair or rebuild process begins after components have been absent
for longer than 60 minutes, it will attempt to make the smartest decisions on the order and placement
of the components it is rebuilding.
In vSAN 6.6, this rebuild and repairing process has been enhanced through the concept of "partial
repairs." Previous editions of vSAN would only be able to successfully execute a repair effort if there
were enough resources to repair all of the degraded, or absent components in entirety. The repair
process in vSAN 6.6 will take a more opportunistic approach to healing by repairing as many
degraded, or absent components as possible, even if there are insufficient resources to ensure full
compliance. The effective result is that an object might remain non-compliant after a partial repair,
but will still gain increased availability from those components that are able to be repaired.
Any remaining components that are not repaired to meet their full level of compliance according to
the SPBM policy will be repaired as soon as enough capacity resources become available.
An example of a partial repair process can demonstrated in a scenario of a 6 node vSAN cluster, with a
VM configured for an FTT set to 2. As shown in Figure 6, In the event that two host failures occur, the
VM would still be available, but the effective FTT would be 0, as there would only be one replica
remaining. With the new partial repair process, vSAN will initiate and complete a repair of the objects
if enough resources are available to increase the effective FTT level as a result of the repair. While the
objects might remain non-compliant from the desired SPBM policy of FTT=2, the partial repair
process will have increased the effective availability to FTT=1. vSAN will eventually complete the
repair to make the object fully compliant when resources become available. Increased resource
availability could come from adding hosts, adding capacity, deleting unused VMs, or reducing
protection levels of other VMs.
12

13

Figure 6. An example of a partial repair process in a standard cluster.
Partial repairs work in both standard and stretched cluster environments. An example of a partial
repair process in a stretched cluster environment can be demonstrated in a scenario of an 8 node
stretched cluster, with a VM configured for PFTT=1 (remote protection level), and SFTT=1 (Local
protection level). As shown in Figure 7, in the event that a site failure occurs, and the remaining site
has an host failure, the VM would still be available in the remaining site. The effective PFTT would be
0, and SFTT would be 0, as it would be the last replica remaining. The new partial repair process will
initiate and complete a “best effort” repair in the surviving site. While the object might remain non-
compliant across sites (PFTT), the partial repair process will have increased the effective local
protection level to a locally compliant level of SFTT=1. vSAN will eventually complete the repair to
make the object fully compliant across sites when resources become available.
14

15

Figure 7. An example of a partial repair in a stretched cluster.
In both examples, vSAN will give priority to the partial repair of data components over witness
components.
2.5 Intelligent Rebuilds using Resumable Resync
Intelligent Rebuild using Resumable Resync

Resynchronization efforts play an important role in the ability to ensure compliance of protection
policies for objects in vSAN.
vSAN 6.6 has improved the resync action so that it is more resilient, and efficient. In prior editions of
vSAN, an interrupted resync would need to start the resync process from the beginning. An example
of a resync interruption includes an absent host coming back online, running the resync process,
followed by a brief network interruption. Changing an SPBM policy on an object while the host
containing the owner object is offline is another scenario in which resyncs would need to start from
the beginning. As shown in Figure 8, vSAN 6.6 is now able to transparently resume a resync operation
where it left off following an interruption, avoiding the need to reprocess already resynchronized data.
16

Figure 8. Resumable resyncs.
Resumable resyncs are achieved by accepting writes and tracking the changes on the component that
remains available. vSAN identifies the last write that the component has when the host containing the
component goes offline, while keeping the previous effort of resynchronized data. The updated writes
that are committed to the component still available is tracked separately from the original resync
tracking. Once the component comes back from being temporarily unavailable, the tracked changes
on the active components are merged into the components temporarily offline so that it can resume
the resynchronization process. As data is resynchronized, vSAN will incrementally update its
understanding of what data has been committed so that in the event of another incremental outage, it
does not have to resync data already synchronized.
2.6 Intelligent Rebuilds with maintenance mode decommissioning
Intelligent Rebuilds with Maintenance Mode

Decommissioning
Maintenance mode operations are an important aspect of typical data center operations. The Enter
Maintenance Mode (EMM) workflow is integrated into vSAN operations, and gives the following
options:
• Evacuate all data

• Ensure data accessibility
• No data evacuation
"Ensure data accessibility" simply allows the VMs to remain accessible, yet potentially in a degraded
state of redundancy. This is typical for most quick maintenance mode operations, and no data will be
moved. “No data evacuation” is most often related to a full cluster shutdown, and just as the name
implies, no data will be moved. “Evacuate all data” will rebuild the components onto other hosts to
maintain the desired protection policies assigned to the VMs. This last option is commonly used for
activities such as hardware maintenance, decommissioning, or possibly ondisk format changes
introduced in vSAN. These modes apply not only to host entering maintenance mode, but disk, and
diskgroup EMM activities.
In vSAN 6.6, evacuating data to other hosts has been optimized to reduce the amount of overhead and
data migration during an EMM operation. This translates to quicker time to complete the EMM
operation.
17

The object manager will no longer attempt to fix compliance at an object level across the cluster
during a full evacuation, but rather, only strive to move all components from the node entering
maintenance mode onto other nodes in the cluster. vSAN will preserve a current object effective FTT
level during this operation. If an object had been assigned FTT=2 in its policy, but had an effective
availability of FTT=1, it will preserve this FTT=1 status for the EMM effort. This reduces the amount of
time required, minimizes data movement across the cluster, and also increases the chance for a
successful maintenance mode operation. Previous editions would require all affected objects to be
fully compliant, including the repair of other unrelated components before completing the EMM
process.
RECOMMENDATION: Choose the correct maintenance mode operation for your intention. Typical
vSphere patches are often applied quickly, and the most significant downtime might just be the reboot
process of the host. If you’re VMs can run with less resiliency for a brief amount of time, then the
“Ensure Accessibility” maintenance mode option is an extremely efficient way to go. Other
maintenance activities on a server expected to take a longer time period might be more suitable for
the “full evacuate” option.
18

3. Usage scenarios
Scenarios that demonstrate where particular Intelligent Rebuild functionality comes into play.
19

3.1 Activities that benefit from improved Intelligent Rebuilds
Activities that benefit from improved Intelligent

Rebuilds
Nearly all back-end management activities by vSAN benefit from the improvements described in this
document. The activities listed below highlight where, and how the individual improvements will
provide benefit. The benefits of the enhancements are not limited to these examples, but serve as an
way to better understand how the improvements apply to real world use cases.
Sustained Fault domain failure

Sustained fault domain failures can be the result of a single fault domain such as a host, device, disk
group, or defined fault domain. A failed host for instance will be able to take advantage of a number
of enhancements to intelligent rebuilds. Partial repairs will attempt to improve the FTT level even if it
cannot satisfy the entire policy, while resumable resyncs will allow for a quicker time to completion in
the event of intermittent outages during a rebuild process. Improved visibility of resync activity in the
UI is also an important part of the intelligent rebuild process.
Fault domain restored prior to rebuild completing

A vSAN cluster that has a host go offline for a temporary amount of time (but over 60 minutes) will
also benefit from a number of enhancements to intelligent rebuilds. Smart efficient repairs provides
the primary benefit in determining the best strategy for rebuild once the host comes back online.
Improved visibility of resync activity in the UI is also an important part of the intelligent rebuild
process.
Decommissioning of a host or diskgroup

Decommissioning a host in a vSphere cluster is a common practice in the data center. Whether it be
for permanent removal of a host from a cluster, or temporarily for maintenance activities. In the event
of entering maintenance mode in which full evacuation of data is desired, vSAN’s intelligent rebuilds
take advantage of enhancements in maintenance mode decommissioning to speed of the process of
entering maintenance mode. Improved visibility of resync activity in the UI provides additional benefit
to the user. These enhancements are also applicable to disk and diskgroup decommissioning that can
occur as a result of disk format changes, or other disk evacuation activities.
RECOMMENDATION: Use a single host maintenance mode for rolling cluster updates. In traditional
three-tier architectures, persistent storage was housed separately, and if a cluster had sufficient
compute resources to tolerate more than one host offline at any given time, this could speed up the
remediation process for host updates. In vSAN, a single host maintenance mode rolling update is a
better strategy as it will reduce the amount component resyncing to ensure proper compliance.
Disks nearing capacity

As storage devices in a vSAN datastore fill up, the improvements made in vSAN 6.6 show up in many
areas. When the automated, reactive rebalancing is initiated, the enhanced rebalancing will allow for
components to be split, offering better placement decisions during the rebalancing process. Improved
visibility of resync activity in the UI provides additional benefit to the user.
20

RECOMMENDATION: Use performance graphs to understand backend activity. vSAN 6.6

significantly reduces this burden of backend traffic, but in order to understand the actual burden of
resync, monitor resync activity using the graphs courtesy of the Performance Service in vSAN. The
new resync graphs can be found by highlighting a host, clicking on Monitor > Performance > “vSAN –
Disk Group” and looking at all resync related graphs.
Disks imbalanced across cluster

vSAN powered clusters that have sufficient capacity, but are identified by vSAN as imbalanced, will
provide the option of a user initiated proactive rebalance, and will benefit from improved visibility of
the resync activity during this rebalance effort. This imbalance could come from thin provisioned
virtual disks having a sudden amount of growth, or perhaps from a host placed into maintenance
mode where all of the data was evacuated, but has recently been brought back online. vSAN will
notice this imbalance, and suggest a rebalance if it is advised. The performance graphs in the vSAN
performance UI will illustrate the backend activity occurring during the resync, and the status
indicators in the Health UI will provide more meaningful state of the rebalance activity.
Impending failure of a device

Not specifically discussed in this document, Degraded Device Handling (DDH) will also benefit from
the enhancements made to intelligent rebuilds. In situations where there is only one valid copy of data
left on the cluster, it will be evacuated and the component placements will be in accordance to the
enhanced rebalancing mechanisms in vSAN 6.6. Improved visibility of resync activity in the UI
provides additional benefit to the user.
21

4. Conclusion
Conclusion
22

4.1 Conclusion
Conclusion
All of the individual enhancements to the task of intelligent rebuilds deliver specific performance and
efficiency improvements to vSAN. These improvements, while described individually in this document,
will have dramatic improvements in efficiency when vSAN uses them collectively against the demands
of real workloads, and environmental demands of a production environment.
About the Author

This content in this document was assembled using content from various resources from vSAN
Engineering and vSAN Product Management.
Pete Koehler is a Sr. Technical Marketing Manager, working in the Storage and Availability Business
Unit at VMware, Inc. He specializes in enterprise architectures, data center analytics, software-defined
storage, and hyperconverged Infrastructures. Pete provides more insight to challenges of the data
center at vmpete.com, and can also be found on twitter at @vmpete.
23

Ba3 Ba3

Transféré par

Informations du document

Titre original

Copyright

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Ba3 Ba3

Transféré par

Droits d'auteur :

Intelligent Rebuilds in vSAN 6.

Copyright © 2018 VMware, Inc. All rights reserved.

Copyright © 2018 VMware, Inc. All rights reserved.

Copyright © 2018 VMware, Inc. All rights reserved.

1.1 Executive Summary

1.2 Introduction to Intelligent Rebuilds

An Introduction to Intelligent Rebuilds

Copyright © 2018 VMware, Inc. All rights reserved.

Figure 1. The relationship between an object, components, and replicas

Copyright © 2018 VMware, Inc. All rights reserved.

2. vSAN 6.6 Intelligent Rebuilds

Copyright © 2018 VMware, Inc. All rights reserved.

2.1 Purpose of Intelligent Rebuilds

Purpose of Intelligent Rebuilds

• Achieve better balance of components across vSAN hosts.

• Intelligent rebuilds using enhanced rebalancing

RECOMMENDATION: Use supplementary tools to understand vSAN backend activity. VMware

2.2 Intelligent Rebuilds using Enhanced Rebalancing

Intelligent Rebuilds using Enhanced Rebalancing

Copyright © 2018 VMware, Inc. All rights reserved.

In vSAN 6.6, a number of enhancements have been made in rebalancing components.

Copyright © 2018 VMware, Inc. All rights reserved.

Figure 2. Reactive Rebalance with component split

Copyright © 2018 VMware, Inc. All rights reserved.

Figure 3. Visibility of resync performance per disk group.

vSAN provides a number of rebalancing "states" in the UI, as shown in Figure 4.

Figure 4. An example of a rebalancing state as seen in the UI

Copyright © 2018 VMware, Inc. All rights reserved.

• Resync Throttling. The traﬃc generated by any resyncing or rebalancing of components

Figure 5. The throttling mechanism found in the UI

RECOMMENDATION: Resync throttling should generally only be used to address temporary

2.3 Intelligent Rebuilds using Smart, Eﬃcient Repairs

Intelligent Rebuilds using Smart, Eﬃcient Repairs

Copyright © 2018 VMware, Inc. All rights reserved.

2.4 Intelligent Rebuilds using Partial Repairs

Intelligent Rebuilds using Partial Repairs

Copyright © 2018 VMware, Inc. All rights reserved.

Copyright © 2018 VMware, Inc. All rights reserved.

Figure 6. An example of a partial repair process in a standard cluster.

Copyright © 2018 VMware, Inc. All rights reserved.

Copyright © 2018 VMware, Inc. All rights reserved.

Figure 7. An example of a partial repair in a stretched cluster.

2.5 Intelligent Rebuilds using Resumable Resync

Intelligent Rebuild using Resumable Resync

Copyright © 2018 VMware, Inc. All rights reserved.

Figure 8. Resumable resyncs.

2.6 Intelligent Rebuilds with maintenance mode decommissioning

Intelligent Rebuilds with Maintenance Mode

• Evacuate all data

Copyright © 2018 VMware, Inc. All rights reserved.

Copyright © 2018 VMware, Inc. All rights reserved.

Copyright © 2018 VMware, Inc. All rights reserved.

3.1 Activities that beneﬁt from improved Intelligent Rebuilds

Activities that beneﬁt from improved Intelligent

Sustained Fault domain failure

Fault domain restored prior to rebuild completing

Decommissioning of a host or diskgroup

Disks nearing capacity

Copyright © 2018 VMware, Inc. All rights reserved.

RECOMMENDATION: Use performance graphs to understand backend activity. vSAN 6.6

Disks imbalanced across cluster

Impending failure of a device

Copyright © 2018 VMware, Inc. All rights reserved.

Copyright © 2018 VMware, Inc. All rights reserved.

About the Author

Copyright © 2018 VMware, Inc. All rights reserved.