Vous êtes sur la page 1sur 6

Expert Reference Series of White Papers

Ten vSphere HA and DRS Misconfiguration Issues

1-800-COURSES

www.globalknowledge.com

Ten vSphere HA and DRS Misconfiguration Issues


John Hales, Global Knowledge VMware instructor, A+, Network+, CTT+, MCSE, MCDBA, MOUS, MCT, VCP, VCAP, VCI, EMCSA

Introduction
VMware has a popular and powerful virtualization suite of products in the vSphere and vCenter family of products. This white paper will focus on ten of the biggest mistakes people make when configuring the High Availability (HA) and Distributed Resource Scheduler (DRS) features. We wont rehash what HA or DR are and how they work. Ill instead refer you to the webinar and white paper I wrote for Global Knowledge, titled Top Five New Features in vSphere 5, and will focus on common configuration mistakes and how to avoid them. Well begin by looking at five common HA issues, then well look at four common DRS issues, then conclude with an issue that affects both HA and DRS. A top-ten list is always subjective, but Id suggest that these features belong among the top mistakes commonly seen in vSphere deployments. In this white paper, well rank the issues in order from those with the biggest impact to those with the smallest impact in each section.

HA Issues
HA is included in almost every version of vSphere, including one of the small business bundles (Essentials Plus), as the impact of an ESXi host failure is much bigger than the loss of a single server in the traditional world because many virtual machines (VMs) are affected. Thus, it is very important to get HA designed and configured correctly.

1.

Purchasing Differently Configured Servers

One of the common mistakes people make is buying differently sized servers (more CPU and/or memory in some servers than others) and placing them in the same cluster. This is often done with the idea that some VMs require a lot more resources than others, and the big, powerful servers are more expensive than several smaller servers. The problem with this thinking is that HA is pessimistic and assumes that the largest servers will fail. Solution: Either buy servers that are configured the same (or at least similarly) or create a couple of different clusters, with each cluster having servers configured the same. Some people also implement affinity rules to keep the big VMs on designated servers, but this impacts DRS well cover that issue later.

2.

Insufficient Hosts to Run All VMs Accounting for HA Overhead

When budgets are tight, many administrators will size their environments to have sufficient resources to run all the VMs that are needed, but forget to take into account the overhead that HA imposes to guarantee that sufficient resources will exist to restart the VMs on a failed host (or multiple hosts, if you are pessimistic). See the next two configuration issues for guidance on planning for host failures (and thus the overhead for HA).

Copyright 2011 Global Knowledge Training LLC. All rights reserved.

VMwares best practice is to always leave Admission Control enabled for HA to have HA automatically set aside resources to restart VMs after a host failure. Wed strongly recommend this as well. Solution: Plan for the overhead of HA and purchase sufficient hardware to cover the resources required by the VMs in the environment plus the overhead for HA.

3.

Using the Host Failures Cluster Tolerates Policy


ost failures the cluster tolerates: The original (and only) option for HA, this type assumes the H loss of a specified number of hosts (one to four in versions 3 and 4, up to 31 in vSphere 5). Percentage of cluster resources reserved as failover spare capacity: Introduced in vSphere 4, this option sets aside a specified percentage of both CPU and memory resources from the total in the cluster for failover use; vSphere 5 improved this option by allowing different percentages to be specified for CPU and memory. pecify failover hosts: This policy specifies a standby host that runs all the time, but is never used S for running VMs unless a host in the cluster fails. It was introduced in vSphere 4 and upgraded in version 5 by allowing multiple hosts to be specified.

Recall that there are three admission control policies, namely:

As described previously, HA is pessimistic, and always assumes the largest host will fail, reserving more resources than usually needed if the hosts are sized differently (though, per issue one, we dont recommend that). This policy also uses a concept called slots to reserve the right amount of spare capacity, but it assumes a one size fits all policy in this regard and uses the VM with the largest CPU and the largest memory reservation as the slot size for all VMs, thus wasting additional resources unless all VMs are sized the same. Solution: Use the VMware recommended policy of Percentage of cluster resources reserved as failover spare capacity instead, which will take a percentage of the entire clusters resources and use actual reservations on each VM instead of using the largest reservation.

4.

Forgetting to Update the Percentage Admission Control Policy as Cluster Grows

If the Percentage of cluster resources reserved as failover spare capacity policy is used (as suggested), it is important to reserve the correct amount of CPU and memory based on the needs of the VMs and the size of the cluster. For example, in a two-node cluster, the loss of one of the nodes removes half of the cluster resources (assuming they are sized the same). Thus, the percentage may be set to 50. However, if additional nodes are added to the cluster later, that value is probably too high and should be reduced to take into account the additional node(s) and the number of simultaneous failures expected (for example with four nodes, the loss of one node suggests that the percentage be set to 25, while if two failures are expected, then 50 percent should be used).

Copyright 2011 Global Knowledge Training LLC. All rights reserved.

Solution: Go back and recalculate the appropriate value in your cluster whenever hosts are added to or removed from the cluster.

5.

Configuring VM Restart Priorities Inefficiently

One of the settings that can be set in an HA cluster is the default restart priority of VMs after a host failure. This defaults to Medium, but can be set to Low, Medium, or High, or Disabled, if most VMs should not be restarted after a host failure. This is okay if all of the VMs are the same priority or if there are three priorities of VMs (low, medium, and high) and most are normal. However, if you have normal, high, and very high, this might not work as well. Solution: Consider setting the cluster default for restart priority to Low, enabling two higher levels for VMs. For example, maybe infrastructure VMs such as domain controllers or DNS servers are the highest priority (setting those VMs to High), followed by critical services, such as database or e-mail servers (setting those VMs to Medium), and then the rest of the VMs will be at the default (Low). Any VMs that dont need to be restarted can be set to Disabled to save resources after a host failure.

DRS Issues
The DRS feature is a little more advanced than HA but, for all but the smallest environments, no less important, as performance across many VMs in a dynamic environment is an ever-present concern and one that could easily consume one or more administrators full time without it. Leverage DRS for an environment that will run as smoothly as possible.

6.

Not Preparing for New Hardware

One of the biggest changes many administrators dont plan for is new hardware (new CPUs with more advanced capabilities or switching CPU vendors between Intel and AMD). The problem here is that you end up with islands of vMotion compatibility, where VMs, once started, can only be moved to some of the other servers in the cluster. This severely limits what DRS can do to load balance the cluster. Solutions: This issue has several solutions: uild separate clusters for AMD and Intel (if you have both CPU architectures better yet, stick with a B single CPU vendor) to solve the CPU vendor issue. lways enable Enhanced vMotion Compatibility (EVC) on every cluster so that, as new nodes are A added, they will be dumbed down to the level of your existing hosts. s old hosts are removed from a cluster, remember to upgrade the EVC level to expose the capabilities A of the new hosts added to the cluster. The setting must always be set to the lowest CPU type in the cluster.

7.

Im Smarter than DRS Mentality

This is a common mentality for administrators who are new to using DRS they dont trust DRS or want to know where VMs are all the time. I once had a student who said that his security department mandated docu-

Copyright 2011 Global Knowledge Training LLC. All rights reserved.

mentation of which host each VM was located on very silly in a virtual environment which is designed to be very dynamic. In other cases, administrators think they are smarter than DRS and can better balance the load. Solution: Let DRS run in Fully Automated mode. You are not smarter than DRS. There are just too many VMs to watch, and you cant always watch them, but DRS will check on the load balance of the cluster every five minutes and will automatically load balance as conditions change.

8.

Setting the Migration Threshold too Aggressively

One of the mistakes new administrators often make with DRS is that they set the Migration Threshold too aggressively. This value goes on a five-point scale from Conservative to Aggressive. Conservative only implements Priority 1 (five-star) recommendations, namely: the host is going into maintenance mode, reservations on the host exceed the hosts capacity, or if affinity rules violated. Priorities 2 5 (four- to one-star recommendations, respectively) take performance into account by using higher priority recommendations when the cluster is more out of balance and lower priority recommendations when the difference between nodes is less. Many administrators think that they want to be as aggressive as possible to be as balanced as possible, but remember that there is a trade-off between being perfectly balanced and the cost of achieving that balance; in other words, the cost of vMotion. Doing too much vMotion may actually cost more than the benefits of being perfectly balanced. Solution: Set the threshold to the mid-point Priority 3 unless the load is fairly static. Analyze cluster performance and recommendations and adjust as necessary.

9.

Non-optimal Sizing of Clusters.

A cluster (HA or DRS) can have up to 32 nodes in it, but just because you can, doesnt mean you should. Very small clusters give DRS few options for load-balancing and often incur higher overhead by HA, reducing the available capacity to run VMs. On the other hand, very large clusters may be fine from an HA perspective. If you are running vSphere 5, there is one master node and the rest are slaves, but any slave can be promoted to be the primary if the primary fails, so large cluster sizes are okay. On the other hand, if you are running version 4 or below, you may wish to use a smaller cluster size as there are a maximum of five primary nodes, with the other nodes being secondary nodes, but secondary nodes are usually not automatically promoted to a primary node if a primary fails. This is important because, if all primary nodes are down, HA will not automatically restart anything. DRS clusters are another matter, however. The problem is that the larger the cluster and the more VMs in the cluster, the more possible scenarios vCenter has to analyze, dramatically increasing the load on that server.

Copyright 2011 Global Knowledge Training LLC. All rights reserved.

Solution: Many experts recommend putting between 16 and 24 hosts in a cluster as a good balance between the reduced overhead for HA and the increased load on the vCenter for DRS. If you will be using linked clones, such as with View, the maximum cluster size is eight nodes.

HA and DRS
Finally, theres an issue that affects both HA and DRS. Optimizing VMs through the proper use of reservations, limits, and shares will be a more time-consuming and challenging task then many previously listed, but will pay dividends day in and day out.

10.

Overuse of Reservations, Limits, and Affinities

One of the powerful features in vSphere is the ability to guarantee a certain level of resources (via reservations) or to cap consumption (via limits) for VMs. While this can be done, it reduces the options that HA and DRS have in load-balancing and restarting VMs. Using affinities, while convenient and may be necessary for HA, performance, or licensing reasons, add even more constraints to HA and DRS. Solution: Use shares whenever possible instead of using reservations and limits, and minimize the use of affinity (VM-to-VM, as well as VM-to-Host) rules to give HA and DRS the most possible options. If limits and reservations are needed, implement them at the resource pool level whenever possible instead of at the individual VM level.

Learn More
Learn more about how you can improve productivity, enhance efficiency, and sharpen your competitive edge through training. VMware vSphere: Fast Track [V5.0] VMware vSphere: Install, Configure, Manage [V5.0] VMware vSphere: Whats New [V5.0] Visit www.globalknowledge.com or call 1-800-COURSES (1-800-268-7737) to speak with a Global Knowledge training advisor. Related VMware Certifications VMware Certified Professional 5 (VCP5)

About the Author


John Hales, VCP, VCAP, VCI, is a VMware instructor at Global Knowledge, teaching all of the vSphere and View classes that Global Knowledge offers. John is also the author of many books (including a book on vSphere 5: Professional vSphere 5: Implementation and Management), from involved technical books from Sybex, to exam preparation books, to many quick reference guides from BarCharts, in addition to custom courseware for individual customers. John has various certifications, including the VMware VCP (3, 4, and 5), VCAP, and VCI; the Microsoft MCSE, MCDBA, MOUS, and MCT; the EMC EMCSA (Storage Administrator for EMC Clariion SANs); and
Copyright 2011 Global Knowledge Training LLC. All rights reserved. 6

Vous aimerez peut-être aussi