Académique Documents
Professionnel Documents
Culture Documents
Page 1 of 19
1 Introduction 2 Background: Storage Management Principles in a Virtual Environment 3 High Level Design Principles for Storage Resiliency in a Virtualized Environment 3 . 1 Building Architectures 3 . 2 Campus Architectures 3 . 3 Globe Architectures 3 . 4 Hybrid Designs 3 . 5 Networking Considerations 3 . 6 SAN Considerations 3 . 7 Conclusions and Summary Table 4. Data Loss and Data Integrity in Replicated Storage Scenarios 4 . 1 Asynchronous Vs Synchronous replication 4 . 2 Data Consistency: Not Only a Technical Issue 5 High Availability and D/R at the "Virtualization layer" Vs "Virtual Machine layer" 6 Storage Resiliency: Actual Implementations in a Virtualized Environment 6 . 1 Building Implementations 6 . 2 Campus Implementations 6 . 2 . 1 Campus Implementations for VMware Virtual Infrastructures
6 . 2 . 1 . 1 Campus Implementations for VMware Virtual Infrastructures (Host-based Replication) 6 . 2 . 1 . 2 Campus Implementations for VMware Virtual Infrastructures (Storage-based Replication)
1. Introduction
This document is a summary of the various storage architectures commonly seen in virtualization deployments. Virtualization has morphed very quickly from being a server technology (i.e. which allows sysadmin to carve out more than one OS per physical server) into being a datacenter technology that can't do without a proper storage setup in order to fully exploit all its potential. The fact of the matter is that there is at the moment more potential to exploit (and more confusion) around storage architectures for virtualized environments than there is for server environments. I am not simply talking about storage virtualization; I am rather talking about storage architectures designed to support server virtualization (this might or might not include storage virtualization).
Page 2 of 19
The document outlines high level considerations and will immediately set the stage for the terminology being used (sure there is a standard terminology but, as a colleague of mine used to say, the good thing about standards is that there are many of them). After that we will get into how the various technologies map into the high level design principles we have outlined. This is the tough part: we will try to mix and match various storage technologies with various server virtualization technologies, with various tools and utilities. The combination of all this will determine what you can do and what you can't do. This document is oriented towards open and distributed infrastructures and specifically towards x86 virtualization environments. The document is also more oriented towards resiliency of the server and storage infrastructure rather than other important aspects (such as performance for example). I will not cover Backup strategies in this release. It is not my intent to push a specific technology. The idea is to provide a framework of architectures (with advantages, disadvantages and more generally characteristics for each of them) with technology examples. You can use this document (and the example) as a "benchmark" for your solution / vendor of choice. Last but not least, as this is more of a "notes from the field" document, please understand that I will be adding/removing/adjusting stuff that, in the long run, may result in a slightly less structured layout. I don't honestly have time to re-read everything so that the flaw when you read is consistent. I'll pay attention though in that the content is consistent. Feedbacks (massimo@it20.info) on the content are always more than welcome.
Page 3 of 19
architecture that we will discuss here. They are the Building, Campus and Globe scenarios.
3 . 1 Building Architecture
We could actually refer to this as the most common storage architecture for virtualization. We could even call this Rack since it's commonly deployed in a single rack (inside the "building"). Simply put it's a single shared storage architecture that is available to a cluster of virtualized servers. As shared storage is becoming more like a must-have for virtual deployments, rather than a nice-to-have, this has become the blueprint architecture for the vast majority of the virtualization deployments: cluster nodes are basically stateless (i.e. with just the bare metal hypervisor with limited local configuration options) and all the assets of the organization are stored on the SAN (that is, in the storage server connected to the SAN) in the form of virtual machines and virtual hard disks: this includes the Guest OSes, the Applications code, the VM configuration information, and, more importantly, the data. This setup has a fundamental characteristic: while the servers are loosely coupled and so very redundant by nature, the storage server is not. While the storage server might have redundant controllers the fact is that they are tightly coupled. Most of the time they are not only tightly coupled from a logical perspective (i.e. one single storage server with 2 integrated controllers) which makes some maintenance operations and potential issues more risky than having two independent storage servers, but often they are tightly coupled from a physical perspective meaning the multiple controllers come packaged into a single physical box. If this happens you can't separate the two controllers and install them into different racks or different buildings in your campus to be able to support a problem that goes beyond the standard single component failure (a controller for example).
Because of the above many people still tend to see the single (yet fully redundant) storage server in the building scenario, as a single point of failure when it comes to resiliency. That's why this solution is commonly perceived as redundant from a server perspective but not redundant from a storage perspective. Again, the storage redundancy here is not referred to the single component failure (for which you would be covered as all components are redundant) but it's referred to critical issues that might occur within the box (i.e. concurrent firmware upgrades issues or limitations, multiple disks failures, loss of power in the building etc etc).
3 . 2 Campus Architecture
I have discussed the limitations of what I refer to as the Building solution. This solution (which I am going to refer to as Campus) is intended to overcome those limitations - obviously nothing is free so expect this to be more expensive in terms of money and complexity you need to deal with. Note that, in the picture, I have named it Campus in bold but I left the Building option simply because you might not be so much concerned about a problem at the building level but you might be concerned about the single storage deployment. In fact nothing would stop you from deploying all this into a single building (or even a single rack). I have named it Campus for the sole reason that, most commonly, customers would want the two infrastructure silos (one silo = one storage server and half the x86 servers) to be in different locations (within a certain distance). This solution is commonly associated to the notion of stretched clusters: servers are managed as a single entity (i.e. cluster) but they are physically distributed across the Campus. While you can take this approach in the first solution we discussed (Building) as well, the fact that you only have a single storage makes distributing servers across your sites a bit useless (the storage server remains a single point of failure). The most important characteristic of this solution is that disk access occurs simultaneously on both storage servers which are a mirror of each other. By the way many would refer to this solution as Business Continuity (or Continuous Availability) although you know that these terms have a different meaning depending on who you talk to. I prefer, throughout this document, to describe the business objective and the actual implementation (details and constraint). The name you attach to any model is not important to me as long as it is consistent with your own company naming convention.
The Campus solution is typically associated to the following business targets: Failover at all levels (servers and storage) is typically automatic: this is in fact a stretched cluster with redundant storage. The RTO is a function of how long the applications take to restart on the surviving elements. The storage is replicated synchronously (replica might occur at the storage level or at the server level). The so called RPO is 0 in this solution While implementations might vary based on the products used, in order to achieve the desired outcome there are two technical requirements that need to be addressed as a must-have: All servers (across sites) need to be on the same layer 2 LAN/VLAN. This will ensure fully transparent switch over of the applications from one site to another within the Campus. All servers in the Campus need to access both storage arrays (exceptions may apply but this is a good rule of thumb). The uniform storage access is the most stringent of the requirements as it will drive the maximum distance of the solution. Before getting into the products that implement this, there are a couple of additional rules of thumb to consider. First of all it goes without saying that the farther the distance the higher the latency. And you can't really go beyond a certain latency level otherwise your hosts will start "screaming" against the storage. You can certainly alleviate the problem by implementing technologies that lower the latency and in this case the rule of thumb is obvious: the more you spend, the better it gets (BAU - Business As Usual). This is where typically the distance
Page 4 of 19
limitation kicks in: implementing this architecture within a few hundred meters is not usually a big deal but trying to stretch it to a few kilometers might become a challenge (or might become VERY expensive - depending on how you want to look at the matter). Having this said of all the 3 scenarios (Building, Campus, Globe) the Campus one is certainly the most difficult to implement given the high end-user expectations (in terms of RTO and RPO) and integration difficulties between the various technology layers (primarily between the storage and the virtualization software running on the servers). It is important to underline that this document is very technology oriented and not so much process oriented. This means that, while from a process and compliance perspective there might be a big difference between storage devices located in nearby buildings and storage devices located within a metropolitan area network, from a technology perspective there isn't so much difference as long as you can guarantee the transparency from a SAN and Network perspective (see the SAN Considerations and Networking Considerations sections). This is to warn you that you might find other documents that make a difference between a Campus implementation and a Metropolitan implementation. While there are indeed differences in the two implementations (especially in terms of the disasters they might be able to address), there is not a big technology difference provided the note above.
3 . 3 Globe Architecture
This is the third and last high-level scenario. I think it would be fair to describe this as a Disaster/Recovery solution and many - if not all - would agree with the definition. As you might depict from the picture it's similar to the Campus solution as it involves the usage of 2 storage servers but it is radically different in what you would expect it to deliver. In a nutshell this is not to be intended as a fully automated solution that automatically reacts in the case of a sudden crash of a site (or of a storage server). Ironically, as you will note hereafter, the Globe solution is somewhat easier (easier doesn't always mean cheaper) to implement than the Campus solution. In fact you will see that there are many fewer requirements and prerequisites than the Campus setup. Similarly to the other solutions, while this one is specifically oriented and suited for long distances D/R scenarios, there is nothing that stops you from implementing these recovery algorithms at the Campus or even at the Building levels. In fact there are some situations where the Campus scenario (as described above) is either impossible or very difficult to achieve from a technical (and cost) perspective, that some customers have defaulted to technologies better suited for long distance DR type of requirements to implement them at the Campus (or even at the Building) level.
What I refer to here as the Globe solution is typically associated to the following business targets: The site is the failover unit. The RTO is a function of the time it takes to acknowledge the disaster, command a failover process and eventually the time it takes to implement the failover process (manually or semi-automated). Storage arrays are (typically) kept in sync by means of either asynchronous built-in mechanisms or third party software utilities (replica of snapshots via IP). The so called RPO might be milliseconds, seconds, minutes or hours depending on the replication mechanism. Because of the above the technical requirements are less stringent than the Campus solution: While a stretched Layer 2 LAN might help in the recovery phase (i.e. no re-IP of virtual machines), recovery procedures should take that step into account if a stretched Layer 2 LAN is not in place. Only the hosts in the site have visibility of the storage in the same site. While a single stretched cluster might be implemented, more commonly independent clusters should be implemented for better on-going management in a Globe scenario. It is important to understand, and this is a big source of confusion in the industry, that a D/R solution is not meant to ensure automatic recovery of a service/workload (or multiple services/workloads) should a non-redundant component of my production infrastructure fail. That's what a Building or a Campus solution is meant to do. D/R is in fact more than just an IT term: it's a set of processes (along with technologies) that allows an organization to recover, in a fully controlled way with limited automation, from a catastrophic event such as terrorist attacks or a natural disaster. D/R is not the kind of thing that... you wake up one day and you find out that all your VMs are running in the Minneapolis data center (failed over automatically from the production site in New York) just because you lost power to a control unit. D/R is that kind of thing that potentially forces you in meetings for hours if not days to decide whether or not you should failover to the remote site. I have seen customers implementing D/R plans that took into account to remain off-line a few days without even considering a failover as it was determined that being (partially) out of business for a few days would have had less implications than a failover (with all the logical and physical consequences) would have had. This is obviously an extreme case but it should give you an idea about what D/R really is. Smaller organizations may have a faster reaction though. As a matter of fact these are the operational characteristics (i.e. manual restart, independent clusters) based on which you would define a solution to be a Globe type of solution. While we have stated that usually Globe scenarios are associated to Asynchronous and/or Snapshot storage replication, this is not what defines a Globe solution. The replication mechanism pertains more to the integrity of the data rather than to the scenario overall. There are indeed technologies that are able to support Synchronous replication at a distance greater than 100Km but this doesn't automatically mean that you are in a Campus type of scenario. Always Remember that the replication mechanism (alone) is not what characterizes a Globe solution. While a Campus solution is pretty fixed in terms of requirements a Globe solution is typically more flexible in how you implement it. In fact the RPO in a Globe scenario might vary between 0 and 24 hours, depending on your need (and the amount of money you may spend).
3 . 4 Hybrid Designs
There are many situations - specifically in the Enterprise organizations - to have multiple redundancy levels for their infrastructures. This is where hybrid scenarios can be used. A typical example of this is an organization that would benefit from adopting a Campus-like solution with the highest levels of SLAs (RPO = 0, RTO close to 0) for their main datacenter - which might encompass more buildings in a single metropolitan area as well as a solution at the Globe level for the most disruptive and catastrophic
Page 5 of 19
events. The picture below describes this example: Again, the Globe scenario is usually associated with an Asynchronous replication but there might be situations where you may want/need both the Campus and the remote DR site to be strictly aligned with RPO = 0.
3 . 5 Networking Considerations
This document focuses on storage architectures but it is important to spend a few words on the networking implications as well. I think this concept should be pretty clear at this point but let's set the stage again: while networking connectivity might not be a problem for the Building and Campus scenarios, it might be an issue in Globe scenarios. As we pointed out for Building and Campus scenarios one of the requirements is that all hosts share a common Layer 2 LAN. This doesn't obviously mean that each host can only have one NIC: this means that a given network connection on a single server is Layer-2-connected to the same connection on all remaining servers in the cluster. Example: the NIC(s) dedicated to virtual environments needs to be on the same Layer 2 network across all physical servers in the cluster. This is what allows you to failover transparently any workload on any server on the (stretched) cluster. If this requirement is not satisfied you are automatically in the Globe type scenario because, as we said, Layer 2 connectivity is a must-have for Building and Campus scenarios. Now this is when all starts to get a bit tricky. Ideally, it would be difficult to span Layer 2 networks across the globe for a common IP addressing schema. Having this said this is well possible using the latest technologies available (note: possible doesn't necessarily mean easy). There are some networking technologies and particularly networking configurations that allow what's referred to as transparent bridging across otherwise routed networks (Layer 3). So for example, routers that connect different networks could be made to work in a way so that they transparently bridge LAN1 and LAN2 (see picture). If this happens, even assuming LAN1 is in New York City and LAN2 is in Minneapolis, if you take a host (or a VM) with a fixed IP address you can move it from LAN1 to LAN2 and viceversa regardless. This Cisco Technote might be a starting point if you are interested. If you are in this situation then the Globe scenario becomes easier to implement as the server administrator doesn't need to bother with the different IP schema for the VMs being failed over to the remote site. There are obviously other challenges but re-IPing is certainly one of the toughest (if not the toughest) things to deal with in a global failover. If you are NOT in this situation than the Globe scenario might become a bit more problematic. You, as the server administrator, have to also take into consideration the right mechanisms and procedures to migrate the current IP schema associated to the VMs in the production site to the new IP schema that is available in the remote D/R location. This is because, in this case, Router A and Router B don't bridge but they actually route - Layer 3 traffic from LAN1 (which is on a given IP network) to LAN2 (which is on a different IP network). This isn't a common task at all since, other than the challenges and risks associated to re-IPing all VMs at failover time, it might have other implications at the application level: in fact there might be some applications that are "sensitive" to IP addresses changes. Because of this some customers I have been working with decided to take a third approach. Instead of creating this stretched network configuration (which was impractical for them) and instead of having to re-IP all VMs (big nightmare)... they decided to reconfigure the whole network infrastructure at disaster-time. During normal operations LAN1 and LAN2 would be on different network schemas but, if a disaster occurs, the networking equipment at the D/R location would get reconfigured with a backup of the networking configuration of the production site. In that case VMs could be restarted on the remote location without the server admin having to change anything in the VMs configuration. Similarly the networking administrators didn't have to bother with the complexity of building a stretched LAN across the two sites using the bridging technology we have referred to. The net of this is that there is no free lunch. There is complexity associated to this global networking configuration and it needs to be solved somehow: whether it is the server administration that solves it (reconfiguring the IP of the workloads) or it's the networking administrator (either using the stretched Layer 2 configuration or through a networking reconfiguration at disaster-time) you have to plan for it.
3 . 6 SAN Considerations
Similarly to the network considerations above, SAN design is not the ultimate goal for this document but yet it is important to set the stage properly. Someone might have depicted at this point that the major difference between a Campus and a Globe solution is the architecture that describes the way storage is accessed. Simply put, if my SAN characteristics are good enough so that any server can access any storage then I am in a Campus like scenario. Alternatively if storage access attributes (latency and bandwidth) vary depending on the location then I am more likely in a Globe type of solution. The interesting thing is that while distance plays a big role, it's not the only discriminator factor: in fact, depending on the budget I have, I could create a 10Km connection between two sites which would result "transparent" to the server and storage infrastructure (i.e. logically one single physical site) or I could create a 2 Km connection between two sites which would look like two distinct and different regions from a bandwidth and latency perspective. You might wonder where the trick is. Well the trick is the money (as usual) where the cost to implement the former is order of magnitude higher than the cost to implement the latter. So don't be surprised if you ever bump into Campus like scenarios where the two sites are 10Km apart and/or Globe scenarios where the two sites are just a 2 or 3 Km apart. Remember: the distance is only one aspect of the limitation.... with the right amount of money you can buy (almost) anything in this world. While the SAN purists may kill me for the following oversimplification, let's assume there are fundamentally 3 different methods to "extend" a SAN.
Page 6 of 19
The first method is to simply use longer FC cables that might allow to stretch a fabric across two buildings. We are talking about a distance of 300 / 500 m depending on the speed you want to use (1Gb/s vs 2Gb/s vs 4Gb/s). In this case it's the end-user that is responsible to deploy and cable properly the infrastructure. The second method is using so called Dark Fibers / DWDM. In this case you wouldn't be deploying cables but you would actually rent an already deployed cable (or a fraction of it). While this solution might provide the transparency described in the paragraph above it does have an associated cost which might be prohibitive for many customers. More commonly this is a technology used for Campus type scenarios with sites dispersed at the metropolitan level. The third method available is using Multi Protocol Routers that are able to encapsulate the FC traffic into IP frames. This way you can stretch your fabric across much longer distances (leveraging IP WANs) at a much lower cost compared to Dark Fibers. Obviously the usage of FCIP (Fibre Channel over IP) techniques have implications in terms of bandwidth and latency (specifically latency) that make it unsuitable for Campus like scenarios (even at short distances). The good thing is that the limit is the globe in terms of how disperse the sites can be.
Building
Typical Cost Implementation Effort Typical Distance Between Storage Servers Typical RPO on Storage Failure Typical RTO on Storage Failure Typical Storage Configuration Typical Network Configuration Typical Servers Configuration Can Survive a Single Storage Controller Crash Can Survive Dual Storage Controllers Crash Automatic Recovery on Storage Failure Automatic Recovery on Building Failure Cheap Easy (Not Applicable - Single Storage) (Not Applicable - Single Storage) (Not Applicable - Single Storage) All Servers See the Single Storage Layer 2 Single Cluster
Campus
Expensive
Very Difficult It Varies (from It Varies (from Easy to Very
Yes No No No
Page 7 of 19
point where your transaction doesn't exist anymore (as it got lost during the disaster) but you have other IT counterparts that assume your transaction has completed and you have it in your records. While it doesn't exactly work in this way, a good example might be a money transfer from a bank account to another. Imagine a distributed transaction occurs between two banks (A and B) where 500$ are moved from account A1 ( A1 = A1 - 500$) to account B1 ( B1 = B1 + 500$). All went well. Should a disaster occur a few minutes later at Bank A most likely you will have your transaction record lost (A1 = A1 - 500$) while Bank B still has its record in the system. While, as I said, it doesn't really work like this, it should give you an example of how Data Loss issues might turn into Data Integrity issues as well.
4.1
Synchronous replication is the preferred method to use for storage synchronization in case data consistency / accuracy is mandatory. The idea is that the data needs to be physically written on both storage devices before control (and the I/O acknowledgement) can be returned to the application. Asynchronous replication has different meanings depending who you ask. For some people Asynchronous replication means that the server writes the data on its active storage device and, once the transaction is on the disk, it gives back control to the application. Instantaneously (but with an obvious slight delay) the active storage sends the update to the secondary storage ASAP. The delay is usually in terms of milliseconds / seconds (depending on the latency and bandwidth between the sites). The delay is so low that many vendors call this technique NearSynchronous replication. There is another option that is sometimes used which is what I refer to here as Snapshots. The idea is to have a software utility (or the storage firmware) take a snapshot of the minidisk and send it over (typically via TCP/IP) to the remote site. Usually the RPO in this case is between 1 hour and 24 hours (depending on the DR policies). This technique is also referred to as Asynchronous replication by the same vendors that use the Near-Synchronous replication term. Make sure you check with your storage vendors and agree on the terms being used in the discussion. The following table summarizes the characteristics of the two approaches.
Synchronous
Also Referred To As (by some HW vendors) Typical Scenario Typical Distance Requires a Stretched SAN RPO characteristics Synchronous Campus 500m / 10 Km (up to 100+Km but with HIGH costs) Yes (Dark Fiber / DWDM) RPO = 0
Asynchronous
Near-Synchronous Globe Unlimited Yes (Dark Fiber / DWDM / FCIP) RPO > 0 (ms / seconds / minutes)
No (replica is typically v
4.2
It's important to call this out. While this document is very focused on the technology aspects of the High Availability and DR concepts related to the storage subsystem in a virtualized environment, the actual implementation has to include important additional aspects such as the organization internal processes. Technology is just one little aspect that allows a company to restart its business operations should a complete disaster hit its headquarters. As you might depict this is more important in a DR (i.e. Globe - long distance) type of scenarios where the organization is facing with multiple issues. The operations that need to be put in place to restart the business is one aspect, but there are other aspects such as the quality and the alignment of the data so that they are not just consistent from a technical perspective but they are consistent from a business-view perspective. Technology can only go so far. There is no single technology (or a mix of them) that could automagically solve all DR associated problems w/o proper planning and operational procedures.
5. High Availability and D/R at the "Virtualization layer" Vs "Virtual Machine layer"
This document has a somewhat limited scope. It will not, in fact, discuss storage HA and DR techniques that can transparently be implemented within guest Operating Systems running on top of the hypervisor (as if they were physical systems). Throughout the document, and specifically when talking about the actual products implementations, I will not take into account solutions that could either be implemented on physical servers running Windows / Linux or within virtual machines running the same base software. Basically I am not taking into account solutions that can be implemented within a guest simply because a partition can mimic the operation of a physical server (which, in retrospect, was the initial value of virtualization). Typical examples are some Lotus Domino or MS Exchange clustering options that could get away with shared storage scenarios and have replication mechanisms that can keep different mail servers aligned by means of native application replicas. Similarly, other topologies I am not covering in this document, are all those solutions that - independently from the applications - are able to interact at the Guest OS level and provide remote data copies as well as optional automatic restarts. This would fall into the same category I have described above which is: a software that has been designed to work on physical servers running Windows / Linux and that happens to run similarly on Guest Operating Systems running on top of a given hypervisor. A good example of this solution is Double-Take for Windows which is a host level storage and server high availability solution that is able to keep one or more Windows servers in sync through IP data replication and have built-in high availability features to restart services and applications should a system fail. As you can depict from the picture, if you install this software stack into a couple of Windows Guests running on your hypervisor of choice nothing really changes. The biggest thing is that you no longer leverage a physical box (the "Hardware Layer" blue brick in the picture) but you would rather use a dedicated partition running the same software. While it's perfectly feasible to use such a scenario in a virtualized environment, I truly believe that this sort of infrastructure services should be provided at the basic infrastructure level rather than being leveraged within each of the guests. I have made a similar point months ago when talking about the nature and the architecture of server high availability solutions in a Building type of environment. You can read that article here. The same concept can be applied to the storage high availability issues for both Campus and Globe scenarios. While moving these infrastructure services at the virtualization layer might sound more complicated than mimic in the Guest what you could do on physical servers, it does have a strategic value associated to it. Think, for example, of a scenario where you have 100 OS guests supporting 100 heterogeneous applications running on 10 physical hosts (a conservative 1 : 10 consolidation ratio and a realistic 1:1 application/OS ratio are assumed). Should you implement a traditional Double-Take solution at the guest OS level, most likely you would end up creating 100 clusters to protect 100 applications across 200 partitions (2 VMs per cluster). If, on the other hand, you move the high availability and replication algorithms at the virtual infrastructure level you keep the 100 partitions unaltered and you only enable the HA and disk syncing functionalities on a single 10 hosts cluster (i.e. the physical boxes that comprises the virtual infrastructure). Perhaps it is for this very reason that Double-Take came out with newer versions of their solution to address specifically this scenario. The products are called Double-Take for Hyper-V and Double-Take for VMware Infrastructure: they essentially do what's described in the picture above with the only "slight" difference that they treat the hypervisor as the host OS, essentially targeting a step below in the entire stack.
Page 8 of 19
For convenience, the picture shows the architecture for the VMware Infrastructure version. Note that, in this case, the Double-Take product is deployed as an external component (due to the fact that VMware doesn't encourage ISVs to develop code against the local ConsoleOS, and in the case of ESXi it would be even more challenging). For Hyper-V the layout might be different in the sense that Double-Take would be installed on the local servers (in the parent partition). As you can guess from the picture, Double-Take works with the minidisk files (vmdk files in ESX language) without bothering what's actually installed and what actually runs inside those images. I have used Double-Take as a mere example. More information about the tools and specific technologies used to accomplish the goal will follow in the implementation sections. I want to stress again that it might make perfect sense - in some circumstances - to implement a partition-based HA / replication strategy. Having this said I strongly recommend readers to valuate a longer term plan (with all benefits associated) where the infrastructure functionalities are delivered at the virtualization layer and not within the Guest OS. This is the scenario that this document covers anyway.
As a reminder, please keep in mind that while most of the implementation details I will discuss hereafter have been successfully implemented in the field and are generally fully supported by all vendors, there may be other implementations that are "technically possible" but are not automatically blessed and certified by the various vendors. Some vendors might also have public and general (negative) support policies and ad-hoc deviations for which they might selectively support a given implementation on a customer by customer basis. This document is intended to provide an overview of how various technologies could be tied together to achieve a certain goal. Always refer to your vendors and/or systems integrator to have their blessing in what is and is not actually officially supported.
6 . 1 Building Implementations
This is going to be a very short sub-chapter. Since the Building Architecture doesn't include any dual storage configuration there is not an HA or D/R storage implementation to describe either. Yet, as I pointed out, this is how the vast majority of the VMware and Microsoft customers out there have implemented their server farms. This configuration is the bread-and-butter of virtual environments which usually requires shared storage implementations to exploit all the potential (such as being able to move a workload from one server to another on-the-fly). Yet, if you are to implement a brand new virtualized Building type of infrastructure, you might want to implement it in an open way that allows you to expand it into a Campus or Globe type of deployment, should the need arise.
6 . 2 Campus Implementations
This is when things start to get complicated. We have already discussed the philosophies behind this scenario in the High Level Design Principles section above so we will get straight into the matter here.
Page 9 of 19
There is no way, as of the time of the first vSphere release, to implement a host-based synchronization across two independent storage devices. In other words vSphere doesn't yet allow native mirroring of VMFS volumes across different Storage devices. Note that there are customers implementing software mirroring within the virtual machine OS though, but this is beyond the scope of this document as it's not a hypervisor-delivered feature. There have been speculations in the past that VMware would have been able to provide a functionality to ESX administrator to basically do mirroring across two different VMFS volumes. And since the two VMFS volumes could potentially come from two different disk arrays, they would automatically provide a host-based replication technology out-of-the-box. While I am sure there are still plenty of discussions inside VMware about this (and the priority to assign to such a feature) today we don't have it available. To make things worse the fundamental architecture of VMFS (but specifically the fundamental trend/suggestion to not load anything inside the ESX host - ESXi being the extreme example where it's not even basically possible to load third party code) makes it impossible for technology partners to develop a volume manager that extend the default VMware volume manager. This is the main reason for which a Veritas Volume Manager for ESX doesn't exist - as of today at least. It is important to understand that, in this scenario, the two storage devices would not need any sort of replication support. It is in fact the host, which has visibility to both disk arrays at the same time, that instantiates the mirror between the LUNs associated to that host. The disk array might be the most stupid limited and cheapest JBOD you can think of. 6 . 2 . 1 . 2 Campus Implementations for VMware Virtual Infrastructures (Storage-based Replication) So is it possible to configure a Campus type of scenario (as we described it) with VMware vSphere? Yes it is possible, you just need to off-load the job that the host is supposed to do all the way into the storage layer. This is not as simple as it sounds though. Fibre Channel (FC) Most of the readers at this point would be led to think that you can achieve this with the replication feature most mid-range and high-end storage devices make available. Things like IBM PPRC, EMC SRDF or Hitachi TrueCopy. Not so easy. Remember the fundamental characteristic of the Campus scenario as we have defined it: 1) all servers can access all storage at the same time and 2) the failover (for servers and storage) is automatic. The technologies I have mentioned above are able to keep in sync an active LUN on one storage to a passive LUN into the other storage device but it is in the automation process that all falls apart. In fact should the building with the active storage in the Campus collapse there are manual recovery steps that need to occur before the virtual machines and their guest operating systems can be restarted: The passive LUNs on the surviving storage need to be reactivated The ESX hosts needs to be restarted or alternatively their SAN configuration needs to be refreshed (now the LUNs come from a different storage from different WWNs - never mind the data are the same). The VMs need to be re-registered Etc Another critical aspect to consider is that once a failover has occurred the failback process might be painful as it will require a scheduled shutdown of the entire infrastructure to revert the replication and bring back the workloads to the primary data center. You may think of it as a scheduled DR event in a way. It is obvious that being able to create an active / active storage relationship (i.e. half LUNs replicates "eastbound" and the other half replicates "westbound") will not solve the fundamental architectural issue we have described here. So, no, this is not what we would usually define a Campus like scenario (or a Business Continuity scenario if you
Page 10 of 19
will). The problem is that all these replication solutions assume two independent VMware storage domains such as a domain A replicating onto a domain B and/or viceversa (depending if the configuration is A/A or A/P). You cannot have multiple storage domains simply because the ESX cluster cannot transparently migrate a Fibre Channel connection from one domain to the other automatically (assuming a storage domain could failover automatically onto its replica - which is not usually the case). Just to make sure we are on the same page I define a VMware storage domain as a fully redundant storage device with its own standalone identity. The vast majority of the storage arrays available on the market fall under this category: IBM DS3000/4000/5000/6000/8000, IBM XIV, EMC Symmetric, HP EVA, HP XP12000 etc. So how do I make VMware look at a distributed, stretched and mirrored disk array across different locations as if it was a single storage domain that ESX can transparently deal with? I would ideally need to take one of the fully redundant products mentioned above and be able to "split" them into two different locations. That's the problem, most of the products available in the market are physically monolithic, fully redundant sure, but yet they come as a single physical device that you cannot stretch across buildings. There are obviously exceptions to this and the exceptions are the only products that would allow you to create a VMware server farm stretched in a Campus scenario. Notice that you can stretch your VMware cluster independently, the problem is how to keep a mirrored stretched copy of the data that is transparently available to the cluster. More by chance than by design, most of the storage virtualization products available in the market do have this characteristic of being modular in nature. A few examples: While not, strictly speaking, a storage virtualization solution, the NetApp storage servers happen to be engineered with two physically separated controllers (aka heads) that can be physically separated. This allows to achieve the desired result (i.e. stretching). Notice NetApp also supports the usage of standard storage servers as a backend disk repository (instead of their own disk shelves). This essentially turns the NetApp Heads into front-end Gateways between the actual storage servers and the VMware hosts. In a way this is indeed storage virtualization similar to what the IBM SVC does. The IBM San Volume Controller (SVC) is in fact the next example; it's a storage virtualization product that comes packaged, as a minimum, as two System x servers (aka SVC appliances). Together they form the SVC cluster (think about the two servers as if they were the redundant controllers in a DS3000/4000/5000 class of devices, or two NetApp heads if you will). The big news is that you can take the servers and stretch them up to a few kilometers apart. The Datacore SanSymphony product works in a similar way. The difference is that the two servers are standard Windows servers where you can install the Datacore product. I am not sure about the distances (which you always need to check with your vendor) but the philosophy is pretty similar: put a server in building A, a server in building B and have VMware vSphere access the storage device uniformly as if Datacore server A and server B were two controllers of a traditional storage device. There is a big common question mark when it comes to transparently automate the recovery process in a scenario where you let the storage infrastructure stretch across sites (i.e. what we defined storage-based replication). This big question mark is the so called split brain issue. While all the solutions mentioned above can survive single and even multiple components failures, the failure of a complete site is a different thing. When this happens, the storage components at the surviving site have a big dilemma: the surviving site has typically no way to know whether the problem is due to an actual remote site failure, or it's just due to a temporary/permanent fault in the communication links between the two sites. If it's the latter and the storage servers were to assume both should be promoted as "surviving site nodes" a split brain would occur where both sites would be working separately on their own copy of the data. The problem is how you would then "merge" those data when the communication is re-established (that would be impossible). Each stretched storage solution has its unique way to solve split brain conditions: some solutions can be implemented in a way so that the decision is automatic (which is what a Campus like scenario would require - as per our description) some can be implemented in a way where this condition would stop the operations and manual tasks by an operator may be required. I'll start digging a bit into the NetApp solution. The best way to look at this is that Primary Node and Secondary Node are essentially the two controllers of a single storage device (i.e. what I referred to as the storage domain). Because of the modular design the two controllers can be taken apart (up to 100Km) and configured in a way so that a portion of the disks in the site on the left mirrors onto the disk in the site of the right and viceversa. The NetApp device is an A/P device by nature where a single controller owns a set of disks at any point in time and allows the other controller to take ownership of them in case of failure. In this scenario a cross A/P configuration has been implemented so that both controllers do useful works during normal operations. The idea is to have, on top of them, a stretched cluster of VMware ESX servers and the cluster not even noticing that it's dispersed geographically. In this case, should one of the two sites fail, VMware HA would kick in and restart the partitions on the surviving ESX servers in the cluster. The surviving ESX servers wouldn't even notice the storage issue as they would continue to work on the local copy of the data (after the NetApp heads has transparently terminated the failover). The picture on the left outlines the physical components required to set this up.
The picture on the right, on the other hand, reminds the reader of the logical design where, again, there is a single stretched storage device that encompasses both sites in the Campus scenario. For more general information about this NetApp MetroCluster solution there is a very good document here. Specifically for VMware environments there is an even more interesting document here. The last document does a great job at describing the what-if failure scenarios. As you will notice you can loose multiple components at the storage infrastructure level and still be resilient (and transparent) as far as the stretched VMware cluster is concerned. As you may notice, you are transparently covered in all combination of failure scenarios but the Building/Site failure; this scenario requires manual intervention due to the split brain potential problem described above. Due to the A/P nature of the NetApp solution the Head in the surviving
Page 11 of 19
site would still be active and servicing hosts in the same site. Ideally half of your infrastructure is still up and running. For the remaining part you have to issue the NetApp failover command to manually switch ownership (as you have verified you are not in a split brain situation). Then you have to rescan your ESX hosts to re-detect the LUNs and restart manually the virtual machines. For some customers NetApp has developed custom solutions through the use of a "witness" at a third site that can verify whether it's a split brain condition or an actual Building/Site failure and initiate a storage failover automatically. The idea is that the failover would occur within the ESX storage timeouts making the operation transparent (hence it wouldn't require any manual rescan and manual restart of the guests). Contact NetApp as this is not a standard product offering but rather a services offering. For the IBM San Volume Controller (picture on the left) it wouldn't be much different. Architecturally it would be pretty similar with the only difference that, while the NetApp solution has the option of being a single vendor bundle (i.e. heads + disks expansions or shelves), the SVC solution is more modular by design. Since the primary value proposition of the SVC is to virtualize heterogeneous storage devices, you can (and have to) use any sort of intelligent or semi-intelligent disk arrays as a back-end data repository. The fact that you can then split the SVC servers comes as a bonus (a very useful bonus when talking about Campus solutions). As of today the SVC public support policy allows you to split the nodes up to 100 meters. Granted the proper bandwidth and latency characteristics are satisfied IBM allows (on a customer by customer basis) to stretch them up to 10Km. The SVC is no exception when it comes to the split brain syndrome. As opposed to NetApp (which requires an ad-hoc services solution to complement the products) the SVC platform has a built-in mechanism to determine split brain situations. This is done via a quorum disk located in a third site (the "witness") as reported in this IBM Technote.
The Datacore SanSymphony solution would be similar to the SVC from an architecture perspective. At the end of the day they have a similar value proposition and the overall design wouldn't be drastically different. Obviously the implementation details will vary (cablings, maximum distances, etc). I have done some researches on how Datacore addresses split brain issues but I haven't found enough at this time to report in this document. I have listed, as examples, three products that address the storage-based replication requirement in a Campus scenario. There may be others products in the industry that address the same scenario. Serial Attached SCSI (SAS) The SAS protocol (and storage products based on it) has the same characteristics we have described for the FC protocol. In addition to that the SAS protocol (as a mechanism to present storage to a server or a cluster of thereof) is targeted at the low-end spectrum. As such it's perfectly suited to Building scenarios (i.e. no storage high availability) but not so much suited for more challenging duties such as Campus storage replication. Last but not least there are intrinsic distance limitations built into the SAS protocol that make it impossible to stretch things out as you could do with FC. This would be on top of the robustness and reliability characteristics of the SAS protocol which are relatively low compared to the more mature Fibre Channel protocol. At the time of this writing there is no, as far as I know, SAS storage-based replication solutions that could be leveraged in a VMware environment. Network Attached Storage (NAS) and iSCSI Generally speaking creating a storage-based replication solution which leverages the NAS and iSCSI protocols as a mean to connect to servers and clusters is relatively easier compared to Fibre Channel. There are at least a couple of reasons for which this is true: 1. Most IP-based (iSCSI / NAS) storage solutions are built as a software stack running on standard x86 servers. These solutions are modular by design and it's easy to physically stretch the x86 servers in different buildings. This is in contrast to most FC-based storage solutions which are usually built as a firmware-based monolithic (yet fully redundant) proprietary entity which is impossible to stretch (exceptions apply as we have seen). 2. With TCP/IP it is relatively easy to create failover mechanisms for the storage address (i.e. storage IP address) compared to what you could do with FC addresses (i.e. WWNs). If your ESX host is connected to an iSCSI target / NFS volume the storage server may failover the LUN and the target IP address on another member server of the storage subsystem without the ESX server even noticing that. This is not possible with FC as you can't simply failover the WWN onto another server (usually some niche exceptions may apply). Because of this one may expect to find more vendors in the market offering stretched Storage-based replication solutions for devices exposing storage via iSCSI / NAS protocols and less vendors offering the same capabilities for the FC protocol (as it's more difficult). For the records, the IBM SVC exposes the FC protocol as well as the iSCSI protocol for hosts connectivity; the NetApp heads can expose additional protocols so your VMware servers have an option to use FC, iSCSI and NFS (other protocols available wouldn't be supported by VMware anyway). For completeness Datacore would only support FC and iSCSI connectivity against the ESX hosts/clusters. Creating a storage-based replication solution that only exposes iSCSI / NAS protocols would be relatively so easy that you can get very naive if you want. You can for example use free open source technologies that would turn commodity x86 servers into iSCSI / NAS storage-based solutions with built-in replicas.
Page 12 of 19
As an example I grabbed the picture on the right from a thread on the VMware community forum. In fact someone was trying to achieve exactly this using two opensource technologies: OpenFiler (to expose the storage to servers via iSCSI / NAS protocols) as well as DRBD (to keep the storage on the two servers in sync). The idea is that both ESX servers would connect to the 172.16.0.3 IP interface which in turns exposes storage via iSCSI / NAS protocols for the VMware cluster to use. Since the IP is virtual (i.e. can failover from one node to another - SAN01 or SAN02) and since both storage devices are kept in sync by means of the DRBD algorithms, you can implement a basically free storage-based replication solution for your Campus (granted you will NOT attach to a single Ethernet switch as per the picture!). Again, this architecture, as the others we have seen, is very prone to split brain type of scenarios. As an example, digging into the DRBD documentation on line I found out that they have a number of choices that can be configured when split brain events occur. You can find a brief formal description here. For your convenience I will list hereafter the various options offered: 1. Discarding modifications made on the younger primary. 2. Discarding modifications made on the older primary. 3. Discarding modifications on the primary with fewer changes. 4. Graceful recovery from split brain if one host has had no intermediate changes. Options #1, #2 and #3 would be very worrying as you may guess. Option #4 would be realistically impossible. In such a scenario, a better option would be to have DRBD stop the I/O operations so that a system administrator could determine the root cause (i.e. split brain or actual site failure). This will cause downtime and a manual activity (not really desirable in a Campus scenario per our description) but, at least, it ensures data consistency. On the other hand the recovery manual operations (as in the case of the NetApp MetroCluster) are easy to execute compared to more complex DR solutions. On top of this remember we can't pretend that FC and IP-based solutions are equal in terms of performance, robustness, reliability. I am clearly not referring to the opensource solution above which cannot even be considered beyond a lab exercise (I would think twice before putting into a production shop something like this - no matter the size). I am referring to mid-range and high-end products from Tier 1 vendors: no matter what, the iSCSI and NAS protocols are not seen as of today as a replacement for Fibre Channel SANs and will not be seen as such any time soon. Especially for big enterprises that have invested so much in their FC infrastructure. This is not just a "legacy" discussion however: there are some enterprise characteristics in a FC SAN that cannot be matched by IP-based protocols. Ensure you search through the VMware Hardware Compatibility List (HCL) to find out whether your iSCSI / NAS (and FC) solutions of choice are officially supported.
The picture shows a potential Campus scenario using a third party volume manager. Should one of the two buildings/sites (with 1 Storage device + 2 servers) fail, the remaining part of the infrastructure would carry on the work. Consider that all servers have, in this particular case, access to all mirrored LUNs at the same time, thus providing a higher level of flexibility leveraging a cluster file system (be it Microsoft CSV or Sanbolic MelioFS depending on the solution you may want to implement). Consider that this configuration is protocol agnostic but only works with block-level access (not file-level access). This is obviously due to the fact that a Volume Manager assumes access to a RAW LUNs and not a network mount point to an existing file system on the network. As a result of this SAS, FC and iSCSI storage devices would be all capable to implement something like this. This rules out the
Page 13 of 19
usage of NAS devices. Coincidentally, other than the volume manager architectural discussion, Microsoft has decided so far not to support NAS (be it NFS or CIFS) based storage to host virtual machines in Hyper-V environments. While in theory this architecture should be open enough to let you use your clustering software of choice (i.e. Veritas Cluster Server for example), Microsoft seems to be building most of the virtualization oriented functionalities (i.e. LiveMigration and QuickMigration for example) around their built-in clustering technology making it a defacto must-have in virtualization scenarios.
There is an alternative technology that may allow you to achieve similar results in a Hyper-V context and that's Double-Take GeoCluster. It is a technology that allows you to stretch a Microsoft Failover Cluster (MSCS) leveraging two different storage servers replicated via software mechanisms. It's very similar to the other popular "stretched Failover Cluster" configuration that leverages Storage-based replication (this is discussed in the next section) and the only difference is that, instead of leveraging proprietary built-in storage algorithms, it leverages the Hyper-V hosts to send data (at the byte level) from the active server (and active storage) to the passive server (and passive storage).
As you can depict from the picture (example) the Double-Take technology is responsible for keeping the drive Q:\ and E:\ in sync. In fact these are not shared disks (as it's often the case in cluster scenarios) but they are distinct set of disks coming from distinct storage servers (they may also be local hard drives installed physically in the nodes). In this example there are two virtual machines configured, one on the Q:\ drive (which is owned by Node 1 and replicated to the storage connected to Node 2) and the other virtual machine is on the E:\ drive (which is owned by Node 2 and replicated to the storage connected to Node 1). In case of a failover the VM gets moved to the other host and the active disk follows the VM. Having this said, at the time of this writing, this solution only works without a clustered file system. As it is the case for Failover Cluster integration with Storagebased Replication, this solution does not work with CSV (or any other third party cluster file system). I will update the document accordingly in the future, should these limitations be removed.
Being able to implement a Campus scenario using a host-based storage replication mechanism is very interesting. Not only it allows for storage device independency (although the level of official support when using heterogeneous storage devices might vary) but at a minimum it allows the usage of cheaper storage in the back-end. Sure one have to take into account the third party volume manager but, as a rule of thumb, this software costs usually less than what you would have to spend for a storage-based replication solution in a Campus scenario. Many SMB customers, where budget is not unlimited, might be attracted by a cheaper solution like this. Another point of attention for this solution is that you need to account for hosts resources to do the dirty replication job effectively stealing resources that could have been delivered to the Guest OS (leading to a potential slightly lower consolidation ratio). 6 . 2 . 2 . 2 Campus Implementations for Microsoft Virtual Infrastructures (Storage-based Replication) Storage-based replication solutions are supposed to be software agnostic and transparent to the virtualization solution being used. In fact these solutions just create a physically distributed single logical view of your centralized storage. Because of this what we have discussed for VMware in the "Campus Implementations for VMware Virtual Infrastructures (Storage-based Replication)" is also valid for Microsoft Virtual Infrastructures as well. The only exception / difference that you need to take into account is that MS Hyper-V doesn't support the NAS (NFS) protocol to host virtual machines. All other protocols, mechanisms and layouts are similar. As a background, it's interesting to notice that some storage vendors have historically implemented a certain level of integration between the Microsoft native server high availability functionality (Microsoft Failover Cluster) and their Storage-based replication between two independent storage arrays. This is similar to the Double-Take GeoCluster integration we have mentioned in the section above. The idea here is that storage devices implementing an Active/Passive LUN replication can be controlled by the Microsoft Failover Cluster events: imagine a stretched Microsoft cluster comprised of two servers with two storage devices in replica (by means of their hw functionalities). Server A would work against storage A supporting a business critical clusterized application. A failover of the application not only would move all the processes to the other host but it would also invert the storage replication settings. Simply put this integration is basically a resource DLL that gets registered into the Microsoft cluster engine and that fails over the LUN (i.e. it changes the direction of the replica) along with all the other resources within the Cluster Resource Group. The picture below should explain what happens when a system administrator forces a
Page 14 of 19
failover at the Microsoft Failover Clustering level (note the red arrows and the active/passive LUNs):
level of integration works in a non-virtualized scenario (where the clusterized application is typically installed locally on each node and the LUN contains the persistent data the application needs to use). Official certifications aside it may also work in pre-Cluster Shared Volumes type of Hyper-V deployments where the LUN becomes the unit of failover along with the virtual machine(s) instantiated on that LUN (typically one VM but more VMs per LUN was an option too). Nowadays the usage of a Cluster File System (be it Microsoft own CSV or the Sanbolic MelioFS) breaks this architecture. In fact, generally with a Cluster File System, all Hyper-V hosts would have concurrent access to the LUN while maintaining the flexibility to spread the load of virtual machines across all hosts in the cluster. In this (advanced and desirable) scenario the unit of failover would become the VM as the File System remains concurrently shared across all hosts (this is true for simple Building scenarios where Storage-based replication is not being considered). At the time of this writing this solution (GeoCluster) is incompatible with Hyper-V R2 implementations that use a Clustered Shared Volume scenario (or any other cluster file system).
6 . 3 Globe Implementations
As per the Campus (aka Continuous Availability) scenario, the Globe implementation may be implemented using different technologies. As a reminder, Globe implementations are typically associated to long distances characteristics as well as semi-automated recovery procedures and with an RPO that is not 0 (actual RPO depends on the technologies being used).
Page 15 of 19
different levels of automation: some products might implement a red button for each machine whereas others might implement a site-wise red button. It is also important to note that we are not talking about a stretched cluster on multiple physical locations, as opposed to the Campus scenario we have discussed. We are talking about two loosely coupled datacenters each with its own "VMware identity". In a way you can assimilate this scenario as two independent Building implementations (i.e. two independent VMware clusters) tied together by a third party software that does #1 and #2 discussed above. For those of you that think that a picture is worth 1.000 words this is an example of how the Veeam LAN-replication architecture looks like:
Double-Take for VMware Infrastructure and Vizioncore vReplicator have a similar architectural implementation. Note that I always talked about two separate data centers and logically separated clusters whereas in this picture the "recovery site" only shows up as a single ESX host with local drives. On one hand the picture is an oversimplification of a D/R architecture that could be far more complex (for example with a SAN at the recovery site). On the other hand it shows how these D/R techniques are flexible and adaptable to many circumstances (even the smallest). It is important to remember that these products (and most of the other third party products that have similar functionalities) work with VMware-based software snapshots. Since they are storage agnostic they leverage the software implementation of snapshots that VMware has built into ESX. For this reason particular attention needs to be paid to the VMware infrastructure configuration. This ties back to the introduction at the very beginning of this document (Background: Storage Management Principles in a Virtual Environment). The usage of minidisks not only allows for more flexibility, generally speaking, but it is also the only configuration that supports VMware software snapshots. This means that only VMFS partitions with VMDK files and RDMs in virtual compatibility mode are supported. RDMs in physical compatibility mode are not usually supported (as they cannot be used as a source for snapshots). This is one of the many reasons for which the usage of standard VMFS volumes and minidisks is so popular. As VMware introduced new Storage APIs for Data Protection with vSphere, some of the requirements and characteristics above may be changed. As far as data consistency is concerned, these utilities are typically designed to interface Virtual Center which in turns call the VMware tools components to flush the OS cache before taking a snapshot. This ensures that the snapshot being sent to the DR site is consistent from an OS perspective but the same cannot be said for the applications running inside the virtual machines. In fact these applications (for example Microsoft SQL Server, Oracle Database, Microsoft Exchange etc) may be using their own memory caching mechanisms which are not integrated with the flush described. At the time of this writing the VMware Tools do not integrated with additional software other than the operating systems so, at the recovery site, you will find your third party applications in a "crash consistent" state. Operating Systems delivered services would be in "quiesced" state though. Some of the vendors may provide a replication product that integrates better into the application stack which may result in better consistency state at the application level as well (they typically achieve this through Microsoft VSS integration or other application aware-techniques). Additionally, being this solution an "abstraction" on top of an existing standalone VMware cluster (i.e. it doesn't integrate with any storage architecture), this means that it would work with any storage architecture with which the source (production) data center and destination (DR) data center have been built. This includes Fibre Channel, SAS, iSCSI and NAS type of technologies. Also, because of the agnostic storage architecture, you can use a mix of these technologies to implement the production and DR data centers. 6 . 3 . 1 . 2 Globe Implementations for VMware Virtual Infrastructures (Storage-based Replication) This is VMware's bread and butter and where most VMware customers would be "defaulting" when talking about resilient storage designs. As you can see from the complexity of this document this is certainly not the only option available but it's certainly the most recurring during storage high availability discussions where the VMware platform is being considered. The idea is similar to that described in the Host-based Replication section with the difference that the replica would be done by the native storage replication features rather than by an external third party software. This technique is not to be confused with the technologies we have explored in the Campus storage-replication scenario. The Globe scenario, by design, requires two independent and different sites (i.e. not a single stretched cluster); for this reason it is ok to have two independent storage domains (aka storage servers) replicating the data. See the discussion in the Campus Implementations for VMware Virtual Infrastructures (Storage-based Replication). Most - if not all midrange and enterprise storage arrays in the market have this functionality available (typically, but not always, as an add-on cost). The idea is that the disks in the production datacenter are replicated through the proprietary storage replica algorithms onto a companion array at the DR site (typically the same model, almost always from the same vendor unless a certain level of storage virtualization is in place). Should the production site experience an outage, the storage at the DR site (in Read/Only mode during normal operations) is brought on-line for Read-Write mode and the local/independent cluster at the DR site is configured to read the now available LUNs and restart the workloads (i.e. the virtual machines). At first, it looks like an easy operation. However, if you think thoroughly, you will appreciate that exporting & exposing storage-replicated VMFS volumes to a different cluster of hosts is not a trivial task. Similarly, restarting all VMs on the DR data center is not trivial either. In order to address all these challenges and make the administrator life easier during a failover event, VMware has created a product called Site Recovery Manager (SRM) to address this specific scenario. It is not the intent of this document to get into the details of what Site Recovery Manager does. This is a well known solution in the VMware community and this document is more intended to shed some light on the "unexplored" rather than reiterating the obvious and the well-known. However if you are new to this and want to get some insides about how it works you may appreciate this blog post of mine that I wrote to describe the manual steps to achieve all this as well as where VMware Site Recovery Manager ties in for such a scenario.
Page 16 of 19
This is done using the LSI storage as the building block but it will provide you with a solid background of what happens no matter what storage (and proprietary replication) you use, be it LSI, EMC, HP or whatsoever. I am only attaching a couple of pictures from that document for those that are too lazy to check. They show the potential logical architecture and corresponding physical deployment:
Note that, for the sake of simplicity, during the test we opted to host, on the same virtualized infrastructures, all of the components that are supposed to manage and protect them (i.e. the vCenter and SRM instances per each site). While we have done this because it was more practical for our tests, some customers may decide to implement these services on separated and dedicated hosts. This is a never ending debate in the community: should I host the management services on the same infrastructure they are managing? There is no definitive answer to that question.
A few additional things that is worth paying attention to: while the general idea is that there is a primary site and a DR site, note that SRM (and most storage replication algorithms) support bidirectional replication so that both data centers can have production workloads and can be configured to failover one over the other. This is why some customers are using these tools as a replacement for proper Campus solution in some circumstances. However, as we noted, this solution misses the key characteristics of a Campus solution (i.e. for example single stretched cluster and failover automation). Another important aspect to keep in mind is that the current SRM release (4.0) does not support the failback process out of the box (one may argue that many customers do not have processes in place to support that either). While some storage vendors (such as EMC) have developed scripts that make the failback process a bit easier, the fact is that it's painful and not built-in into the (current release of the) product. This ties back to the positioning discussion we have done already. If you want to use this solution for real DR purposes it may make sense to see a failback process as a project of its own. I have been working with a customer, as an example, that didn't even consider a failback process and, in case of such a catastrophic event that would trigger a DR response, the DR data center would become the production datacenter indefinitely. At the time of this
Page 17 of 19
writing the official failback procedure is to reconfigure SRM and create DR plans that are the exact mirror copy of the original DR plans. In that case you can push the red button and trigger a so called "scheduled DR event". This will of course account for another downtime of the entire infrastructure (but at least this time can be scheduled). In terms of protocols being supported, all storage vendors have products that have the replication functionalities we have described. Depending on the specific storage model some of the protocols may or may not be supported. In general, storage products with the described replication features have Fibre Channel and iSCSI protocol support. Some of the products may also have NFS support. None of them has SAS support. Also, some vendors may allow you to mix connection protocols in different data centers (for example Fibre Channel connections in the primary data center and iSCSI connections in the DR site).
I haven't personally tried this myself. The documentation I have read so far doesn't mention whether this can be used in conjunction with Failover Clustering at the production location ( in scenarios where you may have, for example, 4 Hyper-V clustered servers in production and a couple of Hyper-V servers as target replication servers in a remote location, possibly clusterized as well). The documentation I have seen refers to integration with Hyper-V Manager not even Virtual Machine Manager / Failover Cluster Manager hence the doubt. Additionally it remains to be seen if, assuming it works with cluster setups, it also works with CSVs or third party cluster file systems. I will update this section as soon as I find more info on this. Another implementation example of Host-based Replication for Globe scenarios where Hyper-V is the hypervisor of choice is the Veritas Storage Foundation. Specifically for Globe scenarios, it is worth mentioning that this suite of products includes a feature called Veritas Volume Replicator which allows to replicate a set of data (or a set of VHD files in this case) across an IP network (LAN or more specifically a WAN given the distance). The idea is similar to all of the Host-based Replication technologies we have seen so far. Veritas (now Symantec) typically offers this solution in conjunction with Veritas Cluster Server which includes a globally dispersed cluster option. While the Veritas Volume Manager we discussed in the Campus scenario for Hyper-V integrates well into the Microsoft Virtualization framework (i.e. Hyper-V, Virtual Machine Manager and Failover Clustering) the Veritas Volume Replicator product could possibly require other pieces of the Veritas offering (i.e. Veritas Cluster Server with Global Cluster support) which is incompatible with other Microsoft key components such as the Failover Clustering. Since Microsoft is hard coding most of the virtualization features (such as LiveMigration) with the Failover Cluster functionality, it may be impossible to get rid of this fundamental piece of the Microsoft architecture and replace it with something else. Still digging into it. As it often happens for Host-based Replication solutions, the storage protocol is not a variable in these implementations so you can use either FC, iSCSI, NFS or SAS (or a mix of them) in the source and destination data centers. 6 . 3 . 3 . 2 Globe Implementations for Microsoft Virtual Infrastructures (Storage-based Replication) This is where Microsoft is falling short. The Microsoft ecosystem for long distance DR scenarios is largely based, as of today, on Host-based Replication scenarios that we have discussed in the previous section. The only viable Storage-based Replication solution available for Windows is the Failover Clustering integration with selected vendors that implement Storage-based Replication technologies. We have briefly discussed this solution in the "Campus" scenario because its characteristics (RPO=0 via synchronous Storage-based Replication as well as automatic failover) make it look more like a technical implementation of a Campus / Business Continuity architecture. Having this said
Page 18 of 19
many still see this solution as a viable "Globe" implementation. I don't personally consider this a viable DR implementation for a number of reasons, certainly the most important of which is that you want to have a certain level of "human control and judgment" before you commit to a DR failover. This solution is instead fully automated since it leverages the standard Failover Clustering capabilities built into the Windows stack (and hence available to the Hyper-V role as well). For the records this solution can be tweaked for not failing over automatically should the use wish so. However, as we have already mentioned in the other section, at the time of this writing there is no possible integration between this "stretched Failover Clustering" configuration and Hyper-V R2 when used in conjunction with CSV or other clustered file systems technologies from third parties. An approach that is more appealing (to me at least) to solve the DR issue in a Hyper-V context is what Citrix is releasing with their new Essentials package. At the time of this writing the new Citrix Essentials package for Hyper-V R2 seems to include a VMware SRM like feature called "Site Recovery with StorageLink". This is what the web page reports at the moment: "Site Recovery uses StorageLink technology to unlock underlying storage array replication services, giving you the toolset you need to enable remote fail-over strategies for their Hyper-V environments. You can use StorageLink Site Recovery to set up native storage remote mirroring services, then test the recoverability of virtual infrastructure at remote sites through the staging capabilities included with Site Recovery. Instant clones of recovery VMs can be created and placed in isolated networks where you can test recovery without impacting ongoing data replication or access to critical Hyper-V infrastructure. Additionally, you can use workflow orchestration tools in Citrix Essentials and Windows Server 2008 Clustering to build out a completely automated DR solution for your Hyper-V infrastructure." And this is the picture which comes with the description:
While the implementation of this product may end up to be vastly different than the VMware Site Recovery Manager, the philosophy behind it is pretty similar and it maps/fits more nicely the segmentation I have done in this document. I am assuming, reading what's available to date, that the "automation" Citrix is referring to is similar to the red button provided by VMware Site Recovery Manager. It's in fact important to distinguish between "automation" used to define an automatic and unattended action triggered by a given event and "automation" used to define a bundle of complex tasks coded and tied together in an atomic action, whose execution is dependent by a human trigger. The latter is what a real "Globe" (or DR) implementation should provide. Similarly I am assuming that the reference to the Microsoft Failover Clustering technologies is meant to describe the integration between these "DR rules" implemented at the Citrix Essentials package level and the local Failover Clustering technologies that can be used within the two "logically" separated clusters and data centers. It remains to be seen whether the product in subject is a framework that can be programmed to create an SRM-like experience or whether it provides an SRM-like experience out-of-the-box. Generally speaking we are at the very early stages of building effective DR solutions based on Hyper-V. This is true both from a product perspective as well as from a best practices perspective. This will inevitably change as time goes by and more and more customers look into this for their own implementations. The storage protocols that can be implemented (FC, iSCSI, NFS) may vary based on specific storage models.
6 . 4 Conclusions
As I have said at the beginning the whole point of this document is to create a framework for you to think about your storage high availability plans. While the first part is more philosophical (subjective) the second part is more practical (objective). I invite the readers to send me feedbacks for both but specifically for the second part of the document. It's almost impossible to get everything right (as there are too many products and actual implementations to track). There are certain things that are not accurate because they have not yet been fully released. This is specifically a problem with third party solutions that might work with VMware VI3 / Microsoft Hyper-V but may not work with VMware vSphere / Microsoft Hyper-V R2 (or viceversa). I have tested many of the implementations listed myself (either in the lab or for customers). However, for other implementations, I have only read the documentation and I don't have practical hands-on experience. That's where your help, as an experienced end-user or a vendor (but please avoid flooding me with marketing/sales information about your products though), is appreciated to set the record straight in this document. Any feedback you may have please send me a private e-mail to massimo@it20.info.
Page 19 of 19