Académique Documents
Professionnel Documents
Culture Documents
November 2015
Table of contents
Section 1: Overview ................................................................................... 5
Executive summary ...................................................................................................... 5
Audience ................................................................................................................... 5
Disaster Recovery vs. High Availability ........................................................................ 6
Defining types of Disaster Recovery ......................................................................... 6
Defining what is critical .............................................................................................. 7
Section 4: Conclusion............................................................................... 27
Section 5: Appendices.............................................................................. 28
Appendix A ................................................................................................................. 28
citrix.com
References .............................................................................................................. 28
Appendix B ................................................................................................................. 29
High Level Regional Diagrams ................................................................................ 29
Appendix C ................................................................................................................. 31
Identifying Services and Applications for DR/HA ..................................................... 31
citrix.com
Section 1: Overview
Executive summary
There is much conversation around executing disaster recovery for a data center, and utilizing high
availability wherever possible. However, what are the requirements around disaster recovery, and how
does it differ from high availability? How do they work together to ensure your systems and applications
are up and available, no matter what?
This white paper looks at understanding disaster recovery and high availability. As with most things in life,
there are trade-offs. The more resilient to failure you want to be, the more it is going to cost. How do
these trade-offs affect you? There is the old-fashioned approach of writing everything of importance to
tape, storing the tape off-site, and waiting for a disaster to occur. Tape is a very low cost option, but it
could take days or weeks to rebuild your environment. The other end of the spectrum comes from utilizing
todays technology and making everything active/active, essentially running two complete data centers in
two different locations. The two data centers option is an extremely resilient, but also extremely costly
option. Simply, you are betting you are going to have a disaster that affects at least one of your sites.
What exactly needs to be up and running as quickly as possible after a failure of your data center? Where
does high availability come into play to help? This document looks at some of these questions, and asks
a few more, to help you understand and make good decisions in building a disaster recovery plan.
This project is not looking at sizing, scaling, or performance, but at design considerations for disaster
recovery. In the Solutions Lab, a team of engineers including lab hardware specialists, network
specialists, storage specialists, architects, and Citrix experts were challenged to build a disaster recovery
solution for a fictitious company defined by Solutions Lab Management. This document shows how the
company was defined, how the team architected and then implemented a solution, and some of the
issues and problems they uncovered as flaws in their plan or things they did not expect or anticipate. The
end result plan was compared to how companies such as Citrix handle disaster recovery, and it was
found to be very similar. The team had an advantage in that they were able to build the company data
center to fit their design, not try to fit a design to an existing data center. Hopefully what they learned and
uncovered will assist you as you think about building your own disaster recovery plan.
Note that a major component of any disaster recovery solution is the storage and storage vendor used.
The concerns are around the amount of data to be moved between the sites and the acceptable delta
between data synchronizations. For this paper, we worked with EMC, utilizing their storage solution to
achieve our defined goals.
Audience
This paper was written for IT experts, consultants, and architects tasked with designing a disaster
recovery plan.
citrix.com
Duplicate hardware
Everything that occurs on the primary site also occurs on the secondary site
Load balanced
In A/P, depending on how quickly you need to be back up and running, it may be as simple as backing up
to tape and in a disaster restoring from tape to available hardware. This is the lower cost solution, but not
very resilient or quick for recovery. A/A has duplicate hardware and software running and supporting
users. In a multi-site scenario, each site must have enough additional hardware to support the user
failover. A/A is much quicker to recover from a disaster, but much more expensive from a Capital
Expenditure (CAPEX) cost with hardware. Essentially, each site has a complete duplicate set of underutilized hardware waiting for a disaster. With A/W, the plan is to define that which is critical to the
company and what must be recovered as quickly as possible and having enough bandwidth at the other
site(s) to support the requirement. Once the most critical environment is defined, the rest of the company
can be dealt with. This does require some extra hardware in each region, but we can better manage the
resources and costs.
citrix.com
Requires continuous availability, though short breaks in service are not catastrophic
As stated earlier, we created a fictitious company for this disaster recovery plan scenario. This company
has a single Mission Critical application and a single Business Critical application, and associated users.
The company president defined the acceptable response times and requirements, including a desire to
have a warm failover for mission- and business-critical users, and a passive failover for the rest of the
company. The following sections highlight the development and implementation of the plan.
citrix.com
For a closer look at this diagram by region, see Appendix B at the end of the paper.
citrix.com
Service Descriptions
This table defines our MC, BC and PR services and applications and our considerations in handling them
in our setup.
Service
Type
Service
Description
Configuration
Requirements
Mission
Critical
Microsoft SQL
Sample
Database Northwind
SQL Sample
Northwind Database
is used along with a
web server. This
represents the Call
Center mission
critical application
database.
Microsoft
Exchange /
Outlook
Business
Critical
DFS Replication is
In case of disaster, a limited set
configured between primary of users must have access to
sites and file-based backup the DR file share location.
is performed to the DR
location every 8 hours.
Published Microsoft Office
must be unavailable to users
when the file share is not
available.
Microsoft Office is
published on
XenApp.
citrix.com
Mission Critical
Business
Critical
Business
Operational / PR
Engineering
30
60
560
HR
10
10
20
Management
45
75
580
Mission Critical
Business
Critical
Business
Operational / PR
Call Center
20
60
520
Engineering
10
HR
25
Management
40
90
citrix.com
50
570
10
Four physical XenApp hosts in a single delivery group, as a 3+1 HA model supporting the
business operational users.
Four physical hosts running XenServer configured as a pool, in a 3+1 HA model supporting the
mission- and business-critical users. This pool supported the following configuration:
The Region 2 failover pool in Region 1 is four XenServer hosts in a 3+1 model supporting the
following configuration:
citrix.com
Three physical servers running XenServer, hosting infrastructure VMs, including the SQL call
center cluster.
Four physical XenApp hosts in a single delivery group, as a 3+1 HA model supporting the
business operational users.
Four physical hosts running XenServer configured as a pool, in a 3+1 HA model supporting the
mission- and business-critical users. This pool supported the following configuration:
The Region 1 failover pool in Region 2 is four XenServer hosts in a 3+1 model supporting the
following configuration:
citrix.com
For the DR site, the Region 1 disaster recovery site was set up with four XenServer hosts in a 3+1 HA
model supporting the following configuration:
o
Infrastructure VMs
The Region 2 disaster recovery site was set up with four XenServer hosts in a 3+1 HA model supporting:
o
Infrastructure VMs
Note: The networks for Region 1 and Region 2 in this site are set up with the same IP ranges as in the
original regional sites.
citrix.com
13
Software
The following is a list of software components deployed in the environment:
Component
Version
Endpoint Client
Web Portal
License Server
Office
Database Server
Hypervisor
Network Appliance
WAN Optimization
Storage Network
Storage DR
Note: All software is updated to run the latest hotfixes and patches
Hardware
Servers
The hardware used in this configuration were blade servers with 2-socket Intel Xeon E5-2670 @
2.60GHz, with 192 GB of RAM and two internal hard drives.
citrix.com
14
Network
VMs were utilized as site edge devices that helped route traffic between regions. The perimeter network
(also known as a DMZ) had a firewall between itself and the internet and another firewall between the
perimeter network and production network.
NetScaler Global Site Load Balancing (GSLB) was used to determine which region the user is sent. If
available, users are sent to their primary region. When the primary region is not available, users are sent
to their secondary region. A pair of NetScaler VPX appliances per region were utilized for authentication,
access, and VPN communications. Additionally, a pair of NetScaler Gateway VPX appliances were
utilized per region to allow connectivity into the XenApp/XenDesktop environment. CloudBridge VPX
appliances were utilized for traffic acceleration and optimization between regions. NetScaler CloudBridge
Connector was configured for IPSec tunneling.
The following diagram is a detailed architectural design of our network implementation.
citrix.com
15
Storage
Storage was configured using EMC XtremIO All-Flash Storage and Isilon Clustered NAS systems.
Storage Network for EMC XtremIO was configured with Brocade Fibre Channel SAN switches. The
following diagram gives a high level view for Region 1. As stated previously, failover to a DR site requires
manual intervention, so the concern in syncing data comes down to a math problem. How much data do
you need to sync between sites and what size pipe between the sites? That determines how long it will
take to sync. Can you sync in the time allowed? If not, what do you have to correct the problem, reduce
the amount of data or increase the pipe speed?
One thing to look at is the LUNs, or storage repositories. Our design created multiple volumes for mission
critical data and business critical data, and scheduled syncs accordingly. It is crucial that you work with
the storage vendor to get the proper configuration.
citrix.com
16
Use Cases
The following use cases define the possible scenarios that must be considered and, for our case study,
the users that must be supported. The minimum implies the mission critical and business critical users
that need to be supported.
Use Case 1
If the Region 1 site fails, mission- and business-critical users will be able to connect and log on to
the Region 2 site with the same data resources as were available in the Region 1 site.
With the Region 1 site back online, NetScaler GSLB will direct users to the correct site, as Region
1 site users log off from the Region 2 site and then log back into the Region 1 site.
A maximum of 120 users will have warm HA failover capability from Region 1 to Region 2.
Use Case 2
If the Region 2 site fails, mission- and business-critical users will be able to connect and log on to
the Region 1 site with the same data resources as were available in the Region 2 site.
With the Region 2 site back online, NetScaler GSLB will direct users to the correct site, as Region
2 site users log off from the Region 1 site and then log back into the Region 2 site.
A maximum of 130 users will have warm HA failover capability from Region 2 to Region 1.
The sites configured as Active/Passive, with the goal of failing over only the mission critical users
from the Region 1/Region 2 sites to the DR site.
This site will be based on backup data from Region1 and Region 2 and will go live within 5 days.
When users login to the DR site, they should have any changes/modifications in their
dedicated environment in the DR site environment. There is potential of data loss between
the last site to site copy and the failover. Once failed over to DR site, when Region 1/Region
2 return online, and after allowing appropriate time for replication between sites, login should
connect to Region 1/Region 2 and the changes should be reflected there.
The cold DR site will contain subset of the regional sites including networking, infrastructure
and dedicated VDIs.
o
citrix.com
This approach allows us to both easily recover from disaster with backups, and later
rebuild regional sites from the DR site data.
Mission Critical users will have primary access to the cold DR site, followed by Business
Critical, and then the rest of the company depending on timelines and disaster impact.
o
17
Section 3: Deployment
In building this configuration, this document is not a step by step manual, but a guide to help understand
what needs to be done. Wherever possible, Citrix documentation was followed around deployment and
configuration. The following configuration sections highlight any deviations or areas of importance to help
with a successful deployment.
Implementing the software breaks down to two major areas. First, putting the correct software into each
region. Second, configuring NetScaler for GSLB.
The process followed for deployment was:
1. Deploy XenServer pools.
2. Create required AD groups and DHCP scopes.
3. Prepare SQL Environment (SQL AlwaysOn). PVS 7.6 adds support for AlwaysOn.
4. Deploy XenDesktop environment.
5. Deploy Storefront servers and connect to XenDesktop.
6. Deploy PVS environment and create required vDisks.
7. Configure NetScaler GSLB, create site and service.
8. Configure NetScaler Gateway in Active/Passive mode and update Storefront configuration.
9. Deploy Microsoft Exchange Environment.
The NetScaler configurations are straightforward, there was nothing special done with configuring
StoreFront. This was a typical XenDesktop and NetScaler Gateway configuration. Two StoreFront servers
were configured to be load balanced by NetScaler.
NetScaler GSLB is where the focus is:
Using location settings in NetScaler to define the primary regions of the clients local DNS
Servers and for the GSLB sites and services.
Users regardless of region use the same Fully Qualified Domain Name (FQDN) (i.e.
desktop.domain.com) NetScaler running ADNS will answer authoritatively with the IP of
primary site.
Once the user is redirected to the proper site, the user authenticates at AG, and is then
redirected to local StoreFront to get access to resources.
citrix.com
18
Configuration Considerations
The following defines some of the specific configurations applied to environment:
XenApp/XenDesktop
FMA Services to have SSL on Controllers and change XML Service ports from HTTP
to HTTPS ports to secure traffic communication
5 Machine Catalogs
Physical XA HSD
XA HSD MC
XA HSD BC
XA HSD MC Failover
XA HSD BC Failover
4 Machine Catalogs
PR
BC
PR Failover
BC Failover
4 Delivery Groups
citrix.com
4 Machine Catalogs
MC
BC
MC Failover
BC Failover
4 Delivery Groups
19
Static VMs
XenApp / HSD
o
StoreFront VMs
o
License Server VM
citrix.com
2 HA license servers
20
4 - X410 Nodes
Provisioning Services
o
Utilizing remote storage location for vDisks on each PVS remote storage attached
to PVS VMs as 2nd drive via File Server and SMB/CIFS.
Separate locations for vDisk store for Mission Critical and Business Critical
vDisks on File Server via SMB/CIFS
Multihomed
citrix.com
2 LB VPX in HA mode
LDAP Authentication
AG VIP
VPN
21
2 XenDesktop Brokers
2 - StoreFront VMs
2 - Provisioning Services
2 AD DC VMs
2 Mailbox
2 Client Access
Perimeter Network
1 Firewall / Router VM
2 CloudBridge VPX VMs - HA Model - Active/Passive Site to Site user access WAN
optimization
R2 HA Fail-Over Pool
citrix.com
5 XA HSD VMs
22
2 XenDesktop Brokers
2 - StoreFront VMs
2 - Provisioning Services
2 AD DC VMs
2 Mailbox
2 Client Access
Perimeter Network
1 Firewall / Router VM
2 CloudBridge VPX VMs - HA Model - Active/Passive Site to Site user access WAN
optimization
R1 HA Fail-Over Pool
citrix.com
5 XA HSD VMs
23
2 AD DC VMs
2 AD DC VMs
2 Delivery Controllers
2 StoreFront VMs
2 Mailbox
2 Client Access
2 AD DC VMs
2 Delivery Controllers
2 StoreFront VMs
2 Mailbox
2 Client Access
Perimeter Network
1 Firewall / Router VM
Note: The infrastructure VMs for regions 1 and 2 were duplicated in region 3 for networking purposes. By
setting the networks correctly in region 3, once regions 1 and 2 were brought up, no network changes
were required in their infrastructure or VHD files.
citrix.com
24
Failover Process
The dedicated VMs present the biggest challenge in a failure. To address this, VMs are created in both
regions for the failover dedicated VMs from the other region. However, no storage is attached to these
VMs. In the event of a failure, these VMs will be assigned the proper VHD file from the backup storage
location. It should also be noted that for fail-back after the failed region is back online, the dedicated VM
VHD files will be deleted in the failed region and copied back from the failover region and attached to the
proper VM. This ensures the latest version of the dedicated VMs will be restarted after the fail-back.
Note: In dealing with dedicated VMs, we realized that we had to carefully name the VHD files and
associated files to ensure connecting the correct VHD file to the correct VM in failover and fail-back.
If there is a failure in either Region 1 or Region 2 (whats called a warm failover), a few steps need to be
taken. The actions differ depending on the failure. If it is a network access issue, or the Internet is down,
the dedicated VMs in the failed region are placed in Maintenance Mode in Citrix Studio and shut down.
The latest storage backup of the dedicated VMs in the new region must be made available and the
storage for each VM needs to be attached individually to the pre-created VMs already present. Group
policy is applied to the dedicated VMs OU which import the registry value, listing the delivery controllers
host names, allowing VDA registration with the local delivery controllers. The pooled VDI and XA HSD
VMs on the local delivery site are also taken off Maintenance Mode and brought online.
For Region 2, the SQL database for the call center application is brought online as well. Depending on
the type of failure, you may need to power down the failed region firewall to force failover to the other
region.
Once those steps are completed, you boot Mission Critical User VMs and Business Critical User VMs.
Mission- and Business-Critical data is kept in sync between the sites. You can then communicate the
availability to your users. The end users use the same URL as always, with GSLB redirecting as required.
For fail-back after recovery of the failed region has completed, the steps are to sync all storage back to
the failed site, perform the necessary steps for the dedicated VMs, bring the applications back online, and
bring up the users.
In a full loss of both Regions 1 and Regions 2, the DR site, or Region 3, needs to be brought online. The
physical servers are powered up, making the XenServer pools accessible. The latest database and
Exchange information are imported and the infrastructure for user VDI VMs should be restored and
brought online. A new URL is required to log in. Once the site has been brought online, any new
information, like a new URL for access, needs to be given to your users.
citrix.com
25
Import Domain Controllers from backup and restore Active Directory functionality
NetScaler
o
XenServer
o
XenDesktop Environment
o
Import SQL VMs and restore XenDesktop, PVS and Call Center application databases
Import StoreFront, XenDesktop and PVS VMs and test connectivity to databases
Exchange Environment
o
File Services
o
External DNS
citrix.com
26
Section 4: Conclusion
As stated in the beginning, the goal of this project was to challenge a group of engineers with creating a
disaster recovery plan for a fictitious company. This meant understanding what was mission critical,
business critical, and normal day-to-day work, and what applications and data needed to be ready in case
of a disaster. This also meant understanding user needs for issues like dedicated VMs. This paper
highlights and defines some of the issues around creating a disaster recovery environment. This is not a
how-to, step-by-step manual, but a guide to help you understand the issues and concerns in doing
disaster recovery, and things to consider when defining your disaster plan. It shows you how the Citrix
Solutions Lab team of engineers defined, designed, and implemented a DR plan for a fictitious company.
This may not be the optimal solution overall for your company, but it is one that you can utilize as a base
line of considerations and operational steps to be used when you create your disaster recovery plan for
you company.
During the process of deploying and testing, there were some realizations and changes made. One of the
first was around failing back after a failover; how to handle the data. Do you sync back, or delete and
copy back? Our decision was to delete and copy back, ensuring the original site is clean and up to date.
Another realization was around the configuration of GSLB and the failed site. Since preparing the fail
over site for access requires manual intervention, there is potential for GSLB to re-direct users to the fail
over site before it is ready, users could hit a StoreFront before any personal desktops or applications are
available for them, they would have access to any common applications or desktops.
We used two different SQL approaches, Always-on for our infrastructure environment and clustering for
our data base application. This was done by design in the lab to show issues and considerations around
both.
To support high availability between the two main regions and having a third region for total failover the
one thing that our company president was less than thrilled with was the Cap-Ex cost of hardware not
being fully utilized. This is a cost of doing business.
However, with the recent introduction of Citrix Workspace Cloud, an alternate may have come up that we
are reworking our fictitious company toward. Rather than having additional hardware in Regions 1 and
2, what if there was a cloud site running at a minimum waiting for a region to fail, and spin up what is
needed to support the failure? Essentially, what is needed in the cloud is a NetScaler VPX for
connectivity, an AD server, a SQL Always on server, and an Exchange server. This keeps the mission
critical and business critical environments in sync. You can then determine what else may be required to
support each region. The one current caveat of the cloud is that currently no cloud supports desktop
operating systems; VDI users get server operating systems running in a desktop mode. This is not a
major issue for pooled VDI users, but does become something to be solved for dedicated VDI users.
Will the cloud work for you? Should you use additional hardware in your regions? What are your recovery
times? How much of your environment is actually mission critical? These are questions we hope you are
now considering as you build a disaster recovery plan for your company.
citrix.com
27
Section 5: Appendices
Appendix A
References
EMC Storage
http://www.emc.com/en-us/storage/storage.htm?nav=1
Brocade Storage Network
http://www.brocade.com/en/products-services/storage-networking/fibre-channel.html
XenApp
http://www.citrix.com/products/xenapp/overview.html
XenDesktop
http://www.citrix.com/products/xendesktop/overview.html
NetScaler
http://www.citrix.com/products/netscaler-application-delivery-controller/overview.html
CloudBridge
http://www.citrix.com/products/cloudbridge/overview.html
Citrix CloudBridge Data Sheet:
https://www.citrix.com/content/dam/citrix/en_us/documents/products-solutions/cloudbridge-data-sheet.pdf
citrix.com
28
Appendix B
High Level Regional Diagrams
citrix.com
29
citrix.com
30
Appendix C
Identifying Services and Applications for DR/HA
This section identifies all the applications, services and data items for planning within our setup.
Call Center
Type: Database and App
Description: Main application for call center activity required for company mission critical function
Level: Mission Critical
Primary Location: Region 2 (West Coast), Region 1, R3/DR in case of failover or disaster
Access Methods:
Web Servers
Notes:
Database servers and database must be made accessible in R1 and R3/DR in case of fail-over or
disaster
http://businessimpactinc.com/install-northwind-database/
https://msdn.microsoft.com/en-us/library/vstudio/tw738475%28v=vs.100%29.aspx
Exchange
Type: Service
Description: Email service, required for internal and external communication
Level: Business Critical
Primary Location: Region 1 & 2, R3/DR in case of disaster
Access Methods:
Web Outlook
31
Notes
Microsoft Office
Type: Application
Description: Productivity applications for regular office work
Level:
Web Outlook
Data:
Exchange Mailbox
Data Location:
Exchange Servers
Systems:
Notes:
Outlook needs to be available in all regions in case of failover for business critical users.
XenDesktop
Type: Service
Description: Virtual Desktop Brokering and management system, required for virtual desktop access and
assignment
Level: Mission Critical
citrix.com
32
Notes:
Must be available in all regions for mission- and business-critical users to be able to access
desktops.
For R3/DR the XenDesktop database and SQL servers supporting it are required to be brought
up before the XD Deliver Controllers
Licensing server must be available for XenDesktop functionality to allow user connections
StoreFront
Type: Service
Description: Web Portal into the XenDesktop environment, required for user session access
Level: Mission Critical
Primary Location: Region 1 & 2, R3/DR in case of disaster
Access Methods: Web Browser, Citrix Receiver
Data: SF configuration
Data Location: SF servers
Systems: Storefront Server VMs
Notes:
Must be available in all regions for mission- and business-critical users to be able to access
desktops.
Provisioning Services
Type: Service
Description: Virtual Desktop VM streaming and deployment system, required for the virtual desktop VMs
launch
Level: Mission Critical
Primary Location: Region 1 & 2, R3/DR in case of disaster
Access Methods: PXE and DHCP for the Virtual Desktop VMs
Data:
vDisks
Data Location:
citrix.com
33
Systems:
Notes:
Licensing server must be available for PVS functionality to allow virtual desktop launch
User Profiles
Type: Data
Description: User data required for all users work on virtual desktops
Level: Mission Critical
Primary Location: Region 1 & 2, R3/DR in case of disaster
Access Methods: SMB
Data: User personal data, including redirected My Documents
Data Location: UPM File Servers
Systems: File Server VMs
citrix.com
34
Corporate Headquarters
Bangalore, India
UK Development Center
Pacific Headquarters
Hong Kong, China
About Citrix
Citrix (NASDAQ:CTXS) is leading the transition to software-defining the workplace, uniting virtualization, mobility management, networking
and SaaS solutions to enable new ways for businesses and people to work better. Citrix solutions power business mobility through secure,
mobile workspaces that provide people with instant access to apps, desktops, data and communications on any device, over any network
and cloud. With annual revenue in 2014 of $3.14 billion, Citrix solutions are in use at more than 330,000 organizations and by over 100
million users globally. Learn more at www.citrix.com
citrix.com
35