Vous êtes sur la page 1sur 11

CA Directory r12 SP1

Data Replication & Recovery Best Practice


This document provides specific advice on how to configure CA Directory r12 SP1 for replication between peer data DSAs to provide High-Availability and 24x7 service. CA Directory provides three data replication methods, multiwrite, DISP and multiwriteDISP. This document recommends the use of multiwrite-DISP as the best practice replication and automatic recovery method to use for CA Directory r12 SP1 and later deployments. CA Directory r12 SP1 is resilient and a DSA can recover gracefully even if the OS fails and data is missing or corrupt. In order to do this, the DSA relies on a transaction log file. The document details three transactional modes for a data DSA; With Transaction Log and flushing: this guarantees that a data DSA can recover even after a machine or OS failure. This is suggested for environments which are expected to support a write/modify performance of at most a few hundred updates per second (performance metrics will vary depending on disk speed) and may not be closely monitored by system administrators for failure; With Transaction Log and no flushing: this guarantees that a data DSA can recover when a DSA fails (e.g. core dumps) but does not guarantee recovery on a machine or OS failure. This may be considered for environments which are expected to support a write/modify performance of thousands of updates per second (performance metrics will vary depending on disk speed) and is closely monitored by system administrators for failure;

Without Transaction Log: no guarantee of recovery on data DSA, machine or OS failure but provides optimal write/modify performance. This is suggested for environments which are expected to support very high volume concurrent write/modify performance of many thousands of updates per second (performance metrics will vary depending on disk speed) and is closely monitored by system administrators for failure. The document details how to implement a data DSA disaster recovery plan if there is a need to: Manually recover a single data DSA from its online peer data DSA(s). This would be needed when a) running a data DSA without transaction log and the data DSA, machine or OS failed or b) when running a data DSA with transaction log (no flushing) and the machine or OS failed. Manually recover a single data DSA that has been unavailable for such a long time (e.g. weeks) that performing disaster recovery is a quicker method to get the data DSA online than relying on multiwrite-DISP recovery. Manually recover a single data DSA because one of its peer DSA(s) in-memory multiwrite queue has blown only applicable to when multiwrite replication is being used.

2009 CA, Inc. All rights reserved. Confidential and Proprietary Information Version: 1.0 August 09

CA Directory r12 SP1 Data Replication & Recovery Best Practice

Table of Contents
Executive Summary .................................................................................................. 3 Product Version ....................................................................................................... 3 Data DSA Replication ................................................................................................ 4 Data DSA Operational Modes .................................................................................... 5 Data DSA Disaster Recovery ..................................................................................... 7 Online Data DSA Backup ........................................................................................... 7 How to Implement a Data DSA Disaster Recovery Plan .................................................. 8 When using multiwrite-DISP or DISP replication ........................................................ 8 When using multiwrite replication .......................................................................... 10

CA Directory r12 SP1 Data Replication & Recovery Best Practice

Executive Summary
This document provides specific advice on how to configure CA Directory for data replication and data recovery for optimal performance, reliability, resilience and scalability. Using two or more CA Directory servers allows for 24x7 service and data availability. For example, when a CA Directory server is brought down for system maintenance (e.g. applying OS patches or hardware upgrades) or a server fails, there must always be at least one other server online. The key to providing 100% service and data availability for your LDAP client applications is to ensure that there is always at least one CA Directory server available and that server can accommodate any peak load the LDAP client applications might push at it. At a single Data Center, the advice is to use at least two CA Directory servers - all CA Directory servers replicating synchronously using a multiwrite-DISP configuration. Preferred master or multi-master setups are both valid configurations and dependent on customer preference and use case. In general however, preferred master is preferable to multi-master as best-practice. Use the write-precedence setting in the data DSA settings file to define the preferred master server for a set of peer data DSAs. For example, you might define set write-precedence = Host1-UserStore; in each UserStore DSA settings file on Host1, Host2 and Host3 if the data DSA is named UserStore and there are three hosts named Host1, Host2 and Host3 hosting the UserStore data DSA. If two or more Data Centers are being used, the advice is to use two or more CA Directory servers per Data Center - all CA Directory servers within a Data Center replicating synchronously using a (preferred master or multi master) multiwrite-DISP configuration and CA Directory servers between Data Centers replicating asynchronously using a multi-write group configuration. It is CA Directory best practice advice to use a null prefix router DSA on each CA Directory server all client LDAP requests should enter the CA Directory X500 backbone via a router DSA. A router DSA will route the LDAP requests to the appropriate data DSA using a shortest path routing algorithm. This document describes in detail how to configure replication between CA Directory servers and how to ensure they are always kept in-sync with their peer data DSAs.

Product Version
CA Directory r12 SP1 Service Release 2 (Build 2266) and later.

CA Directory r12 SP1 Data Replication & Recovery Best Practice

Data DSA Replication


The CA Directory r12 SP1 Administration Guide provides detailed information on setting up replication (Chapter 5). See https://support.ca.com/irj/portal/anonymous/userlogin?TARGET=%20%20%20%20https%3A %2F%2Fsupport.ca.com%2Firj%2Fportal%2FDocumentationResults%3FproductID%3D1639 %26amp%3BreleaseID%3DALL%26amp%3BlanguageID%3DENU%26amp%3BactionID%3D2 (requires support.ca.com login). However, in summary, if using ASCII configuration, define the following DSA settings: a) Add the following dsa-flags settings in the data DSA knowledge files:
set dsa dsaname = { ... dsa-flags = multi-write, no-service-while-recovering ... ... };

b) If using more than one Data Center, add a specific string for the multi-write-group setting to logically group all DSAs and hosts within a specific Data Center (suggest using the Data Center name or location for the string):
set dsa dsaname = { ... multi-write-group = <data center name> dsa-flags = ... ... };

c) In the data DSA settings file, define:


set multi-write-disp-recovery = true; set wait-for-multiwrite = false;

If using DXmanager XML configuration, define the following data DSA settings: a) Add the following replication settings for each data DSA:

b) Define hosts within a single Site if using one Data Center or define multiple Sites to logically group all DSAs and hosts within each specific Data Center (suggest using the Data Center name or location for the name of a Site):

CA Directory r12 SP1 Data Replication & Recovery Best Practice

Data DSA Operational Modes


Five data files are used by data DSAs. <dsaname>.db contains all DSA data. <dsaname>.oc contains a list of object classes referenced by the DSA. <dsaname>.at contains a list of attributes referenced by the DSA. <dsaname>.dp contains a list of peer DSA names and the datetime the DSA was able to last successfully communicate with its peers. This is used by the DISP and multiwrite-DISP replication methods. <dsaname>.dx Similar to the .dp file but is only written to by dxdisp. Having a separate file works around the problem where the DSA is updating the .dp file at the same time dxdisp is updating it. The dxdisp command line utility will update the time a peer data DSA was last communicated with to now for a defined peer data DSA on a CA Directory server. The dxdisp utility is used as part of a data DSA disaster recovery plan (see later). <dsaname>.tx an optional transaction log file, by default enabled, containing any transactions that have not been committed to the memory-mapped file on disk.

The data files location should be defined when the data DSA is created using the dxgrid-dblocation setting defined in the DSAs sever (.dxi) file if using ASCII configuration or defined explicitly in DXmanager if using the XML configuration the default is $DXHOME/data on UNIX/Linux and %DXHOME%\data on Windows. There is a trade-off in DSA operational modes write performance versus guaranteed automatic data recovery. Which mode is chosen is dependent on a number of factors but ultimately is a deployment specific choice. Using a transaction log will help automatic data recovery when a data DSA, machine or OS stops unexpectedly. However, performance of write/modify operations will not be optimal ballpark around 100 updates per second when flushing and a few thousand updates per second when not flushing (performance metrics will vary depending on disk speed). Performance of the transaction log file can be increased by hosting the data DSA files on a high-performing disk system such as a SAN or using RAID (e.g. RAID 1+0 or RAID 5). The network performance between the server and the disk system would then be the likely performance barrier. Flushing of the transaction log file will guarantee that a data DSA will be able to automatically restart and recover data even on machine or OS failure (without system administrator manual intervention). Not flushing the transaction log file will guarantee that a data DSA will always be able to restart on a DSA failure but not guaranteed on machine or OS failure. It is impossible for the data DSA to differentiate between the cases where there was a DSA, machine or OS failure when it is starting up. Thus, when not flushing the transaction log file, a system administrator needs to know whether to just restart the data DSA because they know there was a DSA failure or implement their data DSA disaster recovery plan because they know there was a machine or OS failure. Not using a transaction log will provide optimal write/modify performance ballpark thousands of updates per second. However, if there is a DSA, machine or OS failure,

CA Directory r12 SP1 Data Replication & Recovery Best Practice

there will be a need for a system administrator to implement their data DSA disaster recovery plan. By default the transaction log file and flushing is enabled when a data DSA is created. To disable the transaction log file define: set disable-transaction-log = true; in the data DSA settings file. To disable the transaction log file flushing define: set disable-transaction-log-flush = true; in the data DSA settings file. Which data DSA operational mode you choose to use is dependent on a number of factors and is customer deployment specific. Some questions that would determine which mode to use would be: Which type of application is accessing CA Directory? Do those application have a requirement to support a high-volume of update/modify operations? Hundreds of update/modify operations per second? Thousands of update/modify operations per second? Are the update/modify operations mission critical e.g. session/token data may be perceived as transient and not mission critical as it is only relevant for the time a user is logged in. In this case performance overrides any need to guarantee that the data will be available on DSA or machine failure. How closely will the system administrators monitor the CA Directory deployment? When a data DSA, machine or OS fails, will the system administrators take action promptly? What is the historical stability or expected availability of the hardware and/or OS that the CA Directory deployment will be running on? For example, if deploying to a closely controlled UNIX environment that has had 99.999% or above uptime for some years may provide more confidence to deploy with transaction log and no flushing or no transaction log at all. If deploying to a virtual environment which does not have strict access or management control in place may mean you are more conservative in choosing to deploy using a transaction log file with full flushing.

Note that using a CA Directory transaction log file with flushing will likely provide equivalent, if not better, write performance than that provided by disk I/O bound database solutions provided by LDAP vendors.

CA Directory r12 SP1 Data Replication & Recovery Best Practice

Data DSA Disaster Recovery


Data DSA disaster recovery needs to be implemented when a data DSA cannot be restarted. What prevents a data DSA from starting? When a router or data DSA starts up, it writes a file to the DXHOME/PID folder on the server with the filename the same as the DSA name. This file is then removed when a DSA stops cleanly. If the DSA finds the file on start up, it will refuse to start (unless transaction log with flushing, the default, is being used). Scenarios when a system administrator would action their data DSA disaster recovery plan would be: No transaction log file being used and either a data DSA has failed (e.g. core dumped or terminated unexpectedly), a machine has failed (e.g. unexpected power outage) or OS failed (e.g. blue-screened), Transaction log file without flushing being used and a machine failed. If a data DSA failed and not the machine, the system administrator would need to determine this and remove the PID file so that the DSA can restart normally. If a data DSA has been offline for a substantial period of time (e.g. weeks). If using multiwrite replication and a peer DSA in-memory queue blows.

Online Data DSA Backup


CA Directory provides the capability to schedule periodic online backups. This is of use when there is a total data DSA failure across peers and there is a need to go back to a last good data file to recover all peer DSAs. To do this define: dump dxgrid-db period <start> <period>; in the DSA settings file. <start> is the number of seconds since Sunday 00:00:00 am (GMT) and <period> is the number of seconds between online backups. Note, the start time is defined from Sunday 00:00:00 GMT and not your local time. This will produce a snapshot of the .db, .at and .oc files at the specified time (with extensions .zdb, .zat and .zoc) in the same folder. Each time the online backup is run, it will overwrite any previous online backup file. Thus, if you want to archive any online backups, you would need to file copy the backup files to a different archive location by defining a cron job on UNIX/Linux systems or a scheduled task on Windows systems. Ensure you time the start of the cron job or scheduled task to a time after the online backup will have completed (an online backup will usually take minutes at most depending on the size of the datafile). It is recommended to stagger the scheduling of online backups from peer data DSAs. For example, if you have two CA Directory servers, schedule the data DSA on server 1 to online backup at midnight and then every two hours and the peer data DSA on server 2 to backup at 1am and then every two hours. On a data DSA on server 1, you might define:

CA Directory r12 SP1 Data Replication & Recovery Best Practice

dump dxgrid-db period 0 7200; On a peer data DSA on server 2, you might define: dump dxgrid-db period 3600 7200; Note, that when using horizontally partitioned data DSAs on separate servers or with data in a DSA that has a logical relationship to data hosted in a separate distributed DSA, you would schedule the online backup for those data DSAs at the same time. An online backup can be taken at any time by connecting to the data DSA via the DXconsole interface on the CA Directory server (i.e. telnet localhost <console-port> where console-port and optional console connection details can be found in the data DSA knowledge file if using ASCII configuration or via DXmanager if using XML configuration). Run the following command within the DXconsole interface: dump dxgrid-db;

How to Implement a Data DSA Disaster Recovery Plan


There are a number of methods that could be implemented we will discuss the quickest and best-practice method using online backup files. Remember that a disaster recovery plan is only valid if it is tested regularly dont wait until you need to implement it in your production environment to verify that your alerting mechanism when a DSA or machine has failed (e.g. using SNMP) is working as expected and that you can run through the process defined below successfully. Test your data DSA disaster recovery plan in your pre-production environment regularly. Lets consider an example of three servers named Host1, Host2 and Host3 each hosting a data DSA named UserStore. The data DSA on Host1 fails and it is required to implement your data DSA disaster recovery plan to get that data DSA back online.

When using multiwrite-DISP or DISP replication


The same disaster recovery process and logic as defined for multiwrite-DISP replication can be applied to a deployment using DISP replication. The following diagram defines the sequence of steps from time t0 to time t5 that you may take if deciding to recover from the peer data DSA on Host3 when using multiwrite-DISP or DISP replication:

CA Directory r12 SP1 Data Replication & Recovery Best Practice

CA Directory r12 Data DSA Disaster Recovery plan when using multiwrite-DISP or DISP replication DSA Host1-UserStore
DSA Host1-UserStore fails and requires DSA disaster recovery

DSA Host2-UserStore

DSA Host3-UserStore

t0

t1

dxdisp Host1-UserStore

dxdisp Host1-UserStore

dxdisp Host1-UserStore

t2

Perform an online backup of Host3-UserStore creating: Host3-UserStore.zdb Host3UserStore.zoc Host3-UserStore.zat Remove Host1-UserStore data files including .dp and .tx files. Rename: Host3-UserStore.zdb -> Host1-UserStore.db Host3-UserStore.zoc->Host1-UserStore.oc Host3-UserStore.zat->Host1-UserStore.at

t3

Copy files to Host1-UserStore

t4

Remove Host1-UserStore PID file if it exists. Start Host1-UserStore DSA

CA Directory r12 SP1 Data Replication & Recovery Best Practice

Notes There is a need to dxdisp the recovering DSA on all machines (including the recovering machine) at step t1 BEFORE doing the online backup. This is to prevent the peer DSAs on those machines forwarding updates to the recovering machine when the DSA is started from before the time the backup was performed. The effect of running dxdisp Host1-UserStore on Host2, Host3 etc... is to update the date/time of the last communication that Host2-UserStore and Host3-UserStore had with Host1-UserStore to now in the .dp/.dx file.

When using multiwrite replication


Vanilla multiwrite replication relies on in-memory queues to propagate replication of updates to peer DSAs when a DSA is recovering. If an in-memory queue blows for a peer, there is a need to implement Disaster Recovery for the peer DSA as well as for the other use cases defined previously e.g. a machine or OS crash when not using a transaction log file. The process is to simply copy a current data file backup from an online peer to the recovering peer machine and restart the peer DSA. Also, it is recommended to clear the in-memory queues of any peer DSAs for the recovering DSA before taking the backup; this is to reduce start up time of the recovering DSA if there were a large number of pending updates queued by peer DSAs from a time prior to when the backup was created. To do this, connect to the online peer DSA(s) via the DXconsole interface on the CA Directory server (i.e. telnet localhost <console-port> where console-port and optional console connection details can be found in the data DSA knowledge file if using ASCII configuration or via DXmanager if using XML configuration). Run the following command within the DXconsole interface:
clear multi-write-queue <dsa-name>;

where dsa-name is the name of the DSA being recovered.

10

CA Directory r12 SP1 Data Replication & Recovery Best Practice

CA Directory r12 Data DSA Disaster Recovery plan when using multiwrite replication DSA Host1-UserStore
DSA Host1-UserStore fails and requires DSA disaster recovery

DSA Host2-UserStore

DSA Host3-UserStore

t0

t1

telnet localhost <console port> clear multi-write-queue Host1-UserStore;

telnet localhost <console port> clear multi-write-queue Host1-UserStore;

t2

Perform an online dump of Host3-UserStore creating: Host3-UserStore.zdb Host3-UserStore.zoc Host3-UserStore.zat Remove Host1-UserStore data files including .dp and .tx files. Rename: Host3-UserStore.zdb -> Host1-UserStore.db Host3-UserStore.zoc->Host1-UserStore.oc Host3-UserStore.zat->Host1-UserStore.at

t3

Copy files to Host1-UserStore

t4

Remove Host1-UserStore PID file if it exists. Start Host1-UserStore DSA

Notes It is advised to clear the multiwrite queue for the recovering DSA on all machines (except the recovering machine) at step t1 BEFORE doing the online backup. This is to ensure that any updates applied much earlier than the online backup time in step t2 are not replicated when the recovering DSA is started in step t4.

11

Vous aimerez peut-être aussi