Vous êtes sur la page 1sur 10

Login Contact Us Help Language: English

Search the Support Community

Home
Home

Top Contributors
NetPro

Expert Corner
IP Telephony

Collaboration, Voice and Video

Up to Documents in IP Telephony

Actions
Register / Login for more Actions View as PDF View print preview VERSION 61

Document
Created on: Oct 25, 2010 9:29 AM by Lang Hardison- Last Modified: Feb 23, 2011 9:28 AM by Lang Hardison
share

Troubleshooting CUCM Database Replication in Linux Appliance Model


Replication Architecture CUCM 5.x Replication Architecture CUCM 6.x,7.x,8.xArchitecture User Facing Features Checking Current Replication Status The Real Time Monitoring Tool - Database Summary Page Cisco Unified Reporting - Database Status Report CLI Commands to Check Replication State What does the Status (0,1,2,3,4) Mean? Verifying the Status Number Based on Descriptions Ciso Unified Reporting - Replication Server List (cdr list serv) Next Steps Server/Cluster Connectivity Configuration Files /etc/hosts .rhosts file sqlhosts DNS (Optional) Using Replication Commands Meaning of each command Examples Reestablish logical CDR connections to all servers for replication Reestablish logical CDR connection to a single node Logical connections are established but tables are out of sync Corrupt syscdr causing replication failure Replication Steps List of steps Replication Flow Chart Publisher syscdr/define Subscriber define Replication Timeout Design Estimation

Bookmarked By (28)
View: Everyone

CUCM Database Replication is an area in which Cisco customers and partners have asked for more indepth training in being able to properly assess a replication problem and potentially resolve an issue without involving TAC. This document discusses the basics needed to effectively troubleshoot and resolve replication issues.

Next

More Like This


Linux CUCM

Replication Architecture
CUCM 5.x Replication Architecture
Communications Manager 5.x has a similar replication topology to Callmanager 4.X. They both follow a hub and spoke topology. The publisher establishes a connection to every server in the cluster and the subscribers establish a connection the local database and the publisher only. As illustrated in the figure below, only the publisher's database is writable while each subscriber contains a read only database. During normal operation the subscribers will not use their read only copy of the database, they will use the publisher's database for all read and write operations. In the event the publisher goes down or becomes inaccessible the subscribers will use their local copy of the database. Since the subscriber's database is read only and the publisher's database is inaccessible, no changes are permitted to the database during the failover period. Changes in architecture are implemented in later versions to address this limitation.

CUIC Reports Not Showing Up on all Subs Troubleshooting the replication issues To Verify Successful UCCX v 8.x Replication Status Cisco Unity Connection Clustering

Incoming Links

converted by Web2PDFConvert.com

Re: MOH Upload problem How do you tell if 'utils dbreplication reset all' completed successfully? Call Manager 7.1.3 DB Replication issues after changing IPs CUCM Reset Cisco CCM Publisher 7.1.5 DB Replication shows bad

More by Lang Hardison


View Lang Hardison's profile

CUCM 6.x,7.x,8.x Architecture


Replication in Communications Manager 6.x, 7.x, and 8.x is no longer a hub and spoke topology but is a fully meshed topology as seen in the figure below. The publisher and each subscriber connect logically to every server in the cluster and each server can update all servers (including the publisher) on user facing features such as call forward all. The full list of user facing features is located on the following slide. This change in topology overcomes previous limitations in replication architecture, as changes can now be made to local subscriber databases for user facing faetures even while the publisher is inaccessible. Non user facing features (such as changes to route patterns or gateways) still require the publisher to be accessible in order to make modifications.

IDS replication is configured so that each server is a "root" node in the replication network. Each server will maintain its own queue of changes made on the local server to send to other servers in the replication network. Aroot node will not pass a replication change on to another root node. Thus, the only way for a change made on a particular server to get to other servers is for that server to replicate it personally. In other words, a change made on "A" will be sent to "B" by "A". But, "B" will not send that same change on to "C". Server "A" must send it to "C" and all other nodes.

User Facing Features


Below is a list of user facing features that can be updated by the subscriber and therefore updated while the publisher is down. User Facing Features Call Forward All (CF A) Message Waiting Indication (MWI) Privacy Enable/Disable Do Not Disturb Enable/Disable (DND) Extension Mobility Login (EM) Monitor (for future use, currently no updates at the user level) Hunt Group Logout Device Mobility CTI CAPF status for end users and application users Credential hacking and authentication

Checking Current Replication Status


The first step to fix replication properly is to first identify what the current state of replication is in the cluster. Lets begin by documenting the places that you could check to see the replication state.

The Real Time Monitoring Tool - Database Summary Page

converted by Web2PDFConvert.com

In RTMT, Choose CallManager->Service->Database Summary.

Cisco Unified Reporting - Database Status Report


Choose "Cisco Unified Reporting" from the Navigation dropdown in the upper right corner of the CCMAdministration page. Then choose "Database Status Report", and generate a new report.

CLI Commands to Check Replication State


admin:show perf query class "Number of Replicates Created and State of Replication" ==>query class : - Perf class (Number of Replicates Created and State of Replication) has instances and values: ReplicateCount -> Number of Replicates Created = 348 ReplicateCount -> Replicate_State =2

What does the Status (0,1,2,3,4) Mean?


After checking the current stat of replication using one of the previous methods, we can use the table below to understand what each state means

Value 0 1 2

Meaning Initialization State Number of Replicates not correct Replication is good Tables are

Description This state indicates that replication is in the process of trying to setup. Being in this state for a period longer than an hour could indicate a failure in setup. This state is rarely seen in 6.x and 7.x but in 5.x can indicate its still in the setup process. Being in this state for a period longer than an hour could indicate a failure in setup. Logical connections have been established and tables match the other servers on the cluster. Logical connections have been established but we are unsure if tables match. In 6.x and 7.x all servers could show state 3 if one server is down in the cluster.
converted by Web2PDFConvert.com

suspect Setup Failed / Dropped

This can happen because the other servers are unsure if there is an update to a user facing feature that has not been passed from that sub to the other device in the cluster. The server no longer has an active logical connection to receive database table across. No replication is occurring in this state.

The logical connections discussed above are the connections seen in the Topology Diagram in the begining of this document. The way we look at these logical connections is through our cdr list serv (Cisco Database Replicator List of Server Connections).

Verifying the Status Number Based on Descriptions


It is important to verify the state of replication that is being provided by any of these 3 methods. For example, in some instances a server could show that it is in state 3 but the 'cdr list serv' output shows that the logical connections are in "Dropped" state. This is similar to the server being in state 4. The best place to see these logical connections we are referring to is from Cisco Unified Reporting Database Status Report. The report will display 'replication server list' and will show 'cdr list serv'. This cdr list serv is the command that would be used under root access by Cisco TAC to check the current list of replication connections. Note: Cisco Database Replicator (CDR) list of servers is in no way related to Call Detail Records (also known as CDR).

Ciso Unified Reporting - Replication Server List (cdr list serv)

Next Steps
Now that the state of replication has been identified, if the servers are in a state other than 2 it is necessary to identify what other information is needed in order to proceed in taking further acction. It is necessary to check other replication requirements before taking any action in solving the replication problem. Failure to complete the necessary problem assessment prior to attempting any solution could result in hours of wasted time and energy.

Server/Cluster Connectivity
Confirm the connectivity between nodes. In 5.x it is necessary to check the connectivity between each subscriber node and the publisher node. In 6.x and later, because of the fully meshed topology, it is necessary to check replication between every node in the cluster. This is important to keep in mind if an upgrade has taken place from 5.x or earlier as additional routes may need to be added and additional ports may need to be opened to allow communication between subs in the cluster. The documentation on checking connectivity is linked below. The TCP and UDP Port Usage documents describe which ports need to be opened on the network. CUCM 8.5: http://www.cisco.com/en/US/docs/voice_ip_comm/cucm/port/8_5_1/portlist851.html CUCM 8.0: http://www.cisco.com/en/US/docs/voice_ip_comm/cucm/port/8_0_2/portlist802.html

Configuration Files
Check all the hosts files that will be used when setting up replication. These files play a role in what each server will do and which servers we will trust. The files we are referring to here are listed below Purpose This file is used to locally resolve hostnames to IP addresses. It should include the hostname and /etc/hosts IP address of all nodes in the cluster including CUPS nodes. /home/informix/.rhosts Alist of hostnames which are trusted to make database connections Full list of CCM servers for replication. Servers here should have the correct hostname and node id $INFORMIXDIR/etc/sqlhosts (populated from the process node table). This is used to determine to which servers replicates are pushed. File

/etc/hosts
Below is the /etc/hosts as displayed Verified in Unified Reporting. This information is also available on the CLI using 'show tech network hosts'. Cluster Manager populates this file and is used for local name resolution.

converted by Web2PDFConvert.com

.rhosts file

sqlhosts

DNS (Optional)
If DNS is configured on a particular server it is required for both forward and reverse DNS to resolve correctly. Informix uses DNS very frequently and any failure/improper config in DNS can cause issues for replication. Verifying DNS The best command to verify DNS is utils diagnose test. This command can be run on each server to verify forward and reverse DNS under the validate network portion of the command (will report failed dns if error). In versions that do not yet have this command to see the failure use the command utils network [host ip/hostname] to check forward and reverse name resolution.

Using Replication Commands


Meaning of each command
After verifying that we have good connectivity and all the underlying hosts files are correct and matching across the cluster it might be necessary to use CLI replication commands to fix the replication problem. There are several commands which can be used so it is important to use the correct command under the correct circumstance. The following table lists each command and it's function. Certain commands are not available in each version of CUCM. function Normally run on a subscriber. Stops currently replication running, restarts ACisco DB Replicator, deletes utils dbreplication marker file used to signal replication to begin. This can be run on each node of the cluster by doing utils stop dbreplication stop. In 7.1.2 and later utils dbreplication stop all can be run on the Publisher node to stop replication on all nodes utils dbreplication repair utils dbreplication Introduced in 7.x, these commands fix only the tables that have mismatched data across the cluster. This repairtable mismatched data is found by issuing a utils dbreplication status. These commands should only be used if utils dbreplication logical connections have been established between the nodes. repairreplicate utils dbreplication Always run from the publisher node, used to reset replication connections and do a broadcast of all the tables. This can be executed to one node by hostname utils dbreplication reset nodename or on all nodes by utils reset dbreplication reset all. utils dbreplication Run on a publisher or subscriber, this command is used to drop the syscdr database. This clears out
converted by Web2PDFConvert.com

command

utils dbreplication configuration information from the syscdr database which forces the replicator to reread the configuration files. dropadmindb Later examples talk about identifying a corrupt syscdr database. utils dbreplication Always run from the publisher node, used to reset replication connections and do a broadcast of all tables. Error checking is ignored. Following this command 'utils dbreplication reset all' should be run in order to get clusterreset correct status information. Finally after that has returned to state 2 all subs in the cluster must be rebooted utils dbreplication Available in 7.X and later this command shows the state of replication as well as the state of replication repairs runtimestate and resets. This can be used as a helpful tool to see which tables are replicating in the process. utils dbreplication This command forces a subscriber to have its data restored from data on the publisher. Use this command forcedatasyncsub only after the 'utils dbreplication repair' command has been run several times and the 'utils dbreplication status' ouput still shows non-dynamic tables out of sync.

Examples
Reestablish logical CDR connections to all servers for replication
Getting on I would instantly check the RTMT or Unified Report in order to identify the current state of replication. I choose to ask for the Database Status report as the customer is in a version that has this available. In the report the information I find is the following. The cluster is a 5 node cluster. The publisher is in Replication State = 3 All subscribers in the cluster are in Replication State = 4 We check in the report for Replication Server List and only the publisher shows local as connected. This verifies to us that based on our descriptions the servers are indeed in the states listed above. We now do some other checks to prepare to fix replication. We verify in the report that all of the hosts files look correct. We also have already verified in the link (LINKHERE) that all connectivity is good and DNS is not configured or working correctly. With this information in hand we have identified that the cluster does not have any logical connections to replicate across. Thus the recommendation to the customer would be to follow the most basic process that fixes about 50 percent of replication cases. Below are these steps. 1. utils dbreplication stop on all subscribers. This command can be run on all subscribers at the same time but needs to complete on all subscribers prior to being run on the publisher. The amount of time this command takes to return is based on your cluster's repltimeout. This can be seen through the Command show tech repltimeout and by default is 300 seconds or 5 minutes. At the end of this document I will provide a calculation for determining what you should set your repltimeout (via utils dbreplication setrepltimeout) in your cluster and how this value affects replication. (If utils dbreplication stop all is available (7.X and later) then step 1 and step 2 can be accomplished in this one command) 2. utils dbreplication stop on publisher. Again this command can be done through the utils dbreplication stop all if available. This also will wait the repltimeout as said above. 3. utils dbreplication reset all - This command will take an hour to complete or longer depending on your cluster. You can monitor the status through the utils dbreplication runtimestate or through the procedure following the examples portion of this document.

Reestablish logical CDR connection to a single node


Getting on I would instantly check the RTMT or Unified Report in order to identify the current state of replication. I choose to ask for the Database Status report as the customer is in a version that has this available. In the report the information I find is the following. The cluster is a 3 node cluster. The publisher is in Replication State = 3 SubscriberAis in Replication State =3 and SubscriberB is in Replication State = 4 We check in the report for Replication Server List and the publisher shows local as connected and SubscriberAas connected. Subscriber B is not listed in this list (NOTE* Would be the same procedure if SubscriberB showed "Dropped". This verifies to us that based on our descriptions the servers are indeed in the states listed above. We now do some other checks to prepare to fix replication. We verify in the report that all of the hosts files look correct. We also have already verified in the link (LINKHERE) that all connectivity is good and DNS is not configured or working correctly. 1. utils dbreplication stop on Subscriber B. This command can be run on all subscribers at the same time but needs to complete on all subscribers prior to being run on the publisher. The amount of time this command takes to return is based on your cluster's repltimeout. This can be seen through the Command show tech repltimeout and by default is 300 seconds or 5 minutes. 2. utils dbreplication reset nodename from publisher. In our example I would type utils dbreplication stop subscriberB from the publisher and then wait for it to recreate the logical connection to the subscriberB and broadcast the tables to that subscriber. To follow where in the procedure it is you can use utils dbreplication runtimestate or the procedure following the examples section. 3. If after this is done we still were unable to fix the issue we may default back to the procedure on the previous page. (utils dbreplicaton stop all from publisher (or on subs then pub) and then utils dbreplication reset all)

Logical connections are established but tables are out of sync


Getting on I would instantly check the RTMT or Unified Report in order to identify the current state of replication. I choose to ask for the Database Status report as the customer is in a version that has this available. In the report the information I find is the following. All nodes in the cluster are in Replication State = 3 We check in the report for Replication Server List and all servers show as connected in the list (each server shows local for itself but connected from other servers perspective) This verifies to us that based on our descriptions the servers are indeed in the states listed above. We now do some other checks to prepare to fix replication. We verify in the report that all of the hosts files look correct. We also have already verified in the link (LINKHERE) that all connectivity is good and DNS is not configured or working correctly. 1. utils dbreplication status from publisher. This command will confirm for us that indeed we have mismatched tables. Based on our earlier explanations this means that we do not need to do any reset as they fix connections and repairs fix mismatched tables. Once we have used the status to confirm that we have tables different we then can proceed to fix. 2. utils dbreplication repair from publisher - Which repair command depends on which tables and how many tables are out of sync. If you have a megacluster or a very large deployment a repair can take a long time (a day in some instances depending on how it failed) Thus it is important on which command you pick in those circumstances. On smaller deployments (less than 5,000 phones for example a utils dbreplication repair all is normally fine to do. If it is only one node with a bad table we can do utils dbreplication repair nodename. If it is only one table you can use utils dbreplication repairtable (node/all) to fix it on the problem server or on the whole cluster.

converted by Web2PDFConvert.com

Corrupt syscdr causing replication failure


The cluster is a 3 node cluster. The publisher is in Replication State = 3 SubscriberAis in Replication State =3 and SubscriberB is in Replication State = 4 We check in the report for Replication Server List and the publisher shows local as connected and SubscriberAas connected. Subscriber B is not listed in this list (NOTE* Would be the same procedure if SubscriberB showed "Dropped". This verifies to us that based on our descriptions the servers are indeed in the states listed above. We now do some other checks to prepare to fix replication. We verify in the report that all of the hosts files look correct. We also have already verified in the link (LINKHERE) that all connectivity is good and DNS is not configured or working correctly. Here are the things we tried utils dbreplication stop on Subscriber B. utils dbreplication reset SubscriberB from publisher. When that still failed we then tried utils dbreplication stop all, followed by utils dbreplication reset all. After this we waited for an hour and our status's have once again reverted back to Replication State = 3 for publisher and SubscriberAwith SubcriberB in state 4. Our Replication Server List again shows that we do not have a connection to SubscriberB even after using the reset commands to try and reset up that connection. With that the question becomes What do I do now? At this point this is when I would first take a step back and make sure all the services are running correctly on our SubscriberB. From the CLI of subscriberB I would then confirm that the following services are started using the command utils service list (A CIsco DB, ACisco DB Replicator, Cisco Database Layer Monitor). If any issues with these services we must first work on getting those to start (try manually via utils service start servicename). If everything is ok there then we need to confirm if syscdr (the underlying component for replication could be corrupt). In order to do this please run a utils dbreplication status from the problem node SubscriberB. From the output at the top we will be looking or 2 different distinct states. 1. Enterprise Replication not active 62 - Normal state means that replication has not yet been defined on the node 2. --------------------------------------------- Dashes only at the top of the output. This could indicate a corrupt syscdr. This output can only be checked from the utils dbreplication status command as the Cisco Unified Reporting Tool Database Status Report displays just ------------------------ for both of the above states (CSCti84361 as been opened to address that issue). If from the utils dbreplicaton status you do see you are currently in state 2 above (shows just ----------------------------- header without any servers listed) here is what you need to do to fix that issue. If the following does not resolve the issue you will probably need a tac case to resolve. 1. utils dbreplication dropadmindb - Issued on SubscriberB to restart the syscdr on that node. 2. utils dbreplication stop - Issued on SubscriberB 3. utils dbreplication reset subscriberB - TO once again try to reestablish the connection after fixing the corrupted syscdr.

Replication Steps
These steps are done automatically (by replication scripts) when the system is installed. When we do a utils dbreplication reset all they get done again.

List of steps
Define Pub - Set it up to start replicating Define template on pub and realize it (Tells pub which tables to replicate) Define each Sub Realize Template on Each Sub (Tells sub which tables they will get/send data for) Sync data using "cdr check" or "cdr sync" (older systems use sync)

Replication Flow Chart

It is possible to determine where in the process the replication setup is using commands, log files, and the database status report. This command utils dbreplication runtimestate is most helpful in identifying the current runtimestate of replication setup. In versions where that is not available or as a supplement here is how to follow replication using logs.

Publisher syscdr/define
Confirm that publisher has brought its on syscdr back up (In Cisco Unified Reporting-> Database Status -> Replication Server List confirm that you see the publisher's local connection up. This should occur within a few minutes of the reset. You can also look in the informix log on that box to confirm this. "file view activelog cm/log/informix/ccm.log" from CLI Publisher Define

converted by Web2PDFConvert.com

Subscriber define
You can also check the output of file list activelog cm/trace/dbl date detail. This should show corresponding defines for each subscriber in the cluster. The define is shown in white below.

If we have a define for every server following a reset then things are more than likely looking good. Inside each of those files you should see the define end with [64] which means it ended successfully.

After all subscribers have been defined we then wait the repltimeout (Can check from show tech repltimeout) it will then do a broadcast file that actually pushes the replicates across. The broadcast is shown in yellow. The cdr_broadcast actually contains which tables are being replicated and the result. Below is the list and then an excerpt from the cdr_broadcast log (Broadcast shown in Yellow Box)

CDR Broadcast Log Excerpt

converted by Web2PDFConvert.com

With this you should be able to follow and fix replication cases. Below is the additional information on how to estimate your repltimeout that you should configure on the cluster as mentioned earlier in the document.

Replication Timeout Design Estimation


With replication one of the questions we see for larger customers is how do I know what my repltimeout should be. As we discussed earlier in the document you can check your repltimeout through the command show tech repltimetout. You can also change the value from default (5 min or 300 seconds) using the command utils dbreplication setrepltimeout timeout. The reason for changing your replication timeout is to prevent replication from having to occur in 2 batches. By replication occurring in two batches what we mean is that we run out of time (from the timeout) to get all of the defines done in the cluster prior to doing our broadcast. This means that we may define the first 5 servers and create the logical connections to those nodes and do a broadcast to only those 5 servers. We must then wait for that broadcast to complete before we can then define the rest of the servers and do a second broadcast to those nodes. This makes replication take longer than necessary to setup. Here is the estimate we use to calcuate what your repltimeout should be. Server 1-5 = 1 Minute Per Server Servers 6-10 = 2 Minutes Per Server Servers >10 = 3 Minutes Per Server. Example: 12 Servers in Cluster : Server 1-5 * 1 Min = 5 Min, + 6-10 * 2 Min = 10 min, + 11-12 * 3 Min = 6 Min, Repltimeout should be set to 21 Minutes. This calculation is for all nodes and not just those running the ccm service (as they still get replicated 2. Another important thing to note is that this is an estimate and may need to be manipulated based on other factors in your enviroment. After setting the value you can do a "utils dbreplication stop all:" followed by a "utils dbreplication reset all" to check to see if it still requiered 2 batches or not. (This can be done by using our file list activelog /cm/trace/dbl date detail command. In that list you should see all servers defined prior to the cdr_broadcast. If you end up having a second batch (after broadcast it does another define and broadcast for missing servers) increase the repltimeout and try again. This document was made possible by the help of a lot of individuals. The replication timeout section and a lot of the learning in this presentation is from the database development team that always take time to help teach (Thanks again Nancy). Also, some of the screenshots from the presentation are from Sriram as well as I have reviewers from TAC teams across the US (Sunny,Sriram,Shane,Wes,Ryan,Jason,Kenneth,Omar)
18202 Views Tags: configuration, troubleshoot, callmanager_7.x, callmanager_6.x, monitoring, troubleshooting, linux, database_replication, database, reporting, replication, tool, callmanager_8.x

Average User Rating

(12 ratings)

Comments (4)

CARLOS ENRIQUE GUTIERREZ CARRASCO Feb 10, 2011 11:22 AM Excellent contribution. 10 / 10 points. Very helpful

bmarkel123 Mar 8, 2011 12:59 AM Great guide! Saved me hours of extra work.

ivanobalice Jun 16, 2011 10:04 AM An excellent and comprehensive DB replication guide! Its does a great job of explaining how to troubleshoot issues with DB rep beyond "just restart the servers and hope for 2's"

converted by Web2PDFConvert.com

Mark VandeVere Feb 13, 2012 11:40 AM TAC engineer on a replication issue case referred me to this link as a helpful education resource. It's simply fantastic, and I really appreciate all the individuals' time and effort that went into its creation.

Postings may contain unverified user-created content and change frequently. The content is provided as-is and is not warrantied by Cisco.
1992-2012 Cisco Systems Inc. All rights reserved.

T erms & Conditions Privacy Statement Cookie Policy T rademarks of Cisco Systems, Inc.

converted by Web2PDFConvert.com