Académique Documents
Professionnel Documents
Culture Documents
Node fencing is a general concept used by computer clusters to forcefully remove a malfunctioning node from it. This preventive technique is a necessary measure to make sure no I/O from malfunctioning node can be done, thus preventing data corruptions and guaranteeing cluster integrity. There are many methods to fence a node including, but not limited to, STONITH (Shoot The Other Node In The Head), SCSI3 persistent reservation and SAN fabric fencing. Oracle Clusterware uses a STONITH comparable1 fencing method in which a local process initiates the node removal of one or more nodes from the cluster. In addition to the traditional fencing method from 11g Release 2 onwards, Oracle Clusterware can use a remote node termination mechanism.
1.
Starting with the version 10g Release 1 Oracle introduced an own portable cluster software Cluster Ready Services. This product has been renamed in the version 10g Release 2 to Oracle Clusterware; from 11g Release 2 is part of the Oracle Grid Infrastructure software. It provides basic clustering functionality (node membership, resource management, monitoring, etc.) to Oracle software components like Automatic Storage Management (ASM), Real Application Cluster (RAC), single instance databases as well as any kind of failover application. One of the most important jobs of Oracle Clusterware, is to assure that only healthy nodes are part of the cluster, others should be forcefully removed. Oracle Clusterware will fence a node (cut off access to shared resources) in case of: Not being able to ping cluster peers via the network heartbeat Not being able to ping the cluster voting files OS scheduler problems, OS locked up in a driver or hardware (hung processes), CPU starvation, etc. which makes generally the above mentioned ping operations not possible in a timely manner Crash of some important Oracle Clusterware processes
2.
Cluster Interconnect
Cluster interconnect is a very important dedicated private network used for communication between all cluster nodes. High availability (redundancy) in this area belongs to one of the most important tasks during system architecture design, including all used components like NICs, switches, etc. High availability at OS/server level can be achieved with: OS dependent methods, like bonding on Linux, IPMP on Sun Solaris or Teaming on Windows OS independent method Redundant Interconnect2 available from 11.2.0.2 onwards (not available on Windows)
1 2
It is not a typical STONITH technology with a remote initiated node reset/power off Uses two features internally: multicasting and local link address info@trivadis.com . www.trivadis.com . Info-Tel. 0800 87 482 347 . Date 05.01.2011 . Seite 2 / 9
Cluster interconnect is used as a communication medium to ping3 the nodes with small heartbeat messages, making sure all of them are up and running. Network pings are performed by one of the core cluster processes, ocssd.bin (Cluster Synchronization Services).
ocssd.bin
Heartbeat pings
ocssd.bin
The network heartbeats are associated with a timeout called misscount, set from 11g Release 1 to 30 sec4.
oracle@rac1:~/ [+ASM1] crsctl get css misscount 30
Failure of the cluster interconnect can be simulated by deactivating (e.g. ifdown ethx) all involved network interfaces.
oracle@rac2:~/ [+ASM1] oifcfg getif bond0 192.168.122.0 global public bond1 10.10.0.0 global cluster_interconnect
To prevent a split brain in case of a failed cluster interconnect and to guarantee the integrity of the cluster, Oracle Clusterware will fence at least one node from it. After losing the connectivity and network heartbeats 50 % of the misscount interval (i.e. 15 sec.) long, you will observe the first error messages in the cluster alert log file indicating problems in this area. Losing heartbeats 100% of the misscount interval, will trigger node fencing. From the cluster alert log file:
[cssd(2864)]CRS-1612:Network communication with node rac1 (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.920 seconds [cssd(2864)]CRS-1610:Network communication with node rac1 (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.900 seconds [cssd(2864)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity
Transmit heartbeat messages. Used also for other purposes, like RAC cache fusion, cluster control, etc. From 11g Release 1 the same value for all operating systems. In 10g on Linux the timeout was set to 60 sec.
4
But how does Oracle Clusterwere decide which node to kill in this particular failure scenario? How can Oracle avoid a race condition in which all cluster nodes have been fenced? To make sure this cannot easily happen, additional heartbeat mechanism must be used to allow the cluster stack coordination during reconfiguration with a broken cluster interconnect. Oracle uses additional heartbeat pings to the cluster voting files (see later), which keep a status of the cluster nodes. As long as every node can access the shared storage (i.e. voting files), the cluster can coordinate the reconfiguration in a planned manner. Node(s) which should leave the cluster, will be asked to commit suicide. Technically a kill block will be written to the node slot in every voting file. After reading a kill block, the ocssd.bin process on that particular node will commit a suicide by resetting it.
[CSSD][1108789568]clssnmvDiskEvict: Kill block write, file /dev/sdh flags 0x00010004, kill block unique 1292838821, stamp 1517864/1517864 [CSSD][1094216000]clssnmvDiskEvict: Kill block write, file /dev/sdj flags 0x00010004, kill block unique 1292838821, stamp 1517864/1517864 [CSSD][1104943424]clssnmvDiskEvict: Kill block write, file /dev/sdi flags 0x00010004, kill block unique 1292838821, stamp 1517864/1517864
We could go further and ask, what if the voting files are also not accessible at that particular moment (i.e. I/O calls are not possible)? Well, it is not a problem; the cluster can handle this too (for details see chapter about voting files). Generally Oracle Clusterware uses two rules to choose which nodes should leave the cluster to assure the cluster integrity: In configurations with two nodes, node with the lowest ID will survive (first node that joined the cluster), the other one will be asked to leave the cluster With more cluster nodes, the Clusterware will try to keep the largest sub-cluster running
afterwards Oracle High Availability Service Daemon (OHASD)5 will try to start the Cluster Ready Services (CRS) stack again. Once the cluster interconnect is back online, all relevant cluster resources on that node will automatically start kill the node if stop of resources or processes generating I/O is not possible (hanging in kernel mode, I/O path, etc.)
3.
Voting Files
Voting files are also a very important (in addition to the interconnect heartbeats) communication mechanism used by the ocssd.bin process. They keep a status of the cluster nodes and count votes during cluster reconfigurations. To avoid a single point of failure, it is recommended to mirror the file, using an odd number of them. Every node in the cluster must always access the majority of the files (n/2+1; where n is the odd number of the voting files), or it will be evicted from the cluster. With an odd number and the majority rule, there is a guarantee that every node in the cluster can communicate with each other using at least 1 voting file. Heartbeat pings to the voting files are performed by the ocssd.bin process.
ocssd.bin
Voting File 1
Voting File 2
Voting File 3
Generally: a node cannot join the cluster, if it cannot access the majority of the files a node must leave the cluster, if it cannot communicate with the majority of the files
5
New component in 11g Release 2 info@trivadis.com . www.trivadis.com . Info-Tel. 0800 87 482 347 . Date 05.01.2011 . Seite 5 / 9
From 11.2.0.1 onwards, Oracle supports voting files (as well as OCR and ASM spfile) directly in an ASM disk group6. The physical location of the voting files in used ASM disks is fixed, i.e. the cluster stack does not rely on a running ASM instance to access the files. The location of the file is visible in the ASM disk header (dumping the file out of ASM with dd is quite easy):
oracle@rac1:~/ [+ASM1] kfed read /dev/sdf | grep -E 'vfstart|vfend' kfdhdb.vfstart: kfdhdb.vfend: 96 ; 0x0ec: 0x00000060 128 ; 0x0f0: 0x00000080 <<- begin AU offset of the voting file <<- end AU offset of the voting file
Heartbeat pings to the voting files are associated with the following timeouts: long I/O timeout (disktimeout = 200 sec.) This timeout is used in all scenarios, with a working cluster interconnect short I/O timeout (misscount 30 sec. reboot latency 3 sec. = 27 sec). Used only during cluster reconfigurations (node join/leave events), node evictions to avoid split brain with a broken cluster interconnect
[CSSD][1117743424] misscount 30 reboot latency 3 [CSSD][1117743424] long I/O timeout 200 short I/O timeout 27 [CSSD][1117743424] diagnostic wait 0 active version 11.2.0.2.0 oracle@rac1:~/ [+ASM1] crsctl get css disktimeout 200
The following example demonstrates the Oracle Clusterware behavior after losing connectivity to the voting files. In this configuration, a separate ASM disk group named GRID with normal redundancy has been used for the voting files, OCR and ASM spfile7.
oracle@rac1:~/ [+ASM1] crsctl query css votedisk ## STATE File Universal Id File Name Disk group -- ----- ------------------------- --------1. ONLINE 3b10214c6d8a4f8bbfe493304ebf5684 (/dev/sdf) [GRID] 2. ONLINE 6bc0d20dc6344f6dbf0a08604647eeab (/dev/sdg) [GRID] 3. ONLINE 4de13969fd4b4fecbfb007f4da233ad3 (/dev/sde) [GRID] Located 3 voting disk(s).
In the first step we cut off the I/O path to one voting file placed on the /dev/sde device:
[root@rac1 ~]# echo 1 > /sys/block/sde/device/delete
After loss of device connectivity at the operating system level, the status of the voting file changed to PENDOFFL (pending offline) than to OFFLINE, dropping it at the end from the GRID ASM disk group. From ASM point of view its still OK, we still have 2 devices (failgroups).
In releases prior 11g Release 2 block/raw devices or a cluster file system was used as a location for voting files 7 Trivadis Best Practice separating the files from other introduce more flexibility info@trivadis.com . www.trivadis.com . Info-Tel. 0800 87 482 347 . Date 05.01.2011 . Seite 6 / 9
oracle@rac1:~/ [+ASM1] crsctl query css votedisk ## STATE File Universal Id File Name Disk group -- ----- ------------------------- --------1. ONLINE 3b10214c6d8a4f8bbfe493304ebf5684 (/dev/sdf) [GRID] 2. ONLINE 6bc0d20dc6344f6dbf0a08604647eeab (/dev/sdg) [GRID] Located 2 voting disk(s).
What happens to the node if we lose additional device, i.e. lose the majority of the voting files? To demonstrate it, we will cut off the I/O path to the second voting file placed on the /dev/sdf device:
[root@rac1 ~]# echo 1 > /sys/block/sdf/device/delete
Losing the next device (second failgroup) in the GRID disk group, leads to lost access to the majority of the voting files as well as a force dismount of the disk group on that node. The ocssd.bin process on this particular node will wait disk timeout value (200 sec. long I/O timeout; heartbeat pings via cluster interconnect still work), than kill it.
oracle@rac1:~/ [+ASM1] crsctl query css votedisk ## STATE File Universal Id File Name Disk group -- ----- ------------------------- --------1. PENDOFFL 3b10214c6d8a4f8bbfe493304ebf5684 (/dev/sdf) [GRID] 2. ONLINE 6bc0d20dc6344f6dbf0a08604647eeab (/dev/sdg) [GRID] Located 2 voting disk(s).
And the last question: what happens when the cluster loses nearly at the same time the private network and access to the majority of the cluster voting files? The answer: Depends on the timing. In some cases all nodes will be evicted by a fast-reboot, in other the Clusterware will use rebootless node fencing on one or all of the nodes. In any case, the integrity of your data is guaranteed!
4.
Monitoring Processes
In addition to the ocssd.bin process which is responsible, among other things, for the network and disk heartbeats, Oracle Clusterware 11g Release 2 uses two new monitoring processes cssdagent and cssdmonitor8, which run with the highest real-time scheduler priority and are also able to fence a server.
oracle@rac2:~/ [+ASM2] ps -ef | grep cssd | grep -v grep root 3583 root 3595 oracle 3607 1 0 11:45 ? 1 0 11:45 ? 1 1 11:45 ? 00:00:17 /u00/app/grid/11.2.0.2/bin/cssdmonitor 00:00:16 /u00/app/grid/11.2.0.2/bin/cssdagent 00:04:22 /u00/app/grid/11.2.0.2/bin/ocssd.bin
oracle@rac2:~/ [+ASM2] chrt -p 3595 pid 3595's current scheduling policy: SCHED_RR pid 3595's current scheduling priority: 99
The cssdagent and cssdmonitor can reset a server in case of: problem with the ocssd.bin process OS scheduler problem, CPU starvation OS locked up in a driver or hardware (e.g. I/O call) Both of them are also associated with an undocumented timeout. In case the execution of the processes stops for more than 28 sec., the node will be evicted.
root@rac1 ~]# kill -STOP 3595; sleep 27; kill -CONT 3814 [USRTHRD][1091856704] clsnproc_needreboot: Impending reboot at 90% of limit 28160; disk timeout 27980, network timeout 28160, last heartbeat from CSSD at epoch seconds 1292839498.814, 27521 milliseconds ago based on invariant clock 4294767479; now polling at 100 ms
Node evictions initiated by either cssdagent or cssdmonitor will leave in most cases a small and useful for debugging log file in /etc/oracle/lastgasp directory. Changing diagwait9 value as used to in versions prior 11g Release 2 to collect more information for diagnosing node evictions is not necessary anymore.
6.
Summary
Oracle Clusterware uses sophisticated and proved techniques to evict an unhealthy node from the cluster to guarantee its integrity. Nevertheless node eviction is an exceptional case and mostly triggered by an improper system design. In well-designed system architecture, every component should be redundant, thus reducing the possibility of node evictions. Good luck with the use of Trivadis know-how. Robert Bialek Trivadis GmbH Lehrer-Wirth-Str. 4 D-81829 Munich Internet: www.trivadis.com