Vous êtes sur la page 1sur 9

Oracle Clusterware Node Fencing A Killing Glance Beyond The Surface

Robert Bialek Principal Consultant December 21, 2010

Node fencing is a general concept used by computer clusters to forcefully remove a malfunctioning node from it. This preventive technique is a necessary measure to make sure no I/O from malfunctioning node can be done, thus preventing data corruptions and guaranteeing cluster integrity. There are many methods to fence a node including, but not limited to, STONITH (Shoot The Other Node In The Head), SCSI3 persistent reservation and SAN fabric fencing. Oracle Clusterware uses a STONITH comparable1 fencing method in which a local process initiates the node removal of one or more nodes from the cluster. In addition to the traditional fencing method from 11g Release 2 onwards, Oracle Clusterware can use a remote node termination mechanism.

1.

Oracle Clusterware Overview

Starting with the version 10g Release 1 Oracle introduced an own portable cluster software Cluster Ready Services. This product has been renamed in the version 10g Release 2 to Oracle Clusterware; from 11g Release 2 is part of the Oracle Grid Infrastructure software. It provides basic clustering functionality (node membership, resource management, monitoring, etc.) to Oracle software components like Automatic Storage Management (ASM), Real Application Cluster (RAC), single instance databases as well as any kind of failover application. One of the most important jobs of Oracle Clusterware, is to assure that only healthy nodes are part of the cluster, others should be forcefully removed. Oracle Clusterware will fence a node (cut off access to shared resources) in case of: Not being able to ping cluster peers via the network heartbeat Not being able to ping the cluster voting files OS scheduler problems, OS locked up in a driver or hardware (hung processes), CPU starvation, etc. which makes generally the above mentioned ping operations not possible in a timely manner Crash of some important Oracle Clusterware processes

2.

Cluster Interconnect

Cluster interconnect is a very important dedicated private network used for communication between all cluster nodes. High availability (redundancy) in this area belongs to one of the most important tasks during system architecture design, including all used components like NICs, switches, etc. High availability at OS/server level can be achieved with: OS dependent methods, like bonding on Linux, IPMP on Sun Solaris or Teaming on Windows OS independent method Redundant Interconnect2 available from 11.2.0.2 onwards (not available on Windows)

1 2

It is not a typical STONITH technology with a remote initiated node reset/power off Uses two features internally: multicasting and local link address info@trivadis.com . www.trivadis.com . Info-Tel. 0800 87 482 347 . Date 05.01.2011 . Seite 2 / 9

Cluster interconnect is used as a communication medium to ping3 the nodes with small heartbeat messages, making sure all of them are up and running. Network pings are performed by one of the core cluster processes, ocssd.bin (Cluster Synchronization Services).

ocssd.bin

Heartbeat pings

ocssd.bin

The network heartbeats are associated with a timeout called misscount, set from 11g Release 1 to 30 sec4.
oracle@rac1:~/ [+ASM1] crsctl get css misscount 30

Failure of the cluster interconnect can be simulated by deactivating (e.g. ifdown ethx) all involved network interfaces.
oracle@rac2:~/ [+ASM1] oifcfg getif bond0 192.168.122.0 global public bond1 10.10.0.0 global cluster_interconnect

To prevent a split brain in case of a failed cluster interconnect and to guarantee the integrity of the cluster, Oracle Clusterware will fence at least one node from it. After losing the connectivity and network heartbeats 50 % of the misscount interval (i.e. 15 sec.) long, you will observe the first error messages in the cluster alert log file indicating problems in this area. Losing heartbeats 100% of the misscount interval, will trigger node fencing. From the cluster alert log file:
[cssd(2864)]CRS-1612:Network communication with node rac1 (1) missing for 50% of timeout interval. Removal of this node from cluster in 14.920 seconds [cssd(2864)]CRS-1610:Network communication with node rac1 (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.900 seconds [cssd(2864)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity

Transmit heartbeat messages. Used also for other purposes, like RAC cache fusion, cluster control, etc. From 11g Release 1 the same value for all operating systems. In 10g on Linux the timeout was set to 60 sec.
4

info@trivadis.com . www.trivadis.com . Info-Tel. 0800 87 482 347 . Date 05.01.2011 . Seite 3 / 9

More debugging information is written to the ocssd.bin process log file:


[CSSD][1119164736](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, rac2, is smaller than cohort of 1 nodes led by node 1, rac1, based on map type 2 [CSSD][1119164736]################################### [CSSD][1119164736]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread [CSSD][1119164736]###################################

But how does Oracle Clusterwere decide which node to kill in this particular failure scenario? How can Oracle avoid a race condition in which all cluster nodes have been fenced? To make sure this cannot easily happen, additional heartbeat mechanism must be used to allow the cluster stack coordination during reconfiguration with a broken cluster interconnect. Oracle uses additional heartbeat pings to the cluster voting files (see later), which keep a status of the cluster nodes. As long as every node can access the shared storage (i.e. voting files), the cluster can coordinate the reconfiguration in a planned manner. Node(s) which should leave the cluster, will be asked to commit suicide. Technically a kill block will be written to the node slot in every voting file. After reading a kill block, the ocssd.bin process on that particular node will commit a suicide by resetting it.
[CSSD][1108789568]clssnmvDiskEvict: Kill block write, file /dev/sdh flags 0x00010004, kill block unique 1292838821, stamp 1517864/1517864 [CSSD][1094216000]clssnmvDiskEvict: Kill block write, file /dev/sdj flags 0x00010004, kill block unique 1292838821, stamp 1517864/1517864 [CSSD][1104943424]clssnmvDiskEvict: Kill block write, file /dev/sdi flags 0x00010004, kill block unique 1292838821, stamp 1517864/1517864

We could go further and ask, what if the voting files are also not accessible at that particular moment (i.e. I/O calls are not possible)? Well, it is not a problem; the cluster can handle this too (for details see chapter about voting files). Generally Oracle Clusterware uses two rules to choose which nodes should leave the cluster to assure the cluster integrity: In configurations with two nodes, node with the lowest ID will survive (first node that joined the cluster), the other one will be asked to leave the cluster With more cluster nodes, the Clusterware will try to keep the largest sub-cluster running

2.1 Rebootless Node Fencing


In versions before 11.2.0.2 Oracle Clusterware tried to prevent a split-brain with a fast reboot (better: reset) of the server(s) without waiting for ongoing I/O operations or synchronization of the file systems. This mechanism has been changed in version 11.2.0.2 (first 11g Release 2 patch set). After deciding which node to evict, the Clusterware: attempts to shut down all Oracle resources/processes on the server (especially processes generating I/Os) will stop itself on the node

info@trivadis.com . www.trivadis.com . Info-Tel. 0800 87 482 347 . Date 05.01.2011 . Seite 4 / 9

afterwards Oracle High Availability Service Daemon (OHASD)5 will try to start the Cluster Ready Services (CRS) stack again. Once the cluster interconnect is back online, all relevant cluster resources on that node will automatically start kill the node if stop of resources or processes generating I/O is not possible (hanging in kernel mode, I/O path, etc.)

This behavior change is particularly useful for non-cluster aware applications.


[cssd(3713)]CRS-1610:Network communication with node rac1 (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.190 seconds [cssd(3713)]CRS-1652:Starting clean up of CRSD resources. [cssd(3713)]CRS-1654:Clean up of CRSD resources finished successfully. [cssd(3713)]CRS-1655:CSSD on node rac2 detected a problem and started to shutdown. ... [cssd(5912)]CRS-1713:CSSD daemon is started in clustered mode

3.

Voting Files

Voting files are also a very important (in addition to the interconnect heartbeats) communication mechanism used by the ocssd.bin process. They keep a status of the cluster nodes and count votes during cluster reconfigurations. To avoid a single point of failure, it is recommended to mirror the file, using an odd number of them. Every node in the cluster must always access the majority of the files (n/2+1; where n is the odd number of the voting files), or it will be evicted from the cluster. With an odd number and the majority rule, there is a guarantee that every node in the cluster can communicate with each other using at least 1 voting file. Heartbeat pings to the voting files are performed by the ocssd.bin process.

ocssd.bin Heartbeat pings

ocssd.bin

Voting File 1

Voting File 2

Voting File 3

Generally: a node cannot join the cluster, if it cannot access the majority of the files a node must leave the cluster, if it cannot communicate with the majority of the files
5

New component in 11g Release 2 info@trivadis.com . www.trivadis.com . Info-Tel. 0800 87 482 347 . Date 05.01.2011 . Seite 5 / 9

From 11.2.0.1 onwards, Oracle supports voting files (as well as OCR and ASM spfile) directly in an ASM disk group6. The physical location of the voting files in used ASM disks is fixed, i.e. the cluster stack does not rely on a running ASM instance to access the files. The location of the file is visible in the ASM disk header (dumping the file out of ASM with dd is quite easy):
oracle@rac1:~/ [+ASM1] kfed read /dev/sdf | grep -E 'vfstart|vfend' kfdhdb.vfstart: kfdhdb.vfend: 96 ; 0x0ec: 0x00000060 128 ; 0x0f0: 0x00000080 <<- begin AU offset of the voting file <<- end AU offset of the voting file

Heartbeat pings to the voting files are associated with the following timeouts: long I/O timeout (disktimeout = 200 sec.) This timeout is used in all scenarios, with a working cluster interconnect short I/O timeout (misscount 30 sec. reboot latency 3 sec. = 27 sec). Used only during cluster reconfigurations (node join/leave events), node evictions to avoid split brain with a broken cluster interconnect
[CSSD][1117743424] misscount 30 reboot latency 3 [CSSD][1117743424] long I/O timeout 200 short I/O timeout 27 [CSSD][1117743424] diagnostic wait 0 active version 11.2.0.2.0 oracle@rac1:~/ [+ASM1] crsctl get css disktimeout 200

The following example demonstrates the Oracle Clusterware behavior after losing connectivity to the voting files. In this configuration, a separate ASM disk group named GRID with normal redundancy has been used for the voting files, OCR and ASM spfile7.
oracle@rac1:~/ [+ASM1] crsctl query css votedisk ## STATE File Universal Id File Name Disk group -- ----- ------------------------- --------1. ONLINE 3b10214c6d8a4f8bbfe493304ebf5684 (/dev/sdf) [GRID] 2. ONLINE 6bc0d20dc6344f6dbf0a08604647eeab (/dev/sdg) [GRID] 3. ONLINE 4de13969fd4b4fecbfb007f4da233ad3 (/dev/sde) [GRID] Located 3 voting disk(s).

In the first step we cut off the I/O path to one voting file placed on the /dev/sde device:
[root@rac1 ~]# echo 1 > /sys/block/sde/device/delete

After loss of device connectivity at the operating system level, the status of the voting file changed to PENDOFFL (pending offline) than to OFFLINE, dropping it at the end from the GRID ASM disk group. From ASM point of view its still OK, we still have 2 devices (failgroups).

In releases prior 11g Release 2 block/raw devices or a cluster file system was used as a location for voting files 7 Trivadis Best Practice separating the files from other introduce more flexibility info@trivadis.com . www.trivadis.com . Info-Tel. 0800 87 482 347 . Date 05.01.2011 . Seite 6 / 9

oracle@rac1:~/ [+ASM1] crsctl query css votedisk ## STATE File Universal Id File Name Disk group -- ----- ------------------------- --------1. ONLINE 3b10214c6d8a4f8bbfe493304ebf5684 (/dev/sdf) [GRID] 2. ONLINE 6bc0d20dc6344f6dbf0a08604647eeab (/dev/sdg) [GRID] Located 2 voting disk(s).

What happens to the node if we lose additional device, i.e. lose the majority of the voting files? To demonstrate it, we will cut off the I/O path to the second voting file placed on the /dev/sdf device:
[root@rac1 ~]# echo 1 > /sys/block/sdf/device/delete

Losing the next device (second failgroup) in the GRID disk group, leads to lost access to the majority of the voting files as well as a force dismount of the disk group on that node. The ocssd.bin process on this particular node will wait disk timeout value (200 sec. long I/O timeout; heartbeat pings via cluster interconnect still work), than kill it.
oracle@rac1:~/ [+ASM1] crsctl query css votedisk ## STATE File Universal Id File Name Disk group -- ----- ------------------------- --------1. PENDOFFL 3b10214c6d8a4f8bbfe493304ebf5684 (/dev/sdf) [GRID] 2. ONLINE 6bc0d20dc6344f6dbf0a08604647eeab (/dev/sdg) [GRID] Located 2 voting disk(s).

From the cluster log file:


[cssd(3016)]CRS-1613:No I/O has completed after 90% of the maximum interval. Voting file /dev/sdf will be considered not functional in 19750 milliseconds [cssd(3016)]CRS-1606:The number of voting files available, 1, is less than the minimum number of voting files required, 2, resulting in CSSD termination to ensure data integrity

And the last question: what happens when the cluster loses nearly at the same time the private network and access to the majority of the cluster voting files? The answer: Depends on the timing. In some cases all nodes will be evicted by a fast-reboot, in other the Clusterware will use rebootless node fencing on one or all of the nodes. In any case, the integrity of your data is guaranteed!

info@trivadis.com . www.trivadis.com . Info-Tel. 0800 87 482 347 . Date 05.01.2011 . Seite 7 / 9

4.

Monitoring Processes

In addition to the ocssd.bin process which is responsible, among other things, for the network and disk heartbeats, Oracle Clusterware 11g Release 2 uses two new monitoring processes cssdagent and cssdmonitor8, which run with the highest real-time scheduler priority and are also able to fence a server.
oracle@rac2:~/ [+ASM2] ps -ef | grep cssd | grep -v grep root 3583 root 3595 oracle 3607 1 0 11:45 ? 1 0 11:45 ? 1 1 11:45 ? 00:00:17 /u00/app/grid/11.2.0.2/bin/cssdmonitor 00:00:16 /u00/app/grid/11.2.0.2/bin/cssdagent 00:04:22 /u00/app/grid/11.2.0.2/bin/ocssd.bin

oracle@rac2:~/ [+ASM2] chrt -p 3595 pid 3595's current scheduling policy: SCHED_RR pid 3595's current scheduling priority: 99

The cssdagent and cssdmonitor can reset a server in case of: problem with the ocssd.bin process OS scheduler problem, CPU starvation OS locked up in a driver or hardware (e.g. I/O call) Both of them are also associated with an undocumented timeout. In case the execution of the processes stops for more than 28 sec., the node will be evicted.
root@rac1 ~]# kill -STOP 3595; sleep 27; kill -CONT 3814 [USRTHRD][1091856704] clsnproc_needreboot: Impending reboot at 90% of limit 28160; disk timeout 27980, network timeout 28160, last heartbeat from CSSD at epoch seconds 1292839498.814, 27521 milliseconds ago based on invariant clock 4294767479; now polling at 100 ms

Node evictions initiated by either cssdagent or cssdmonitor will leave in most cases a small and useful for debugging log file in /etc/oracle/lastgasp directory. Changing diagwait9 value as used to in versions prior 11g Release 2 to collect more information for diagnosing node evictions is not necessary anymore.

5th IPMI Support in 11g Release 2


In addition to the traditional Clusterware fencing methods described above, Oracle introduced with the version 11g Release 2, a remote based node termination technique able to fence a node without cooperation of the cluster stack itself on that local node (the evicting CSS process will reset the failed node by using directly the node's IPMI device over the LAN). In order to use this solution, the server must be equipped with IPMI device running firmware compatible with IPMI version 1.5 with IPMI support over a LAN. Configuration of this method is possible during Grid Infrastructure installation or afterwards with crsctl.
In versions prior 11g Release 2 Oracle Clusterware was using for this purpose oprocd process, and on Linux additionally hangcheck-timer kernel module 9 In versions 10.1.0.5, 10.2.0.3 (or higher) and in 11.1.0.6 (or higher) it was recommended to change the default value for diagwait to 13, which means 10 sec. diagwait(13) reboot latency(3) info@trivadis.com . www.trivadis.com . Info-Tel. 0800 87 482 347 . Date 05.01.2011 . Seite 8 / 9
8

6.

Summary

Oracle Clusterware uses sophisticated and proved techniques to evict an unhealthy node from the cluster to guarantee its integrity. Nevertheless node eviction is an exceptional case and mostly triggered by an improper system design. In well-designed system architecture, every component should be redundant, thus reducing the possibility of node evictions. Good luck with the use of Trivadis know-how. Robert Bialek Trivadis GmbH Lehrer-Wirth-Str. 4 D-81829 Munich Internet: www.trivadis.com

Tel: Fax: Mail:

+49-89 99 27 59 30 +49-89 99 27 59 59 robert.bialek@trivadis.com

Literature and Links


Oracle White Paper, Oracle Clusterware 11g Release 2 Oracle Database 11g Release 2 documentation, Clusterware Administration and Deployment Guide MOS Article ID # 1068835.1, What to Do if 11gR2 Clusterware is Unhealthy MOS Article ID # 1050908.1, How to Troubleshoot Grid Infrastructure Startup Issues MOS Article ID # 1050693.1, Troubleshooting 11.2 Clusterware Node Evictions (Reboots)

info@trivadis.com . www.trivadis.com . Info-Tel. 0800 87 482 347 . Date 05.01.2011 . Seite 9 / 9