Oracle RAC instances are composed of following background processes:
ACMS(11g) Atomic Control le to Memor! Ser"ice (ACMS)
#$%&'( (11g) #lobal $ransaction )rocess *MO+ #lobal ,n-ueue Ser"ice Monitor *M. #lobal ,n-ueue Ser"ice .aemon *MS #lobal Cac/e Ser"ice )rocess *C0& 1nstance ,n-ueue )rocess .1A# .iagnosabilit! .aemon RMSn Oracle RAC Management )rocesses (RMSn) RSM+ Remote Sla"e Monitor .2RM .atabase Resource Manager (from 11g R3) )1+# Response $ime Agent (from 11g R3) Oracle Real Application Clusters New features Oracle 9i RAC O)S (Oracle )arallel Ser"er) was renamed as RAC C4S (Cluster 4ile S!stem) was supported OC4S (Oracle Cluster 4ile S!stem) for *inu5 and 6indows watc/dog timer replaced b! /angc/eck timer Oracle 10g R1 RAC Cluster Manager replaced b! CRS ASM introduced Concept of Ser"ices e5panded ocrc/eck introduced ocrdump introduced A6R was instance specic Oracle 10g R2 RAC CRS was renamed as Clusterware asmcmd introduced C*7849 introduced OCR and 8oting disks can be mirrored Can use 4A+:4C4 wit/ $A4 for OC1 and O.);+,$ Oracle 11g R1 RAC 1; Oracle 11g RAC parallel upgrades ' Oracle 11g /a"e rolling upgrade features w/ereb! RAC database can be upgraded wit/out an! downtime; 3; <ot patc/ing ' =ero downtime patc/ application; >; Oracle RAC load balancing ad"isor ' Starting from 1&g R3 we /a"e RAC load balancing ad"isor utilit!; 11g RAC load balancing ad"isor is onl! a"ailable wit/ clients w/o use ;+,$? O.2C? or t/e Oracle Call 1nterface (OC1); @; A..M for RAC ' Oracle /as incorporated RAC into t/e automatic database diagnostic monitor? for cross'node ad"isories; $/e script addmrpt;s-l run gi"e report for single instance? will not report all instances in RAC? t/is is known as instance A..M; 2ut using t/e new package .2MSAA..M? we can generate report for all instances of RAC? t/is known as database A..M; B; OptimiCed RAC cac/e fusion protocols ' mo"es on from t/e general cac/e fusion protocols in 1&g to deal wit/ specic scenarios w/ere t/e protocols could be furt/er optimiCed; D; Oracle 11g RAC #rid pro"isioning ' $/e Oracle grid control pro"isioning pack allows us to Eblow'outE a RAC node wit/out t/e time'consuming install? using a pre'installed EfootprintE; Oracle 11g R2 RAC 1; 6e can store e"er!t/ing on t/e ASM; 6e can store OCR F "oting les also on t/e ASM; 3; ASMCA >; Single Client Access +ame (SCA+) ' eliminates t/e need to c/ange tns entr! w/en nodes are added to or remo"ed from t/e Cluster; RAC instances register to SCA+ listeners as remote listeners; SCA+ is full! -ualied name; Oracle recommends assigning > addresses to SCA+? w/ic/ create t/ree SCA+ listeners; @; A6R is consolidated for t/e database; B; 11g Release 3 Real Application Cluster (RAC) /as ser"er pooling tec/nologies so itGs easier to pro"ision and manage database grids; $/is update is geared toward d!namicall! ad(usting ser"ers as corporations manage t/e ebb and How between data re-uirements for dataware/ousing and applications; D; 2! default? *OA.A2A*A+C, is O+; I; #S. (#lobal Ser"ice .eamon)? gsdctl introduced; J; #)n) prole; K; Oracle RAC One+ode is a new option t/at makes it easier to consolidate databases t/at arenGt mission critical? but need redundanc!; 1&;raconeinit ' to con"ert database to RacOne+ode; 11;racone5 ' to 5 RacOne+ode database in case of failure; 13;racone3rac ' to con"ert RacOne+ode back to RAC; 1>;Oracle Restart ' t/e feature of Oracle #rid 1nfrastructureLs <ig/ A"ailabilit! Ser"ices (<AS) to manage associated listeners? ASM instances and Oracle instances; 1@;Oracle Omotion ' Oracle 11g release3 RAC introduces new feature called Oracle Omotion? an online migration utilit!; $/is Omotion utilit! will relocate t/e instance from one node to anot/er? w/ene"er instance failure /appens; 1B;Omotion utilit! uses .atabase Area +etwork (.A+) to mo"e Oracle instances; .atabase Area +etwork (.A+) tec/nolog! /elps seamless database relocation wit/out losing transactions; 1D;Cluster $ime S!nc/roniCation Ser"ice (C$SS) is a new feature in Oracle 11g R3 RAC? w/ic/ is used to s!nc/roniCe time across t/e nodes of t/e cluster; C$SS will be replacement of +$) protocol; 1I;#rid +aming Ser"ice (#+S) is a new ser"ice introduced in Oracle RAC 11g R3; 6it/ #+S? Oracle Clusterware (CRS) can manage .!namic <ost Conguration )rotocol (.<C)) and .+S ser"ices for t/e d!namic node registration and conguration; 1J;Oracle *ocal Registr! (O*R) ' 4rom Oracle 11gR3 EOracle *ocal Registr! (O*R)E somet/ing new as part of Oracle Clusterware; O*R is nodeGs local repositor!? similar to OCR (but local) and is managed b! O<AS.; 1t pertains data of local node onl! and is not s/ared among ot/er nodes; 1K;Multicasting is introduced in 11gR3 for pri"ate interconnect traMc; 3&;1:O fencing pre"ents updates b! failed instances? and detecting failure and pre"enting split brain in cluster; 6/en a cluster node fails? t/e failed node needs to be fenced oN from all t/e s/ared disk de"ices or diskgroups; $/is met/odolog! is called 1:O 4encing? sometimes called .isk 4encing or failure fencing; 31;Re'bootless node fencing (restart) ' instead of fast re'booting t/e node? a graceful s/utdown of t/e stack is attempted; 33;8irtual Oracle 11g RAC cluster ' Oracle 11g RAC supports "irtualiCation; SPLIT BRAIN CONITION AN IO !"NCIN# $"C%ANIS$ IN ORACL" CL&ST"R'AR" Oracle clusterware pro"ides t/e mec/anisms to monitor t/e cluster operation and detect some potential issues wit/ t/e cluster; One of particular scenarios t/at needs to be pre"ented is called split brain condition; A split brain condition occurs w/en a single cluster node /as a failure t/at results in reconguration of cluster into multiple partitions wit/ eac/ partition forming its own sub'cluster wit/out t/e knowledge of t/e e5istence of ot/er; $/is would lead to collision and corruption of s/ared data as eac/ sub'cluster assumes owners/ip of s/ared data O1P; 4or a cluster databases like Oracle RAC database? data corruption is a serious issue t/at /as to be pre"ented all t/e time; Oracle clustereware solution to t/e split brain condition is to pro"ide 1O fencing: if a cluster node fails? Oracle clusterware ensures t/e failed node to be fenced oN from all t/e 1O operations on t/e s/ared storage; One of t/e 1O fencing met/od is called S$OM1$< w/ic/ stands for S/oot t/e Ot/er Mac/ine in t/e <ead; 1n t/is met/od? once detecting a potential split brain condition? Oracle clusterware automaticall! picks a cluster node as a "ictim to reboot to a"oid data corruption; $/is process is called node e"iction; .2As or s!stem administrators need to understand /ow t/is 1O fencing mec/anism works and learn /ow to troubles/oot t/e clustereware problem; 6/en t/e! e5perience a cluster node reboot e"ent? .2As or s!stem administrators need to be able to anal!Ce t/e e"ents and identif! t/e root cause of t/e clusterware failure; Oracle clusterware uses two Cluster S!nc/roniCation Ser"ice (CSS) /eartbeats: 1; network /eartbeat (+<2) and 3; disk /eartbeat (.<2) and two CSS misscount "alues associated wit/ t/ese /eartbeats to detect t/e potential split brain conditions; $/e network /eartbeat crosses t/e pri"ate interconnect to establis/ and conrm "alid node members/ip in t/e cluster; $/e disk /eartbeat is between t/e cluster node and t/e "oting disk on t/e s/ared storage; 2ot/ /eartbeats /a"e t/eir own ma5imal misscount "alues in seconds called CSS misscount in w/ic/ t/e /eartbeats must be completedQ ot/erwise a node e"iction will be triggered; $/e CSS (isscount for t)e networ* )eart+eat /as t/e following default "alues depending on t/e "ersion of Oracle clusterweare and operating s!stems: OS 10g ,R1 -R2 . 11 g *inu5 D& >& 7ni5 >& >& 8MS >& >& 6indo ws >& >& $/e CSS (isscount for /is* )eart+eat also "aries on t/e "ersions of Oracle clustereware; 4or oracle 1&;3;1 and up? t/e default "alue is 3&& seconds; NO" "0ICTION IA#NOSIS CAS" ST&1 6/en a node e"iction occurs? Oracle clusterware usuall! records error messages into "arious log les; $/ese logs les pro"ide t/e e"idences and t/e start points for .2As and s!stem administrators to do troubles/ooting ; $/e following case stud! illustrates a troubles/ooting process based on a node e"iction w/ic/ occurred in a 11'node 1&g RAC production database; $/e s!mptom was t/at node I of t/at cluster got automaticall! rebooted around 11:1Bam; $/e troubles/ooting started wit/ e5amining s!slog le :"arQlog:messages and found t/e following error message: Jul 23 11:15:23 racdb7 logger: Oracle clsomon failed with fatal status 12. Jul 23 11:15:23 racdb7 logger: Oracle CSSD failure 13. Jul 23 11:15:23 racdb7 logger: Oracle C!S failure. !ebooting for cluster integrit". $/en e5amined t/e OCSS. logle at $CRS_HOME/log/<hostname>/cssd/ocssd.log le and found t/e following error messages w/ic/ s/owed t/at node I network /eartbeat didnGt complete wit/in t/e D& seconds CSS misscount and triggered a node e"iction e"ent: # CSSD$2%%&'%7'23 11:1:(.15% #11(()1&%%$ *+,!-.-/: clssnm0olling1hread: node racdb7 273 at 5%4 heartbeat fatal5 e6iction in 2(.72% seconds .. clssnm0olling1hread: node racdb7 273 at (%4 heartbeat fatal5 e6iction in %.55% seconds 7 # CSSD$2%%&'%7'23 11:15:1(.%7( #122%5(&112$ *1!,C8: clssnmDoS"nc9:date: 1erminating node 75 racdb75 misstime2)%2%%3 state233 CRS R"BOOTS TRO&BL"S%OOTIN# PROC"&R" 2esides of t/e node e"iction caused b! t/e failure of network /eartbeat or disk /eartbeat? ot/er e"ents ma! also cause CRS node reboot; Oracle clusterware pro"ides se"eral processes to monitor t/e operations of t/e clusterware; 6/en certain conditions occurs? to protect t/e data integrit!? t/ese monitoring process ma! automaticall! kill t/e clusterware? e"en reboot t/e node and lea"e some critical error messages in t/eir log les $/e following lists roles of t/ese clusterware processes in t/e ser"er reboot and w/ere t/eir logs are located: $/ree of clusterware processes OCSS.? O)ROC. and OC*SOMO+ can initiate a CRS reboot w/en t/e! run into certain errors: 1; OCSS , CSS /ae(on. monitors inter'node /eat/? suc/ as t/e interconnect and members/ip of t/e cluster nodes; 1ts log le is located in RCRSA<OM,:log:S/ostT:cssd:ocssd;log 3; OPROC,Oracle Process $onitor ae(on.? introduced in 1&;3;&;@? detects /ardware and dri"er freeCes t/at results in t/e node e"iction? t/en kills t/e node to pre"ent an! 1O from accessing t/e s/aring disk; 1ts log le is :etc:oracle:oprocd:S/ostnameT; oprocd;log >; OCLSO$ON process monitors t/e CSS daemon for /angs or sc/eduling issue; 1t ma! reboot t/e node if it sees a potential /ang; $/e log le is RCRSA<OM,:log:S/ostT:cssd:oclsomon:oclsmon;log And one of t/e most important log les is t/e s!slog le? On *inu5? t/e s!slog le is :"ar:log:messages; $/e CRS reboot troubles/ooting procedure starts wit/ re"iewing "arious logs les to identif! w/ic/ of t/ree processes abo"e contributes t/e node reboot and t/en isolates t/e root cause of t/is process reboot; 4igure D troubles/ooting tree or diagram illustrated t/e CRS reboot troubles/ooting Howc/art;