Académique Documents
Professionnel Documents
Culture Documents
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracles products remains at the sole discretion of Oracle.
Copyright 2011, Oracle and/or its affiliates. All rights reserved. 2 Copyright 2012, Oracle and/or its affiliates. All rights reserved.
Sandesh Rao, Bob Caldwell RAC Assurance Team Oracle Product Development
Agenda
Architectural Overview Grid Infrastructure Processes Installation Troubleshooting RAC Performance Dynamic Resource Mastering (DRM)
Q&A
Architectural Overview
The Grid Home contains the software for both products CRS can also be Standalone for ASM and/or Oracle Restart. CRS can run by itself or in combination with other vendor clusterware Grid Home and RDBMS home must be installed in different locations
The installer locks the Grid Home path by setting root permissions.
communications
Should be a separate physical network. VLANS are supported with restrictions. Used for :-
One Public IP and VIP per node in DNS One Scan name set up in DNS.
Or Grid Naming Service (GNS) set up Public Network
One Public IP per node ( recommended ) One GNS VIP per cluster DHCP allocation of hostnames.
10
12
13
14
15
oraagent, orarootagent, application agent, script agent, cssdagent Single process started from init on Unix (ohasd). Diagram below shows all core resources.
16
Level 4a
Level 0
Level 1
Level 4b
Level 2b
17
18
resources.
oraagent - Agent responsible for managing all oracle owned ohasd
resources.
cssdmonitor - Monitors CSSD and node health (along with the cssdagent).
19
20
21
resources.
oraagent - Agent responsible for managing all nonroot owned crsd
resources. One is spawned for every user that has CRS ressources to manage.
22
Agent Name
oraagent oraagent oraagent cssdagent cssdmonitor orarootagent orarootagent oraagent orarootagent oraagent orarootagent
Owner
crs user crs user crs user root root root root crs user root crs user root
24
Troubleshooting Scenarios
Cluster Startup Problem Triage (11.2+)
Startup Sequence
ps ef|grep init.ohasd ps ef|grep ohasd.bin
Running?
YES
NO
Obvious?
NO
Init integration?
NO
TFA Collector
YES TFA Collector Engage Oracle Support Engage Sysadmin Team YES
ps ef|grep ps ef|grep ps ef|grep ps ef|grep ps ef|grep ps ef|grep ps ef|grep ps ef|grep ps ef|grep ps ef|grep ps ef|grep Etc
cssdagent ocssd.bin orarootagent ctssd.bin crsd.bin cssdmonitor oraagent ora.asm gpnpd.bin mdnsd.bin evmd.bin
Running?
YES
NO
Obvious?
TFA Collector
NO
Obvious?
YES
25
Troubleshooting Scenarios
Cluster Startup Problem Triage Multicast Domain Name Service Daemon (mDNS(d))
Used by Grid Plug and Play to locate profiles in the cluster, as well as by GNS
to perform name resolution. The mDNS process is a background process on Linux and UNIX and on Windows.
Uses multicast for cache updates on service advertisement arrival/departure. Advertises/serves on all found node interfaces. Log is GI_HOME/log/<node>/mdnsd/mdnsd.log
26
Troubleshooting Scenarios
Cluster Startup Problem Triage
Grid Plug n play daemon (gpnp(d))
Provides access to the Grid Plug and Play profile Coordinates updates to the profile from clients among the nodes of the cluster Ensures all nodes have the most recent profile Registers with mdns to advertise profile availability Log is GI_HOME/log/<node>/gpnpd/gpnpd.log
27
Troubleshooting Scenarios
Cluster Startup Problem Triage
<?xml version="1.0" encoding="UTF-8"?> <gpnp:GPnP-Profile Version="1.0" xmlns="http://www.grid-pnp.org/2005/11/gpnp-profile" xmlns:gpnp="http://www.gridpnp.org/2005/11/gpnp-profile" xmlns:orcl="http://www.oracle.com/gpnp/2005/11/gpnp-profile" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.gridpnp.org/2005/11/gpnp-profile gpnp-profile.xsd" ProfileSequence="6" ClusterUId="b1eec1fcdd355f2bbf7910ce9cc4a228" ClusterName="staij-cluster" PALocation=""> <gpnp:Network-Profile><gpnp:HostNetwork id="gen" HostName="*"> <gpnp:Network id="net1" IP="140.87.152.0" Adapter="eth0" Use="public"/> <gpnp:Network id="net2" IP="140.87.148.0" Adapter="eth1 Use="cluster_interconnect"/> </gpnp:HostNetworkcss"></gpnp:Network-Profile> <orcl:CSS-Profile id=" DiscoveryString="+asm" LeaseDuration="400"/> <orcl:ASM-Profile id="asm" DiscoveryString="" SPFile="+SYSTEM/staijcluster/asmparameterfile/registry.253.693925293"/> <ds:Signature xmlns:ds="http://www.w3.org/2000/09/xmldsig#"><ds:SignedInfo><ds:CanonicalizationMethod Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#"/><ds:SignatureMethod Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#"> <InclusiveNamespaces xmlns="http://www.w3.org/2001/10/xml-exc-c14n#" PrefixList="gpnp orcl xsi"/></ds:Transform></ds:Transforms><ds:DigestMethod Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/><ds:DigestValue>x1H9LWjyNyMn6BsOykHhMvxnP8U=</ds:Di gestValue></ds:Reference></ds:SignedInfo><ds:SignatureValue>N+20jG4=</ds:SignatureValue></ds:Signature> </gpnp:GPnP-Profile>
28 Copyright 2012, Oracle and/or its affiliates. All rights reserved.
Troubleshooting Scenarios
Cluster Startup Problem Triage
cssd agent and monitor
Same functionality in both agent and monitor Functionality of several pre-11.2 daemons consolidated in both
OPROCD system hang OMON oracle clusterware monitor VMON vendor clusterware monitor Run realtime with locked down memory, like CSSD Provides enhanced stability and diagnosability Logs are GI_HOME/log/<node>/agent/oracssdagent_root/oracssdagent_root.log GI_HOME/log/<node>/agent/oracssdmonitor_root/oracssdmonitor_root.log
29
Troubleshooting Scenarios
Cluster Startup Problem Triage
cssd agent and monitor
oprocd
The basic objective of both OPROCD and OMON was to ensure that the perceptions of other nodes was correct
If CSSD failed, other nodes assumed that the node would fail within a
gone and OPROCD would ensure that it was gone The goal of the change is to do this more accurately and avoid false terminations
30
Troubleshooting Scenarios
Node Eviction Triage
Cluster Time Synchronisation Services daemon
Provides time management in a cluster for Oracle.
31
Troubleshooting Scenarios
Node Eviction Triage
Cluster Ready Services Daemon
The CRSD daemon is primarily responsible for maintaining the availability of
application resources, such as database instances. CRSD is responsible for starting and stopping these resources, relocating them when required to another node in the event of failure, and maintaining the resource profiles in the OCR (Oracle Cluster Registry). In addition, CRSD is responsible for overseeing the caching of the OCR for faster access, and also backing up the OCR.
Log file is GI_HOME/log/<node>/crsd/crsd.log
32
Troubleshooting Scenarios
Node Eviction Triage
CRSD oraagent
CRSDs oraagent manages
all database, instance, service and diskgroup resources node listeners SCAN listeners, and ONS
If the Grid Infrastructure owner is different from the RDBMS home owner then you would have 2 oraagents each running as one of the installation owners. The database, and service resources would be managed by the RDBMS home owner and other resources by the Grid Infrastructure home owner. GI_HOME/log/<node>/agent/crsd/oraagent_<user>/oraagent_<user>.log
Log file is
33
Troubleshooting Scenarios
Node Eviction Triage
CRSD orarootagent
CRSDs rootagent manages
GNS and its VIP Node VIP SCAN VIP network resources.
Log file is
GI_HOME/log/<node>/agent/crsd/orarootagent_root/oraagent_root.log
34
Troubleshooting Scenarios
Node Eviction Triage Agent return codes
Check entry must return one of the following return codes:
ONLINE UNPLANNED_OFFLINE
Target=online, may be recovered failed over
PLANNED_OFFLINE UNKNOWN
Cannot determine, if previously online, partial then monitor
PARTIAL
Some of a resources services are available. Instance up but not open.
FAILED
Requires clean action
35 Copyright 2012, Oracle and/or its affiliates. All rights reserved.
36
Pre-reqs Met?
NO
YES Install
NO
NO
NO
System Provisioning
810394.1 1096952.1 169706.1
Upgrade?
YES
Engage appropriate team CVU Fixup Jobs DBAs Sysadmin Networking Storage OS Vendor HW Vendor Oracle Support Etc NO
Top 5
TFA Collector
NO
YES Success?
Pre-reqs Met?
YES
Install
NO
37
References
RAC and Oracle Clusterware Best Practices ..(Platform Independent) (Doc ID 810394.1) Master Note for Real Application Clusters (RAC) Oracle Clusterware .. (Doc ID 1096952.1) Oracle Database .. Operating Systems Installation and Configuration .. (Doc ID 169706.1) Troubleshoot 11gR2 Grid Infrastructure/RAC Database runInstaller Issues (Doc ID 1056322.1) Top 5 CRS/Grid Infrastructure Install issues (Doc ID 1367631.1) How to Proceed from Failed 11gR2 Grid Infrastructure (CRS) Installation (Doc ID 942166.1) How to Proceed When Upgrade to 11.2 Grid Infrastructure Cluster Fails (Doc ID 1364947.1) How To Proceed After The Failed Upgrade ..In Standalone Environments (Doc ID 1121573.1) Top 11gR2 Grid Infrastructure Upgrade Issues (Doc ID 1366558.1) TFA Collector - Tool for Enhanced Diagnostic Gathering (Doc ID 1513912.1)
38
Installation logs installActions${TIMESTAMP}.log oraInstall${TIMESTAMP}.err oraInstall${TIMESTAMP}.out Relink errors in installActions*.log due to missing RPMs on Linux, eg.
Error :/usr/bin/ld: skipping incompatible /usr/lib/gcc/x86_64-redhat-linux/3.4.6/../../../libpthread.so when searching for -lpthread /usr/bin/ld: skipping incompatible /usr/lib/gcc/x86_64-redhat-linux/3.4.6/../../../libpthread.a when searching for -lpthread /usr/bin/ld: cannot find -lpthread collect2: ld returned 1 exit status Affected Version :10.2 on RHEL3(x86-64),RHEL4(x86-64) and RHEL5(x86-64) RPM missing :glibc-devel (64 bit)
39
40
Problem Avoidance
Standard builds with proper configuration baked in Pre-flight checklist ssh configuration Follow How To Configure SSH for a RAC Installation (Doc ID 300548.1) Some customers do not follow the guidelines in the note Manual checking of ssh ($ ssh hostname date) and CVU checks of ssh pass But Oracle Universal Installer fails with messages about ssh configuration Sanity check and verify the way OUI expects
$ /usr/bin/ssh -o FallBackToRsh=no -o PasswordAuthentication=no -o StrictHostKeyChecking=yes -o NumberOfPasswordPrompts=0 <hostname> date; Tue Jan 14 12:49:48 PST 2014
Installations/Upgrades Cluster Verification Utility (CVU) Upgrades raccheck/orachk pre-upgrade mode (./orachk u o pre)
41
Symptom Failed to start Cluster Synchronization Service in clustered mode at /u01/app/crs/11.2.0.2/crs/install/crsconfig_lib.pm line 1016. Cause Improper multicast configuration for cluster interconnect network Solution Prior to install Follow Grid Infrastructure Startup During Patching, Install or Upgrade May Fail Due to Multicasting Requirement (Doc ID 1212703.1)
42
Symptom GI install failure when running root.sh Cause Known issues for which fixes already exist Solution In-flight application of most recent PSU Proceed with install up to step requiring running root.sh Before running root.sh script apply PSU In general youll want the latest PSUs anyway but this step may help avoid problems For upgrades run ./raccheck u o pre prior to beginning Checks for pre-req patches
43
# 3: How to complete a GI installation if the OUI session has died while running root.sh on the clusternodes
Symptom Incomplete or interrupted installation Cause Unexpected reboot/failure of node on which OUI session was running before confirmation that root.sh was run on all the nodes prior to the reboot/failure and before the assistants are run Solution As the grid user execute "$GRID_HOME/cfgtoollogs/configToolAllCommands" on the first node (only)
44
Symptom Clusterware startup problems Individual clusterware component startup problems Cause Improper network configuration for public and/or private network Solution Prior to installation How to Validate Network and Name Resolution Setup for the Clusterware and RAC (Doc ID 1054902.1) Grid Infrastructure Startup During Patching, Install or Upgrade May Fail Due to Multicasting Requirement (Doc ID 1212703.1)
45
Symptom Rolling upgrade failure Cause Potential ASM bugs Solution Prior to rolling GI upgrade ./raccheck u o pre Checks for pre-req patches Install pre-req patches to avoid ASM bugs and If complete cluster outage is allowable, optionally perform non-rolling GI upgrade
References
Top 5 CRS/Grid Infrastructure Install issues (Doc ID 1367631.1) for more details Things to Consider Before Upgrading to 11.2.0.3/11.2.0.4 Grid Infrastructure/ASM (Doc ID 1363369.1)
46
47
What is it?
Not something you would ordinarily need to worry about Part of the plumbing of Cache Fusion Optimizations to speed access to data Reduce interconnect traffic DRM - Dynamic Resource management (Doc ID 390483.1) Lock element (LE) resources for data blocks for objects Hashed and mastered across all nodes in the cluster Access statistics collected, compared to policies in the database (50:1 access pattern) Depending upon workload access patterns resource mastership may migrate to other nodes Resources automatically remastered to node where most often accessed LMON, LMD, LMS processes responsible for DRM DRMs can be seen in LMON trace files gv$dynamic_remaster_stats Insert/Update/Delete operations continue without interruption Example use case that might trigger DRM hybrid workloads OLTP vs Batch
48
Affinity locks
Optimization introduced in 10.2 with object affinity to manage buffers Smaller, more efficient than fusion locks (LE) Less memory required Fewer instructions performed Master node grants affinity locks Affinity locks can be expanded to fusion locks If another instance needs to access the block If mastership is changed Affinity locks apply to data and undo segment blocks GCS lock (LE) mastered on instance 2 Instance 1 accesses buffers for this object 50x more than instance 2 LEs dissolved and affinity locks created, mastership stored in memory Instance 1 can now cheaply read/write to these buffers Instance 2 accesses buffers, affinity locks expanded to fusion locks (LE)
49
High DRM related wait events gcs drm freeze in enter server mode Script to Collect DRM Information (drmdiag.sql) (Doc ID 1492990.1) Open SR and submit diagnostics collected by script With large buffer cache (> 100 gig) gcs resource directory to be unfrozen gcs remaster waits Bug 12879027 - LMON gets stuck in DRM quiesce causing intermittent pseudo reconfiguration (Doc ID 12879027.8) DRM hang causes frequent RAC Instances Reconfiguration (Doc ID 1528362.1) Database slowdowns that correlate with DRMs Script to Collect DRM Information (drmdiag.sql) (Doc ID 1492990.1) Open SR and submit diagnostics collected by script
50
Questions Answers
51
52
53