Vous êtes sur la page 1sur 40

<Insert Picture Here>

Everything you ever wanted to know


about the Cluster Health Monitor (CHM)
Markus Michalewicz
Principal Product Manager Oracle RAC & Oracle Clusterware
The preceding is intended to outline our general
product direction. It is intended for information
purposes only, and may not be incorporated into any
contract. It is not a commitment to deliver any
material, code, or functionality, and should not be
relied upon in making purchasing decisions.
The development, release, and timing of any
features or functionality described for Oracle’s
products remains at the sole discretion of Oracle.
Agenda
• Introduction
– What is? Why use? Where to get?
Cluster Health Monitor (CHM) <Insert Picture Here>

• Installation
– Of the Tool
– Of the GUI

• CHM in Action

• Administration

• FAQ & More Information


– OTN Migration
What is the Cluster Health Monitor (CHM)?
Introduction
• The Cluster Health Monitor (CHM)
(formerly a.k.a. Instantaneous Problem Detector for Clusters or IPD/OS)
is designed to
– detect and to analyze operating system (OS)
– and cluster resource related degradation and failures
– in order to bring more explanatory power to many issues that occur in clusters,
in which Oracle Clusterware and / or Oracle RAC are used, e.g. node evictions.
– It is Oracle Clusterware and Oracle RAC independent in the current release.
Why should you use CHM?
Because there is a Monday morning for example
• Assume the following scenario:
– Leaving the office Friday night
– Getting an email that one node in the cluster rebooted on Sunday morning
– Getting a question from your manager why that node rebooted on Monday

• Typical way of addressing this question:


– Gather and analyze Oracle Clusterware and operating system logs
(e.g. following MOS doc 330358.1 - CRS 10gR2/ 11gR1/ 11gR2 Diagnostic Collection Guide)
– Open a Service Request with Oracle Support

• Possible outcomes:
– Oracle Support finds the answer in one of the logs
– Oracle Support needs more node specific information to answer the question

• For the latter: This why you need Cluster Health Monitor (CHM) for example
Why should you use CHM?
Because you want to prevent another incident
• Based on the previous scenario:
– It is determined that the reboot was caused by
an abnormally high CPU load in conjunction with extreme IO waits.
– Your manager asks you:
What caused the high CPU load? What can we do to prevent this in future?

• For the latter: CHM provides a historical view on collected data for analyzes
– >crfgui -d "00:05:00" -m 192.168.2.8
– Cluster Health Analyzer V1.10 Look for Loggerd via node 192.168.2.8
...reading 300 sec from the past
Connected to Loggerd on rac1
Note: Node rac1 is now up
Cluster 'MyCluster', 2 nodes. Ext time=2010-08-18 23:22:30
Where can I get CHM?
Free Download
• Direct download link:
– http://www.oracle.com/technetwork/database/clustering/downloads/ipd-download-
homepage-087212.html

• But better go to:


• http://www.oracle.com/goto/rac
– On this page, follow this link:
Cluster Health Monitor – Download

• Reason: OTN Migration (later…)


<Insert Picture Here>

Installation
How to Install CHM?
Use the documentation
• Overview of Cluster Heath Monitor (CHM)
(http://www.oracle.com/technetwork/database/enterprise-edition/ipd-overview-130032.pdf)

Summary of installation steps:


1. Download the software
2. Unzip the downloaded file
• Do not install from a shared file system
3. Set up an OS-user for CHM
• The user must have passwordless SSH access to all nodes
• The user can be the same as the Oracle Grid Infrastructure-owner
4. Install the software
• $CHM_install_DIR/install/crfinst.pl –i {node1,node2…} –b /BDBdirectory
• Do not use a shared destination for the location of the BDBdirectory
• The software is distributed across all nodes specified under –i automatically
• Define one of the nodes as the master node
• Run “crfinst.pl -f -b /BDBdirectory” as root on all nodes to enable the tool
How to Install CHM?
Tips and tricks part 1
• Set up an OS-user for CHM
• The user must have passwordless SSH access to all nodes
• The user can be the same as the Oracle Grid Infrastructure-owner

• The “passwordless SSH” setup basically follows the configuration that


you would use for the Oracle Grid Infrastructure (Oracle Clusterware
11g Release 2) setup, which allows setting up the passwordless SSH
automatically.

• If you plan on deploying Oracle Grid


Infrastructure on this system, you
might want to do it first and then install
Cluster Health Monitor.
How to Install CHM?
Tips and tricks part 2
• Install the software
• $CHM_install_DIR/install/crfinst.pl –i {node1,node2…} –b /BDBdirectory
• Do not use a shared destination for the location of the BDBdirectory
• The software is distributed across all nodes specified under –i automatically

• For the BDBdirectory (Berkeley Database directory) some rules apply:


• Do not use a shared destination. Local storage is preferred.
• IF you want to use a shared location, make sure that you have a different
folder / directory for the BDB for each node in the cluster (you can specify
this in the final “crfinst.pl -f -b /orachmbdbrac/” run as root).
• This directory must not be in the root FS
• It must be of sufficient size per node (2GB* # of nodes in the cluster):

ERROR: Enough space not available on /u01/orachmbdb.


Space available 4002196 KB, but required 4194304 KB
Please rerun with a valid storage location for BDB.
The location should be a path on a volume with at least
2GB per node space available and writable by root only.
It is recommended to not create it on root filesystem.
After the Installation
Enable the stack
• The tool is installed under: /usr/lib/oracrf/ on Linux

• After the stack is installed, you need to enable it on each node:


[root@rac1 ~]# /etc/init.d/init.crfd
Usage: /etc/init.d/init.crfd {start|stop|restart|status|disable|enable}

[root@rac1 ~]# /etc/init.d/init.crfd enable


[root@rac1 ~]# ssh rac2
Last login: Tue Aug 17 21:21:52 2010 from rac1
[root@rac2 ~]# /etc/init.d/init.crfd enable

• What you will find:

[root@rac2 ~]# ps -ef |grep oracrf


root 28025 27949 0 21:26 ? 00:00:00 /bin/sh/usr/lib/oracrf/bin/crfcheck
root 28028 27949 0 21:26 ? 00:00:00 /usr/lib/oracrf/bin/osysmond
oracle 28089 1 0 21:26 ? 00:00:00 /usr/lib/oracrf/bin/oproxyd
root 28127 1 0 21:26 ? 00:00:00 /usr/lib/oracrf/bin/ologgerd -m rac1 -r -d /u01/orachmbdb/
After the Installation
What was installed on a 2-node cluster?
• 3 daemons are installed
– osysmond is the monitoring and OS metric collection daemon on every node
– ologgerd follows a master / standby paradigm if more than 1 node in the cluster;
the master manages the OS metric database in BDB
and interacts with the standby to manage a replica of the master metrics
– oproxyd is a proxy on all nodes which handles connection to the public interface
(default port: 61027, not configurable in current release / as of 11.2.0.1)

osysmond osysmond

ologgerd ologgerd

oproxyd oproxyd
How to install the GUI?
Use the documentation + tips & tricks
• The GUI needs to be installed separately.
• It is recommended to install the GUI on a separate (client) machine
– The GUI can be installed on (one) node(s) of the cluster, if it has to
• If your client is a Windows client, download the Windows version of the tool
• Unzip and install the GUI using:

Usage: crfinst.pl -a [<nodelist>]


-c [<nodelist>]
-d
-f [-b <bdb loc>]
-g <ui install dir>
-h
-i <nodelist> -b <bdb loc> [-m <master>]
-N ClusterName.

• Note that the installation is performed using a perl script


– An Oracle client installation typically includes a perl version you can use
<Insert Picture Here>

Cluster Health Monitor


in Action
Cluster Health Monitor in Action
Get started or “Overview”
D:\chmgui\bin>crfgui -m 192.168.2.8
Cluster Health Analyzer V1.10
Look for Loggerd via node 192.168.2.8
...Connected to Loggerd on rac1
Note: Node rac1 is now up
Cluster 'MyCluster', 2 nodes. Ext time=2010-08-19 01:01:25
Making Window: IPD Cluster Monitor V1.10 on mmichale-lap, Logger V1.04.20091223,
Cluster "MyCluster" (View 0), Refresh rate: 1 sec
Cluster Health Monitor in Action
Overview in detail
Cluster Health Monitor in Action
Details as required
Cluster Health Monitor in Action
Alerts based on pre-defined thresholds
Cluster Health Monitor in Action
Alerts and detailed views on the situation
Cluster Health Monitor in Action
Information on processes, CPU usage, network, IO, etc.
Cluster Health Monitor in Action
Even more details for IO and network
Back to: Why should you use CHM?
Because of the Monday morning…
• Assume it’s Monday and you are:
– getting a question from your manager what caused the high CPU load on…
• Current view (assume Monday morning) – no issues:

• Let’s “go back in time” (crfgui -d "00:35:00" -m 192.168.2.8):


<Insert Picture Here>

Administration
Administration part 1
The main administration tool for CHM: oclumon

[oracle@rac1 ~]$ oclumon -h

For help from command line : oclumon <verb> -h


For help in interactive mode : <verb> -h
Currently supported verbs are :
showtrail, showobjects, dumpnodeview, manage, version, debug, quit and help

[oracle@rac1 ~]$ oclumon version

Instantaneous Problem Detection - OS Tool, Version 1.04.20091223 - Production


Copyright 2009 Oracle. All rights reserved.
Administration part 2
How long can I “go back in time”?
• Reviewing historical data is limited by the size of the Berkeley DB
– By default the database retains the node views from all the nodes
– for the last 24 hours in a circular manner.
– This limit can be increased to 72 hours by using the following oclumon command:

'oclumon manage -bdb resize 259200'.

– ‘resize’ is set in seconds


– In the current release (as of 11.2.0.1) you cannot query the current retention time
– You can, however, set it to the time that you think is appropriate / reasonable

• Whenever time is specified in the format “HH:MM:SS“, it refers to the


amount of time that you want to go back (in hours, minutes, seconds).
• This command: crfgui -d "00:35:00" -m 192.168.2.8
– Views the data 35 minutes ago from “now”.
Administration part 3
Get me information on the command line
> oclumon dumpnodeview -v -n rac1 -last "00:00:03“  3 seconds

----------------------------------------
Node: rac1 Clock: '08-19-10 03.53.53 UTC' SerialNo:63193
----------------------------------------

SYSTEM:
#cpus: 2 cpu: 4.5 cpuq: 1 physmemfree: 13896 mcache: 959952 swapfree: 1900208 ior: 0 iow: 297 ios: 17 netr: 57.9 netw: 43.56 procs: 187 rtprocs: 11 #fds: 2658 #sysfdlimit:
6815744 #disks: 7 #nics: 4 nicErrors: 0

TOP CONSUMERS:
topcpu: 'osysmond(13446) 0.66' topprivmem: 'ologgerd(13532) 102260' topshm: 'ologgerd(13532) 46680' topfd: 'crsd.bin(10754) 102' topthread: 'crsd.bin(10754) 58'

PROCESSES:

name: 'osysmond' pid: 13446 #procfdlimit: 1024 cpuusage: 0.66 memusage: 78912 shm: 41196 #fd: 22 #threads: 9 priority: 139
name: 'orarootagent.bi' pid: 10890 #procfdlimit: 65536 cpuusage: 0.66 memusage: 6420 shm: 10032 #fd: 7 #threads: 34 priority: 19
name: 'ologgerd' pid: 13532 #procfdlimit: 1024 cpuusage: 0.0 memusage: 102260 shm: 46680 #fd: 19 #threads: 9 priority: 139

DEVICES:
sdf ior: 0.0 iow: 0.0 ios: 0 qlen: 0 wait: 0 type: SYS
sdf1 ior: 0.0 iow: 0.0 ios: 0 qlen: - wait: - type: SYS
sde ior: 0.0 iow: 0.0 ios: 0 qlen: 0 wait: 0 type: SYS
sde1 ior: 0.0 iow: 0.0 ios: 0 qlen: - wait: - type: SYS
sdd ior: 0.0 iow: 0.0 ios: 0 qlen: 0 wait: 0 type: SYS

NICS:
lo netrr: 21.3 netwr: 21.3 neteff: 42.7 nicerrors: 0 pktsin: 7 pktsout: 7 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 7 innonunicast: 0 type:
PUBLIC
eth0 netrr: 25.65 netwr: 15.94 neteff: 41.60 nicerrors: 0 pktsin: 13 pktsout: 13 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 13 innonunicast:
0 type: PRIVATE latency: <1
eth1 netrr: 10.27 netwr: 6.58 neteff: 16.85 nicerrors: 0 pktsin: 30 pktsout: 22 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 30 innonunicast: 0
type: PRIVATE latency: <1
eth2 netrr: 0.12 netwr: 0.0 neteff: 0.12 nicerrors: 0 pktsin: 0 pktsout: 0 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 0 innonunicast: 0
type: PUBLIC latency: <1

PROTOCOL ERRORS:
IPHdrErr: 0 IPAddrErr: 0 IPUnkProto: 0 IPReasFail: 0 IPFragFail: 0 TCPFailedConn: 50 TCPEstRst: 13 TCPRetraSeg: 69 UDPUnkPort: 41 UDPRcvErr: 0

End of data
Administration part 4
Time is crucial – “the clock”
> oclumon dumpnodeview -n rac1 -s "2010-08-19 02.00.01" -e "2010-08-19 02.00.03"

----------------------------------------
Node: rac1 Clock: '08-19-10 02.00.01 UTC' SerialNo:58695
----------------------------------------

SYSTEM:
#cpus: 2 cpu: 4.20 cpuq: 4 physmemfree: 17728 mcache: 953248 swapfree: 1900208 ior: 0 iow: 103
ios: 7 netr: 46.36 netw: 39.29 procs: 187 rtprocs: 11 #fds: 2658 #sysfdlimit: 6815744
#disks: 7 #nics: 4 nicErrors: 0

TOP CONSUMERS:
topcpu: 'osysmond(13446) 1.31' topprivmem: 'ologgerd(13532) 102260' topshm: 'ologgerd(13532)
46680' topfd: 'crsd.bin(10754) 102' topthread: 'crsd.bin(10754) 58'

End of data

Alternative:
 oclumon dumpnodeview -allnodes -s "2010-08-19 02.00.01" -e "2010-08-19 02.00.03“

 The "Clock:" in the oclumon output is printed in the


timezone which the master daemon is running with.
Administration part 5
Sampling data and refresh rate
• Two independent rates to distinguish:
1. The sampling rate of the tool
2. The refresh rate of the GUI

• The sampling rate of the tool depends on the currently active processes
and the devices on the system. Up to a total of 1000 active processes and
disks with ideal system, the sampling interval is approximately 1 second.

• The refresh rate of the GUI is 1 second per default, but a higher refresh
rate can be specified using the –r parameter followed by the time in secs.
– Example: crfgui -r 5 -m 192.168.2.8
<Insert Picture Here>

Frequently Asked
Questions (FAQ)
The most common FAQs…
…Are answered in the tool readme
• Direct download link:
– http://www.oracle.com/technetwork/database/clustering/downloads/ipd-download-
homepage-087212.html

• But better go to:


• http://www.oracle.com/goto/rac
– On this page, follow this link:
Cluster Health Monitor – Download

• and then “Readme”


FAQ #1
Is CHM a CVU (Cluster Verification Utility) replacement?

• NO
• CVU is a separate tool with
a completely different purpose.

• CVU does not gather nor


provide the same data that
CHM provides.

• For more information on CVU


got to:
– http://www.oracle.com/goto/rac
– On this page, follow this link:
Cluster Verification Utility - Download
FAQ #2
Can CHM be used as an OS Watcher replacement?

• YES

• OS Watcher (OSW) is a collection of UNIX shell scripts(*) intended to


– collect and archive operating system and network metrics
– to aid support in diagnosing performance issues.

• OSW is provided by and made available through Oracle Support.


– For more information see MOS doc ID 301137.1 - OS Watcher User Guide

• Note: OS Watcher may have some specific environments, in which it


provides additional information (e.g. Version 3.0 OS of Watcher adds
additional collections for Exadata, as per the MOS note mentioned.)

(*) on Unix – there is also a Windows version of OS Watcher


FAQ #3
Is CHM the standard tool to be used?

• YES

• Oracle RAC Development recommends using CHM whenever possible:


– When using Oracle Clusterware, Oracle Grid Infrastructure, or Oracle RAC
– The current release is available on Linux and Windows – both 32 and 64bit.

• CHM will be the standard tool moving forward


• Therefore, more OSs will be supported in future
<Insert Picture Here>

More Information
Future Development of CHM
What you will find in Oracle Grid Infrastructure 11.2.0.2

• Cluster Health Monitor is planned to be integrated with


Oracle Grid Infrastructure starting with 11.2.0.2 as follows:
– The data gathering part of the tool will be part of the standard installation
– CHM will therefore be installed into the Oracle Grid Infrastructure home
– The Berkeley DB will be installed in the Oracle Grid Infrastructure home (default)
– The GUI remains as a separately downloadable item
– Changes in some parts of the architecture are possible, but the principles remain
– The tool will provide more configuration options on the command line for example
– The tool will be enabled per default with a default retention time (adjustable)

• Going forward, all OS supported for Oracle Grid Infrastructure


will be supported for Cluster Health Monitor.
– More Operating Systems are planned to be supported for CHM as 11.2.0.2
becomes available on these Operating Systems (last planned for 11.2.0.3)
More Information

• http://www.oracle.com/goto/rac
– Download link: Cluster Health Monitor - Download

• http://www.oracle.com/goto/clusterware
– Technical White Paper
Oracle Clusterware 11g Release 2 Technical Overview

• For OS Watcher
– My Oracle Support doc ID 301137.1 - OS Watcher User Guide
OTN Migration
A migration with some impact
• Note that Oracle Technology Network (also known as OTN) was migrated
– URLs containing http://otn.oracle.com/ are moved
– Individual items (e.g. papers) are migrated to a new Content Management System
– Direct links using the old URL to those items may therefore not work anymore

• Some links to main pages should be redirected to some new pages – e.g.:
– http://otn.oracle.com/rac (might go away over time) 
http://www.oracle.com/technetwork/database/clustering/overview/index.html

• New short links available:


– www.oracle.com/goto/rac
– www.oracle.com/goto/clusterware
– www.oracle.com/goto/asm

• Items are linked on the main pages to the new URLs


• Tip: follow the links on the main pages until migration is complete

Vous aimerez peut-être aussi