Vous êtes sur la page 1sur 16

Veritas Product Overview

There are several VERITAS products that all work together when running a VERITAS
cluster. These are:

Veritas File System


This is filesystem software. One could use the standard filesystem with ufs logging
enabled, however, VERITAS journal filesystem will make your filesystem checks
MUCH faster upon reboot/crash.

Veritas Volume Manager (for partitioning disks)


This software is like a raid manager with more bells and whistles. You stripe, raid,
combine, and partition disk groups. One could use Disksuite (free) instead. However,
using multiple vendors can be problematic when debugging emergency issues.
Additionally, Sun keeps saying that they will discontinue DiskSuite.

Veritas Cluster Server (High Availability Cluster)


This software enables one system to failover to the other system. All related
software processes are simply moved from one system to the other system with
minimal downtime. One could use Sun's Cluster product. Veritas is less platform
dependant and does not require a terminal concentrator.
You can find out what products you are running with:

/usr/bin/pkginfo -x | grep VRTS


You can find out your licenses by running:
/usr/sbin/vxlicense -p
Veritas FileSystem Overview

VERITAS filesystem is a journaling filesystem. This means transactions are logged so


that if a failure occurs, the transactions can be rolled forward. This prevents
filesystem corruption as well as allows the system to do a filesystem check (fsck)
faster.

Normally, at boot-up after a crash, a system needs to manually check the integrity of
all the filesystems. With VERITAS, it checks the journalled logs, and then comes up.
This can save 30-60 minutes on large filesystems.

Generally, maintanence for VERITAS file system is often done with VERITAS Volume
Manager.

Further documentation for Veritas FileSystem is in: /opt/VRTSfsdoc

--------------------------------------------------------------------------------
Veritas Volume Manager Overview
Veritas Volume Manager divides disks into disk groups and partitions these groups as
desired. There is a nice GUI which helps alot. You can even pull up a command
window to see what the gui is running. The newest version of the gui is vmsa.

General Commands:

1
Veritas Volume Manager licenses info:
/usr/sbin/vxlicense -p
What volume groups:

vxdg list
Import volume group (see details on cluster debugging)

vxdg import oradg


Specific voulme group info:

vxprint -ht
What is veritas doing (if running another command and it is hanging)

vxtask list

--------------------------------------------------------------------------------
Veritas Cluster Overview
Veritas Cluster enables one system to failover to the other system. All related
software processes are simply moved from one system to the other system with
minimal downtime.

Veritas Cluster does NOT have both boxes up at once sevicing requests. It only
offers a hot standby system. This enables the system to keep running (with a short
transfer period) if a machine fails or system maintenance needs to be done.

Overview
Cluster Startup Processes
File Locations * Changing Configurations

Veritas Cluster Install


Veritas Cluster Debugging
Veritas Cluster Testing
What a Cluster Does at Startup

Here is what the cluster does at startup:

Node checks if other node is already started, if so -- stays OFFLINE


If no other machine is running, checks communication (gabconfig). May need system
admin intervention if cluster requires both nodes to be available. (/sbin/gabconfig -c
-x)

Once communication between machines is open -- or gabconfig has been started, it


sets up network (nic & ip adddress) (starts cluster server)

If also brings up volume manager, file system, and then oracle.

If any of the critical processes fail, the whole system is faulted. The most common
reason for failing is expired licenses, so check licenses before doing work with
vxlicense -p.

2
Overview * Cluster Startup Processes * File Locations * Changing Configurations
Veritas Cluster Install * Veritas Cluster Debugging * Veritas Cluster Testing
File Locations (Logs, Conf, Executables)

Log location: /var/VRTSvcs/log

There are several logs in this directory:

hashadow-log_A: hashadow checks to see if the ha cluster daemon (had) is up and


restarts it if needed. This is the log of that process.
engine.log_A: primary log, usually what you will be reading for debugging

Oracle_A: oracle process log (related to cluster only)

Sqlnet_A: sqlnet process log (related to cluster only)

IP_A: related to shared IP

Volume_A: related to Volume manager

Mount_A: related to mounting actual filesystes (filesystem)

DiskGroup_A: related to Volume Manager/Cluster Server

NIC_A: related to actual network device

Look at the most recent ones for debugging purposes (ls -ltr).
Conf files:

Llt conf: /etc/llttab [should NOT need to access this]


Network conf: /etc/gabtab
If has: /sbin/gabconfig -c -n2 , will need to run /sbin/gabconfig -c -x if only one
system comes up and both systems were down.

Cluster conf: /etc/VRTSvcs/conf/config/main.cf


Has exact details on what the cluster contains.

Most executables are in: /opt/VRTSvcs/bin or /sbin

Overview * Cluster Startup Processes * File Locations * Changing Configurations


Veritas Cluster Install * Veritas Cluster Debugging * Veritas Cluster Testing
Changing Configurations

ALWAYS be very careful when changing the cluster configurations. The only time I
needed to change the cluster configuration was when Vipul upgraded Oracle versions
and ORACLE_HOME changed directories. This is a very dangerous thing to do.

There are two ways of changing the configurations. The method one uses if the
system is up (cluster is running on at least one node, preferably on both):

haconf -makerw

3
run needed commands (ie. hasys ....)
haconf -dump -makero
If both systems are down:

hastop -all (shouldn't need this as cluster is down)


cp main.cf main.cf.save
vi main.cf
hacf -verify /etc/VRTSvcs/conf/config
hacf -generate /etc/VRTSvcs/conf/config
hastart

--------------------------------------------------------------------------------

Veritas Cluster Install Preparation & Planning

This is the MOST important part of the veritas cluster installation process. If you skip
a step - you pay for it later!

Machines racked
Machines jumpstarted or equivalent (last 4 e3500's needed to be done by hand)
On one machine set scsi-initiator to 7
Have array installed, verify disks can be seen on both machines
Additional patches for veritas installed (download latest sun recommended patches)
-- these are not in jumpstart!
Send off for veritas licemse -- if not there in time, get temp license
Have software ready for install
Hardware meets specifications
Veritas checklist should be filled out & ready to go
Change internet/network link to qfe1 ; cross over cables on hme0 & qfe0 (REMOVE
hostname.home and hostname.qfe0)

Veritas Cluster Install - Day 1


Verify that preplanning & preparation was completed properly. If not, do NOT
proceed.

If preparation was completed, on BOTH machines:

/etc/init.d/volmgt start
insert cdrom for db edition 2.1 [or higher]
cd /cdrom/cdrom0
./installDBED [answer default/yes to all EXCEPT say no to single / group ]
[may need to change cdroms here - not sure]
installvx
remove cdrom
vxdiskadm -- encapsulate root - specify your 2 root drives
process will reboot twice
create mount points for your db
Once done, on ONE machine:

vxdiskadm - initialize driives in array


vxassist -g oradg -maxsize drive1 drive2 - setup array drives

4
mount disks
add to vfstab

Veritas Cluster Install - Day 2


DBAs install oracle on ONE machine, update /etc/system on BOTH machines, add
table for vcs

Veritas Cluster Install - Day 3


On BOTH machines

/etc/init.d/volmgt start
insert cluster server cd
cd /cdrom/cdrom0
pkgadd -d . (add packages 3,2,5,1,4,6; yes to everything)
eject cdrom; mount oracle cluster agent
cd /cdrom/cdrom0
pkgadd -d .
eject cdrom
cd /opt/VRTSllt
cp llttab /etc
cd /etc
vi /etc/llttab -- uncomment/change following:
set-node 0 [on one machine set to 1]
set-cluster 0
link hme0 & qfe0
low-link pri qfe1
start (at bottom)
vi /etc/gabtab
uncomment gabconfig -c -n 2
cd /etc/rc2.d
start llt and gab on both machines
/sbin/lltconfig -a list [check for all 3 interfaces]
gabconfig -a [check for membership]
add /sbin and /opt/VRTSvcs/bin to /.profile PATH]
On ONE machine:

mkdir /etc/VRTSvcs/conf/config
cd /etc/VRTSvcs/config
cp *.cf config
cd sample-oracle
cp main.cf ../config [may need to copy other files - check]
cd ../config
vi main.cf
update systema and systemb
update SystemList and AutoStartList
add diskgroups
IP qfe1 nic-qfe1
add mountpoints
update oracle info
build dependancies
listener

5
oracle
mount
volumes
diskgroup
vip
nic
hacf -verify .
manually stop listener [lsnrctl stop]
manually stop db [svrmgrl; connect internal ; shutdown immediate]
take mountpoints out of vfstab
hastart [start cluster]
hagrp -switch oragrp -to systemb [test switchover]
run veritas testing on both machines

Veritas Cluster Debugging Tips

Initial Notes * Veritas Cluster Overview


Cluster Not Up - HELP * Find Current Status

Single System NOT Faulted * Check License


Clear Faults * Clearing Faults that will NOT clear

Reviewing Logfiles * Calling Support


Restart Services * DB Maintenance

Admin Network Maintenance


Shutting Down Machines * Manual Startup for Emergency
Initial Notes

Veritas cluster server is a high availability server. This means that processes switch
between servers when a server fails. All database processes are run through this
server - and as such, this needs to run smoothly. Note that the oracle process should
only actually be running on the server which is active. On monitoring tools, the procs
light for whichever box is secondary should be yellow, because oracle is not running.
Yet, the cluster is running on both systems.

Cluster Not Up -- HELP

The normal debugging of steps includes: checking on status, restarting if no faults,


checking licenses, clearing faults if needed, and checking logs.
To find out Current Status:

/opt/VRTSvcs/bin/hastatus -summary
This will give the general status of each machine and processes

/opt/VRTSvcs/bin/hares -display
This gives much more details - down to the resource level.
If hastatus fails on both machines (it returns that the cluster is not up or returns
nothing), try to start the cluster

/opt/VRTSvcs/bin/hastart

6
/opt/VRTSvcs/bin/hastatus -summary
will tell you if processes started properly. It will NOT start processes on a FAULTED
system.
Starting Single System NOT Faulted

If the system is NOT FAULTED and only one system is up, the cluster probably needs
to have gabconfig manually started. Do this by running:
/sbin/gabconfig -c -x
/opt/VRTSvcs/bin/hastart

/opt/VRTSvcs/bin/hastatus -summary

If the system is faulted, check licenses and clear the faults as described next.
To check licenses:

vxlicense -p
Make sure all licenses are current - and NOT expired! If they are expired, that is your
problem. Call VERITAS to get temporary licenses.

There is a BUG with veritas licences. Veritas will not run if there are ANY expired
licenses -- even if you have the valid ones you need. To get veritas to run, you will
need to MOVE the expired licenses. [Note: you will minimally need VXFS, VxVM and
RAID licenses to NOT be expired from what I understand.]

vxlicense -p
Note the NUMBER after the license (ie: Feature name: DATABASE_EDITION [100])
cd /etc/vx/elm
mkdir old
mv lic.number old [do this for all expired licenses]
vxlicense -p [Make sure there are no expired licenses AND your good licenses are
there]
hastart

If still fails, call veritas for temp licenses. Otherwise, be certain to do the same on
your second machine.

To clear FAULTS:

hares -display
For each resource that is faulted run:
hares -clear resource-name -sys faulted-system

If all of these clear, then run hastatus -summary and make sure that these are clear.
If some don't clear you MAY be able to clear them on the group level. Only do this as
last resort:
hagrp -disableresources groupname
hagrp -flush group -sys sysname
hagrp -enableresources groupname

To get a group to go online:


hagrp -online group -sys desired-system
If it did NOT clear, did you check licenses?

7
Veritas Product Overview

Veritas FileSystem Overview

Veritas Volume Manager Overview

Veritas Cluster Overview

Veritas Cluster Install

Veritas Cluster Debugging

Veritas Cluster Testing

--------------------------------------------------------------------------------

Veritas Software
Home Page

Veritas Software
Technical Services

--------------------------------------------------------------------------------

Search Amazon.com Software for

Veritas Software

Veritas Clusters:
Shared Data Clusters: Scaleable,...

Bringing up Machines when fault will NOT clear:

System has the following EXACT status:

gedb002# hastatus -summary

-- SYSTEM STATE
-- System State Frozen

A gedb001 RUNNING 0
A gedb002 RUNNING 0

-- GROUP STATE
-- Group System Probed AutoDisabled State

B oragrp gedb001 Y N OFFLINE


B oragrp gedb002 Y N OFFLINE

8
gedb002# hares -display | grep ONLINE
nic-qfe3 State gedb001 ONLINE
nic-qfe3 State gedb002 ONLINE

gedb002# vxdg list


NAME STATE ID
rootdg enabled 957265489.1025.gedb002

gedb001# vxdg list


NAME STATE ID
rootdg enabled 957266358.1025.gedb001

Recovery Commands:

hastop -all
on one machine hastart
wait a few minutes
on other machine hastart
Reviewing Log Files:

If you are still having troubles, look at the logs in /var/VRTSvcs/log. Look at the most
recent ones for debugging purposes (ls -ltr). Here is a short description of the logs in
/var/VRTSvcs/log:

hashadow-log_A: hashadow checks to see if the ha cluster daemon (had) is up and


restarts it if needed. This is the log of that process.
engine.log_A: primary log, usually what you will be reading for debugging

Oracle_A: oracle process log (related to cluster only)

Sqlnet_A: sqlnet process log (related to cluster only)

IP_A: related to shared IP

Volume_A: related to Volume manager

Mount_A: related to mounting actual filesystes (filesystem)

DiskGroup_A: related to Volume Manager/Cluster Server

NIC_A: related to actual network device

By looking at the most recent logs, you can know what failed last (or most recently).
You can also tell what did NOT run which may be jut as much of a clue. Of course, if
none of this helps, open a call with veritas tech support.

Calling Tech Support:

If you have tried the previously described debugging methods, call Veritas tech
support: 800-634-4747. Your company needs to have a Veritas support contract.

9
Restarting Services:

If a system is gracefully shutdown and it was running oracle or other high availability
services, it will NOT transfer them. It only transfers services when the system
crashes or has an error.

hastart

hastatus -summary
will tell you if processes started properly. It will NOT start processes on a FAULTED
system. If the system is faulted, clear the faults as described above.
Doing Maintenance on DBs:

BEFORE working on DB

Run hastop -all -force

AFTER working on Dbs:

You MUST bring up oracle on same machine


Once Oracle is up, run:

hastart on the same machine as you started the work on (the first on system with
oracle running)
wait 3-5 minutes
then run hastart on the other system

If you need the instance to run on the other system, you can run: hagrp -switch
oragrp -to othersystem
Shutting down db machines:

If you shutdown the machine that is running veritas cluster, it will NOT start on the
other machine. It only fails over if the machine crashes. You need to manually switch
the services if you shutdown the machine. To switch processes:

Find out groups to transfer over


hagrp -display
Switch over each group
hagrp -switch group-to-move -to new-system

Then shutdown machine as desired. When rebooted will start cluster daemon
automatically.
Doing Maintenance on Admin Network:

If the admin network is brought down (that the veritas cluster uses), veritas WILL
fault both machines AND bring down oracle (nicely). You will need to do the
following to recover:

hastop -all
On ONE machine: hastart
wait 5 minutes

10
On other machine: hastart
Manual start/stop WITHOUT veritas cluster:

THIS IS ONLY USED WHEN THERE ARE DB FAILURES

If possible, use the section on DB Maintenance. Only use this if system fails on
coming up AND you KNOW that it is due to a db configuration error. If you manually
startup filesystems/oracle -- manually shut them down and restart using hastart
when done.

To startup:

Make sure ONLY rootdg volume group is active on BOTH NODEs. This is EXTREMELY
important as if it is active on both nodes corruption occurs. [ie. oradg or xxoradg is
NOT present]

vxdg list
hastatus (stop on both as you are faulted on both machines )
hastop -all (if either was active make sure you are truly shutdown!)
Once you have confirmed that the oracle datagroup is not active, on ONE machine
do the following:
vxdg import oradg [this may be xxoradg where xx is the client 2 char code]
vxvol -g oradg startall

mount -F vxfs /dev/vx/dsk/oradg/name /mountpoint [Find volumes and mount points


in /etc/VRTSvcs/conf/config/main.cf]

Let DBAs do their stuff

To shutdown:
umount /mountpoint [foreach mountpoint]
vxdg deport oradg

vxvol -g oradg stopall

clear faults; start cluster as described above

--------------------------------------------------------------------------------

Testing Veritas Cluster

Actual commands are in black.

0. Check Veritas Licenses - for FileSystem, Volume Manager AND Cluster


vxlicense -p

If any licenses are not valid or expired -- get them FIXED before continuing! All
licenses should say "No expiration". If ANY license has an actual expiration date, the

11
test failed. Permenant licenses do NOT have an expiration date. Non-essential
licenses may be moved -- however, a senior admin should do this.

1. Hand check SystemList & AutoStartList


On either machine:

grep SystemList /etc/VRTSvcs/conf/config/main.cf


You should get:
SystemList = { system1, system2 }

grep AutoStartList /etc/VRTSvcs/conf/config/main.cf


You should get:
AutoStartList = { system1, system2 }
Each list should contain both machines. If not, many of the next tests will fail.

If your lists do NOT contain both systems, you will probably need to modify them
with commands that follow.
more /etc/VRTSvcs/conf/config/main.cf (See if it is reasonable. It is likely that the
systems aren't fully set up)
haconf -makerw (this lets you write the conf file)
hagrp -modify oragrp SystemList system1 0 system2 1
hagrp -modify oragrp AutoStartList system1 system2
haconf -dump -makero (this makes conf file read only again)

2. Verify Cluster is Running


First verify that veritas is up & running:

hastatus -summary
If this command could NOT be found, add the following to root's path in /.profile:
vi /.profile
add /opt/VRTSvcs/bin to your PATH variable
If /.profile does not already exist, use this one:
PATH=/usr/bin:/usr/sbin:/usr/ucb:/usr/local/bin:/opt/VRTSvcs/bin:/sbin:$PATH
export PATH
. /.profile
Re-verify command now runs if you changed /.profile:
hastatus -summary
Here is the expected result (your SYSTEMs/GROUPs may vary):

One system should be OFFLINE and one system should be ONLINE ie:
# hastatus -summary
Veritas Product Overview

Veritas FileSystem Overview

Veritas Volume Manager Overview

Veritas Cluster Overview

Veritas Cluster Install

12
Veritas Cluster Debugging

Veritas Cluster Testing

--------------------------------------------------------------------------------

Veritas Software
Home Page

Veritas Software
Technical Services

--------------------------------------------------------------------------------

Search Amazon.com Software for

Veritas Software

Veritas Clusters:
Shared Data Clusters: Scaleable,...

-- SYSTEM STATE
-- System State Frozen

A e4500a RUNNING 0
A e4500b RUNNING 0

-- GROUP STATE
-- Group System Probed AutoDisabled State

B oragrp e4500a Y N ONLINE


B oragrp e4500b Y N OFFLINE

If your systems do not show the above status, try these debugging steps:

If NO systems are up, run hastart on both systems and run hastatus -summary
again.

If only one system is shown, start other system with hastart. Note: one system
should ALWAYS be OFFLINE for the way we configure systems here. (If we ran
oracle parallel server, this could change -- but currently we run standard oracle
server)

If both systems are up but are OFFLINE and hastart did NOT correct the problem
and oracle filesystems are not running on either system, the cluster needs to be

13
reset. (This happens under strange network situations with GE Access.) [You ran
hastart and that wasn't enough to get full cluster to work.]
Verify that the systems have the following EXACT status (though your machine
names will vary for other customers):

gedb002# hastatus -summary

-- SYSTEM STATE
-- System State Frozen

A gedb001 RUNNING 0
A gedb002 RUNNING 0

-- GROUP STATE
-- Group System Probed AutoDisabled State

B oragrp gedb001 Y N OFFLINE

B oragrp gedb002 Y N OFFLINE

gedb002# hares -display | grep ONLINE


nic-qfe3 State gedb001 ONLINE
nic-qfe3 State gedb002 ONLINE

gedb002# vxdg list


NAME STATE ID
rootdg enabled 957265489.1025.gedb002

gedb001# vxdg list


NAME STATE ID
rootdg enabled 957266358.1025.gedb001

Recovery Commands:

hastop -all
on one machine hastart
wait a few minutes
on other machine hastart
hastatus -summary (make sure one is OFFLINE && one is ONLINE)

If none of these steps resolved the situation, contact Lorraine or Luke (possibly Russ
Button or Jen Redman if they made it to Veritas Cluster class) or a Veritas
Consultant.

3. Verify Services Can Switch Between Systems


Once, hastatus -summary works, note the GROUP name used. Usually, it will be
"oragrp", but the installer can use any name, so please determine it's name.

14
First check if group can switch back and forth. On the system that is running
(system1), switch veritas to other system (system2):

hagrp -switch groupname -to system2 [ie: hagrp -switch oragrp -to e4500b]
Watch failover with hastatus -summary. Once it is failed over, switch it back:
hagrp -switch groupname -to system1

4. Verify OTHER System Can Go Up & Down Smoothly For Maintanence


On system that is OFFLINE (should be system 2 at this point), reboot the computer.

ssh system2
/usr/sbin/shutdown -i6 -g0 -y
Make sure that the when the system comes up & is running after the reboot. That is,
when the reboot is finished, the second system should say it is offline using hastatus.
hastatus -summary
Once this is done, hagrp -switch groupname -to system2 and repeat reboot for the
other system
hagrp -switch groupname -to system2
ssh system1
/usr/sbin/shutdown -i6 -g0 -y
Verify that system1 is in cluster once rebooted
hastatus -summary

5. Test Actual Failover For System 2 (and pray db is okay)


To do this, we will kill off the listener process, which should force a failover. This test
SHOULD be okay for the db (that is why we choose LISTENER) but there is a very
small chance things will go wrong .. hence the "pray" part :).

On system that is online (should be system2), kill off ORACLE LISTENER Process

ps -ef | grep LISTENER


Output should be like:
root 1415 600 0 20:43:58 pts/0 0:00 grep LISTENER
oracle 831 1 0 20:27:06 ? 0:00 /apps/oracle/product/8.1.5/bin/tnslsnr
LISTENER -inherit

kill -9 process-id (the first # in list - in this case 831)


Failover will take a few minutes
You will note that system 2 is faulted -- and system 1 is now online

You need to CLEAR the fault before trying to fail back over.

hares -display | grep FAULT


for the resource that is failed (in this case, LISTENER)
Clear the fault
hares -clear resource-name -sys faulted-system [ie: hares -clear LISTENER -sys
e4500b]

6. Test Actual Failover For System 1 (and pray db is okay)


Now we do same thing for the other system first verify that the other system is NOT
faulted

15
hastatus -summary
Now do the same thing on this system... To do this, we will kill off the listener
process, which should force a failover.
On system that is online (should be system2), kill off ORACLE LISTENER Process

ps -ef | grep LISTENER


Output should be like:
oracle 987 1 0 20:49:19 ? 0:00 /apps/oracle/product/8.1.5/bin/tnslsnr
LISTENER -inherit
root 1330 631 0 20:58:29 pts/0 0:00 grep LISTENER

kill -9 process-id (the first # in list - in this case 987)


Failover will take a few minutes
You will note that system 1 is faulted -- and system 1 is now online

You need to CLEAR the fault before trying to fail back over.

hares -display | grep FAULT for the resource that is failed (in this case, LISTENER)
Clear the fault

hares -clear resource-name -sys faulted-system [ie: hares -clear LISTENER -sys
e4500a]
Run:

hastatus -summary
to make sure everything is okay.

16

Vous aimerez peut-être aussi