Académique Documents
Professionnel Documents
Culture Documents
Doosan Machine Tools Precision Design, High Efficiency, View Complete Product Line usa .doosaninfracore.co.kr
Industry Chock Easy adjustable engine/machine mounts, 40% cheaper, high quality www.industrychock .com
fault tree Corrosion - predict, control, avoid ASM International Education Course asminternational.org/education
Introduction
This is a short paper to give you some feel for how to deal with the
solaris operating system when you are faced with a fault, and need to
restore the system to healthy operation quickly without resorting to
pulling your CV from the bottom draw and calling the local employment
agencies.
I will break the paper up into four main areas. The first is failed
system analysis - the machine is down, what can I do ? The second is
System Recovery - how do I get the machine back up and running.
Thirdly, a brief flirt with some of the failure analysis resources,
including tools to look at what has happened, and what is happening
now. The fourth section covers preventative maintenance - things to do
to either avoid crashes or survive them with minimal grief.
One final note before I launch into this. This paper barely
scratches the surface of the available debugging tools and utilities
within the Solaris operating system. I have tried to cover enough
detail help you navigate through the pitfalls of a severe software
failure, throw in a few pointers to the odd useful information that
is not generally known and give a few insights that may help you
realise some of the trade offs involved when building a machine.
However, things are not always so obvious. The machine could be hung,
the network down, some services could be broken, or even the machine is
fine - your user expectations could be wrong, so they percieve that the
machine is down.
If you can ping the machine, can you telnet or rlogin into the
machine ? If you can, the machine is probably generally healthy, and
it is time to find out what the user observed, and what is causing
this.
There are several possibilities as to why you may not be able to login
to the machine via the network. If the machine is in single user mode,
you usually get a message telling you that you cannot login at this
time. Similarly, running out of pseudo terminals will normally give a
message indicating that the machine cannot create another device.
Next, check the console. Make sure it has power. If it is not on, you
can halt the machine when you turn it on. If the machine has a key and
you can put the machine into lock mode, do so before turning on the
power to the terminal.
Read the messages. Carefully. They are trying to tell you something.
If there are error messages, note them down. They may not survive a
reboot. They may be the only clue to the failure, and losing them
could well mean the difference between "Please do this to fix the
fault" and "Please phone us next time the problem reoccurs."
If the machine does not respond, you are rapidly running out of
options. The next action might be a surprise - go and observe the
machine. Listen to it - is the disk hammering away ? Are the lights
flashing on the machine ? Are the disk lights flashing ? Are the
network lights flashing ? Is the machine exhibiting any signs of
life ?
number of network services that can hang the machine. Make sure the
network connection is firmly plugged in. Make sure the hub has power.
See if the link light is on at the hub. If the link light is on, is
the active light on, flashing or doing nothing ? Possibly try a
different port in the hub. Unplug the machine from the hub and see if
the console gets any messages. Doing this investigation may give vital
clues as to what the failure is. These can be vital if you find that
rebooting the machine does not fix the failure, and are searching for
some external agent impacting the machine.
If this seems like a lot of things to look at and relatively few direct
fixes, I apologise, but the operating system is over 5.5 million lines
of code at last count, and given the nature of this size and the
accompanying flexibility of system, the range of failures that can
impact the machine are simply immeasurable. This is the same for any
unix, and all modern operating systems. Quick sales spiel - the sheer
range of possible failures is a good reason to have a maintenance
contract so that you have people specializing in this stuff backing you
up when you really need them.
We want to get to the prom prompt to tell the machine to dump the
memory image into swap space. Hopefully you have adequate swap space
to cope with the dump, or we may simply not be able to capture all of
the dump, and the dump will be worthless. Even then, you will need
have savecore turned on and you have not striped your swap space to
avoid going into single user mode and manually capturing the savecore.
More on this later.
First, if you are on a graphic head, hit the stop and A keys
simultaneously. If you are on a terminal, generate a break. Since
this is terminal dependant, this exercise is left to the student.
Also, if you are on a machine with a key control on the front, for
example, an enterprise server or a sparc centre, make sure the key
switch is not in the locked position, as this disables the console
break.
All going well, you will be rewarded with an ok prompt. This means the
machine was "soft" hung, and there is a good chance that we will be
able to get a core for subsequent analysis. If you do not have an ok
prompt, time to get a bit more brutal. Unplug the keyboard, then plug
it back in if you are on a graphic head. If you are using a terminal,
powercycle it. Please note - any messages on the screen will be lost -
make sure you have noted them down before hand.
If you have got an ok prompt, type sync and hit return. The machine
should then dump the memory image to swap. Usually, the only reason
this will fail is gross SCSI chain failure. You will usually know
this because either the machine will hang on the sync or you will get
major scsi errors on the write. This means the core is not likely to
be captured, but we do have more clues as to the nature of the
failure.
The immediate choice you need to make at this point is whether to bring
the machine up to multiuser first or into single user and look around,
possibly doing some repairs first. Going to single user will take
more time, but is generally safer, and if you need to recover the
savecore and have some issues as to why you may not be able to do this
on a normal boot, you may need to bring the machine up in single user.
A third option exists for really sick machines - booting off the
network or cdrom and running off an in memory, minimal image of unix.
You will need to boot to an in memory image of the operating system to
recover root or user from backup, repair the device tree or do some
serious debugging of why the machine cannot boot to single user mode.
If you are on a machine with a key switch, and you are not sure that
the hardware is behaving correctly, you can always set the switch into
diagnostic mode. Diagnostic mode will print all of the power on self
test information as the machine boots, giving you a chance to view what
is happening within the core of the machine at boot time, and hopefully
highlighting any component failues on the machine. Almost all
component failures will be caught by either the power on self test or
be seen by the operating system and logged in the system messages files.
You can also turn the diag-switch option on in the NVRam, but unless
you change the diag-device entry, the machine will attempt to boot off
the net. Break to the ok prompt at this point and turn the diag-switch
to false to complete booting.
The NVRam and boot prom are reasonably sophisticated on the sparc
hardware. It is capable of a range of testing and system diagnosis,
and system configuration. You can view the system device tree to check
for items you are expecting to see, configure devices aliases, load
firmware, configure boot options, etc. This is all documented in the
Solaris System Adminsitrator Answerbook.
There are also a range of switches for the boot command, most notably
-v - verbose option - report the device testing at boot time
-r - reconfiguration boot - rebuild the device tree on boot
-s - boot to single user mode
-w - do not start up any windowing, normally used for in-memory boots
-a - check with the user which of the boot files to use on boot.
I will cover the cdrom or network boot of the machine first, as this is
the most drastic and low level boot, and has the most powerful recover
options. If you have a network boot server for the local machine
configured,
boot net -sw
will boot you up to a single user shell, no password required. If you
do not have this previously configured, pull out the manual on
jumpstart and start reading. If the machine has a Sun cdrom drive,
load the operating system cdrom into the drive and enter
boot cdrom -sw
at the ok prompt.
If you are worried about the security implications of the above, either
lock the machine in a secure room (in real terms, there is no security
for a physically accessable machine, ever) or set the prom password
on. Do not forget it. If you do forget it, the only way to unset it
without knowing the password is via the eeprom command when the machine
is running and you are logged in as root. If you can't get the machine
up, you are history.
Other things to note - when you boot up to a memory image, the device
tree is not built with historic information. This means that the
device tree may differ radically from that on a machine that has been
reconfigured several times, as the machine when running will attempt to
retain device numbers over time. For example, you always want your
database to live on c5t2d0d3, even if you remove controller four from
the system.
Normally, the first thing to do here is to fsck the root and user file
systems, unless you are running a mirrored root or user file system and
do not intend to make any changes on it. This is the only time when
you can safely fsck the raw root and user file systems. This can fix
seriously corrupt file systems.
You can then mount the root file system on the mount point /a and the
user file system as /a/usr. Do an ls to verify that they are infact
the correct file systems. If user is wrong, at least you can look in
/a/etc/vfstab and see what it should be. At this point, we can chroot
to /a and perform maintenance on the disk based operating system root
filesytem image while it is quiescent. Note, however, that you have
not mounted /var or other file systems. If you need them, fsck them
then mount them before you do the chroot. You will want /var if you
intend to use the vi editor on the root file system. Also note that
you will need to set the TERM type when in single user mode.
If you have to recover the root and user file systems, this is the
point to do it. Assuming that you have a ufsdump of the file systems,
newfs the raw partition, mount it as /a and ufsrestore to it. You will
also need to run the installboot command to tell the boot loader where
to find the boot block. Otherwise, the machine will complain that the
boot file does not appear to be executable when you try and bring the
machine up.
If you are dealing with a mirrored root or user file system, you
immediately need to edit /a/etc/vfstab and comment out the mirrored
device entries, replace both the raw and cooked mirrored file system
entries with the raw and cooked disk device entries, then edit
/a/etc/system and comment out the metadevice entries if they exist.
When the machine is back up, you will need to rebuild the devices to
get the machine fully operational. Generally, if you are in this
situation, call your support provider before attempting the recovery
and get help. The penalties for failure are high enough that you want
to verify that you are taking the correct measures. If you are not
sure what you are doing, you are in over your head.
Running on the in memory unix image is the best time to rebuilt the
operating system device tree, especially in the event of corruption of
various nodes under /dev. /dev under solaris is merely a series of
symbolic links into /devices. The entries in /devices are complex
device pointers into the in-memory device tree.
To rebuild the device tree and the /dev structures, you need to know if
the operating system is carrying device history. I will cover how to
rebuild the device tree without history first. Firstly, chroot onto the
root file system by
chroot /a /bin/sh
cd to /dev and remove the rmt, dsk and rdsk directories. Remove any
other suspect entries in /dev, but do not remove the following entries :
/dev/null, /dev/zero, /dev/mem, /dev/kmem and /dev/ksyms. cd to
/devices and remove all the directories except pseudo. /devices/pseudo
has the pointers for the previously mentioned devices. Next, rebuild the
/devices entries, then the symlinks for the various required devices via
the following commands:
drvconfig
devlinks
disks
This will rebuild the devices required for rebooting the operating
system. Touch /reconfigure to get the operating system to complete the
device tree rebuild on the next boot.
If you are trying to rebuild the device tree on a machine with some
historic device entries, which is generally most important for disk
controller numbers, you will need to retain the first entry in
/dev/rdsk and /dev/dsk for that disk chain, e.g c5t0d0s0 in the above
example, before you rebuild the device tree. If you wish to change a
disk controller number, you use a similar mechanism. For example, to
change a controller from c4 to c5 where there is no c5 currently, cd to
/dev/rdsk, move c4t0d0s0 to c5t0d0s0, remove all the other c4 entries,
do the same in /dev/dsk and then run the command disks.
You can also turn on savecore at this time. I will cover this in a
later section, as this is certainly not the best time to turn on
savecore. You may also want to turn on various boot up debugging at
this time. The most useful tool I find for debugging boot failures is
a quick hack to the /etc/rc scripts. The main script of interest are
/etc/rc2. If you have a quick look through this script, you will find
a small section of code that runs the S scripts in the directory
/etc/rc2.d with the start option. I suggest that you place an echo of
the script name just before the case statement in which this is done
in, so that you can see which script is hanging or failing. You can
also run the scripts with the -x option to debug more gnarly failures,
or set the -x option within a problem script. Turning the -x option
on for all scripts will overload your screen with information, and is
generally not a good idea.
When you have finished with the root file system, exit out of chroot,
cd to / and unmount the usr, var and root file systems before rebooting
the machine.
To boot the machine to single user mode, get to the ok prompt and type
boot -sw
This should bring the machine up to the point where you are asked for
the root password to proceed with system maintenance. If the machine
does not accept the root password at this point, you may well find that
the shadow or password file has been damaged or destroyed by the system
failure. The only recovery option you have is to boot to an in-memory
version of the operating system, and repair these files. Remember the
issues involved if you are using mirroring.
Another situation that would normally cause the machine to enter single
user mode is the failure of a file system to fsck cleanly when
booting. The machine will usually have the root and user file systems
mounted read only when this occurs, and the mnttab file will usually
indicate that all file systems are mounted. When this occurs, you
first need to remount the root and user file systems as writeable, then
clean up the mnttab file before proceeding with other work. Start by
fscking the root and user file systems, or just run fsck to fsck all
the filesystems with fsck entries in the vfstab. Then remount the root
and user file systems as writeable via the following commands
mount -o remount,rw /
and
mount -o remount,rw /usr
Follow this by a umountall command, which should clean up the mnttab
file.
With the machine in single user mode, you can dump the save core image
into a file before swap is mounted and possibly trashes the image.
Since the savecore image can be up to the size of physical memory, you
will want to dump this on a partition that has enough space. By
default, the operating system attempts to dump this in the /var/crash
directory. You will need to mount the partition you wish to dump the
image in, if it is not already mounted. Create a directory for the
image if it does not already exist, and use the command
savecore image-directory
to dump the image.
Once you have finished whatever single user administration work you
wish to do, you can exit single mode to immediately boot to multiuser
mode. There are two places that this process will normally hang.
The first and most common is when there is a nfs mount in the vfstab
file, and the remote server is not running. The simple work around to
this is to add either the bg or soft mount option in the vfstab, or use
the automounter. When you are on a server, the automounter needs to be
used with great care if you are using direct maps. Personally, I am
religiously opposed to direct maps, but they are a common choice. The
problem arises in the fact that the automounter takes control of a
mount point in a direct map.
If, for example, you wish to mount /usr/local from the main server as
/usr/local on all of your client machines. If you are using nis or
nisplus maps to manage the automounter, when the server runs the
automounter, the automounter tries to take control of /usr/local, which
already has a file system mounted under it, and in use. This has
caused the mount point to hang when I have run into this problem. I
prefer to use the automounter in combination with symbolic links to get
around this, so that at work, my /usr/local is a symbolic link to
/wgtn/apps/local. This map system also allows me to do some neat
things with lan and wan links with one consistent set of maps.
Another fault that can cause a hang at boot is a problem in the device
tree. One common bug is that the ps command will hang due to a
corruption in the device tree caused by buttons and dials package. If
you have the device /dev/bd.off, remove it, then get rid of the buttons
and dials software via
pkgrm SUNWdialh SUNWdial
unless you have a buttons and dials board on your machine. I am not aware
of any of these in New Zealand.
Now that you have the machine up and running again, it is time for a
quick look around to see what possibly caused the failure.
The first place to look are the system message files. The current
messages can be pulled from the machine via the dmesg command. If the
machine has been rebooted recently, the boot details of the machine,
including the hardware configuation at that time, can be seen. These
messages are also saved in the messages files in /var/adm, the older
files having a numeric postfix on them. Hardware problems seen by the
operating system will also be seen in these files. It is worthwhile
monitoring these files and being aware of what the messages in them
mean on your machine.
These files are updated via the syslog daemon. How the daemon works
and what it logs can be modified via the syslogd.conf file. If, for
example, you find that the file is continuously being filled with
sendmail messages, you may elect to put the sendmail messages in a
separate file and allocate the system messages file to more critical
messages. As a side note, if the last message from the previous boot
is from syslogd indicating it was shutdown with a signal 15, this tells
you that someone halted the machine manually. Use the last command to
give you clues as to who the culprit may be.
One very useful tool to consider using if you are not already is sar,
or system accounting. As the adm user, you can use cron to run the
command
/usr/lib/sa/sa1
periodically to check point various operating system parameters,
including cpu usage, system memory and swap usage, networking and disk
utilization, etc. You can then use the sar command to interogate the
files captured to get information and trends from these files. The
trends are particularly useful for system performance tuning and system
debugging. This is often the only way to spot a slow memory leak in
the kernel.
Two things to be aware of with sar. It can chew through space in the
/var partition, and it is not particularly good at cleaning up after
itself. The second problem is that it often does not report correctly
after a reboot. The data is being saved, but sar does not report it.
I believe you can use the time options on sar to get around this.
You have captured the system crash and now want to look at it. There
are two tools that ship with the operating system to do this, crash and
adb. Neither are pretty, but crash is the most user friendly, which is
not saying much. Crash at least has a help. Once again, grovelling
around in the kernel is for the trained expert, and not much I can say
will help here. There are some internal tools being developed to do
light weight crash dump analysis, but I am not sure when or if these
will be available outside Sun. Generally, call your service provider
and they should be able to analyse the core file and figure out why
the machine failed. This can be a very time consuming exercise, so
please be patient. Digging through hundreds of threads to find which
has the critical mutex can be very painful.
Another useful tool that you can turn on is streams error reporting.
To turn this on, create a directory /var/adm/streams and run the
command strerr. You can run this up in the background on boot to get
the machine to continually record the messages. While primarily useful
for network problems, streams errors can highlight other system
problems and can pick up the odd intermittant fault that has otherwise
been missed. Run it, have a look at what it produces.
3.5 SunSolve
3.6 Patches
Both sunsolve mechanisms also have full access to the Sun patch
database, enabling you to install any patches indicated by sunsolve, or
the current recommended patch cluster to bring your machine up to date
with the current recommended revision of you operating system.
Patching the machine to the latest recommended revisions is required
before SunService can escalate a problem over to Engineering for
analysis and repair. If your machine does become unstable, I would
generally recommend bringing the patches up to date incase the problem
is known and a fix has already been released for it, which is often
the case.
The recommended and security patches are also available for anonymous
download to all Sun customers from sunsolve1.sun.com. However, access
to the non-recommended patches is limited to machines which are covered
under a maintenance contract for legal reasons.
To check which patches are on the machine, you can use the command
showrev -p
on solaris machines. Under SunOS machines, patches are installed
manually, so you need to keep accurate records as to what has been
installed. Patching is managed by a wrapper over the package system
at the present moment, although this is likely to change with Solaris
2.6.
rather than an upgrade when you next come to upgrade the operating
system. If showrev core dumps on you, you will normally find that a
core file has been dropped in the /var/sadm directories somewhere. It
is safe to clean this up.
When you install patches, these will install as packages with a numeric
postfix to the main packages that they update. You should be able to
see this on a pkginfo command.
Much of this section borders on what are religious issues for most
system administrators. I will cover the least contentious issue first.
Two major schools of thought come into play when discussing the
operating system file system layout, those who believe disk space is a
problem and want to put everything in the root and those who like to
compartmentalise the operating system, putting everything into it's own
partition. There have been several major debates about this within
Sun, and the general consensus is that the single partition is good for
work stations, but not appropriate for servers.
The most important file systems to the operating system are the root
and user file systems, whether as distinct partitions or combined.
These two file systems should generally be relatively static. On the
other hand is the /var file system, which is where the operating system
does a lot of scribbling and creating files. In the event of a crash,
the /var partition will almost always need an fsck to clean it up
before remounting. Since fsck can also be a liability, you do not want
to run it on the root or user file systems if possible. Therefore, it
is strongly advised that you keep /var, and any volatile file systems,
as distinct file systems from the root and user partitions. This also
makes the machine much quicker to recover in the event of major
corruption where you need to recover those two partitions.
The other separate partition you should have is swap. Although you can
run the machine with a swap partition, or you could swap onto a file
within a file system, you are not going to catch a dump in the event of
failure. This means that you may not find out why the machine is
crashing, so you may not get a fix.
Even if you have the solstice backup product, aka networker, you want
to separately backup these partitions via ufsdump, and include the /opt
partition. You need networker installed to recover networker backups,
and the operating system cd does not have this option, so you would
have to build the operating system, then install networker, then
license and configure it, before you could begin to recover the
machine. With the ufsdump backups, you recover the root, usr and opt
file systems from a in-memory version of unix, run installboot, then
boot the machine and recover the rest of the system.
This comes back to the old rule, which always applies to computers as
much as anything else, which is Keep it simple.
Conclusion
There is a lot of complexity within the Solaris operating system,
as there is in virtually all modern operating systems. Knowledge and
planning are the keys to managing this complexity, especially in times
of crisis. If you can take the time to learn about your machine, talk
to your users, observe your machine and do a little exploring, reading
and thinking, and you will have a less stressful work life. Ofcourse,
how you make time for this while you are running around dealing with
the current disasters it the big question.
Thanks for your time, and I hope you get some value from this course,
and that you feel a bit more in control next time everything turns to
jello and 400 users are ringing to find out why they can't surf the
web ...
crashed disk
By : tunde ( Fri Oct 6 07:57:53 2006 )
System enters ok
By : Alaba ( Fri Aug 25 01:02:51 2006 )
All Comments
Name : anonymous
Password :
E-mail :
Subject :
Comments :
exchange database repair All Exchange versions at once Sofortiger Export in pst-Dateien www.cds24 .de
Data Recovery Laci Specialists Recover your Data No Recovery No Fee (800)233-3648 www.DriveCrash.com
Hard drive recovery tool For corrupted, reformatted drives Windows 7, XP, Vista, 2003, 2000 www.quetek.com