Vous êtes sur la page 1sur 49

Chapter 13

Troubleshooting the Operating System 

13.1 - Identifying and Locating Symptoms


and Problems
13.2 - LILO Boot Errors
13.3 - Various Reasons for Package
Dependency Problems
13.4 - Troubleshooting Network Problems
13.5 - Disaster Recovery
Identifying and Locating
Symptoms and Problems
Hardware Problems

• Although a few problems are


due to a combination of
factors, most can be isolated in
origin to one of these:
– Hardware, Kernel, Application
Software, Configuration, and
User Error,
• Other hardware leaves traces that
the kernel detects and records.
• Assuming an error is such that it
does not crash the system,
evidence might be left in the log
file /var/log/messages, with the
message prefixed by the word
oops.
Kernel Problems

• Released Linux kernels are remarkably stable,


unless experimental versions are used or
individual modifications are made.
• Loadable kernel modules are considered part
of the kernel as well, at least for the time
period they are loaded.
• Sometimes these can cause difficulties, too.
• The good news with modules is that they can
be uninstalled and replaced with fixed versions
while the system is still running.
Application Software

• Errors in application packages are most identifiable in


that they occur only when running the application.
• This is in contrast to hardware and kernel conditions
that affect an entire system.
• Some common signs of application bugs are failure to
execute and program crash.
• An application may consume too much system
memory and ultimately begin to swap so badly that
the whole system is affected.
• Some errors are caused by things that have to do with
the running program itself.
Configuration

• Configuration problems tend to affect whole subsystems,


such as the graphics, printing, or networking subsystems.
• If the system is rebooted and a remote file system that
was once present is not, the first place to look is in the
configuration file /etc/fstab to see if the file system is
supposed to be mounted at boot time.
User Error

• It is not unforgivable to make a mistake in using a


computer program, nor is it to be ignorant of the right
way to do something. It is only unforgivable to insist
on remaining stubbornly so.

• There is more to know about the ins and outs of


operating almost any software package than
everyday users will ever care or attempt to learn. 
Using System Utilities and System Status Tools

• Linux operating systems


provide various system
utilities and system status
tools.
• The setserial utility
provides information and
set options for the serial
ports on the system.
• The lpq command helps
resolve printing problems.
• The command will display
all the jobs that are
waiting to be printed.
Using System Utilities and System Status Tools

• The ipconfig command can


be entered at the shell to
return the current network
interface configuration of
the system.
• The route command
displays or sets the
information on the system’s
routing, which it uses to
send information to
particular IP addresses. 
Unresponsive Programs and Processes

• Sometimes there are programs and


processes that for various reasons can
become unresponsive or “lock up”.
• Sometimes just the program or process itself
will lock up and other times can cause the
entire system to become unresponsive.
• One method of identifying and locating the
unresponsive program and effectively
troubleshooting the problem is to kill or restart
the process or program.
When to Start, Stop, 
or Restart a Process
• It is easiest to terminate a program by using the kill
command.
• Other processes need to be terminated by editing the
Sys V startup script.
• When restarting a program, service, or daemon it is
best to first consult the documentation because
different programs have to be restarted in different
ways.
• Some support using the restart command, some
need to be stopped completely and then started
again, and others can simply reread their
configuration files without needing to be either
stopped and started again, or restarted. 
Troubleshooting Persistent Problems

• The best way to fix programs that crash repeatedly is


to replace them with new software or with a different
kind of software that performs the same task.
• If it is possible, try using the software in a different
way or if there is a particular keystroke or command
that causes the program to fail, stop using it.
• Most times there will be replacement software
available.
• If it is a daemon that is crashing regularly try using
other methods of starting it and running it. 
Examining Log Files

• Some of the more important log


files on a Linux system are the
/var/log/messages,
/var/log/secure, and the
/var/log/syslog log files.
• The system’s log files can be
used to monitor system loads
such as how many pages a
web server has served.
• They can also check for
security breaches such as
intrusion attempts, verify that
the system is functioning
properly, and note any errors
that might be generated by
software or programs.
Examining Log Files

• There are several different types of


information that are good to know, which will
make identifying problems using the log files
a little easier.
• Some of these are listed below:
– Monitoring System Loads
– Intrusion Attempts and Detection
– Normal System Functioning
– Missing Entries
– Error Messages 
The dmesg Command

• The dmesg command can


be used to display the
recent kernel messages,
also known as the kernel
ring buffer.
• These messages contain
information about the
hardware installed in the
system and the drivers.
• The information in these
messages relates to
whether the drivers are
being loaded successfully
and what devices the
drivers are controlling.
Troubleshooting Problems 
Based on User Feedback
• There are several
different types of
problems that users
report.
• Some of the most
common ones are:
– Login Problems
– File Permission
Problems
– Removable Media
Problems
– E-mail Problems
– Program Errors
– Shutdown Problems
LILO Boot Errors
Error Codes

• The LILO boot loader is the


first piece of code that takes
control of the boot process
form the BIOS. It loads the
Linux kernel, and then
passes control entirely to the
Linux kernel.

• When there is a problem


with LILO an error code will
be displayed:
– None, L error-code, LI,
LI101010… LIL , LIL?, LIL-,
LILO
Booting a Linux System 
without LILO
• Using the LILO on a 
Floppy method is the 
least useful but it can help 
in some instances. 

• From this screen a LILO 
boot floppy disk can be 
created which can be used 
to boot Linux from LILO 
using the floppy disk.
Emergency Boot System

• Linux provides an
emergency system’s copy
of LILO, which can be
used to boot Linux in the
event that the original LILO
boot loader has errors or is
not working.

• This is known as the


Emergency Boot System.

• To use this copy of LILO


configuration changes
must be made in lilo.conf.
Using an Emergency 
Boot Disk in Linux
• There are several
reasons and errors that
can cause a Linux
system not to boot,
besides LILO problems.
• The emergency boot
disk should have the
necessary disk utilities
such as fdisk, mkfs,
and fsck, which can be
used to format a hard
drive so that Linux can
be installed on it. 
Using an Emergency 
Boot Disk in Linux
• It is always important to
include some sort of
backup software utility.
• If a change or repair to
some configuration files
needs to be made, first
back them up.
• Most distributions come
with some sort of backup
utility like tar, restore,
cpio, and possibly others.
Recognizing Common Errors
Various Reasons for 
Package Dependency Problems
• When a package is installed in a Linux system there
might be other packages that need to be installed for
that particular package to work properly.

• The dependency package may have certain files


which need to be in place or it may run certain
services which need to be started before the package
that is to be installed can work.

• Linux will often notify the user if they are installing a


package that has dependencies so that they can be
installed as well.
Solutions to Package 
Dependency Problems
• One solution to solving package dependency problems is to
simply ignore the error message and forcibly install the
package anyway.

• The correct and recommended method for providing solutions


is to modify the system so that it has the necessary
dependencies that are needed to run properly.

• It may be necessary to rebuild the package from source code


if there are dependency error messages showing up.

• The easiest way is to locate a different version of the package


that is causing the problems.

• Another option is to look for a newer version of the package.  


Backup and Restore Errors

• Backup and Restore errors can occur at different points.

• Some errors will occur when the system is actually


performing the backup.

• Other errors will occur during the restore process when the
system is attempting to recover data.

• Some of the most common types of problems:


– Driver problems
– Tape drive access errors
– File access errors
– Media errors
– Files not found errors 
Application Failure 
on Linux Servers
• There are several things that can provide some
indication of an application failure or software problem
on a Linux server:
– Failure to Start
– Failure to Respond
– Slow Responses
– Unexpected Reponses
– Crashing Application or Server

• A good general rule is to check the system’s logs.

• The system’s log files are usually the place to find most
error messages that are generated because they are not
always displayed on the screen.
Troubleshooting Network Problems
Loss of Connectivity

• Loss of connectivity can be hardware and/or software


related. The first rule of troubleshooting is to check
for physical connectivity.

• Ensure that the cables are properly plugged in at


both ends, that the network adapter is functioning by
checking the link light on the NIC, that the hub's
status lights are on, and that the communication
problem is not a simple hardware malfunction.
Operator Error

• Be sure that users are using the correct username and


password and that their accounts are not restricted in a
way that prevents them from being able to connect to the
network.

• Software settings might have been changed by the


installation routine of a recently installed program, or the
user might have been experimenting with settings.

• Users accidentally, or purposely, delete files, and power


surges or shutting down the computer abruptly can
damage file data.

• Viruses can also damage system files or user data.


Using TCP/IP Utilities

• The first step in checking for a


suspected connectivity
problem is to ping the host.
• If a reply is received, the
physical connection between
the two computers is intact and
working.
• The successful reply also
signifies that the calling system
can reach the Internet.
• The term ping time refers to
the amount of time that
elapses between the sending
of the Echo Request and
receipt of the Echo Reply.
• A low ping time indicates a fast
connection.
Using TCP/IP Utilities

• Tracing utilities are used to


discover the route taken by a
packet to reach its destination.
• The way to determine packet
routing in UNIX systems is the
traceroute command.
• Traceroute shows all the
routers through which the
packet passes as it travels
through the network from
sending computer to destination
computer.
• This is useful for determining at
what point connectivity is lost or
slowed.
Using TCP/IP Utilities

• The ipconfig
command is used in
Windows NT and
Windows 2000 to
display the IP address,
subnet mask, and
default gateway for
which a network
adapter is configured.

• For more detailed


information, the /all
switch is used.
Problem­Solving Guidelines

• Troubleshooting a network requires problem-solving


skills.

• The use of a structured method to detect, analyze,


and address each problem as it is encountered
increases the likelihood of successful
troubleshooting.

• These steps should be followed:


– Gather information
– Analyze the information
– Formulate and implement a "treatment" plan
– Test to verify the results of the treatment
– Document everything  
Windows 2000 Diagnostic Tools

• The network diagnostic


tools for Microsoft
Windows 2000 Server
include Ipconfig,
Nbtstat, Netstat,
Nslookup, Ping, and
Tracert.

• Windows 2000 Server


also includes the
Netdiag and Pathping
commands.
Wake­on­LAN

• Some network interface cards support a technology


known as Wake-On-LAN (WOL).
• The purpose of WOL technology is to enable a
network administrator to power up a computer by
sending a signal to the NIC with WOL technology.
• The signal is called a magic packet.
• When the magic packet is received by the NIC, it
will power up the computer in which it is installed.
• When fully powered up, the remote computer can
be accessed through normal remote diagnostic
software. 
Disaster Recovery
Risk Analysis

A good risk analysis can be broken into the following


four parts:
• Identify business processes and their associated
infrastructure.
• Identify the threats associated with each of the
business processes and associated infrastructure.
• Define the level of risk associated with each threat.
• Rank the risks based on severity and likelihood.
Understanding Redundancy

• Redundancy is the ability to


continue providing service
when something fails.

• RAID 0 - also known as disk


striping, it writes data
across two or more physical
drives and has no
redundancy.
Understanding Redundancy
• RAID 1 - also known as disk
mirroring, requires the use of two
disk drives and one disk controller
to provide redundancy.
• To increase performance add a
second controller, one for each disk
drive.
• RAID 5 - also know as disk striping
with parity. Parity is an encoding
scheme that represents where the
information is stored on each drive.
• RAID 5 is similar to RAID 0 in that it
writes data across disks but it adds
a parity bit for redundancy.
• Three drives are required to
implement this type of RAID.
Understanding Redundancy
• RAID 0+1 offers the best of
both worlds, the performance
of RAID 0 and the
redundancy of RAID 1.
• This is an expensive solution
because of the number of
drives it requires.
• A number of other
components in the server can
be configured in a redundant
manner:
– Power supplies, Cooling
fans, Network interface
adapters, Processors,
Uninterruptible power supply
(UPS)
Clustering

• A cluster is a group of
independent computers
working together as a
single system.
• This system is used to
ensure that mission-critical
applications and resources
are as highly available as
possible.
• The advantages to running
a clustered configurations:
– Fault tolerance, High
availability, Scalability, Easier
manageability 
Scalability

• Scalability refers to how well


a hardware or software
system can adapt to
increased demands.
• The question is how much
extra capacity should be built
in, and how much additional
capacity can be added once
the server is installed?
• It is a good idea to add an
additional 25% to any new
server configurations to
ensure scalability.  
High Availability

• High availability is the designing and configuring of a


server to provide continuous use of critical data and
applications.

• Highly available systems are required to provide


access to the enterprise applications that keep
businesses up and running, regardless of planned or
unplanned interruption.

• It is not uncommon for mission critical applications to


have an availability requirement of 99.999%. 
Hot Swapping, 
Warm Swapping, and Hot Spares
The types of components that might be kept on hand in case of
a problem are broken into these three basic categories:

1. A hot-swap component has the capability to be added


and removed from a computer while the computer is
running and have the operating system automatically
recognize the change.
2. Warm swaps are generally done in conjunction with the
failure of a hard drive. In this case, it is necessary to shut
down the disk array before the drive can be replaced.
3. A hot-spare component is a component that can be kept
on hand in case of an equipment failure.
Creating a Disaster Recovery Plan 
Based on Fault Tolerance/Recovery
The first piece of the plan is to create the fault-tolerance
portion of the disaster-recovery plan, follow these steps:

• From the risk analysis, identify the hardware failure-


related threats
• From the list of components, identify the components
that place the data at the most risk if they were to fail
• Take each component and make a list of the methods
that could be used to implement it in a fault-tolerant
configuration. List approximate costs for each solution
and the estimated outage time in the event of a failure
for each component. 
Creating a Disaster Recovery Plan Based on 
Fault Tolerance/Recovery
4. Take any components that can be implemented in a cost
effective manner and start documenting the
configuration.
2. Take any components that either cannot be
implemented in a fault-tolerant configuration or that for
which a fault-tolerant configuration would be cost-
prohibitive, and determine whether a spare part should
be kept on hand in the event of an outage.
3. The disaster-recovery plan should include documented
contingencies for any of the threats identified as part of
the risk analysis.
4. After all this information has been documented, place
the orders and get ready to start configuring the server.
Testing the Plan

Some of the things that should be tested for include the following:

• Check the documentation to ensure that it is understandable


and complete.
• Do a “dry run” of each of the components of the plan. Make
sure spare drives can be located, if applicable, and that
replacement parts can be ordered from the vendor.
• Test the notification processes. It should be documented
who is to be notified in case of an outage.
• Check the locations of any hot spare equipment or servers.
• Verify that any support contracts that are on equipment are
still in effect, and that all the contact numbers are available.
• Test the tape backups at least once a week.
• Test the RAID configuration at least twice a year.
Hot and Cold Sites

Two types of disaster-recovery sites are commonly used:

1. A hot site is a commercial facility available for


systems backup.
• For a fee, these vendors provide a facility with server
hardware, telecommunications hardware, and
personnel to assist a company with restoring critical
business applications and services in the event of an
interruption in normal operations.

• A cold site, also known as a shell site, is a facility


that has been prepared to receive computer
equipment in the event that the primary facility is
destroyed or becomes unusable.  

Vous aimerez peut-être aussi