Vous êtes sur la page 1sur 22

CSE398:SystemAdministration

Debugging

Learnthecustomer'
sproblem

Findtherootcauseandfixit

Havetherighttools

Fixingthingsonce

Spring2004

Fixthingsonce,ratherthanoverandover

Avoidthetemporaryfixtrap

Learningfromcarpenters
CSE398:SystemAdministration

2004BrianD.Davison

Learnthecustomer'
sproblem

Stepone:understand(atahighlevel)whattheuser
istryingtodo,andwhatpartisfailing

Thecustomerexpectsaparticularresultfromsome
action,butisgettingsomethingelse

Ex:

Mymailprogramisbroken

Ican'
treachthemailserver

Mymailboxdisappeared!

Anycouldbetrue,buttherealproblemcouldbe
DNS,apowerfailure,anetworkproblem,etc.

Whencomplete,makesurethecustomeragrees!

Spring2004

CSE398:SystemAdministration

2004BrianD.Davison

Example#1:tapefailures
AseverySystemAdministratorknows,reliablebackupsarea
must.Becauseofthismyteambecamesuitablyconcernedwhen
theoperatorshandlingourcentraldatabaseserversstartedto
report"tapefailures."Thefailuressoonbecameregular,and
requiredregularmanualinterventiontokeepoperational.In
investigatingthecauseofthisproblem,corporatesecurityand
productionfloorrulesforcedustodependontheoperatorsfor
information.Theoperationsstaffplacedtheblameontheoff
sitetapestorageservice'
sjostlingtapesduringtransport,and
requestsforsamplesoffailedtapesgavenoindicationastothe
cause.

Spring2004

CSE398:SystemAdministration

2004BrianD.Davison

Example#1continued
Therootcauseoftheproblemdidn'
tbecomeobviousuntilthis
hadbeengoingonforacoupleofmonths.Duringalarge
systemupgrade,myteamwasabletoobservetheoperatorsat
work.Theoperationsstaffhadbeenoutsourcedtoalowcost
contractingfirmthatapparentlycontainedalargepercentageof
fansofthelocalprofessionalhockeyteam.Theoperatorswere
skiddingthe8mmtapesacrossthecomputerroomfloorlikea
hockeypuckinsteadofcarryingthemacrossthefloor.Addinga
ruleprohibitingthrowing,skipping,andslidingofbackuptapes
quicklyrestoredbackupstoareliablestate.
TapeHockey,byAllenPeckham

Spring2004

CSE398:SystemAdministration

2004BrianD.Davison

Findtheproblem's
causeandfixit

Workaroundsaregood,butfixingtheroot
causeismuchbetter

Rebooting/restartingisacommonworkaround

E.g.,solutionforfulldiskproblemisnotto
deleteoldlogfiles
Improvingthespeedofrebootsisnotreally
thesolutioneither!

Spring2004

CSE398:SystemAdministration

2004BrianD.Davison

Example#2:mailproblems
AnISPnoticedthatemailservicewasparticularlyslowoneday
andtheyweregettingcomplaintsthatittookupto4hoursto
delivermessagesthatweresentthroughtheSMTPserver.
Thequick(andeasy)solutionwouldbetorestarttheserverand
flushoutwhateverwasslowingitdown.Thatwouldhave
maskedtheproblem,however.
Instead,theymonitoredtheserviceandnoticedthattheywere
gettingrepeatedaccessesfromthesamesite.Hundreds,no
thousandsofemailsflowingintousfromonesource.This
indicatedthatsomeonewasspammingthroughtheISP.

Spring2004

CSE398:SystemAdministration

2004BrianD.Davison

Examplecontinued
WiththeknowledgeoftheirIPaddress,theywereabletotrack
downwhotheywereandblockthemfromthesystemthereby
stoppingthemfromspammingthroughitanymore.
(Yes,theyhadspamblockinginplace;thisuserwasacustomer
andthereforewasallowedtousetheSMTPserver.Their
AcceptableUsePolicy,however,forbadeusingittosend
unsolicitedcommercialbulkemailsotheybannedtheuser.)
Source:LecturenotesofScottHeffner,KeeneStateCollege

Spring2004

CSE398:SystemAdministration

2004BrianD.Davison

Example#3:missingfiles
Inthemiddleofthenight,allthemachineswentdown,
withvaryingamountsofstuffmissing.
Nobodyknewwhatwhatwasgoingon!Thesystemswere
restoredfrombackup,andthingsseemedtobegoingOK,
untilthenextnight.
Thistime,CorporateSecuritywascalledin,andthe
admingroup'
ssupervisorwascalledbackfromhis
vacation(Ithinkthere'
ssomethinginthereabouta
helicopterpickingtheguyupfromaraftingtripinthe
GrandCanyon).

Spring2004

CSE398:SystemAdministration

2004BrianD.Davison

Example#3continued
Bychance,somebodycheckedthecronscripts,andall
waswellforthenextnight...
Why?Whathappened:
Wehaveahomegrownadminsystemthatcontrols
accountsonallofourmachines.Ithasaremoveuser
operationthatremovestheuserfromallmachinesat
thesametimeinthemiddleofthenight.
Well,onenight,thethinggoesoffandtriestoremove
auserwiththehomedirectory'
/'
...

Spring2004

Organization:AT&TBellLabs,MurrayHill,NJ,USA
CSE398:SystemAdministration

2004BrianD.Davison

Howtofindthecause

Besystematic

Formhypotheses,testthem,notetheresults,makechanges
basedonthoseresults

Use

Processofelimination

Successiverefinement

Realproblemismostoftenassociatedwiththemost
recentchangemadetothehost,network,orwhatever
isbroken

Spring2004

Fromalackoftesting
CSE398:SystemAdministration

2004BrianD.Davison

Processofelimination

Removedifferentpartsofthesystemuntilthe
problemdisappears

Commontechniqueforhardwareproblems

Problemwasinlastpartremoved
Swaporremovepiecesuntilitworks

Alsoworksforsoftware

Spring2004

CSE398:SystemAdministration

2004BrianD.Davison

Successiverefinement

Addonenewcomponentatatimeandverifythatit
workscorrectly

tracerouteworksthisway

Mayrequireexaminingintermediatestagesofoutput

Forsystems/processeswithmanycomponents,the
processofeliminationandsuccessiverefinementmay
takeawhile.

Spring2004

Why?Whatisanalternative?

CSE398:SystemAdministration

2004BrianD.Davison

Havetherighttools

Diagnostictoolsletyouseeintodevicesorsystemsto
seeinnerworkings

Stillneedtointerpretwhatyousee

Packetsniffersareeasytouse

Understandhowthetoolworks

Understandingtheprotocolscapturedrequiresknowledge
andtraining(e.g.,networkingcourses)
Itmaydrawthewrongconclusion

Simpletoolsareoftenbest

Spring2004

ping,traceroute,telnet
CSE398:SystemAdministration

2004BrianD.Davison

Takeascientificapproach

Givenanunusual,recurring
behavior/problem

Spring2004

Collectdata

Visualize[optional,butoftenhelpful]

Discernpatterns

Hypothesizesourceofpatterns

Testforsuchsources

Applysolution

Testsolution
CSE398:SystemAdministration

2004BrianD.Davison

Endtoendunderstandinghelps
Acustomerreportsthatsomeofhisfilesweredisappearing
hehadabout100MBinhishomedirectory,andallbut
2MBhaddisappeared.
Herestoredhisfiles.Acoupleofdayslaterithappened
again.
Thishadbeenhappeningforafewweeks,butwas
embarrassedtotellthesystemadmins.
Theory1:Virusscansrevealednothing.
Theory2:Prank,orbadcronjob.
Wasgivenpagernumbers,toldtocallnexttime
Networksnifferswereputintoplace

Spring2004

CSE398:SystemAdministration

2004BrianD.Davison

Example#4continued
Happensagain.Wasaskedwhathelastdid?Usedalab
machinetosurftheWeb.
Extraknowledgehelpsasysadminrememberedthat
Webbrowserskeptacacheandprunedittostayundera
certainlimit(suchas2MB).
Labworkstationmisconfigured;browserfoundan
invalid/missingcachedirectoryandusedtheuser'
shome
directoryinstead.

Spring2004

CSE398:SystemAdministration

2004BrianD.Davison

Fixthingsonce,
ratherthanoverandoveragain

Whensomethingseemstrivialortemporary,
itiseasytoignoreit,oruseaquickfix

Alittleeffortwilloftenpayforitself

Rule:Fixitonce

Spring2004

CorollaryA:Fixtheproblempermanently

CorollaryB:Leveragewhatothershavedone
don'
treinventthewheel

CorollaryC:Fixaproblemforallhostsatthe
sametime
CSE398:SystemAdministration

2004BrianD.Davison

Avoidthetemporaryfixtrap

Sometimesacompletefixisimpossibleinthat
situation

It'
simportantthatatemporaryfixbefollowedbya
permanentone

Recordtheactionstakenforatemporaryproblem!

Putthefullsolutiononatroubleticket!

Fixingthesamesmallthingsishabitformingwe
getgoodatthekeystrokesneeded!

Wegetusedtothequickfix,anddon'
trealizehow
muchtimewehavelostasaresult.

Spring2004

CSE398:SystemAdministration

2004BrianD.Davison

Mailinglistexample

Runningamailinglistseemseasy.

Bookauthorranmanymailinglists,andhadtodealwith
bouncedmessages.

E.g.,automatedsubscribeandunsubscribe

Dealingwithbouncestakestime.Wrotescriptstohelp
manageitcollectbounces,figureoutwhowasbouncing,
deletesubscriberiferrorpersisted.Stilltook~1houraday!

Bettersolutionwasothersoftwarethathandledbounces
ormadelistownersdealwiththem

Spring2004

Heignoredbouncesforaweek;stayedlatetoinstallnew
softwarewithoutinterruption.Cost:5hours;savings:4
hoursperweek.
CSE398:SystemAdministration

2004BrianD.Davison

Learningfromcarpenters

Measuretwice,cutonce

Alittleextracareisasmallpricecomparedtothe
potentialdamageofamistake.

Carpenterscopyalengthbyreusingthe
originalpieceoverandoveragain

Spring2004

Reuseworkingscriptsratherthanrewriting
them

Usecommandlineshellshortcutsratherthanre
typing
CSE398:SystemAdministration

2004BrianD.Davison

Example#5:rmfolly
MymistakeonSunOS(withOpenWindows)wastotry
andcleanupallthe'
.*'
directoriesin/tmp.Obviously"rm
rf/tmp/*"missedthese,soIwasverycarefulandmade
sureIwasin/tmpandthenexecuted
"rmrf./.*"
Iwillneverdothisagain.IfIaminanydoubtastohowa
wildcardwillexpandIwillechoitfirst.
Organization:DataCADLtd,Hamilton,Scotland

Spring2004

CSE398:SystemAdministration

2004BrianD.Davison

Summary

Understandtheproblem

Fixesshouldbepermanent

Leverageothers'
fixes

Fixesshouldbeglobal

Testyoursolution!

Spring2004

CSE398:SystemAdministration

2004BrianD.Davison

Vous aimerez peut-être aussi