Académique Documents
Professionnel Documents
Culture Documents
Debugging
Learnthecustomer'
sproblem
Findtherootcauseandfixit
Havetherighttools
Fixingthingsonce
Spring2004
Fixthingsonce,ratherthanoverandover
Avoidthetemporaryfixtrap
Learningfromcarpenters
CSE398:SystemAdministration
2004BrianD.Davison
Learnthecustomer'
sproblem
Stepone:understand(atahighlevel)whattheuser
istryingtodo,andwhatpartisfailing
Thecustomerexpectsaparticularresultfromsome
action,butisgettingsomethingelse
Ex:
Mymailprogramisbroken
Ican'
treachthemailserver
Mymailboxdisappeared!
Anycouldbetrue,buttherealproblemcouldbe
DNS,apowerfailure,anetworkproblem,etc.
Whencomplete,makesurethecustomeragrees!
Spring2004
CSE398:SystemAdministration
2004BrianD.Davison
Example#1:tapefailures
AseverySystemAdministratorknows,reliablebackupsarea
must.Becauseofthismyteambecamesuitablyconcernedwhen
theoperatorshandlingourcentraldatabaseserversstartedto
report"tapefailures."Thefailuressoonbecameregular,and
requiredregularmanualinterventiontokeepoperational.In
investigatingthecauseofthisproblem,corporatesecurityand
productionfloorrulesforcedustodependontheoperatorsfor
information.Theoperationsstaffplacedtheblameontheoff
sitetapestorageservice'
sjostlingtapesduringtransport,and
requestsforsamplesoffailedtapesgavenoindicationastothe
cause.
Spring2004
CSE398:SystemAdministration
2004BrianD.Davison
Example#1continued
Therootcauseoftheproblemdidn'
tbecomeobviousuntilthis
hadbeengoingonforacoupleofmonths.Duringalarge
systemupgrade,myteamwasabletoobservetheoperatorsat
work.Theoperationsstaffhadbeenoutsourcedtoalowcost
contractingfirmthatapparentlycontainedalargepercentageof
fansofthelocalprofessionalhockeyteam.Theoperatorswere
skiddingthe8mmtapesacrossthecomputerroomfloorlikea
hockeypuckinsteadofcarryingthemacrossthefloor.Addinga
ruleprohibitingthrowing,skipping,andslidingofbackuptapes
quicklyrestoredbackupstoareliablestate.
TapeHockey,byAllenPeckham
Spring2004
CSE398:SystemAdministration
2004BrianD.Davison
Findtheproblem's
causeandfixit
Workaroundsaregood,butfixingtheroot
causeismuchbetter
Rebooting/restartingisacommonworkaround
E.g.,solutionforfulldiskproblemisnotto
deleteoldlogfiles
Improvingthespeedofrebootsisnotreally
thesolutioneither!
Spring2004
CSE398:SystemAdministration
2004BrianD.Davison
Example#2:mailproblems
AnISPnoticedthatemailservicewasparticularlyslowoneday
andtheyweregettingcomplaintsthatittookupto4hoursto
delivermessagesthatweresentthroughtheSMTPserver.
Thequick(andeasy)solutionwouldbetorestarttheserverand
flushoutwhateverwasslowingitdown.Thatwouldhave
maskedtheproblem,however.
Instead,theymonitoredtheserviceandnoticedthattheywere
gettingrepeatedaccessesfromthesamesite.Hundreds,no
thousandsofemailsflowingintousfromonesource.This
indicatedthatsomeonewasspammingthroughtheISP.
Spring2004
CSE398:SystemAdministration
2004BrianD.Davison
Examplecontinued
WiththeknowledgeoftheirIPaddress,theywereabletotrack
downwhotheywereandblockthemfromthesystemthereby
stoppingthemfromspammingthroughitanymore.
(Yes,theyhadspamblockinginplace;thisuserwasacustomer
andthereforewasallowedtousetheSMTPserver.Their
AcceptableUsePolicy,however,forbadeusingittosend
unsolicitedcommercialbulkemailsotheybannedtheuser.)
Source:LecturenotesofScottHeffner,KeeneStateCollege
Spring2004
CSE398:SystemAdministration
2004BrianD.Davison
Example#3:missingfiles
Inthemiddleofthenight,allthemachineswentdown,
withvaryingamountsofstuffmissing.
Nobodyknewwhatwhatwasgoingon!Thesystemswere
restoredfrombackup,andthingsseemedtobegoingOK,
untilthenextnight.
Thistime,CorporateSecuritywascalledin,andthe
admingroup'
ssupervisorwascalledbackfromhis
vacation(Ithinkthere'
ssomethinginthereabouta
helicopterpickingtheguyupfromaraftingtripinthe
GrandCanyon).
Spring2004
CSE398:SystemAdministration
2004BrianD.Davison
Example#3continued
Bychance,somebodycheckedthecronscripts,andall
waswellforthenextnight...
Why?Whathappened:
Wehaveahomegrownadminsystemthatcontrols
accountsonallofourmachines.Ithasaremoveuser
operationthatremovestheuserfromallmachinesat
thesametimeinthemiddleofthenight.
Well,onenight,thethinggoesoffandtriestoremove
auserwiththehomedirectory'
/'
...
Spring2004
Organization:AT&TBellLabs,MurrayHill,NJ,USA
CSE398:SystemAdministration
2004BrianD.Davison
Howtofindthecause
Besystematic
Formhypotheses,testthem,notetheresults,makechanges
basedonthoseresults
Use
Processofelimination
Successiverefinement
Realproblemismostoftenassociatedwiththemost
recentchangemadetothehost,network,orwhatever
isbroken
Spring2004
Fromalackoftesting
CSE398:SystemAdministration
2004BrianD.Davison
Processofelimination
Removedifferentpartsofthesystemuntilthe
problemdisappears
Commontechniqueforhardwareproblems
Problemwasinlastpartremoved
Swaporremovepiecesuntilitworks
Alsoworksforsoftware
Spring2004
CSE398:SystemAdministration
2004BrianD.Davison
Successiverefinement
Addonenewcomponentatatimeandverifythatit
workscorrectly
tracerouteworksthisway
Mayrequireexaminingintermediatestagesofoutput
Forsystems/processeswithmanycomponents,the
processofeliminationandsuccessiverefinementmay
takeawhile.
Spring2004
Why?Whatisanalternative?
CSE398:SystemAdministration
2004BrianD.Davison
Havetherighttools
Diagnostictoolsletyouseeintodevicesorsystemsto
seeinnerworkings
Stillneedtointerpretwhatyousee
Packetsniffersareeasytouse
Understandhowthetoolworks
Understandingtheprotocolscapturedrequiresknowledge
andtraining(e.g.,networkingcourses)
Itmaydrawthewrongconclusion
Simpletoolsareoftenbest
Spring2004
ping,traceroute,telnet
CSE398:SystemAdministration
2004BrianD.Davison
Takeascientificapproach
Givenanunusual,recurring
behavior/problem
Spring2004
Collectdata
Visualize[optional,butoftenhelpful]
Discernpatterns
Hypothesizesourceofpatterns
Testforsuchsources
Applysolution
Testsolution
CSE398:SystemAdministration
2004BrianD.Davison
Endtoendunderstandinghelps
Acustomerreportsthatsomeofhisfilesweredisappearing
hehadabout100MBinhishomedirectory,andallbut
2MBhaddisappeared.
Herestoredhisfiles.Acoupleofdayslaterithappened
again.
Thishadbeenhappeningforafewweeks,butwas
embarrassedtotellthesystemadmins.
Theory1:Virusscansrevealednothing.
Theory2:Prank,orbadcronjob.
Wasgivenpagernumbers,toldtocallnexttime
Networksnifferswereputintoplace
Spring2004
CSE398:SystemAdministration
2004BrianD.Davison
Example#4continued
Happensagain.Wasaskedwhathelastdid?Usedalab
machinetosurftheWeb.
Extraknowledgehelpsasysadminrememberedthat
Webbrowserskeptacacheandprunedittostayundera
certainlimit(suchas2MB).
Labworkstationmisconfigured;browserfoundan
invalid/missingcachedirectoryandusedtheuser'
shome
directoryinstead.
Spring2004
CSE398:SystemAdministration
2004BrianD.Davison
Fixthingsonce,
ratherthanoverandoveragain
Whensomethingseemstrivialortemporary,
itiseasytoignoreit,oruseaquickfix
Alittleeffortwilloftenpayforitself
Rule:Fixitonce
Spring2004
CorollaryA:Fixtheproblempermanently
CorollaryB:Leveragewhatothershavedone
don'
treinventthewheel
CorollaryC:Fixaproblemforallhostsatthe
sametime
CSE398:SystemAdministration
2004BrianD.Davison
Avoidthetemporaryfixtrap
Sometimesacompletefixisimpossibleinthat
situation
It'
simportantthatatemporaryfixbefollowedbya
permanentone
Recordtheactionstakenforatemporaryproblem!
Putthefullsolutiononatroubleticket!
Fixingthesamesmallthingsishabitformingwe
getgoodatthekeystrokesneeded!
Wegetusedtothequickfix,anddon'
trealizehow
muchtimewehavelostasaresult.
Spring2004
CSE398:SystemAdministration
2004BrianD.Davison
Mailinglistexample
Runningamailinglistseemseasy.
Bookauthorranmanymailinglists,andhadtodealwith
bouncedmessages.
E.g.,automatedsubscribeandunsubscribe
Dealingwithbouncestakestime.Wrotescriptstohelp
manageitcollectbounces,figureoutwhowasbouncing,
deletesubscriberiferrorpersisted.Stilltook~1houraday!
Bettersolutionwasothersoftwarethathandledbounces
ormadelistownersdealwiththem
Spring2004
Heignoredbouncesforaweek;stayedlatetoinstallnew
softwarewithoutinterruption.Cost:5hours;savings:4
hoursperweek.
CSE398:SystemAdministration
2004BrianD.Davison
Learningfromcarpenters
Measuretwice,cutonce
Alittleextracareisasmallpricecomparedtothe
potentialdamageofamistake.
Carpenterscopyalengthbyreusingthe
originalpieceoverandoveragain
Spring2004
Reuseworkingscriptsratherthanrewriting
them
Usecommandlineshellshortcutsratherthanre
typing
CSE398:SystemAdministration
2004BrianD.Davison
Example#5:rmfolly
MymistakeonSunOS(withOpenWindows)wastotry
andcleanupallthe'
.*'
directoriesin/tmp.Obviously"rm
rf/tmp/*"missedthese,soIwasverycarefulandmade
sureIwasin/tmpandthenexecuted
"rmrf./.*"
Iwillneverdothisagain.IfIaminanydoubtastohowa
wildcardwillexpandIwillechoitfirst.
Organization:DataCADLtd,Hamilton,Scotland
Spring2004
CSE398:SystemAdministration
2004BrianD.Davison
Summary
Understandtheproblem
Fixesshouldbepermanent
Leverageothers'
fixes
Fixesshouldbeglobal
Testyoursolution!
Spring2004
CSE398:SystemAdministration
2004BrianD.Davison