Vous êtes sur la page 1sur 11

12/13/2014

DesigningWatchdogTimersforEmbeddedSystems

AetnaInsurancefound82%oferrorsusinginspectionsand
decreasedresourcesby25%.

Home

Seminars

Newsletter

Videos

Tool&BookReviews

SpecialReports

Articles

Rants

Humor

Search

AboutUs

Advertise

NewandNoteworthy

Fornovelideasaboutbuildingembeddedsystems(bothhardwareandfirmware),join
the25,000+engineerswhosubscribetoTheEmbeddedMuse,afreebiweekly
newsletter.TheMusehasnohype,novendorPR.Ittakesjustafewseconds(just
enteryouremail,whichissharedwithabsolutelynoone)tosubscribe.
GreatWatchdogTimersForEmbeddedSystems
Version1.3,updatedSeptember,2011
LaunchedinJanuaryof1994,theClementinespacecraftspenttwoverysuccessfulmonthsmappingthe
moonbeforeleavinglunarorbittoheadtowardsnearEarthasteroidGeographos.

Doyouneedtoreducebugsinyourfirmware?
Shortenschedules?MyonedayBetterFirmware
Fasterseminarwillteachyourteamhowtooperate
ataworldclasslevel,producingcodewithfarfewer
bugsinlesstime.It'sfastpaced,fun,andcovers
theuniqueissuesfacedbyembeddeddevelopers.
Here'sinformationabouthowthisclass,taughtat
yourfacility,willmeasurablyimproveyourteam's
effectiveness.
NewVideo!Jack'slatestvideoshowshowto
measureidletime.Here'salistofallofJack's
videos.

AdualprocessorHoneywell1750systemhandledtelemetryandvariousspacecraftfunctions.Thoughthe
1750couldcontrolClementine'sthrusters,itdidsoonlyinemergencysituationsallroutinethruster
operationswereundergroundcontrol.
OnMay7,1994,the1750experiencedafloatingpointexception.Thiswasn'tunusualsome3000prior
exceptionshadbeendetectedandhandledproperly.ButimmediatelyaftertheMay7eventdownlinked
datastartedvaryingwildlyandnonsensically.Thenthedatafroze.Controllersspent20minutestryingto
bringthesystembacktolifebysendingsoftwareresetstothe1750allwereignored.Ahardwarereset
commandfinallybroughtClementinebackonline.
Alive,yes,evencommunicatingwiththeground,butwithvirtuallynofuelleft.
Theevidencesuggeststhatthe1750lockedup,probablyduetoasoftwarecrash.Whilehungthe
processorturnedononeormorethrusters,dumpingfuelandsettingthespacecraftspinningat80RPM.
Inotherwords,itappearsthecoderanwild,firingthrustersitshouldneverhaveenabledtheykeptfiring
tillthetanksrannearlydryandthehardwareresetclosedthevalves.
ThemissiontoGeographoshadtobeabandoned.

Advertisehere!Moreinfo.

Designershadworriedaboutthissortofproblemandimplementedasoftwarethrustertimeout.That,of
course,failedwhenthefirmwarehung.
The1750'sbuiltinwatchdogtimerhardwarewasnotused,overtheobjectionsoftheleadsoftware
designer.Withnoautomatic"reset"button,successofthemissionrestedintheabilitiesofthecontrollers
onEarthtodetectproblemsquicklyandsendahardwarereset.Forthelackofafewlinesofwatchdog
codethemissionwaslost.
ThoughsuchafueldumphadneveroccurredonClementinebefore,roughly16timesbeforetheMay7
eventhardwareresetsfromthegroundhadbeenrequiredtobringthespacecraft'sfirmwarebacktolife.
Onemightalsowonderwhysome3000previousfloatingpointexceptionswerepartofthemission's
normalfirmwareprofile.
Notsurprisingly,thesoftwareteamwishedtheyhadindeedusedthewatchdogtimer,andhadnot
implementedthethrustertimeoutinfirmware.Theyalsonoted,though,thatanormal,simple,watchdog
maynothavebeenrobustenoughtocatchthefailuremode.
ContrastthiswithPathfinder,amissionwhosesoftwarealsofamouslyhung,butwhichwassavedbya
reliablewatchdog.Thesoftwareteamfoundandfixedthebug,uploadingnewcodetoatargetsystem40
millionmilesaway,enablinganamazingrovingscientificmissiononMars.
Watchdogtimers(WDTs)areourfailsafe,ourlastlineofdefense,anoptiontakenonlywhenallelsefails
right?Thesemissions(Clementinehadbeenreset16timespriortothefailure)andsomanyothers
suggesttomethatWDTsarenotemergencyouts,butintegralpartsofoursystems.TheWDTisas
importantasmain()ortheruntimelibrary,it'sanassetthatislikelytobeused,andmaybeusedalot.
Outerspaceisahostileenvironment,ofcourse,withhighintensityradiationfields,thermalextremesand
vibrationswe'dneverseeonEarth.DowehavetheseworrieswhendesigningEarthboundsystems?
Maybeso.IntelrevealedthattheMcKinleyprocessor'sultrafinedesignrulesandhugetransistorbudget
meanscosmicraysmayfliponchipbits.TheItanium2processor,alsosportinganastronomical
transistorbudgetandsmallgeometry,includesanonboardsystemmanagementunittohandletransient
http://www.ganssle.com/watchdogs.htm

1/11

12/13/2014

DesigningWatchdogTimersforEmbeddedSystems

hardwarefailures.Thehardwareain'twhatitusedtobeevenifoursoftwarewereperfect.
Buttoomuch(all?)firmwareisnotperfect.ConsiderthisunfortunatelytruestoryfromEdVanderPloeg:
"Theworldhasreachedanewembeddedsoftwaremilestone:Ihadtorebootmyhoodfan.
That'sright,therangeexhaustfaninthekitchen.It'sasimplemodelfromapopularNorth
Americancompany.Ithassixbuttonsonthefront:3forlow,medium,andhighfanspeeds
and3moreforlow,medium,andhighlightlevels.Pressabuttononceandthehoodfandoes
whatthebuttonsays.Pressthesamebuttonagainandthefanorlightsturnoff.That'sit.
Nothingfancy.Anditneededrebootingviathebreakerpanel."
"Apparentlythethinghasamicrotocontrolthelightlevelsandfanspeeds,anditalsohasa
temperaturesensortoautomaticallyswitchthefantohighspeedifthetemperatureexceeds
somefixedthreshold.Well,onedaywewerecookingdinnerasusual,steamingapotof
potatoes,andsuddenlythefankicksintohighspeedandthelightsstartflashing."Hmm,
flakysensororbuggysensorsoftware",Ithinktomyself."
"ThefoodhappenedtobedonesoIturnedoffthestoveandtriedtoturnoffthefan,butI
supposeitwantedthingstocoolofffirst.Fine.Soaftertenminutesorsothefanandlights
turnedoffontheirown.Ithenwenttoturnonthelights,butinsteadtheyflashed
continuously,withtheflashratedependingonthebrightnesslevelIselected."
"SojustforfunItriedturningonthefan,butanyofthethreefanspeedbuttonsproduced
onlyhighspeed."What'smart'featureisthis?",Iwonderedtomyself.Maybeitneededto
restawhile.SoIturnedoffthefan&lightsandwentbacktofinishmydinner.Fortherest
oftheeveningthefan&lightswouldturnonandoffatrandomintervalsandrandomlevels,
soIgaveupontheideathatitwouldselfcorrect.SowithaheavyheartIwentovertothe
breakerpanel,flippedthehoodfanbreakerto&fro,andthehoodfanwasonceagainwell
behaved."
"Forthenextfewdays,mywifesaidthatIwasmopingaroundasifsomeonehaddied.I
wouldtelleveryoneImet,evencompletestrangers,aboutwhathappened:"Hey,knowwhat?
Ihadtorebootmyhoodfantheothernight!".Theresponseswerevaried,rangingfrom
"Freak!"to"Soundslikewhathappenedtomytoaster...".Fellowprogrammerswouldeither
chuckleorstareincommondisbelief."
"What'stheembeddedworldcomingto?Willprogrammersandcompanieseverywhere
realizethecostoftheirmistakesandcleanuptheiract?Orwilltheentireworldbecome
accustomedtooccasionallyrebootingeverythingtheyown?Wouldtheexpensiveembedded
devicesthencomewitha"reset"button,advertisedasafeature?Orwillprogrammerjokes
becomeascommonandruthlessaslawyerjokes?IwishIknewtheanswer.Icanonlyhope
forthebest,butIfeartheworst."
Onedeveloperadmittedtomethathisconsumerproductscompanycouldcarelessaboutthecorrectness
offirmware.Rebootwhocares?Customersareusedtothis,trainedbydecadesofdesktopcomputer
disappointments.Hittheresetswitch,cyclepower,removethebatteriesfor15minutes,evenpreteens
knowthetricksofcopingwithlegionsofembeddeddevices.
Suggestion:Subscribetomyfreenewsletterwhichoftencoverswatchdogsandsimilarsubjects.
Crummyfirmwareisthenorm,butinmyopinionistotallyunacceptable.Shippingadefectiveproductin
anyotherfieldislikeopeningthedoortotorts.Sofartheembeddedworldhasbeenmostlyimmunefrom
predatorylawyers,butthatBrigadoonlikeisolationisunlikelytocontinue.Besides,it'ssimplyunethical
toproducejunk.
Butit'shard,evenimpossible,toproduceperfectfirmware.Wemuststrivetomakethecodecorrect,but
alsodesignoursystemstocleanlyhandlefailures.Inotherwords,ahealthydoseofparanoialeadsto
bettersystems.
AWatchdogTimerisanimportantlineofdefenseinmakingreliableproducts.Decentembeddedsystems
designmeansthat,ifyoursystemneedsaWDT,itbetterbeofexceptionallyhighquality.
Welldesignedwatchdogtimersfireoffalot,dailyandquietlysavingsystemsandliveswithoutthe
esteemofferedtoother,human,heroes.PerhapsthedevelopersproducingsuchreliableWDTsdeservea
parade.PoorlydesignedWDTsfireoffalot,too,sometimessavingthings,sometimesmakingthem
worse.Asimplemindedwatchdogimplementedinanonsafetycriticalsystemwon'tthreatenhealthor
lives,butcanresultinsystemsthathanganddostrangethingsthattickoffourcustomers.Nobusiness
cantolerateunhappycustomers,sounlessyourcodeisperfect(whoseis?)it'sbestinallbutthemost
costsensitiveappstobuildareallygreatWDT.
AneffectiveWDTisfarmorethanatimerthatdrivesreset.Suchsimplicitymighthavesaved
Clementine,butwoulditfirewhenthecodetumblesintoareallyweirdmodelikethatexperiencedby
Ed'shoodfan?

http://www.ganssle.com/watchdogs.htm

2/11

12/13/2014

DesigningWatchdogTimersforEmbeddedSystems

InternalWDTs
Internalwatchdogsarethosethatarebuiltintotheprocessorchip.Virtuallyallhighlyintegrated
embeddedprocessorsincludeawealthofperipherals,oftenwithsomesortofwatchdog.Mostarebrain
deadwatchdogtimerssuitableforonlythelowestendapplications.
Let'slookatafew.
SomeofTI'sStellarisseriesofARMcontrollers,liketheLM4F110B2QR(how'sthatforapartnumber!)
havealockoutregister.ProgramtherightcodeintoregisterWDTLOCKandthecodecannotchangeany
WDTcontrolregister.That'sagreatfeature!Alas,it'seasyenoughtounlocktheWDTregisters.A
pessimistwillnotethataprogramrunninginsanelycouldsimplydisablethewatchdog.
Toshiba'sTMP96141AFispartoftheirTLCS900familyofquitenicemicroprocessors,whichoffersa
widerangeofextremelyversatileonboardperipherals.Allhaveprettymuchthesamewatchdogcircuit.
Asthedatasheetsays,"TheTMP96141AFiscontainingwatchdogtimerofRunawaydetecting."
Ahem.AndIthoughtthedaysofJinglishwereover.Anyway,thepartgeneratesanonmaskableinterrupt
whenthewatchdogtimesout,whichiseitheravery,verybadideaorawonderfullycleverone.It'sclever
onlyifthesystemproducesanNMI,waitsawhile,andonlythenassertsreset,whichtheToshibapart
unhappilycannotdo.ResetandNMIaresynchronous.
AnicefeatureisthatittakestwodifferentI/Ooperationstodisablethewatchdogtimer,sothereareslim
chancesofarunawayprogramturningoffthisprotectivefeature.
Motorola'swidelyused68332variantoftheirCPU32family(likemostofthese68kembeddedparts)
alsoincludesawatchdog.It'sasimplemindedthingmeantforlowreliabilityapplicationsonly.Unlikea
lotofWDTs,usercodemustwritetwodifferentvalues(0x55and0xaa)totheWDTcontrolregisterto
insurethedevicedoesnottimeout.Thisisaverygoodthingitlimitsthechancesofroguesoftware
accidentallyissuingthecommandneededtoappeasethewatchdog.I'mnotthrilledwiththefactthatany
amountoftimemayelapsebetweenthetwowrites(uptothetimeoutperiod).Twobacktobackwrites
wouldfurtherreducethechancesofrandomwatchdogtickles,thoughoncewouldhavetoinsureno
interruptcouldpreemptthepairedwrites.Andthe0x55/0xaatwosomeisoftenusedinRAMtestssince
the68kI/Oregistersarememorymapped,arunawayRAMtestcouldkeepthedevicefromresetting.
The68332'sWDTdrivesreset,notsomeexceptionhandlinginterruptorNMI.Thismakesalotofsense,
sinceanysoftwarefailurethatcausesthestackpointertogooddwillcrashthecode,andafurther
exceptionhandlinginterruptofanysortwoulddrivethepartintoa"doublebusfault".Thehardwareis
suchthatittakesaresettoexitthiscondition.
Motorola'spopularColdfirepartsaresimilar.TheMCF5204,forinstance,willletthecodewritetothe
WDTcontrolregistersonlyonce.Cool!Crashingcode,whichmightdoallsortsofsillythings,cannot
reprogramtheprotectivemechanism.However,it'spossibletochangetheresetinterruptvectoratany
time,prettymuchinvalidatingthecleverwriteoncedesign.
LiketheCPU32partsa0x55/0xaasequencekeepstheWDTfromtimingout,andbacktobackwrites
aren'trequired.TheColdfiredatasheettoutsthisasanadvantagesinceitcanhandleinterruptsbetween
thetwotickleinstructions,butI'dpreferlessofawindow.TheColdfirehasafaultonfaultcondition
muchliketheCPU32'sdoublebusfault,soresetisalsotheonlyoptionwhenwatchdogtimerfires
whichisagoodthing.
There'snoexternalindicationthattheWDTtimedout,perhapstosavepins.Thatmeansyour
hardware/softwaremustbedesignedsoatawarmbootthecodecanissueafromthegroundupresetto
everyperipheraltoclearweirdmodesthatmayaccompanyaWDTtimeout.
Philip'sXAprocessorsrequiretwosequentialwritesof0xa5and0x5atotheWDT.ButliketheColdfire
there'snoexternalindicationofatimeout,anditappearsthewatchdogresetisn'tevenacompleteCPU
restartthedocssuggestit'sjustareloadoftheprogramcounter.Yikeswhatiftheprocessor'sinternal
stateswereindisarrayfromcoderunningamokorahardwareglitch?

http://www.ganssle.com/watchdogs.htm

3/11

12/13/2014

DesigningWatchdogTimersforEmbeddedSystems

ExternalWatchdogTimers
Manyofthesupervisorychipswebuytomanageaprocessor'sresetlineincludebuiltinWDTs.
TI'sUCC3946isoneofmanynicepowersupervisorpartsthatdoesanexcellentjobofdrivingresetonly
whenVccislegal.Inanicesmall8pinSMTpackageiteatspracticallynoPCBrealestate.It'snot
connectedtotheCPU'sclock,sotheWDTwilloutputaresettothehardwaresafeingmechanismseven
ifthere'sacrystalfailure.Butit'stoodarnsimple:toavoidatimeoutjustwiggletheinputbitonceina
while.Crashedcodecoulddothisinanyofamillionways.
TIisn'ttheonlypurveyorofsimplisticWDTs.Maxim'sMAX823andmanyotherversionsaresimilar.
Thecatalogsofadozenothervendorslistequallydullandineffectivewatchdogs.
ButbothTIandMaximdooffermoresophisticateddevices.ConsiderTI'sTPS3813andMaxim's
MAX6323.Bothare"WindowWatchdogs".Unliketheinternalversionsdescribedabovethatavoid
timeoutsusingtwodifferentdatawrites(likea0x55andthen0xaa),theserequireticklingwithincertain
timebands.ToggletheWDTinputtooslowly,toofast,ornotatall,andatimeoutwilloccur.That
greatlyreducesthechancesthataprogramrunamokwillcreatetheprecisetimingneededtosatisfythe
watchdog.Sinceacrashedprogramwilllikelyspeeduporbogdownifitdoesanythingatall,errant
strobingoftheticklebitwillalmostcertainlybeoutsidethetimebandrequired.

http://www.ganssle.com/watchdogs.htm

4/11

12/13/2014

DesigningWatchdogTimersforEmbeddedSystems

CharacteristicsofGreatWatchdogTimers
What'stherationalebehindanawesomewatchdogtimer?TheperfectWDTshoulddetectallerraticand
insanesoftwaremodes.Itmustnotmakeanyassumptionsabouttheconditionofthesoftwareorthe
hardwareintherealworldanythingthatcangowrongwill.Itmustbringthesystembacktonormal
operationnomatterwhatwentwrong,whetherfromasoftwaredefect,RAMglitch,orbitflipfrom
cosmicrays.
It'simpossibletorecoverfromahardwarefailurethatkeepsthecomputerfromrunningproperly,butat
theleasttheWDTmustputthesystemintoasafestate.Finally,itshouldleavebreadcrumbsbehind,
generatingdebuginformationforthedevelopers.Afterall,awatchdogtimeoutistheyinandyangofan
embeddedsystem.Itsavesthesystem,keepingcustomershappy,yetdemonstratesaninherentdesign
flawthatshouldbeaddressed.Withoutdebuginfo,troubleshootingtheseinfrequentanderraticeventsis
closetoimpossible.
Whatdoesthismeaninpractice?
Aneffectivewatchdogisindependentfromthemainsystem.ThoughallWDTsareablendofinteracting
hardwareandsoftware,somethingexternaltotheprocessormustalwaysbepoised,liketheswordof
Damocles,readytointerveneassoonasacrashoccurs.Puresoftwareimplementationsaresimplynot
reliable.
There'sonlyonekindofinterventionthat'seffective:animmediateresettotheprocessorandall
connectedperipherals.Manyembeddedsystemshaveawatchdogthatinitiatesanonmaskableinterrupt.
DesignersfigurethatfiringoffNMIratherthanresetpreservessomeofthesystem'scontext.It'seasyto
seeddebuggingassetsintheNMIhandler(likeastackcapture)toaidinresolvingthecrash'srootcause.
That'sagreatidea,exceptthatitdoesnotwork.
AllwereallyknowwhentheWDTfiresisthatsomethingtrulyawfulhappened.Softwarebug?Perhaps.
Hardwareglitch?Alsopossible.Canyouinsurethattheerrorwasn'tsomethingthattotallyscrambledthe
processor'sinternallogicstates?Iworkedwithonesystemwhereamotorinanotherroominducedso
muchEMFthatourinstrumentsometimeswentbonkers.Wetrackedthisdowntoasubnanosecond
glitchononeCPUinput,aglitchsoshortthattheprocessorwentintoanundocumentedweirdmode.
Onlyaresetbroughtitbacktolife.
SomeCPUs,notablythe68kandColdFire,willthrowanexceptionifasoftwarecrashcausesthestack
pointertogoodd.That'snotbad...exceptthatanywatchdogcircuitthatthendrivestheCPU'snon
maskableinterruptwillunavoidablyinvokecodethatpushesthesystem'scontext,creatingasecondstack
fault.TheCPUhalts,stayinghaltedtillareset,andonlyareset,comesalong.
Driveresetit'stheonlyreliablewaytobringaconfusedmicroprocessorbacktolucidity.Someclever
designers,though,buildcircuitsthatdriveNMIfirst,andthenafterashortdelaypoundonreset.Ifthe
NMIworksthenitsexceptionhandlercanlogdebuginformationandthenhalt.Itmayalsosignalother
connecteddevicesthatthisunitisgoingofflineforawhile.Thependingresetguaranteesanutterlyclean
restartofthecode.Don'tbetemptedtousetheNMIhandlertosafedangeroushardwarethattaskalways,
ineverysystem,belongstoacircuitexternaltothepossiblyconfusedCPU.
Don'tforgettoresetthewholecomputersystemasimpleCPUrestartmaynotbeenough.Arethe
peripheralsabsolutely,positively,inasanemode?Maybenot.Runawaycodemayhaveissuedallsortsof
I/Oinstructionsthatplacedcomplexdevicesininsanemodes.Giveeveryperipheralahardwarereset
softwareresetsmaygetlostinalloftheI/Ochatter.
Considerwhatthesystemmustdotobetotallysafeafterafailure.Maybeapacemakerneedstorebootin
http://www.ganssle.com/watchdogs.htm

5/11

12/13/2014

DesigningWatchdogTimersforEmbeddedSystems

aheartbeat(sotospeak)...ormaybebackuphardwareshouldissueafewticksifrebootsareslow.
Onethicknessgaugethatbeamshighenergygammaraysthrough4inchesofhotsteelfailedina
spectacularway.Defectivehardwarecrashedthecode.Thewatchdogproperlyclosedtheprotectivelead
shutter,blockingoffthe5curiecesiumsource.Iwaspresent,andwatchedincredulouslyasthe
engineeringVPputhisheadinpathofthebeamthecrashedcode,stillexecutingsomething,trickedthe
watchdogintoopeningtheshutter,beaminghighintensityradiationthroughtheveep'sforehead.Iwonder
tothisdaywhateventuallybecameoftheman.
AreallyeffectivewatchdogcannotusetheCPU'sclock,whichmayfail.Abadsolderjointonthecrystal,
poordesignthatdoesn'tworkwellovertemperatureextremes,ornumerousotherproblemscanshut
downtheoscillator.ThissuggeststhatnoWDTinternaltotheCPUisreallysafe.Mostsharethe
processor'sclock.
UndernocircumstancesshouldthesoftwarebeabletoreprogramtheWDToranyofitsnecessary
components(likeresetvectors,I/Opinsusedbythewatchdog,etc).Assumerunawaycoderunsunderthe
guidanceofamalevolentdeity.
Buildawatchdogthatmonitorstheentiresystem'soperation.Don'tassumethatthingsarefinejust
becausesomelooporISRrunsoftenenoughtoticklethewatchdogtimer.Asoftwareonlywatchdog
shouldlookatavarietyofparameterstoinsuretheproductishealthy,kickingthedogonlyifeverything
isOK.Whatisasoftwarecrash,afterall?OccasionallythesystemexecutesaHALTandstops,butmore
oftenthecodevectorsofftoarandomlocation,continuingtoruninstructions.Maybeonlyonetask
crashed.Perhapsonlyoneisstillalivenodoubtthatwhichkicksthedog.
Thinkaboutwhatcangowronginyoursystem.Takecorrectiveactionwhenthat'spossible,butinitiatea
resetwhenit'snot.Forinstance,canyoursystemrecoverfromexceptionslikefloatingpointoverflowsor
dividesbyzero?Ifnot,theseconditionsmaywellsignaltheearlystagesofacrash.Eitherhandlethese
competentlyorinitiateaWDTtimeout.Forthecostofahandfuloflinesofcodeyoumaykeepa60
Minutescameracrewfromappearingatyourdoor.
It'sagoodideatoflashanLEDorotherwiseindicatethattheWDTkicked.Alotofdevicesautomatically
recoverfromtimeoutstheyquicklycomebacktolifewiththecustomertotallyunawareacrashoccurred.
UnlessyouhaveadebugLEDhowdoyouknowifyourpreciouscreationisworkingproperly,or
occasionallyinvisiblyresetting?Oneoutfitcomplainedthatovertime,andwithseveralthousandunitsin
thefield,theirproduct'sresponsetimetouserinputsdegradednoticeably.Abitofresearchshowedthat
theirsystem'swatchdogproperlydrovetheCPU'sresetsignal,andthecodethenrecognizedawarmboot,
goingdirectlytotheapplicationwithnoindicationtotheusersthatthetimeouthadoccurred.Wetracked
theproblemdowntoafloatinginputontheCPU,thatcausedthesoftwaretocrashuptoseveral
thousandtimespersecond.Theprocessorwasspendingmostofitstimeresetting,leadingtoapparently
slowuserresponse.AnLEDwouldhaveshowntheproblemduringdebug,longbeforecustomersstarted
yelling.
EveryoneknowsweshouldincludeajumpertodisabletheWDTduringdebugging.Butfewfolksthink
thisthrough.Thejumpershouldbeinsertedtoenabledebugging,andremovedfornormaloperation.
Otherwiseifmanufacturingforgetstoinstallthejumper,orifitfallsoutduringshipment,theWDTwon't
function.Andthere'snoproductiontesttocheckthewatchdog'soperation.
Designthelogicsothejumperdisconnectsthewatchdogtimerfromtheresetline(possiblythoughan
invertersoaninsertedjumpersetsdebugmode).Thenthewatchdogcontinuestofunctionevenwhile
debuggingthesystem.Itwon'tresettheprocessorbutwillflashtheLED.Thelightwillblinkalotwhen
breakpointingandsinglestepping,butshouldnevercomeonduringfullspeedtesting.

UsinganInternalWatchdogTimer
MostembeddedprocessorsthatincludehighintegrationperipheralshavesomesortofbuiltinWDT.
Avoidtheseexceptinthemostcostsensitiveorbenignsystems.Internalunitsofferminimalprotection
fromroguecode.RunawaysoftwaremayreprogramtheWDTcontroller,manyinternalwatchdogswill
notgenerateaproperreset,andanyfailureoftheprocessorwillmakeitimpossibletoputthehardware
intoasafestate.AgreatWDTmustbeindependentoftheCPUit'stryingtoprotect.
http://www.ganssle.com/watchdogs.htm

6/11

12/13/2014

DesigningWatchdogTimersforEmbeddedSystems

However,insystemsthatreallymustusetheinternalversions,there'splentywecandotomakethem
morereliable.Theconventionalmodelofkickingasimpletimeraterraticintervalsistooeasilyspoofed
byrunawaycode.
ApairofdesignrulesleadstodecentWDTs:kickthedogonlyafteryourcodehasdoneseveral
unrelatedgoodthings,andmakesurethaterraticexecutionstreamsthatwanderintoyourwatchdog
routinewon'tissueincorrecttickles.
Thisisagreatplacetouseasimplestatemachine.Supposewedefineaglobalvariablenamed
"state".Atthebeginningofthemainloopsetstateto0x5555.CallwatchdogroutineA,which
addsanoffsetsay0x1111tostateandtheninsuresthevariableisnow0x6666.Returnifthe
comparematchesotherwisehaltortakeotheractionthatwillcausetheWDTtofire.
Later,maybeattheendofthemainloop,addanotheroffsettostate,say0x2222.Callwatchdog
routineB,whichmakessurestateisnow0x8888.Setstatetozero.Kickthedogifthecompare
worked.Return.Haltotherwise.
SupposethecodecrashesandforinscrutablereasonsprobablyhavingtodowithMurphy'sLawandthe
perversityofnaturevectorsintowdt_b()justbeforethekick_dogcommand.Theprotection
mechanismofthestatemachinewon'thelp.
Perhapsit'ssafetoassumethatthecodewillagaincrashwhenwdt_b()returns,sothesystemwillmiss
thenextwatchdogtickle.But....perhapsnotwhoknowswhatevillurksinthemindofrunaway
software?
Isthisfearparanoid?Youbet.ButtheWDTmightbethelastlineofdefensebetweendeflectingthe
Earthboundasteroidandutterdisaster,oratleastinrebootingthepacemakerbeforegrandpacollapses.
Assumingthatcrashedcodewilloperateinanybenignmodeisnave.
Sowdt_b()doublechecksvariable"state"toinsurethesystemhalts(sothewatchdogcanissueareset)
evenifroguecodewanderedintowdt_b()justbeforeissuingthekick_dog.
Thisisatrivialbitofcode,butnowrunawaycodethatstumblesintoanyoftheticklingroutinescannot
errantlykickthedog.Further,notickleswilloccurunlesstheentiremainloopexecutesintheproper
sequence.IfthecodejustcallsroutineBrepeatedly,notickleswilloccurbecauseitsetsstatetozero
beforeexiting.
Addadditionalintermediatestatesasyourfearoflitigationdictates.
NormallyIdetestglobalvariables,butthisisaperfectapplication.Cruddycodethatmuckswiththe
variable,erranttasksdoingstrangethings,oranyerrorthatstepsontheglobalwillmaketheWDT
timeout.
Doputtheseactionsintheprogram'smainloop,notinsideanISR.It'sfuntowatchamultitasking
productcrashtheentiresystemmightbehung,butonetaskstillrespondstointerrupts.Ifyourwatchdog
ticklerstaysaliveastheworldcollapsesaroundtherestofthecode,thenthewatchdogservesnouseful
purpose.
IftheWDTdoesn'tgenerateanexternalresetpulse(someprocessorshandletherestartinternally)make
surethecodeissuesahardwareresettoallperipheralsimmediatelyafterstartup.Thatmaymean
workingwiththeEEssoanoutputbitresetseveryresettableperipheral.
Ifyoumusttakeactiontosafedangeroushardware,well,sincethere'snowaytoguaranteethecodewill
comebacktolife,stayawayfrominternalwatchdogs.Brokenhardwarewillobviouslycausethis...but
socanlousycode.Adigitalcamerawasrecalledrecentlywhenusersfoundthatturningthedeviceoff
wheninacertainmodemeantitcouldneverbeturnedonagain.Thecodewrotefaultyinfotoflash
memorythatcreatedapermanentcrash.

http://www.ganssle.com/watchdogs.htm

7/11

12/13/2014

DesigningWatchdogTimersforEmbeddedSystems

AnExternalWatchdogTimer
Thebestwatchdogisonethatdoesn'trelyontheprocessororitssoftware.It'sexternaltotheCPU,shares
noresources,andisutterlysimple,thusdevoidoflatentdefects.
UseaPIC,aZ8,orothersimilardirtcheapprocessorasasystemhealthmonitor.Thesepartshavean
independentclock,onboardmemory,andthebuiltintimersweneedtobuildatrulygreatWDT.Being
external,youcanconnectanoutputtohardwareinterlocksthatputdangerousmachineryintosafestates.
ButwhenselectingawatchdogCPUcheckthepart'sspecscarefully.Tyingthetickletothewatchdog
CPU'sinterruptinput,forinstance,maynotworkreliably.AslowpartlikemostPICsmaynotrespond
toatickleofshortduration.ConsiderTI'sMSP430familyorprocessors.They'reaveryinexpensive(half
abuckorso)seriesof16bitprocessorsthatusevirtuallynopowerandnoPCBrealestate.

Tickleitusingthesamesortofstatemachinedescribedabove.Likethewindowedwatchdogs(TI's
TPS3813andMaxim'sMAX6323),defineminandmaxtickleintervals,tofurtherlimitthechancesthata
runawayprogramdeludestheWDTintoavoidingareset.
Perhapsitseemsextremetoaddanentirecomputerjustforthesakeofadecentwatchdog.We'dbefools
toaddextrahardwaretoahighlycostconstrainedproduct.Mostofus,though,buildlowervolume
highermarginsystems.Afiftycentpartthatpreventsthelossofanexpensivemission,orthatevensaves
thecostofonecustomersupportcall,mightmakealotofsense.
Inamultiprocessorsystemit'seasytoturnalloftheprocessorsintowatchdogs.Havethemexchange"I'm
OK"messagesperiodically.Thereceiverresetsthetransmitterifitstopsspeaking.Thisapproachchecks
alotofhardwareandsoftware,andrequireslittlecircuitry.

WatchdogTimersforMultitasking
http://www.ganssle.com/watchdogs.htm

8/11

12/13/2014

DesigningWatchdogTimersforEmbeddedSystems

Taskingturnsalinearbitofsoftwareintoamultidimensionalmixoftaskscompetingforprocessortime.
Eachrunsmoreorlessindependentlyoftheothers...whichmeanseachcancrashonitsown,without
bringingtheentiresystemtoitsknees.
Youcanlearnalotaboutasystem'sdesignjustbeobservingitsoperation.Considerasimpleinstrument
withadisplayandvariousbuttons.Pressabuttonandholditdownifthedisplaycontinuestoupdate,
oddsarethesystemmultitasks.
Yetinthesamesystemasoftwarecrashmightgoundetectedbyconventionalwatchdogstrategies.Ifthe
displayorkeyboardtasksdie,themainlinecodeoraWDTtaskmaycontinuetorun.
AnysystemthatusesanISRoraspecialtasktoticklethewatchdogbutthatdoesnotexaminethehealth
ofallothertasksisnotrobust.Successliesinweavingthewatchdogintothefabricofallofthesystem's
taskswhichishappilymucheasierthanitsounds.
First,buildawatchdogtask.It'stheonlypartofthesoftwareallowedtotickletheWDT.Ifyoursystem
hasanMMUmaskoffallI/OaccessestotheWDTexceptthosefromthistask,soroguecodetrapsonan
errantattempttooutputtothewatchdog.
Next,createadatastructurethathasoneentrypertask,witheachentrybeingjustaninteger.
Whenataskstartsitincrementsitsentryinthestructure.Tasksthatonlystartonceandstayactive
forevercanincrementtheappropriatevalueeachtimethroughtheirmainloops.
Incrementthedataatomicallyinawaythatcannotbeinterruptedwiththedatahalfchanged.++TASKi
(ifTASKisanintegerarray)onan8bitCPUmightnotbeatomic,thoughit'salmostcertainlyOKona
16or32bitter.Thesafestwaytobothencapsulateandinsureatomicaccesstothedatastructureistohide
itbehindanothertask.Useasemaphoretoeliminateconcurrentsharedaccesses.Sendincrement
messagestothetask,usingtheRTOS'smessagingresources.
Astheprogramrunsthenumberofcountsforeachtaskadvances.Infrequentlybutatregularintervalsthe
watchdogtaskruns.Perhapsonceasecond,ormaybeonceamsecit'sallafunctionofyourparanoia
andtheimplicationsofafailure.
Thewatchdogtaskscansthestructure,checkingthatthecountstoredforeachtaskisreasonable.Onethat
runsoftenshouldhaveahighcountanotherwhichexecutesinfrequentlywillproduceasmallervalue.
Partofthetrickisdeterminingwhat'sreasonableforeachtaskstickwithmewe'lllookatthatshortly.
Ifthecountsareunreasonable,haltandletthewatchdogtimeout.IfeverythingisOK,setallofthecounts
tozeroandexit.
Whyisthisrobust?Obviously,thewatchdogmonitorseverytaskinthesystem.Butit'salsoimpossible
forcodethat'srunningamoktostumbleintothewatchdogtimertaskanderrantlyticklethedogby
zeroingthearrayweguaranteeit'sina"bad"state.
Iskippedoveracriticalstephowdowedecidewhat'sareasonablecountforeachtask?Itmightbe
possibletodeterminethisanalytically.IftheWDTtaskrunsonceasecond,andoneofthemonitored
tasksstartsevery50msec,thensurelyacountofaround20isreasonable.
Otheractivitiesaremuchhardertoascertain.Whataboutataskthatrespondstoasynchronousinputs
fromothercomputers,saydatapacketsthatcomeatirregularintervals?Evenincasesofperiodicevents,
ifthesedrivealowprioritytasktheymaybesuspendedforratherlongintervalsbyhigherpriority
problems.
Thesolutionistobroadenthedatastructurethatmaintainscountinformation.Addminandmaxfieldsto
eachentry.Eachtaskmustrunatleastmin,butnomorethanmaxtimes.
Nowredesignthewatchdogtasktoruninoneoftwomodes.Thefirstistheonealreadydescribed,andis
usedduringnormalsystemoperation.
Thesecondmodeisadebugenvironmentenabledbyacompiletimeswitchthatcollectsminandmax
data.EachtimetheWDTtaskrunsitlooksattheincrementedcountsandsetsnewminandmaxvaluesas
needed.Itticklesthewatchdogeachtimeitexecutes.
Runtheproduct'sfulltestsuitewiththismodeenabled.Maybethesystemneedstooperateforadayora
weektogetadecentprofileofthemin/maxvalues.Whenyou'resatisfiedthatthetestsarerepresentative
ofthesystem'srealoperation,manuallyexaminethecollecteddataandadjusttheparametersasseems
necessarytogiveadequatemarginstothedata.
Whatapain!Butbytakingthisstepyou'llgetagreatwatchdogandadeeplookintoyoursystem's
timing.I'veobservedthatfewdevelopershavemuchsensehowtheircreationsperforminthetime
domain."Itseemstowork"tellsuslittle.Lookingatthedataacquiredbythisprofiling,thoughmighttell
alot.IsitasurprisethattaskAruns400timesasecond?Thatmightexplainapreviouslyunknown
performancebottleneck.
Inarealtimesystemwemustmanageandmeasuretimeit'severybitasimportantasproceduralissues,
yetisoftignoredtillanaggingproblemturnsintoanunacceptablesymptom.Thiswatchdogscheme
http://www.ganssle.com/watchdogs.htm

9/11

12/13/2014

DesigningWatchdogTimersforEmbeddedSystems

forcesyoutothinkinthetimedomain,andbyitsnatureprofilesadmittedlywithcoarsegranularitythe
timeoperationofyoursystem.
There'syetonemorekink,though.Sometasksrunsoinfrequentlyorerraticallythatanysortof
automatedprofilingwillfail.Awatchdogthatrunsonceasecondwillmisstasksthatstartonlyhourly.
It'snotunreasonabletoexcludethesefromwatchdogmonitoring.Or,wecanaddabitofcomplexityto
thecodetoinitiateawatchdogtimeoutif,say,theslowtasksdon'tstartevenafteranumberofhours
elapse.
SomeDecentWDTsonMCUs
MCUwatchdogsaregettingbetter.ForinstanceFreescale's32bitColdfire+line,liketheMCF51Qx.
Insteadof"watchdog"Freescalepreferstheawkwardphrase"ComputerActingProperly"(COP).Butit
doesofferaveryintriguingfeature.Ingeneral,onepetsthewatchdog,uh,COP,bywriting0x55andthen
0xaatothecontrolregister.Butinonemodethatsequencemustbesentinthelast25%oftheCOP
timeoutperiod.Aprematurewriteresultsinareset.Oddsofaerrantprogramgettingthetiming
Goldilockscorrect(nottoooften,nortooinfrequently)aretiny.
Thepartalsogeneratesaresetisanyattemptismadetoexecuteanillegalinstruction.That'ssomewhat
differentfrommostCPUs,whichissueanillegalopcodeinterrupt.IratherlikeFreescale'sapproach,
sinceinterrupthandlersarenotguaranteedtoworkifthecodecrashes.Ablownstack,corruptPC(on
someCPUsifthePCisoddafaultistaken),oravectorbaseregisterchanges.Thisalsosuggeststhatit's
agoodideatofillunusedflashatlinktimewithanillegalopcode,andonpowerupfillallofRAMwith
asimilarinstruction,sothaterrantcodewaltzingthroughmemoryislikelytogenerateareset.
Anothernicetouchisthattheresetpinisopendrainandisassertedwhenanyoftheseerrorsoccur.Tieit
totheperipheralresetinputs.Evenifwanderingcodeissuesoutputinstructionstheirpotentially
scrambledlittlebrainswillbestraightenedout.
STMicroelectronicshasalineofCortexM3devices.TheM3hasbecomeextremelypopularforlower
endembeddeddevices,andST'sSTM32Fisrepresentativeoftheseparts(thoughtheWDTisanSTadd
on,anddoesnotnecessarilymirrorothervendors'implementations).TheSTM32Fhastwodifferent
protectionmechanisms.An"IndependentWatchdog"isaprettyvanilladesignthathaslittlegoingforit
otherthaneaseofuse.ButtheirWindowWatchdogoffersmorerobustprotection.Whenacountdown
timerexpires,aresetisgenerated,whichcanbeimpededbyreloadingthetimer.Nothingspecialthere.
Butifthereloadhappenstooquickly,thesystemwillalsoreset.Inthiscase"tooquickly"isdetermined
byavalueoneprogramsintoacontrolregister.
Anothercoolfeature:itcangenerateaninterruptjustbeforeresetting.Writeabitofcodetosnagthe
interruptandyoucantakesomeactionto,forinstance,putthesysteminasafestateortosnapshotdata
fordebuggingpurposes.STsuggestsusingtheISRtoreloadthewatchdogthatis,kickthedogsoa
resetdoesnotoccur.Don'ttaketheiradvice.Iftheprogramcrashestheinterrupthandlersmayverywell
continuetofunctionnormally.AndusinganISRtoreloadtheWDTinvalidatestheentirereasonfora
windowwatchdog.
TheWDTcannotbedisabledonceenabledgoodthinking,folks!Butoddly,theotherconfiguration
registerscanbechangedatwill,whichcanmakethewatchdogbehaveincorrectly.
SummaryandOtherThoughts
Forsomethoughtsonthepossiblefutureofwatchdogs,seethis.
Iremaintroubledbythefanfailuredescribedearlier.It'seasytodismissthisasaglitch,anunexplained
failurecausedbyahardwareorsoftwarebug,cosmicrays,ormeddlingbyaliens.Butothershavewritten
aboutidenticalsituationswiththeirventfans,allapparentlymadebythesamevendor.
Whenweblowoffafailure,callingita"glitch"asifthatnameexplainssomething,we'rebasically
professingourignorance.Therearenoglitchesinourmacroscopicallydeterministicworld.Things
happenforareason.
Thefanfailuresdidn'tmaketheeveningnewsandhurtnoone.Sowhyworry?
Surelythecustomerswereirritated,andthepossiblefuturesalesofthatcompanyatleastsomewhat
diminished.Thecompanyescalatedthegeneralrudenessleveloftheworld,andthusthesumtotal
incipientangerlevel,bytreatingtheircustomerswithcontempt.MaybeacouplemoreValiumswere
popped,afewspousesyelledat,somekidscoweredtilldadcalmeddown.Inthegrandschemeofthings
perhapstheseareinsignificantblips.Yetwemustrememberthepurposeofembeddedcontrolistohelp
people,toimprovelives,nottohelptherapistsgarnernewpatients.
Whatconcernsmeisthatifwecannotevenbuildreliablefancontrollers,whathopeisthereformore
missioncriticalapplications?
Idon'tknowwhatwentwrongwiththosefancontrollers,andIhavenoideaifawatchdogtimerwell
designedornotispartofthesystem.Idoknow,though,thatthefailuresareunacceptableand
avoidable.Butmaybenotavoidablebytheuseofaconventionalwatchdog.AWDTtellsusthecodeis
running.AwindowingWDTtellsusit'srunningwithprettymuchtherighttiming.Nowatchdog,though,
http://www.ganssle.com/watchdogs.htm

10/11

12/13/2014

DesigningWatchdogTimersforEmbeddedSystems

flagssoftwareexecutingwithcorruptdatastructures,unlessthedataissobaditgrosslyaffectsthe
executionstream.
Whywouldadatastructurebecomecorrupt?Bugs,surely.Strangeconditionsthedesignersnever
anticipatedwillalsocreateproblems,liketheneverendingfloodofbufferoverflowconditionsthat
plaguethenet,orunexpecteduserinputs("weneverthoughttheuserwouldpressall4buttonsatthe
sametime!").
Isanotherlayerofselfdefense,beyondwatchdogs,wise?Safetycriticalapps,wherethecostofafailure
isfrighteninglyhigh,shoulddefinitelyincludeintegritychecksonthedata.Lowthreatequipmentlike
thisovenfancanandshouldhaveatleastaminimalamountofcodefortrappingpossiblefailure
conditions.
Somemightargueitmakesnosenseto"waste"timewritingdefensivecodeforadumbfanapplication.
Yetthesimplerthesystem,theeasierandquickeritistopluginabitofcodetolookforprogramand
dataerrors.
Verysimplesystemstendtotranslateinputstooutputs.TheirprimarydatastructuresaretheI/Oports.
Oftenseveralunrelatedoutputbitsgetmultiplexedtoasingleport.Tochangeonebitmeanseither
readingtheport'scurrentstatus,ormaintainingacopyoftheportinRAM.Bothapproachesare
problematic.
Computersaredeterministic,soit'sreasonabletoexpectthat,intheabsenceofbugs,they'llproduce
correctresultsallthetime.Soit'sapparentlysafetoreadaport'scurrentstatus,ANDofftheunwanted
bits,ORinnewones,andoutputtheresult.Thisisastatemachinetheoutputsevolveovertimetodeal
withchanginginputs.Buttheprocessworksonlyifthestatemachineneverincorrectlyflipsabit.
Unfortunately,outputportsareconnectedtothehostileenvironmentoftherealworld.It'sentirely
possiblethatabitofenergyfromstartingthefan'shighlyinductivemotorwillaltertheport'ssetting.I've
seenthishappenmanytimes.
Somaybeit'smorereliabletomaintainamemoryimageoftheport.Thedownsideisthataprogrambug
mightcorrupttheimage.Mostofthetimethesearestoredasglobalvariables,soanybitofsloppycode
canaccidentallytrashthelocation.Encapsulationsolvesthatproblem,butnottheoneofawandering
pointerwalkingoverthedata,orofalatentreentrancyissuecorruptingthings.Youmightarguethat
writingcorrectcodemeansweshouldn'tworryaboutalocationchanging,butweaddedaWDTto,in
part,dealwithbugs.Similarconcernsaboutourdataarewarranted.
Inasimplesystemlookforadesignthatresetsdatastructuresfromtimetotime.Inthecaseoftheoven
fan,whenevertheuserselectsafanspeedresetallI/Oportsanddatastructures.It'sthatsimple.
Inamorecomplicatedsystemthebestapproachistheoldesttrickinsoftwareengineering:checkthe
parameterspassedtofunctionsforreasonableness.Intheembeddedworldwechosenottodothisfor
threereasons:speed,memorycosts,andlaziness.Ofthese,thethirdreasonistherealculpritmostofthe
time.

TheGanssleGroupinfo@ganssle.comCopyrightTheGanssleGroupContactinfohere
Interestedinadvertisingwithus?Moreinformationhere.

http://www.ganssle.com/watchdogs.htm

11/11

Vous aimerez peut-être aussi