Neural Networks and Deep Learning

03/04/2017 Neuralnetworksanddeeplearning
CHAPTER1
Usingneuralnetstorecognizehandwrittendigits
Thehumanvisualsystemisoneofthewondersoftheworld. NeuralNetworksandDeepLearning
Whatthisbookisabout
Considerthefollowingsequenceofhandwrittendigits: Ontheexercisesandproblems
Usingneuralnetstorecognize
handwrittendigits
Howthebackpropagation
algorithmworks
Mostpeopleeffortlesslyrecognizethosedigitsas504192.Thatease
Improvingthewayneural
isdeceptive.Ineachhemisphereofourbrain,humanshavea networkslearn
Avisualproofthatneuralnetscan
primaryvisualcortex,alsoknownasV1,containing140million
computeanyfunction
neurons,withtensofbillionsofconnectionsbetweenthem.Andyet Whyaredeepneuralnetworks
humanvisioninvolvesnotjustV1,butanentireseriesofvisual hardtotrain?
Deeplearning
corticesV2,V3,V4,andV5doingprogressivelymorecomplex Appendix:Isthereasimple
imageprocessing.Wecarryinourheadsasupercomputer,tunedby algorithmforintelligence?
Acknowledgements
evolutionoverhundredsofmillionsofyears,andsuperblyadapted
FrequentlyAskedQuestions
tounderstandthevisualworld.Recognizinghandwrittendigitsisn't
easy.Rather,wehumansarestupendously,astoundinglygoodat Sponsors
makingsenseofwhatoureyesshowus.Butnearlyallthatworkis
doneunconsciously.Andsowedon'tusuallyappreciatehowtough
aproblemourvisualsystemssolve.
Thedifficultyofvisualpatternrecognitionbecomesapparentifyou
attempttowriteacomputerprogramtorecognizedigitslikethose
above.Whatseemseasywhenwedoitourselvessuddenlybecomes

extremelydifficult.Simpleintuitionsabouthowwerecognize Thankstoallthesupporterswho
shapes"a9hasaloopatthetop,andaverticalstrokeinthe madethebookpossible,with
especialthankstoPavelDudrenov.
bottomright"turnouttobenotsosimpletoexpress Thanksalsotoallthecontributorsto
algorithmically.Whenyoutrytomakesuchrulesprecise,you theBugfinderHallofFame.
quicklygetlostinamorassofexceptionsandcaveatsandspecial
Resources
cases.Itseemshopeless.
MichaelNielsenonTwitter
BookFAQ
Neuralnetworksapproachtheprobleminadifferentway.Theidea
Coderepository
istotakealargenumberofhandwrittendigits,knownastraining MichaelNielsen'sproject
examples, announcementmailinglist
DeepLearning,bookbyIan
Goodfellow,YoshuaBengio,and
AaronCourville
cognitivemedium.com
http://neuralnetworksanddeeplearning.com/chap1.html 1/50
ByMichaelNielsen/Jan2017
andthendevelopasystemwhichcanlearnfromthosetraining
examples.Inotherwords,theneuralnetworkusestheexamplesto
automaticallyinferrulesforrecognizinghandwrittendigits.
Furthermore,byincreasingthenumberoftrainingexamples,the
networkcanlearnmoreabouthandwriting,andsoimproveits
accuracy.SowhileI'veshownjust100trainingdigitsabove,
perhapswecouldbuildabetterhandwritingrecognizerbyusing
thousandsorevenmillionsorbillionsoftrainingexamples.
Inthischapterwe'llwriteacomputerprogramimplementinga
neuralnetworkthatlearnstorecognizehandwrittendigits.The
programisjust74lineslong,andusesnospecialneuralnetwork
libraries.Butthisshortprogramcanrecognizedigitswithan
accuracyover96percent,withouthumanintervention.
Furthermore,inlaterchapterswe'lldevelopideaswhichcan
improveaccuracytoover99percent.Infact,thebestcommercial
neuralnetworksarenowsogoodthattheyareusedbybanksto
processcheques,andbypostofficestorecognizeaddresses.
We'refocusingonhandwritingrecognitionbecauseit'sanexcellent
prototypeproblemforlearningaboutneuralnetworksingeneral.
Asaprototypeithitsasweetspot:it'schallengingit'snosmallfeat
torecognizehandwrittendigitsbutit'snotsodifficultastorequire
anextremelycomplicatedsolution,ortremendouscomputational
power.Furthermore,it'sagreatwaytodevelopmoreadvanced
techniques,suchasdeeplearning.Andsothroughoutthebookwe'll
returnrepeatedlytotheproblemofhandwritingrecognition.Later
inthebook,we'lldiscusshowtheseideasmaybeappliedtoother
problemsincomputervision,andalsoinspeech,naturallanguage
processing,andotherdomains.
Ofcourse,ifthepointofthechapterwasonlytowriteacomputer
programtorecognizehandwrittendigits,thenthechapterwouldbe
muchshorter!Butalongthewaywe'lldevelopmanykeyideas
aboutneuralnetworks,includingtwoimportanttypesofartificial
neuron(theperceptronandthesigmoidneuron),andthestandard
learningalgorithmforneuralnetworks,knownasstochastic
gradientdescent.Throughout,Ifocusonexplainingwhythingsare
donethewaytheyare,andonbuildingyourneuralnetworks
intuition.ThatrequiresalengthierdiscussionthanifIjust
presentedthebasicmechanicsofwhat'sgoingon,butit'sworthit
forthedeeperunderstandingyou'llattain.Amongstthepayoffs,by
theendofthechapterwe'llbeinpositiontounderstandwhatdeep
learningis,andwhyitmatters.
Perceptrons
Whatisaneuralnetwork?Togetstarted,I'llexplainatypeof
artificialneuroncalledaperceptron.Perceptronsweredevelopedin
the1950sand1960sbythescientistFrankRosenblatt,inspiredby
earlierworkbyWarrenMcCullochandWalterPitts.Today,it's
morecommontouseothermodelsofartificialneuronsinthis
book,andinmuchmodernworkonneuralnetworks,themain
neuronmodelusedisonecalledthesigmoidneuron.We'llgetto
sigmoidneuronsshortly.Buttounderstandwhysigmoidneurons
aredefinedthewaytheyare,it'sworthtakingthetimetofirst
understandperceptrons.
Sohowdoperceptronswork?Aperceptrontakesseveralbinary
inputs,x 1, x2 , ,andproducesasinglebinaryoutput:
Intheexampleshowntheperceptronhasthreeinputs,x 1, .In
x2 , x3
generalitcouldhavemoreorfewerinputs.Rosenblattproposeda
simpleruletocomputetheoutput.Heintroducedweights,
w1 , w2 , ,realnumbersexpressingtheimportanceofthe
respectiveinputstotheoutput.Theneuron'soutput,0or1,is
determinedbywhethertheweightedsum j
wj xj islessthanor
greaterthansomethresholdvalue.Justliketheweights,the
thresholdisarealnumberwhichisaparameteroftheneuron.To
putitinmoreprecisealgebraicterms:
0 if w j x j threshold
j
output = { (1)
1 if w j x j > threshold
j
That'sallthereistohowaperceptronworks!
That'sthebasicmathematicalmodel.Awayyoucanthinkaboutthe
perceptronisthatit'sadevicethatmakesdecisionsbyweighingup
evidence.Letmegiveanexample.It'snotaveryrealisticexample,
butit'seasytounderstand,andwe'llsoongettomorerealistic
examples.Supposetheweekendiscomingup,andyou'veheard
thatthere'sgoingtobeacheesefestivalinyourcity.Youlike
cheese,andaretryingtodecidewhetherornottogotothefestival.
Youmightmakeyourdecisionbyweighingupthreefactors:
1.Istheweathergood?
2.Doesyourboyfriendorgirlfriendwanttoaccompanyyou?
3.Isthefestivalnearpublictransit?(Youdon'townacar).
Wecanrepresentthesethreefactorsbycorrespondingbinary
variablesx 1, x2 ,andx .Forinstance,we'dhavex
3 1 = 1 ifthe
weatherisgood,andx 1 = 0 iftheweatherisbad.Similarly,x 2 = 1
ifyourboyfriendorgirlfriendwantstogo,andx 2 = 0 ifnot.And
similarlyagainforx andpublictransit.
3
Now,supposeyouabsolutelyadorecheese,somuchsothatyou're
happytogotothefestivalevenifyourboyfriendorgirlfriendis
uninterestedandthefestivalishardtogetto.Butperhapsyou
reallyloathebadweather,andthere'snowayyou'dgotothefestival
iftheweatherisbad.Youcanuseperceptronstomodelthiskindof
decisionmaking.Onewaytodothisistochooseaweightw 1 = 6
fortheweather,andw 2 andw
= 2 3 = 2fortheotherconditions.
Thelargervalueofw indicatesthattheweathermattersalotto
1
you,muchmorethanwhetheryourboyfriendorgirlfriendjoins
you,orthenearnessofpublictransit.Finally,supposeyouchoosea
thresholdof5fortheperceptron.Withthesechoices,the
perceptronimplementsthedesireddecisionmakingmodel,
outputting1whenevertheweatherisgood,and0wheneverthe
weatherisbad.Itmakesnodifferencetotheoutputwhetheryour
boyfriendorgirlfriendwantstogo,orwhetherpublictransitis
nearby.
Byvaryingtheweightsandthethreshold,wecangetdifferent
modelsofdecisionmaking.Forexample,supposeweinsteadchose
athresholdof3.Thentheperceptronwoulddecidethatyoushould
gotothefestivalwhenevertheweatherwasgoodorwhenboththe
festivalwasnearpublictransitandyourboyfriendorgirlfriendwas
willingtojoinyou.Inotherwords,it'dbeadifferentmodelof
decisionmaking.Droppingthethresholdmeansyou'remore
willingtogotothefestival.
Obviously,theperceptronisn'tacompletemodelofhuman
decisionmaking!Butwhattheexampleillustratesishowa
perceptroncanweighupdifferentkindsofevidenceinorderto
makedecisions.Anditshouldseemplausiblethatacomplex
networkofperceptronscouldmakequitesubtledecisions:
Inthisnetwork,thefirstcolumnofperceptronswhatwe'llcallthe
firstlayerofperceptronsismakingthreeverysimpledecisions,by
weighingtheinputevidence.Whatabouttheperceptronsinthe
secondlayer?Eachofthoseperceptronsismakingadecisionby
weighinguptheresultsfromthefirstlayerofdecisionmaking.In
thiswayaperceptroninthesecondlayercanmakeadecisionata
morecomplexandmoreabstractlevelthanperceptronsinthefirst
layer.Andevenmorecomplexdecisionscanbemadebythe
perceptroninthethirdlayer.Inthisway,amanylayernetworkof
perceptronscanengageinsophisticateddecisionmaking.
Incidentally,whenIdefinedperceptronsIsaidthataperceptron
hasjustasingleoutput.Inthenetworkabovetheperceptronslook
liketheyhavemultipleoutputs.Infact,they'restillsingleoutput.
Themultipleoutputarrowsaremerelyausefulwayofindicating
thattheoutputfromaperceptronisbeingusedastheinputto
severalotherperceptrons.It'slessunwieldythandrawingasingle
outputlinewhichthensplits.
Let'ssimplifythewaywedescribeperceptrons.Thecondition

j
w j x j > threshold iscumbersome,andwecanmaketwo
notationalchangestosimplifyit.Thefirstchangeistowrite

j
wj xj asadotproduct,w x j
wj xj,wherewandxare
vectorswhosecomponentsaretheweightsandinputs,respectively.
Thesecondchangeistomovethethresholdtotheothersideofthe
inequality,andtoreplaceitbywhat'sknownastheperceptron's
bias,b threshold.Usingthebiasinsteadofthethreshold,the
perceptronrulecanberewritten:
0 ifw x + b 0
output = { (2)
1 ifw x + b > 0
Youcanthinkofthebiasasameasureofhoweasyitistogetthe
perceptrontooutputa1.Ortoputitinmorebiologicalterms,the
biasisameasureofhoweasyitistogettheperceptrontofire.Fora
perceptronwithareallybigbias,it'sextremelyeasyforthe
perceptrontooutputa1.Butifthebiasisverynegative,thenit's
difficultfortheperceptrontooutputa1.Obviously,introducingthe
biasisonlyasmallchangeinhowwedescribeperceptrons,but
we'llseelaterthatitleadstofurthernotationalsimplifications.
Becauseofthis,intheremainderofthebookwewon'tusethe
threshold,we'llalwaysusethebias.
I'vedescribedperceptronsasamethodforweighingevidenceto
makedecisions.Anotherwayperceptronscanbeusedistocompute
theelementarylogicalfunctionsweusuallythinkofasunderlying
computation,functionssuchasAND,OR,andNAND.Forexample,
supposewehaveaperceptronwithtwoinputs,eachwithweight
2 ,andanoverallbiasof3.Here'sourperceptron:
Thenweseethatinput00 producesoutput1,since
(2) 0 + (2) 0 + 3 = 3 ispositive.Here,I'veintroducedthe
symboltomakethemultiplicationsexplicit.Similarcalculations
showthattheinputs01 and10 produceoutput1.Buttheinput11
producesoutput0,since(2) 1 + (2) 1 + 3 = 1isnegative.

AndsoourperceptronimplementsaNANDgate!
TheNANDexampleshowsthatwecanuseperceptronstocompute
simplelogicalfunctions.Infact,wecanusenetworksof
perceptronstocomputeanylogicalfunctionatall.Thereasonis
thattheNANDgateisuniversalforcomputation,thatis,wecanbuild
anycomputationupoutofNANDgates.Forexample,wecanuse
NANDgatestobuildacircuitwhichaddstwobits,x andx .This 1 2
requirescomputingthebitwisesum,x 1 x2 ,aswellasacarrybit
whichissetto1whenbothx andx are1,i.e.,thecarrybitisjust
1 2
thebitwiseproductx 1 x2 :
TogetanequivalentnetworkofperceptronswereplacealltheNAND
gatesbyperceptronswithtwoinputs,eachwithweight2,andan
overallbiasof3.Here'stheresultingnetwork.NotethatI'vemoved
theperceptroncorrespondingtothebottomrightNANDgatealittle,
justtomakeiteasiertodrawthearrowsonthediagram:
Onenotableaspectofthisnetworkofperceptronsisthattheoutput
fromtheleftmostperceptronisusedtwiceasinputtothe
bottommostperceptron.WhenIdefinedtheperceptronmodelI
didn'tsaywhetherthiskindofdoubleoutputtothesameplace
wasallowed.Actually,itdoesn'tmuchmatter.Ifwedon'twantto
allowthiskindofthing,thenit'spossibletosimplymergethetwo
lines,intoasingleconnectionwithaweightof4insteadoftwo
connectionswith2weights.(Ifyoudon'tfindthisobvious,you
shouldstopandprovetoyourselfthatthisisequivalent.)Withthat
change,thenetworklooksasfollows,withallunmarkedweights
equalto2,allbiasesequalto3,andasingleweightof4,as
marked:
UptonowI'vebeendrawinginputslikex andx asvariables

1 2
floatingtotheleftofthenetworkofperceptrons.Infact,it's
conventionaltodrawanextralayerofperceptronstheinputlayer
toencodetheinputs:
Thisnotationforinputperceptrons,inwhichwehaveanoutput,
butnoinputs,
isashorthand.Itdoesn'tactuallymeanaperceptronwithno
inputs.Toseethis,supposewedidhaveaperceptronwithno
inputs.Thentheweightedsum j
wj xj wouldalwaysbezero,and
sotheperceptronwouldoutput1ifb > 0,and0ifb 0.Thatis,the
perceptronwouldsimplyoutputafixedvalue,notthedesiredvalue
(x ,intheexampleabove).It'sbettertothinkoftheinput
1
perceptronsasnotreallybeingperceptronsatall,butratherspecial
unitswhicharesimplydefinedtooutputthedesiredvalues,
x1 , x2 , .
Theadderexampledemonstrateshowanetworkofperceptronscan
beusedtosimulateacircuitcontainingmanyNANDgates.And
becauseNANDgatesareuniversalforcomputation,itfollowsthat
perceptronsarealsouniversalforcomputation.
Thecomputationaluniversalityofperceptronsissimultaneously
reassuringanddisappointing.It'sreassuringbecauseittellsusthat
networksofperceptronscanbeaspowerfulasanyothercomputing
device.Butit'salsodisappointing,becauseitmakesitseemas
thoughperceptronsaremerelyanewtypeofNANDgate.That's
hardlybignews!
However,thesituationisbetterthanthisviewsuggests.Itturnsout
thatwecandeviselearningalgorithmswhichcanautomatically
tunetheweightsandbiasesofanetworkofartificialneurons.This
tuninghappensinresponsetoexternalstimuli,withoutdirect
interventionbyaprogrammer.Theselearningalgorithmsenableus
touseartificialneuronsinawaywhichisradicallydifferentto
conventionallogicgates.Insteadofexplicitlylayingoutacircuitof
NANDandothergates,ourneuralnetworkscansimplylearntosolve
problems,sometimesproblemswhereitwouldbeextremely
difficulttodirectlydesignaconventionalcircuit.
Sigmoidneurons
Learningalgorithmssoundterrific.Buthowcanwedevisesuch
algorithmsforaneuralnetwork?Supposewehaveanetworkof
perceptronsthatwe'dliketousetolearntosolvesomeproblem.
Forexample,theinputstothenetworkmightbetherawpixeldata
fromascanned,handwrittenimageofadigit.Andwe'dlikethe
networktolearnweightsandbiasessothattheoutputfromthe
networkcorrectlyclassifiesthedigit.Toseehowlearningmight
work,supposewemakeasmallchangeinsomeweight(orbias)in
thenetwork.Whatwe'dlikeisforthissmallchangeinweightto
causeonlyasmallcorrespondingchangeintheoutputfromthe
network.Aswe'llseeinamoment,thispropertywillmakelearning
possible.Schematically,here'swhatwewant(obviouslythis
networkistoosimpletodohandwritingrecognition!):
Ifitweretruethatasmallchangeinaweight(orbias)causesonlya
smallchangeinoutput,thenwecouldusethisfacttomodifythe
weightsandbiasestogetournetworktobehavemoreinthe
mannerwewant.Forexample,supposethenetworkwas
mistakenlyclassifyinganimageasan"8"whenitshouldbea"9".
Wecouldfigureouthowtomakeasmallchangeintheweightsand
biasessothenetworkgetsalittleclosertoclassifyingtheimageasa
"9".Andthenwe'drepeatthis,changingtheweightsandbiases
overandovertoproducebetterandbetteroutput.Thenetwork
wouldbelearning.
Theproblemisthatthisisn'twhathappenswhenournetwork
containsperceptrons.Infact,asmallchangeintheweightsorbias
ofanysingleperceptroninthenetworkcansometimescausethe
outputofthatperceptrontocompletelyflip,sayfrom0to1.That
flipmaythencausethebehaviouroftherestofthenetworkto
completelychangeinsomeverycomplicatedway.Sowhileyour"9"
mightnowbeclassifiedcorrectly,thebehaviourofthenetworkon
alltheotherimagesislikelytohavecompletelychangedinsome
hardtocontrolway.Thatmakesitdifficulttoseehowtogradually
modifytheweightsandbiasessothatthenetworkgetsclosertothe
desiredbehaviour.Perhapsthere'ssomecleverwayofgetting
aroundthisproblem.Butit'snotimmediatelyobvioushowwecan
getanetworkofperceptronstolearn.
Wecanovercomethisproblembyintroducinganewtypeof
artificialneuroncalledasigmoidneuron.Sigmoidneuronsare
similartoperceptrons,butmodifiedsothatsmallchangesintheir
weightsandbiascauseonlyasmallchangeintheiroutput.That's
thecrucialfactwhichwillallowanetworkofsigmoidneuronsto
learn.
Okay,letmedescribethesigmoidneuron.We'lldepictsigmoid
neuronsinthesamewaywedepictedperceptrons:
Justlikeaperceptron,thesigmoidneuronhasinputs,x 1, x2 , .
Butinsteadofbeingjust0or1,theseinputscanalsotakeonany
valuesbetween0and1.So,forinstance,0.638 isavalidinputfor
asigmoidneuron.Alsojustlikeaperceptron,thesigmoidneuron
hasweightsforeachinput,w 1, w2 , ,andanoverallbias,b.Butthe
outputisnot0or1.Instead,it's(w x + b),whereiscalledthe
*Incidentally,issometimescalledthelogistic
sigmoidfunction*,andisdefinedby: function,andthisnewclassofneuronscalled
logisticneurons.It'susefultorememberthis
1 terminology,sincethesetermsareusedbymany
(z) . (3)
1 + e
z peopleworkingwithneuralnets.However,we'll
stickwiththesigmoidterminology.
Toputitallalittlemoreexplicitly,theoutputofasigmoidneuron
withinputsx 1, x2 , ,weightsw 1, w2 , ,andbiasbis
1
. (4)
1 + exp( w j x j b)
j
Atfirstsight,sigmoidneuronsappearverydifferenttoperceptrons.
Thealgebraicformofthesigmoidfunctionmayseemopaqueand
forbiddingifyou'renotalreadyfamiliarwithit.Infact,thereare
manysimilaritiesbetweenperceptronsandsigmoidneurons,and
thealgebraicformofthesigmoidfunctionturnsouttobemoreofa
technicaldetailthanatruebarriertounderstanding.
Tounderstandthesimilaritytotheperceptronmodel,suppose
z w x + b isalargepositivenumber.Thene z
andso
0
(z) 1 .Inotherwords,whenz = w x + b islargeandpositive,

theoutputfromthesigmoidneuronisapproximately1,justasit
wouldhavebeenforaperceptron.Supposeontheotherhandthat
z = w x + b isverynegative.Thene z
,and(z) 0.So
whenz = w x + b isverynegative,thebehaviourofasigmoid
neuronalsocloselyapproximatesaperceptron.It'sonlywhen
w x + b isofmodestsizethatthere'smuchdeviationfromthe
perceptronmodel.
Whataboutthealgebraicformof?Howcanweunderstandthat?
Infact,theexactformofisn'tsoimportantwhatreallymatters
istheshapeofthefunctionwhenplotted.Here'stheshape:
sigmoidfunction
1.0
0.8
0.6
0.4
0.2
0.0
4 3 2 1 0 1 2 3 4
z
Thisshapeisasmoothedoutversionofastepfunction:
stepfunction
1.0
0.8
0.6
0.4
0.2
0.0
4 3 2 1 0 1 2 3 4
z
Ifhadinfactbeenastepfunction,thenthesigmoidneuronwould
beaperceptron,sincetheoutputwouldbe1or0dependingon
*Actually,whenw x + b = 0theperceptron
whetherw x + bwaspositiveornegative*.Byusingtheactual outputs0,whilethestepfunctionoutputs1.So,
functionweget,asalreadyimpliedabove,asmoothedout strictlyspeaking,we'dneedtomodifythestep
functionatthatonepoint.Butyougettheidea.
perceptron.Indeed,it'sthesmoothnessofthefunctionthatisthe
crucialfact,notitsdetailedform.Thesmoothnessofmeansthat
smallchangesw intheweightsandb inthebiaswillproducea
j
smallchangeoutputintheoutputfromtheneuron.Infact,
calculustellsusthatoutputiswellapproximatedby
output output
output w j + b, (5)
wj b
j
wherethesumisoveralltheweights,w ,and output/ w and

j j
output/ b denotepartialderivativesoftheoutputwithrespectto
wj andb,respectively.Don'tpanicifyou'renotcomfortablewith
partialderivatives!Whiletheexpressionabovelookscomplicated,
withallthepartialderivatives,it'sactuallysayingsomethingvery
simple(andwhichisverygoodnews):outputisalinearfunction
ofthechangesw andb intheweightsandbias.Thislinearity
j
makesiteasytochoosesmallchangesintheweightsandbiasesto
achieveanydesiredsmallchangeintheoutput.Sowhilesigmoid
neuronshavemuchofthesamequalitativebehaviouras
perceptrons,theymakeitmucheasiertofigureouthowchanging
theweightsandbiaseswillchangetheoutput.
Ifit'stheshapeofwhichreallymatters,andnotitsexactform,
thenwhyusetheparticularformusedforinEquation(3)?Infact,
laterinthebookwewilloccasionallyconsiderneuronswherethe
outputisf(w x + b)forsomeotheractivationfunctionf() .The
mainthingthatchangeswhenweuseadifferentactivationfunction
isthattheparticularvaluesforthepartialderivativesinEquation
(5)change.Itturnsoutthatwhenwecomputethosepartial
derivativeslater,usingwillsimplifythealgebra,simplybecause
exponentialshavelovelypropertieswhendifferentiated.Inany
case,iscommonlyusedinworkonneuralnets,andisthe
activationfunctionwe'llusemostofteninthisbook.
Howshouldweinterprettheoutputfromasigmoidneuron?
Obviously,onebigdifferencebetweenperceptronsandsigmoid
neuronsisthatsigmoidneuronsdon'tjustoutput0or1.Theycan
haveasoutputanyrealnumberbetween0and1,sovaluessuchas
0.173 and0.689 arelegitimateoutputs.Thiscanbeuseful,for
example,ifwewanttousetheoutputvaluetorepresenttheaverage
intensityofthepixelsinanimageinputtoaneuralnetwork.But
sometimesitcanbeanuisance.Supposewewanttheoutputfrom
thenetworktoindicateeither"theinputimageisa9"or"theinput
imageisnota9".Obviously,it'dbeeasiesttodothisiftheoutput
wasa0ora1,asinaperceptron.Butinpracticewecansetupa
conventiontodealwiththis,forexample,bydecidingtointerpret
anyoutputofatleast0.5asindicatinga"9",andanyoutputless
than0.5asindicating"nota9".I'llalwaysexplicitlystatewhen
we'reusingsuchaconvention,soitshouldn'tcauseanyconfusion.
Exercises
Sigmoidneuronssimulatingperceptrons,partI
Supposewetakealltheweightsandbiasesinanetworkof
perceptrons,andmultiplythembyapositiveconstant,c > 0.
Showthatthebehaviourofthenetworkdoesn'tchange.
Sigmoidneuronssimulatingperceptrons,partII
Supposewehavethesamesetupasthelastproblema
networkofperceptrons.Supposealsothattheoverallinputto
thenetworkofperceptronshasbeenchosen.Wewon'tneed
theactualinputvalue,wejustneedtheinputtohavebeen
fixed.Supposetheweightsandbiasesaresuchthat
w x + b 0 fortheinputxtoanyparticularperceptroninthe
network.Nowreplacealltheperceptronsinthenetworkby
sigmoidneurons,andmultiplytheweightsandbiasesbya
positiveconstantc > 0.Showthatinthelimitasc the
behaviourofthisnetworkofsigmoidneuronsisexactlythe
sameasthenetworkofperceptrons.Howcanthisfailwhen
w x + b = 0 foroneoftheperceptrons?
Thearchitectureofneuralnetworks
InthenextsectionI'llintroduceaneuralnetworkthatcandoa
prettygoodjobclassifyinghandwrittendigits.Inpreparationfor
that,ithelpstoexplainsometerminologythatletsusname
differentpartsofanetwork.Supposewehavethenetwork:
Asmentionedearlier,theleftmostlayerinthisnetworkiscalledthe
inputlayer,andtheneuronswithinthelayerarecalledinput
neurons.Therightmostoroutputlayercontainstheoutput
neurons,or,asinthiscase,asingleoutputneuron.Themiddle
layeriscalledahiddenlayer,sincetheneuronsinthislayerare
neitherinputsnoroutputs.Theterm"hidden"perhapssoundsa
littlemysteriousthefirsttimeIheardthetermIthoughtitmust
havesomedeepphilosophicalormathematicalsignificancebutit
reallymeansnothingmorethan"notaninputoranoutput".The
networkabovehasjustasinglehiddenlayer,butsomenetworks
havemultiplehiddenlayers.Forexample,thefollowingfourlayer
networkhastwohiddenlayers:
Somewhatconfusingly,andforhistoricalreasons,suchmultiple
layernetworksaresometimescalledmultilayerperceptronsor
MLPs,despitebeingmadeupofsigmoidneurons,notperceptrons.
I'mnotgoingtousetheMLPterminologyinthisbook,sinceIthink
it'sconfusing,butwantedtowarnyouofitsexistence.
Thedesignoftheinputandoutputlayersinanetworkisoften
straightforward.Forexample,supposewe'retryingtodetermine
whetherahandwrittenimagedepictsa"9"ornot.Anaturalwayto
designthenetworkistoencodetheintensitiesoftheimagepixels
intotheinputneurons.Iftheimageisa64 by64 greyscaleimage,
thenwe'dhave4, 096 = 64 64inputneurons,withtheintensities
scaledappropriatelybetween0and1.Theoutputlayerwillcontain
justasingleneuron,withoutputvaluesoflessthan0.5indicating
"inputimageisnota9",andvaluesgreaterthan0.5indicating
"inputimageisa9".
Whilethedesignoftheinputandoutputlayersofaneuralnetwork
isoftenstraightforward,therecanbequiteanarttothedesignof
thehiddenlayers.Inparticular,it'snotpossibletosumupthe
designprocessforthehiddenlayerswithafewsimplerulesof
thumb.Instead,neuralnetworksresearchershavedevelopedmany
designheuristicsforthehiddenlayers,whichhelppeoplegetthe
behaviourtheywantoutoftheirnets.Forexample,suchheuristics
canbeusedtohelpdeterminehowtotradeoffthenumberof
hiddenlayersagainstthetimerequiredtotrainthenetwork.We'll
meetseveralsuchdesignheuristicslaterinthisbook.
Uptonow,we'vebeendiscussingneuralnetworkswheretheoutput
fromonelayerisusedasinputtothenextlayer.Suchnetworksare
calledfeedforwardneuralnetworks.Thismeanstherearenoloops
inthenetworkinformationisalwaysfedforward,neverfedback.
Ifwedidhaveloops,we'dendupwithsituationswheretheinputto
thefunctiondependedontheoutput.That'dbehardtomake
senseof,andsowedon'tallowsuchloops.
However,thereareothermodelsofartificialneuralnetworksin
whichfeedbackloopsarepossible.Thesemodelsarecalled
recurrentneuralnetworks.Theideainthesemodelsistohave
neuronswhichfireforsomelimiteddurationoftime,before
becomingquiescent.Thatfiringcanstimulateotherneurons,which
mayfirealittlewhilelater,alsoforalimitedduration.Thatcauses
stillmoreneuronstofire,andsoovertimewegetacascadeof
neuronsfiring.Loopsdon'tcauseproblemsinsuchamodel,sincea
neuron'soutputonlyaffectsitsinputatsomelatertime,not
instantaneously.
Recurrentneuralnetshavebeenlessinfluentialthanfeedforward
networks,inpartbecausethelearningalgorithmsforrecurrentnets
are(atleasttodate)lesspowerful.Butrecurrentnetworksarestill
extremelyinteresting.They'remuchcloserinspirittohowour
brainsworkthanfeedforwardnetworks.Andit'spossiblethat
recurrentnetworkscansolveimportantproblemswhichcanonlybe
solvedwithgreatdifficultybyfeedforwardnetworks.However,to
limitourscope,inthisbookwe'regoingtoconcentrateonthemore
widelyusedfeedforwardnetworks.
Asimplenetworktoclassify
handwrittendigits
Havingdefinedneuralnetworks,let'sreturntohandwriting
recognition.Wecansplittheproblemofrecognizinghandwritten
digitsintotwosubproblems.First,we'dlikeawayofbreakingan
imagecontainingmanydigitsintoasequenceofseparateimages,
eachcontainingasingledigit.Forexample,we'dliketobreakthe
image
intosixseparateimages,
Wehumanssolvethissegmentationproblemwithease,butit's
challengingforacomputerprogramtocorrectlybreakupthe
image.Oncetheimagehasbeensegmented,theprogramthen
needstoclassifyeachindividualdigit.So,forinstance,we'dlikeour
programtorecognizethatthefirstdigitabove,
isa5.
We'llfocusonwritingaprogramtosolvethesecondproblem,that
is,classifyingindividualdigits.Wedothisbecauseitturnsoutthat
thesegmentationproblemisnotsodifficulttosolve,onceyouhave
agoodwayofclassifyingindividualdigits.Therearemany
approachestosolvingthesegmentationproblem.Oneapproachis
totrialmanydifferentwaysofsegmentingtheimage,usingthe
individualdigitclassifiertoscoreeachtrialsegmentation.Atrial
segmentationgetsahighscoreiftheindividualdigitclassifieris
confidentofitsclassificationinallsegments,andalowscoreifthe
classifierishavingalotoftroubleinoneormoresegments.The
ideaisthatiftheclassifierishavingtroublesomewhere,thenit's
probablyhavingtroublebecausethesegmentationhasbeenchosen
incorrectly.Thisideaandothervariationscanbeusedtosolvethe
segmentationproblemquitewell.Soinsteadofworryingabout
segmentationwe'llconcentrateondevelopinganeuralnetwork
whichcansolvethemoreinterestinganddifficultproblem,namely,
recognizingindividualhandwrittendigits.
Torecognizeindividualdigitswewilluseathreelayerneural
network:
Theinputlayerofthenetworkcontainsneuronsencodingthe
valuesoftheinputpixels.Asdiscussedinthenextsection,our
trainingdataforthenetworkwillconsistofmany28 by28 pixel
imagesofscannedhandwrittendigits,andsotheinputlayer
contains784 = 28 28neurons.ForsimplicityI'veomittedmostof
the784inputneuronsinthediagramabove.Theinputpixelsare
greyscale,withavalueof0.0representingwhite,avalueof1.0
representingblack,andinbetweenvaluesrepresentinggradually
darkeningshadesofgrey.
Thesecondlayerofthenetworkisahiddenlayer.Wedenotethe
numberofneuronsinthishiddenlayerbyn,andwe'llexperiment
withdifferentvaluesforn.Theexampleshownillustratesasmall
hiddenlayer,containingjustn = 15neurons.
Theoutputlayerofthenetworkcontains10neurons.Ifthefirst
neuronfires,i.e.,hasanoutput 1,thenthatwillindicatethatthe
networkthinksthedigitisa0.Ifthesecondneuronfiresthenthat
willindicatethatthenetworkthinksthedigitisa1.Andsoon.A
littlemoreprecisely,wenumbertheoutputneuronsfrom0through
9,andfigureoutwhichneuronhasthehighestactivationvalue.If
thatneuronis,say,neuronnumber6,thenournetworkwillguess
thattheinputdigitwasa6.Andsoonfortheotheroutputneurons.
Youmightwonderwhyweuse10 outputneurons.Afterall,thegoal
ofthenetworkistotelluswhichdigit(0, 1, 2, , 9)correspondsto
theinputimage.Aseeminglynaturalwayofdoingthatistousejust
4outputneurons,treatingeachneuronastakingonabinaryvalue,
dependingonwhethertheneuron'soutputiscloserto0orto1.
Fourneuronsareenoughtoencodetheanswer,since2 4
= 16 is
morethanthe10possiblevaluesfortheinputdigit.Whyshouldour
networkuse10 neuronsinstead?Isn'tthatinefficient?Theultimate
justificationisempirical:wecantryoutbothnetworkdesigns,and
itturnsoutthat,forthisparticularproblem,thenetworkwith10
outputneuronslearnstorecognizedigitsbetterthanthenetwork
with4outputneurons.Butthatleavesuswonderingwhyusing10
outputneuronsworksbetter.Istheresomeheuristicthatwouldtell
usinadvancethatweshouldusethe10 outputencodinginsteadof
the4outputencoding?
Tounderstandwhywedothis,ithelpstothinkaboutwhatthe
neuralnetworkisdoingfromfirstprinciples.Considerfirstthecase
whereweuse10 outputneurons.Let'sconcentrateonthefirst
outputneuron,theonethat'stryingtodecidewhetherornotthe
digitisa0.Itdoesthisbyweighingupevidencefromthehidden
layerofneurons.Whatarethosehiddenneuronsdoing?Well,just
supposeforthesakeofargumentthatthefirstneuroninthehidden
layerdetectswhetherornotanimagelikethefollowingispresent:
Itcandothisbyheavilyweightinginputpixelswhichoverlapwith
theimage,andonlylightlyweightingtheotherinputs.Inasimilar
way,let'ssupposeforthesakeofargumentthatthesecond,third,
andfourthneuronsinthehiddenlayerdetectwhetherornotthe
followingimagesarepresent:
Asyoumayhaveguessed,thesefourimagestogethermakeupthe0
imagethatwesawinthelineofdigitsshownearlier:
Soifallfourofthesehiddenneuronsarefiringthenwecan
concludethatthedigitisa0.Ofcourse,that'snottheonlysortof
evidencewecanusetoconcludethattheimagewasa0wecould
legitimatelygeta0inmanyotherways(say,throughtranslationsof
theaboveimages,orslightdistortions).Butitseemssafetosaythat
atleastinthiscasewe'dconcludethattheinputwasa0.
Supposingtheneuralnetworkfunctionsinthisway,wecangivea
plausibleexplanationforwhyit'sbettertohave10 outputsfromthe
network,ratherthan4.Ifwehad4outputs,thenthefirstoutput
neuronwouldbetryingtodecidewhatthemostsignificantbitof
thedigitwas.Andthere'snoeasywaytorelatethatmostsignificant
bittosimpleshapeslikethoseshownabove.It'shardtoimagine
thatthere'sanygoodhistoricalreasonthecomponentshapesofthe
digitwillbecloselyrelatedto(say)themostsignificantbitinthe
output.
Now,withallthatsaid,thisisalljustaheuristic.Nothingsaysthat
thethreelayerneuralnetworkhastooperateinthewayI
described,withthehiddenneuronsdetectingsimplecomponent
shapes.Maybeacleverlearningalgorithmwillfindsome
assignmentofweightsthatletsususeonly4outputneurons.Butas
aheuristicthewayofthinkingI'vedescribedworksprettywell,and
cansaveyoualotoftimeindesigninggoodneuralnetwork
architectures.
Exercise
Thereisawayofdeterminingthebitwiserepresentationofa
digitbyaddinganextralayertothethreelayernetworkabove.
Theextralayerconvertstheoutputfromthepreviouslayerinto
abinaryrepresentation,asillustratedinthefigurebelow.Find
asetofweightsandbiasesforthenewoutputlayer.Assume
thatthefirst3layersofneuronsaresuchthatthecorrect
outputinthethirdlayer(i.e.,theoldoutputlayer)has
activationatleast0.99,andincorrectoutputshaveactivation
lessthan0.01.
Learningwithgradientdescent
Nowthatwehaveadesignforourneuralnetwork,howcanitlearn
torecognizedigits?Thefirstthingwe'llneedisadatasettolearn
fromasocalledtrainingdataset.We'llusetheMNISTdataset,
whichcontainstensofthousandsofscannedimagesofhandwritten
digits,togetherwiththeircorrectclassifications.MNIST'sname
comesfromthefactthatitisamodifiedsubsetoftwodatasets
collectedbyNIST,theUnitedStates'NationalInstituteof
StandardsandTechnology.Here'safewimagesfromMNIST:
Asyoucansee,thesedigitsare,infact,thesameasthoseshownat
thebeginningofthischapterasachallengetorecognize.Ofcourse,
whentestingournetworkwe'llaskittorecognizeimageswhich
aren'tinthetrainingset!
TheMNISTdatacomesintwoparts.Thefirstpartcontains60,000
imagestobeusedastrainingdata.Theseimagesarescanned
handwritingsamplesfrom250people,halfofwhomwereUS
CensusBureauemployees,andhalfofwhomwerehighschool
students.Theimagesaregreyscaleand28by28pixelsinsize.The
secondpartoftheMNISTdatasetis10,000imagestobeusedas
testdata.Again,theseare28by28greyscaleimages.We'llusethe
testdatatoevaluatehowwellourneuralnetworkhaslearnedto
recognizedigits.Tomakethisagoodtestofperformance,thetest
datawastakenfromadifferentsetof250peoplethantheoriginal
trainingdata(albeitstillagroupsplitbetweenCensusBureau
employeesandhighschoolstudents).Thishelpsgiveusconfidence
thatoursystemcanrecognizedigitsfrompeoplewhosewritingit
didn'tseeduringtraining.
We'llusethenotationxtodenoteatraininginput.It'llbe
convenienttoregardeachtraininginputxasa28 28 = 784
dimensionalvector.Eachentryinthevectorrepresentsthegrey
valueforasinglepixelintheimage.We'lldenotethecorresponding
desiredoutputbyy = y(x) ,whereyisa10 dimensionalvector.For
example,ifaparticulartrainingimage,x,depictsa6,then
y(x) = (0, 0, 0, 0, 0, 0, 1, 0, 0, 0)
T
isthedesiredoutputfromthe
network.NotethatT hereisthetransposeoperation,turningarow
vectorintoanordinary(column)vector.
Whatwe'dlikeisanalgorithmwhichletsusfindweightsandbiases
sothattheoutputfromthenetworkapproximatesy(x) forall
traininginputsx.Toquantifyhowwellwe'reachievingthisgoalwe
*Sometimesreferredtoasalossorobjective
defineacostfunction*: function.Weusethetermcostfunction
throughoutthisbook,butyoushouldnotethe
1 otherterminology,sinceit'softenusedin
2
C (w, b) y(x) a . (6)
2n researchpapersandotherdiscussionsofneural
x
networks.
Here,wdenotesthecollectionofallweightsinthenetwork,ballthe
biases,nisthetotalnumberoftraininginputs,aisthevectorof
outputsfromthenetworkwhenxisinput,andthesumisoverall
traininginputs,x.Ofcourse,theoutputadependsonx,wandb,
buttokeepthenotationsimpleIhaven'texplicitlyindicatedthis
dependence.Thenotationvjustdenotestheusuallength
functionforavectorv.We'llcallC thequadraticcostfunctionit's
alsosometimesknownasthemeansquarederrororjustMSE.
Inspectingtheformofthequadraticcostfunction,weseethat
C (w, b) isnonnegative,sinceeveryterminthesumisnon
negative.Furthermore,thecostC (w, b)becomessmall,i.e.,
C (w, b) 0 ,preciselywheny(x) isapproximatelyequaltothe
output,a,foralltraininginputs,x.Soourtrainingalgorithmhas
doneagoodjobifitcanfindweightsandbiasessothatC (w, b) 0.
Bycontrast,it'snotdoingsowellwhenC (w, b)islargethatwould
meanthaty(x) isnotclosetotheoutputaforalargenumberof
inputs.Sotheaimofourtrainingalgorithmwillbetominimizethe
costC (w, b)asafunctionoftheweightsandbiases.Inotherwords,
wewanttofindasetofweightsandbiaseswhichmakethecostas
smallaspossible.We'lldothatusinganalgorithmknownas
gradientdescent.
Whyintroducethequadraticcost?Afterall,aren'tweprimarily
interestedinthenumberofimagescorrectlyclassifiedbythe
network?Whynottrytomaximizethatnumberdirectly,rather
thanminimizingaproxymeasurelikethequadraticcost?The
problemwiththatisthatthenumberofimagescorrectlyclassified
isnotasmoothfunctionoftheweightsandbiasesinthenetwork.
Forthemostpart,makingsmallchangestotheweightsandbiases
won'tcauseanychangeatallinthenumberoftrainingimages
classifiedcorrectly.Thatmakesitdifficulttofigureouthowto
changetheweightsandbiasestogetimprovedperformance.Ifwe
insteaduseasmoothcostfunctionlikethequadraticcostitturns
outtobeeasytofigureouthowtomakesmallchangesinthe
weightsandbiasessoastogetanimprovementinthecost.That's
whywefocusfirstonminimizingthequadraticcost,andonlyafter
thatwillweexaminetheclassificationaccuracy.
Evengiventhatwewanttouseasmoothcostfunction,youmaystill
wonderwhywechoosethequadraticfunctionusedinEquation(6).
Isn'tthisaratheradhocchoice?Perhapsifwechoseadifferent
costfunctionwe'dgetatotallydifferentsetofminimizingweights
andbiases?Thisisavalidconcern,andlaterwe'llrevisitthecost
function,andmakesomemodifications.However,thequadratic
costfunctionofEquation(6)worksperfectlywellforunderstanding
thebasicsoflearninginneuralnetworks,sowe'llstickwithitfor
now.
Recapping,ourgoalintraininganeuralnetworkistofindweights
andbiaseswhichminimizethequadraticcostfunctionC (w, b).This
isawellposedproblem,butit'sgotalotofdistractingstructureas
currentlyposedtheinterpretationofwandbasweightsand
biases,thefunctionlurkinginthebackground,thechoiceof
networkarchitecture,MNIST,andsoon.Itturnsoutthatwecan
understandatremendousamountbyignoringmostofthat
structure,andjustconcentratingontheminimizationaspect.Sofor
nowwe'regoingtoforgetallaboutthespecificformofthecost
function,theconnectiontoneuralnetworks,andsoon.Instead,
we'regoingtoimaginethatwe'vesimplybeengivenafunctionof
manyvariablesandwewanttominimizethatfunction.We'regoing
todevelopatechniquecalledgradientdescentwhichcanbeusedto
solvesuchminimizationproblems.Thenwe'llcomebacktothe
specificfunctionwewanttominimizeforneuralnetworks.
Okay,let'ssupposewe'retryingtominimizesomefunction,C (v) .
Thiscouldbeanyrealvaluedfunctionofmanyvariables,
v = v1 , v2 , .NotethatI'vereplacedthewandbnotationbyvto
emphasizethatthiscouldbeanyfunctionwe'renotspecifically
thinkingintheneuralnetworkscontextanymore.TominimizeC (v)
ithelpstoimagineC asafunctionofjusttwovariables,whichwe'll
callv andv :
1 2
Whatwe'dlikeistofindwhereC achievesitsglobalminimum.
Now,ofcourse,forthefunctionplottedabove,wecaneyeballthe
graphandfindtheminimum.Inthatsense,I'veperhapsshown
slightlytoosimpleafunction!Ageneralfunction,C ,maybea
complicatedfunctionofmanyvariables,anditwon'tusuallybe
possibletojusteyeballthegraphtofindtheminimum.
Onewayofattackingtheproblemistousecalculustotrytofindthe
minimumanalytically.Wecouldcomputederivativesandthentry
usingthemtofindplaceswhereC isanextremum.Withsomeluck
thatmightworkwhenC isafunctionofjustoneorafewvariables.
Butit'llturnintoanightmarewhenwehavemanymorevariables.
Andforneuralnetworkswe'lloftenwantfarmorevariablesthe
biggestneuralnetworkshavecostfunctionswhichdependon
billionsofweightsandbiasesinanextremelycomplicatedway.
Usingcalculustominimizethatjustwon'twork!
(Afterassertingthatwe'llgaininsightbyimaginingC asafunction
ofjusttwovariables,I'veturnedaroundtwiceintwoparagraphs
andsaid,"hey,butwhatifit'safunctionofmanymorethantwo
variables?"Sorryaboutthat.PleasebelievemewhenIsaythatit
reallydoeshelptoimagineC asafunctionoftwovariables.Itjust
happensthatsometimesthatpicturebreaksdown,andthelasttwo
paragraphsweredealingwithsuchbreakdowns.Goodthinking
aboutmathematicsofteninvolvesjugglingmultipleintuitive
pictures,learningwhenit'sappropriatetouseeachpicture,and
whenit'snot.)
Okay,socalculusdoesn'twork.Fortunately,thereisabeautiful
analogywhichsuggestsanalgorithmwhichworksprettywell.We
startbythinkingofourfunctionasakindofavalley.Ifyousquint
justalittleattheplotabove,thatshouldn'tbetoohard.Andwe
imagineaballrollingdowntheslopeofthevalley.Oureveryday
experiencetellsusthattheballwilleventuallyrolltothebottomof
thevalley.Perhapswecanusethisideaasawaytofindaminimum
forthefunction?We'drandomlychooseastartingpointforan
(imaginary)ball,andthensimulatethemotionoftheballasit
rolleddowntothebottomofthevalley.Wecoulddothissimulation
simplybycomputingderivatives(andperhapssomesecond
derivatives)ofC thosederivativeswouldtelluseverythingwe
needtoknowaboutthelocal"shape"ofthevalley,andtherefore
howourballshouldroll.
BasedonwhatI'vejustwritten,youmightsupposethatwe'llbe
tryingtowritedownNewton'sequationsofmotionfortheball,
consideringtheeffectsoffrictionandgravity,andsoon.Actually,
we'renotgoingtotaketheballrollinganalogyquitethatseriously
we'redevisinganalgorithmtominimizeC ,notdevelopingan
accuratesimulationofthelawsofphysics!Theball'seyeviewis
meanttostimulateourimagination,notconstrainourthinking.So
ratherthangetintoallthemessydetailsofphysics,let'ssimplyask
ourselves:ifweweredeclaredGodforaday,andcouldmakeupour
ownlawsofphysics,dictatingtotheballhowitshouldroll,what
laworlawsofmotioncouldwepickthatwouldmakeitsotheball
alwaysrolledtothebottomofthevalley?
Tomakethisquestionmoreprecise,let'sthinkaboutwhathappens
whenwemovetheballasmallamountv inthev direction,and
1 1
asmallamountv inthev direction.CalculustellsusthatC

2 2
changesasfollows:
C
http://neuralnetworksanddeeplearning.com/chap1.html C 25/50
C C
C v1 + v2 . (7)
v1 v2
We'regoingtofindawayofchoosingv andv soastomake 1 2
C negativei.e.,we'llchoosethemsotheballisrollingdowninto
thevalley.Tofigureouthowtomakesuchachoiceithelpstodefine
v tobethevectorofchangesinv,v (v 1, v2 )
T
,whereT is
againthetransposeoperation,turningrowvectorsintocolumn
vectors.We'llalsodefinethegradientofC tobethevectorof
T
partialderivatives,( C
v1
,
C
v2
) .Wedenotethegradientvectorby
C ,i.e.:
T
C C
C ( , ) . (8)
v1 v2
Inamomentwe'llrewritethechangeC intermsofv andthe

gradient,C .Beforegettingtothat,though,Iwanttoclarify
somethingthatsometimesgetspeoplehunguponthegradient.
WhenmeetingtheC notationforthefirsttime,peoplesometimes
wonderhowtheyshouldthinkaboutthe symbol.What,exactly,
does mean?Infact,it'sperfectlyfinetothinkofC asasingle
mathematicalobjectthevectordefinedabovewhichhappensto
bewrittenusingtwosymbols.Inthispointofview, isjustapiece
ofnotationalflagwaving,tellingyou"hey,C isagradientvector".
Therearemoreadvancedpointsofviewwhere canbeviewedas
anindependentmathematicalentityinitsownright(forexample,
asadifferentialoperator),butwewon'tneedsuchpointsofview.
Withthesedefinitions,theexpression(7)forC canberewritten
as
C C v. (9)
ThisequationhelpsexplainwhyC iscalledthegradientvector:
C relateschangesinvtochangesinC ,justaswe'dexpect
somethingcalledagradienttodo.Butwhat'sreallyexcitingabout
theequationisthatitletsusseehowtochoosev soastomake
C negative.Inparticular,supposewechoose
v = C , (10)
whereisasmall,positiveparameter(knownasthelearningrate).
ThenEquation(9)tellsusthatC C C = C
2
.
BecauseC 2
0,thisguaranteesthatC 0 ,i.e.,C will
alwaysdecrease,neverincrease,ifwechangevaccordingtothe
prescriptionin(10).(Within,ofcourse,thelimitsofthe
approximationinEquation(9)).Thisisexactlythepropertywe
wanted!Andsowe'lltakeEquation(10)todefinethe"lawof
motion"fortheballinourgradientdescentalgorithm.Thatis,we'll
useEquation(10)tocomputeavalueforv ,thenmovetheball's
positionvbythatamount:

v v = v C . (11)
Thenwe'llusethisupdateruleagain,tomakeanothermove.Ifwe
keepdoingthis,overandover,we'llkeepdecreasingC untilwe
hopewereachaglobalminimum.
Summingup,thewaythegradientdescentalgorithmworksisto
repeatedlycomputethegradientC ,andthentomoveinthe
oppositedirection,"fallingdown"theslopeofthevalley.Wecan
visualizeitlikethis:
Noticethatwiththisrulegradientdescentdoesn'treproducereal
physicalmotion.Inreallifeaballhasmomentum,andthat
momentummayallowittorollacrosstheslope,oreven
(momentarily)rolluphill.It'sonlyaftertheeffectsoffrictionsetin
thattheballisguaranteedtorolldownintothevalley.Bycontrast,
ourruleforchoosingv justsays"godown,rightnow".That'sstill
aprettygoodruleforfindingtheminimum!
Tomakegradientdescentworkcorrectly,weneedtochoosethe
learningratetobesmallenoughthatEquation(9)isagood
approximation.Ifwedon't,wemightendupwithC ,which
> 0
obviouslywouldnotbegood!Atthesametime,wedon'twantto
betoosmall,sincethatwillmakethechangesv tiny,andthusthe
gradientdescentalgorithmwillworkveryslowly.Inpractical
implementations,isoftenvariedsothatEquation(9)remainsa
goodapproximation,butthealgorithmisn'ttooslow.We'llseelater
howthisworks.
I'veexplainedgradientdescentwhenC isafunctionofjusttwo
variables.But,infact,everythingworksjustaswellevenwhenC is
afunctionofmanymorevariables.SupposeinparticularthatC isa
functionofm variables,v 1, , vm .ThenthechangeC inC
producedbyasmallchangev = (v 1, , vm )
T
is
C C v, (12)
wherethegradientC isthevector
T
C C
C ( ,, ) . (13)
v1 vm
Justasforthetwovariablecase,wecanchoose
v = C , (14)
andwe'reguaranteedthatour(approximate)expression(12)forC
willbenegative.Thisgivesusawayoffollowingthegradienttoa
minimum,evenwhenC isafunctionofmanyvariables,by
repeatedlyapplyingtheupdaterule

v v = v C . (15)
Youcanthinkofthisupdateruleasdefiningthegradientdescent
algorithm.Itgivesusawayofrepeatedlychangingthepositionvin
ordertofindaminimumofthefunctionC .Theruledoesn'talways
workseveralthingscangowrongandpreventgradientdescent
fromfindingtheglobalminimumofC ,apointwe'llreturnto
exploreinlaterchapters.But,inpracticegradientdescentoften
worksextremelywell,andinneuralnetworkswe'llfindthatit'sa
powerfulwayofminimizingthecostfunction,andsohelpingthe
netlearn.
Indeed,there'sevenasenseinwhichgradientdescentisthe
optimalstrategyforsearchingforaminimum.Let'ssupposethat
we'retryingtomakeamovev inpositionsoastodecreaseC as
muchaspossible.ThisisequivalenttominimizingC C v .
We'llconstrainthesizeofthemovesothatv = forsomesmall
fixed > 0.Inotherwords,wewantamovethatisasmallstepofa
fixedsize,andwe'retryingtofindthemovementdirectionwhich
decreasesC asmuchaspossible.Itcanbeprovedthatthechoiceof
v whichminimizesC v isv = C ,where = /C is
determinedbythesizeconstraintv = .Sogradientdescent
canbeviewedasawayoftakingsmallstepsinthedirectionwhich
doesthemosttoimmediatelydecreaseC .
Exercises
Provetheassertionofthelastparagraph.Hint:Ifyou'renot
alreadyfamiliarwiththeCauchySchwarzinequality,youmay
findithelpfultofamiliarizeyourselfwithit.
IexplainedgradientdescentwhenC isafunctionoftwo
variables,andwhenit'safunctionofmorethantwovariables.
WhathappenswhenC isafunctionofjustonevariable?Can
youprovideageometricinterpretationofwhatgradient
descentisdoingintheonedimensionalcase?
Peoplehaveinvestigatedmanyvariationsofgradientdescent,
includingvariationsthatmorecloselymimicarealphysicalball.
Theseballmimickingvariationshavesomeadvantages,butalso
haveamajordisadvantage:itturnsouttobenecessarytocompute
secondpartialderivativesofC ,andthiscanbequitecostly.Tosee
whyit'scostly,supposewewanttocomputeallthesecondpartial
derivatives 2
C / vj vk .Ifthereareamillionsuchv variablesthen
j
we'dneedtocomputesomethinglikeatrillion(i.e.,amillion
*Actually,morelikehalfatrillion,since
squared)secondpartialderivatives*!That'sgoingtobe
2
C /vj vk =
2
.Still,yougetthe
C /vk vj
computationallycostly.Withthatsaid,therearetricksforavoiding point.
thiskindofproblem,andfindingalternativestogradientdescentis
anactiveareaofinvestigation.Butinthisbookwe'llusegradient
descent(andvariations)asourmainapproachtolearninginneural
networks.
Howcanweapplygradientdescenttolearninaneuralnetwork?
Theideaistousegradientdescenttofindtheweightsw andbiases k
bl whichminimizethecostinEquation(6).Toseehowthisworks,
let'srestatethegradientdescentupdaterule,withtheweightsand
biasesreplacingthevariablesv .Inotherwords,our"position"now
j
hascomponentsw andb ,andthegradientvectorC has

k l
correspondingcomponents C / w and C / b .Writingoutthe

k l
gradientdescentupdateruleintermsofcomponents,wehave
C

wk w = wk (16)
k
wk
C

bl b = bl . (17)
l
bl
Byrepeatedlyapplyingthisupdaterulewecan"rolldownthehill",
andhopefullyfindaminimumofthecostfunction.Inotherwords,
thisisarulewhichcanbeusedtolearninaneuralnetwork.
Thereareanumberofchallengesinapplyingthegradientdescent
rule.We'lllookintothoseindepthinlaterchapters.ButfornowI
justwanttomentiononeproblem.Tounderstandwhattheproblem
is,let'slookbackatthequadraticcostinEquation(6).Noticethat
thiscostfunctionhastheformC =
1
n

x
Cx ,thatis,it'sanaverage
2
y(x)a
overcostsC x
2
forindividualtrainingexamples.In
practice,tocomputethegradientC weneedtocomputethe
gradientsC separatelyforeachtraininginput,x,andthen
x
averagethem,C =
1
n

x
C x .Unfortunately,whenthenumber
oftraininginputsisverylargethiscantakealongtime,and
learningthusoccursslowly.
Anideacalledstochasticgradientdescentcanbeusedtospeedup
learning.TheideaistoestimatethegradientC bycomputingC x
forasmallsampleofrandomlychosentraininginputs.By
averagingoverthissmallsampleitturnsoutthatwecanquicklyget
agoodestimateofthetruegradientC ,andthishelpsspeedup
gradientdescent,andthuslearning.
Tomaketheseideasmoreprecise,stochasticgradientdescent
worksbyrandomlypickingoutasmallnumberm ofrandomly
chosentraininginputs.We'lllabelthoserandomtraininginputs
X1 , X2 , , Xm ,andrefertothemasaminibatch.Providedthe
samplesizem islargeenoughweexpectthattheaveragevalueof
theC willberoughlyequaltotheaverageoverallC ,thatis,
Xj x
m
C Xj C x
j=1 x
= C , (18)
m n
wherethesecondsumisovertheentiresetoftrainingdata.
Swappingsidesweget
m
m
1
C C Xj , (19)
m
j=1
confirmingthatwecanestimatetheoverallgradientbycomputing
gradientsjustfortherandomlychosenminibatch.
Toconnectthisexplicitlytolearninginneuralnetworks,supposew k
andb denotetheweightsandbiasesinourneuralnetwork.Then
l
stochasticgradientdescentworksbypickingoutarandomlychosen
minibatchoftraininginputs,andtrainingwiththose,
C Xj

wk w = wk (20)
k
m wk
j
C Xj

bl b = bl , (21)
l
m bl
j
wherethesumsareoverallthetrainingexamplesX inthecurrent j
minibatch.Thenwepickoutanotherrandomlychosenminibatch
andtrainwiththose.Andsoon,untilwe'veexhaustedthetraining
inputs,whichissaidtocompleteanepochoftraining.Atthatpoint
westartoverwithanewtrainingepoch.
Incidentally,it'sworthnotingthatconventionsvaryaboutscalingof
thecostfunctionandofminibatchupdatestotheweightsand
biases.InEquation(6)wescaledtheoverallcostfunctionbya
factor .Peoplesometimesomitthe ,summingoverthecostsof
1
n
1
individualtrainingexamplesinsteadofaveraging.Thisis
particularlyusefulwhenthetotalnumberoftrainingexamplesisn't
knowninadvance.Thiscanoccurifmoretrainingdataisbeing
generatedinrealtime,forinstance.And,inasimilarway,themini
batchupdaterules(20)and(21)sometimesomitthe termout 1
thefrontofthesums.Conceptuallythismakeslittledifference,
sinceit'sequivalenttorescalingthelearningrate.Butwhendoing
detailedcomparisonsofdifferentworkit'sworthwatchingoutfor.
Wecanthinkofstochasticgradientdescentasbeinglikepolitical
polling:it'smucheasiertosampleasmallminibatchthanitisto
applygradientdescenttothefullbatch,justascarryingoutapollis
easierthanrunningafullelection.Forexample,ifwehavea
trainingsetofsizen = 60, 000,asinMNIST,andchooseamini
batchsizeof(say)m = 10,thismeanswe'llgetafactorof6, 000
speedupinestimatingthegradient!Ofcourse,theestimatewon'tbe
perfecttherewillbestatisticalfluctuationsbutitdoesn'tneedto
beperfect:allwereallycareaboutismovinginageneraldirection
thatwillhelpdecreaseC ,andthatmeanswedon'tneedanexact
computationofthegradient.Inpractice,stochasticgradient
descentisacommonlyusedandpowerfultechniqueforlearningin
neuralnetworks,andit'sthebasisformostofthelearning
techniqueswe'lldevelopinthisbook.
Exercise
Anextremeversionofgradientdescentistouseaminibatch
sizeofjust1.Thatis,givenatraininginput,x,weupdateour
weightsandbiasesaccordingtotherules
wk w

k
= w k C x / w k andb l b

l
= bl C x / bl .
Thenwechooseanothertraininginput,andupdatetheweights
andbiasesagain.Andsoon,repeatedly.Thisprocedureis
knownasonline,online,orincrementallearning.Inonline
learning,aneuralnetworklearnsfromjustonetraininginput
atatime(justashumanbeingsdo).Nameoneadvantageand
onedisadvantageofonlinelearning,comparedtostochastic
gradientdescentwithaminibatchsizeof,say,20 .
Letmeconcludethissectionbydiscussingapointthatsometimes
bugspeoplenewtogradientdescent.InneuralnetworksthecostC
is,ofcourse,afunctionofmanyvariablesalltheweightsand
biasesandsoinsomesensedefinesasurfaceinaveryhigh
dimensionalspace.Somepeoplegethungupthinking:"Hey,Ihave
tobeabletovisualizealltheseextradimensions".Andtheymay
starttoworry:"Ican'tthinkinfourdimensions,letalonefive(or
fivemillion)".Istheresomespecialabilitythey'remissing,some
abilitythat"real"supermathematicianshave?Ofcourse,theanswer
isno.Evenmostprofessionalmathematicianscan'tvisualizefour
dimensionsespeciallywell,ifatall.Thetricktheyuse,instead,isto
developotherwaysofrepresentingwhat'sgoingon.That'sexactly
whatwedidabove:weusedanalgebraic(ratherthanvisual)
representationofC tofigureouthowtomovesoastodecreaseC .
Peoplewhoaregoodatthinkinginhighdimensionshaveamental
librarycontainingmanydifferenttechniquesalongtheselinesour
algebraictrickisjustoneexample.Thosetechniquesmaynothave
thesimplicitywe'reaccustomedtowhenvisualizingthree
dimensions,butonceyoubuildupalibraryofsuchtechniques,you
cangetprettygoodatthinkinginhighdimensions.Iwon'tgointo
moredetailhere,butifyou'reinterestedthenyoumayenjoy
readingthisdiscussionofsomeofthetechniquesprofessional
mathematiciansusetothinkinhighdimensions.Whilesomeofthe
techniquesdiscussedarequitecomplex,muchofthebestcontentis
intuitiveandaccessible,andcouldbemasteredbyanyone.
Implementingournetworktoclassify
digits
Alright,let'swriteaprogramthatlearnshowtorecognize
handwrittendigits,usingstochasticgradientdescentandthe
MNISTtrainingdata.We'lldothiswithashortPython(2.7)
program,just74linesofcode!Thefirstthingweneedistogetthe
MNISTdata.Ifyou'reagituserthenyoucanobtainthedataby
cloningthecoderepositoryforthisbook,
gitclonehttps://github.com/mnielsen/neuralnetworksanddeeplearning.git
Ifyoudon'tusegitthenyoucandownloadthedataandcodehere.
Incidentally,whenIdescribedtheMNISTdataearlier,Isaiditwas
splitinto60,000trainingimages,and10,000testimages.That's
theofficialMNISTdescription.Actually,we'regoingtosplitthe
dataalittledifferently.We'llleavethetestimagesasis,butsplitthe
60,000imageMNISTtrainingsetintotwoparts:asetof50,000
images,whichwe'llusetotrainourneuralnetwork,andaseparate
10,000imagevalidationset.Wewon'tusethevalidationdatain
thischapter,butlaterinthebookwe'llfinditusefulinfiguringout
howtosetcertainhyperparametersoftheneuralnetworkthings
likethelearningrate,andsoon,whicharen'tdirectlyselectedby
ourlearningalgorithm.Althoughthevalidationdataisn'tpartof
theoriginalMNISTspecification,manypeopleuseMNISTinthis
fashion,andtheuseofvalidationdataiscommoninneural
networks.WhenIrefertothe"MNISTtrainingdata"fromnowon,
I'llbereferringtoour50,000imagedataset,nottheoriginal
*Asnotedearlier,theMNISTdatasetisbased
60,000imagedataset*. ontwodatasetscollectedbyNIST,theUnited
States'NationalInstituteofStandardsand
ApartfromtheMNISTdatawealsoneedaPythonlibrarycalled Technology.ToconstructMNISTtheNISTdata
setswerestrippeddownandputintoamore
Numpy,fordoingfastlinearalgebra.Ifyoudon'talreadyhave convenientformatbyYannLeCun,Corinna
Cortes,andChristopherJ.C.Burges.Seethis
Numpyinstalled,youcangetithere. linkformoredetails.Thedatasetinmy
repositoryisinaformthatmakesiteasytoload
andmanipulatetheMNISTdatainPython.I
obtainedthisparticularformofthedatafrom
theLISAmachinelearninglaboratoryatthe
Letmeexplainthecorefeaturesoftheneuralnetworkscode,before
UniversityofMontreal(link).
givingafulllisting,below.ThecenterpieceisaNetworkclass,which
weusetorepresentaneuralnetwork.Here'sthecodeweuseto
initializeaNetworkobject:
classNetwork(object):
def__init__(self,sizes):
self.num_layers=len(sizes)
self.sizes=sizes
self.biases=[np.random.randn(y,1)foryinsizes[1:]]
self.weights=[np.random.randn(y,x)
forx,yinzip(sizes[:1],sizes[1:])]
Inthiscode,thelistsizescontainsthenumberofneuronsinthe
respectivelayers.So,forexample,ifwewanttocreateaNetwork
objectwith2neuronsinthefirstlayer,3neuronsinthesecond
layer,and1neuroninthefinallayer,we'ddothiswiththecode:
net=Network([2,3,1])
ThebiasesandweightsintheNetworkobjectareallinitialized
randomly,usingtheNumpynp.random.randnfunctiontogenerate
Gaussiandistributionswithmean0andstandarddeviation1.This
randominitializationgivesourstochasticgradientdescent
algorithmaplacetostartfrom.Inlaterchapterswe'llfindbetter
waysofinitializingtheweightsandbiases,butthiswilldofornow.
NotethattheNetworkinitializationcodeassumesthatthefirstlayer
ofneuronsisaninputlayer,andomitstosetanybiasesforthose
neurons,sincebiasesareonlyeverusedincomputingtheoutputs
fromlaterlayers.
NotealsothatthebiasesandweightsarestoredaslistsofNumpy
matrices.So,forexamplenet.weights[1]isaNumpymatrixstoring
theweightsconnectingthesecondandthirdlayersofneurons.(It's
notthefirstandsecondlayers,sincePython'slistindexingstartsat
0.)Sincenet.weights[1]isratherverbose,let'sjustdenotethat
matrixw.It'samatrixsuchthatw istheweightfortheconnection
jk
betweenthek neuroninthesecondlayer,andthej neuronin

th th
thethirdlayer.Thisorderingofthejandkindicesmayseem
strangesurelyit'dmakemoresensetoswapthejandkindices
around?Thebigadvantageofusingthisorderingisthatitmeans
thatthevectorofactivationsofthethirdlayerofneuronsis:

a = (wa + b). (22)
There'squiteabitgoingoninthisequation,solet'sunpackitpiece
bypiece.aisthevectorofactivationsofthesecondlayerof
neurons.Toobtaina wemultiplyabytheweightmatrixw,andadd

thevectorbofbiases.Wethenapplythefunctionelementwiseto
everyentryinthevectorwa + b.(Thisiscalledvectorizingthe
function.)It'seasytoverifythatEquation(22)givesthesame
resultasourearlierrule,Equation(4),forcomputingtheoutputof
asigmoidneuron.
Exercise
WriteoutEquation(22)incomponentform,andverifythatit
givesthesameresultastherule(4)forcomputingtheoutput
ofasigmoidneuron.
Withallthisinmind,it'seasytowritecodecomputingtheoutput
fromaNetworkinstance.Webeginbydefiningthesigmoidfunction:
defsigmoid(z):
return1.0/(1.0+np.exp(z))
NotethatwhentheinputzisavectororNumpyarray,Numpy
automaticallyappliesthefunctionsigmoidelementwise,thatis,in
vectorizedform.
WethenaddafeedforwardmethodtotheNetworkclass,which,given
*Itisassumedthattheinputaisan(n,1)
aninputaforthenetwork,returnsthecorrespondingoutput*.All Numpyndarray,nota(n,)vector.Here,nisthe
themethoddoesisappliesEquation(22)foreachlayer: numberofinputstothenetwork.Ifyoutryto
usean(n,)vectorasinputyou'llgetstrange
results.Althoughusingan(n,)vectorappears
deffeedforward(self,a):
themorenaturalchoice,usingan(n,1)
"""Returntheoutputofthenetworkif"a"isinput."""
ndarraymakesitparticularlyeasytomodifythe
forb,winzip(self.biases,self.weights):
codetofeedforwardmultipleinputsatonce,and
a=sigmoid(np.dot(w,a)+b)
thatissometimesconvenient.
returna
Ofcourse,themainthingwewantourNetworkobjectstodoisto
learn.Tothatendwe'llgivethemanSGDmethodwhichimplements
stochasticgradientdescent.Here'sthecode.It'salittlemysterious
inafewplaces,butI'llbreakitdownbelow,afterthelisting.
defSGD(self,training_data,epochs,mini_batch_size,eta,
test_data=None):
"""Traintheneuralnetworkusingminibatchstochastic
gradientdescent.The"training_data"isalistoftuples
"(x,y)"representingthetraininginputsandthedesired
outputs.Theothernonoptionalparametersare
selfexplanatory.If"test_data"isprovidedthenthe
networkwillbeevaluatedagainstthetestdataaftereach
epoch,andpartialprogressprintedout.Thisisusefulfor
trackingprogress,butslowsthingsdownsubstantially."""
iftest_data:n_test=len(test_data)
n=len(training_data)
forjinxrange(epochs):
random.shuffle(training_data)
mini_batches=[
training_data[k:k+mini_batch_size]
forkinxrange(0,n,mini_batch_size)]
formini_batchinmini_batches:
self.update_mini_batch(mini_batch,eta)
iftest_data:
print"Epoch{0}:{1}/{2}".format(
j,self.evaluate(test_data),n_test)
else:
print"Epoch{0}complete".format(j)
Thetraining_dataisalistoftuples(x,y)representingthetraining
inputsandcorrespondingdesiredoutputs.Thevariablesepochsand
mini_batch_sizearewhatyou'dexpectthenumberofepochsto
trainfor,andthesizeoftheminibatchestousewhensampling.eta
isthelearningrate,.Iftheoptionalargumenttest_datais
supplied,thentheprogramwillevaluatethenetworkaftereach
epochoftraining,andprintoutpartialprogress.Thisisusefulfor
trackingprogress,butslowsthingsdownsubstantially.
Thecodeworksasfollows.Ineachepoch,itstartsbyrandomly
shufflingthetrainingdata,andthenpartitionsitintominibatches
oftheappropriatesize.Thisisaneasywayofsamplingrandomly
fromthetrainingdata.Thenforeachmini_batchweapplyasingle
stepofgradientdescent.Thisisdonebythecode
self.update_mini_batch(mini_batch,eta),whichupdatesthenetwork
weightsandbiasesaccordingtoasingleiterationofgradient
descent,usingjustthetrainingdatainmini_batch.Here'sthecode
fortheupdate_mini_batchmethod:
defupdate_mini_batch(self,mini_batch,eta):
"""Updatethenetwork'sweightsandbiasesbyapplying
gradientdescentusingbackpropagationtoasingleminibatch.
The"mini_batch"isalistoftuples"(x,y)",and"eta"
isthelearningrate."""
nabla_b=[np.zeros(b.shape)forbinself.biases]
nabla_w=[np.zeros(w.shape)forwinself.weights]
forx,yinmini_batch:
delta_nabla_b,delta_nabla_w=self.backprop(x,y)
nabla_b=[nb+dnbfornb,dnbinzip(nabla_b,delta_nabla_b)]
nabla_w=[nw+dnwfornw,dnwinzip(nabla_w,delta_nabla_w)]
self.weights=[w(eta/len(mini_batch))*nw
forw,nwinzip(self.weights,nabla_w)]
self.biases=[b(eta/len(mini_batch))*nb
forb,nbinzip(self.biases,nabla_b)]
Mostoftheworkisdonebytheline
Thisinvokessomethingcalledthebackpropagationalgorithm,
whichisafastwayofcomputingthegradientofthecostfunction.
Soupdate_mini_batchworkssimplybycomputingthesegradientsfor
everytrainingexampleinthemini_batch,andthenupdating
self.weightsandself.biasesappropriately.
I'mnotgoingtoshowthecodeforself.backproprightnow.We'll
studyhowbackpropagationworksinthenextchapter,includingthe
codeforself.backprop.Fornow,justassumethatitbehavesas
claimed,returningtheappropriategradientforthecostassociated
tothetrainingexamplex.
Let'slookatthefullprogram,includingthedocumentationstrings,
whichIomittedabove.Apartfromself.backproptheprogramisself
explanatoryalltheheavyliftingisdoneinself.SGDand
self.update_mini_batch,whichwe'vealreadydiscussed.The
self.backpropmethodmakesuseofafewextrafunctionstohelpin
computingthegradient,namelysigmoid_prime,whichcomputesthe
derivativeofthefunction,andself.cost_derivative,whichIwon't
describehere.Youcangetthegistofthese(andperhapsthedetails)
justbylookingatthecodeanddocumentationstrings.We'lllookat
themindetailinthenextchapter.Notethatwhiletheprogram
appearslengthy,muchofthecodeisdocumentationstrings
intendedtomakethecodeeasytounderstand.Infact,theprogram
containsjust74linesofnonwhitespace,noncommentcode.All
thecodemaybefoundonGitHubhere.
"""
network.py
~~~~~~~~~~
Amoduletoimplementthestochasticgradientdescentlearning
algorithmforafeedforwardneuralnetwork.Gradientsarecalculated
usingbackpropagation.NotethatIhavefocusedonmakingthecode
simple,easilyreadable,andeasilymodifiable.Itisnotoptimized,
andomitsmanydesirablefeatures.
"""
####Libraries
#Standardlibrary
importrandom
#Thirdpartylibraries
importnumpyasnp
classNetwork(object):
def__init__(self,sizes):
"""Thelist``sizes``containsthenumberofneuronsinthe
respectivelayersofthenetwork.Forexample,ifthelist
was[2,3,1]thenitwouldbeathreelayernetwork,withthe
firstlayercontaining2neurons,thesecondlayer3neurons,
andthethirdlayer1neuron.Thebiasesandweightsforthe
networkareinitializedrandomly,usingaGaussian
distributionwithmean0,andvariance1.Notethatthefirst
layerisassumedtobeaninputlayer,andbyconventionwe
won'tsetanybiasesforthoseneurons,sincebiasesareonly
everusedincomputingtheoutputsfromlaterlayers."""
self.num_layers=len(sizes)
self.sizes=sizes
self.biases=[np.random.randn(y,1)foryinsizes[1:]]
self.weights=[np.random.randn(y,x)
forx,yinzip(sizes[:1],sizes[1:])]
deffeedforward(self,a):
"""Returntheoutputofthenetworkif`à`ìsinput."""
a=sigmoid(np.dot(w,a)+b)
returna
defSGD(self,training_data,epochs,mini_batch_size,eta,
test_data=None):
"""Traintheneuralnetworkusingminibatchstochastic
gradientdescent.The``training_data`ìsalistoftuples
``(x,y)``representingthetraininginputsandthedesired
outputs.Theothernonoptionalparametersare
selfexplanatory.If``test_data`ìsprovidedthenthe
networkwillbeevaluatedagainstthetestdataaftereach
epoch,andpartialprogressprintedout.Thisisusefulfor
trackingprogress,butslowsthingsdownsubstantially."""
iftest_data:n_test=len(test_data)
n=len(training_data)
forjinxrange(epochs):
random.shuffle(training_data)
mini_batches=[
training_data[k:k+mini_batch_size]
forkinxrange(0,n,mini_batch_size)]
formini_batchinmini_batches:
self.update_mini_batch(mini_batch,eta)
iftest_data:
print"Epoch{0}:{1}/{2}".format(
j,self.evaluate(test_data),n_test)
else:
print"Epoch{0}complete".format(j)
defupdate_mini_batch(self,mini_batch,eta):
"""Updatethenetwork'sweightsandbiasesbyapplying
gradientdescentusingbackpropagationtoasingleminibatch.
The``mini_batch`ìsalistoftuples``(x,y)``,and`èta``
isthelearningrate."""
forx,yinmini_batch:
nabla_b=[nb+dnbfornb,dnbinzip(nabla_b,delta_nabla_b)]
nabla_w=[nw+dnwfornw,dnwinzip(nabla_w,delta_nabla_w)]
self.weights=[w(eta/len(mini_batch))*nw
forw,nwinzip(self.weights,nabla_w)]
self.biases=[b(eta/len(mini_batch))*nb
forb,nbinzip(self.biases,nabla_b)]
defbackprop(self,x,y):
"""Returnatuple``(nabla_b,nabla_w)``representingthe
gradientforthecostfunctionC_x.``nabla_b`ànd
``nabla_w`àrelayerbylayerlistsofnumpyarrays,similar
to``self.biases`ànd``self.weights``."""
#feedforward
activation=x
activations=[x]#listtostorealltheactivations,layerbylayer
zs=[]#listtostoreallthezvectors,layerbylayer
z=np.dot(w,activation)+b
zs.append(z)
activation=sigmoid(z)
activations.append(activation)
#backwardpass
delta=self.cost_derivative(activations[1],y)*\
sigmoid_prime(zs[1])
nabla_b[1]=delta
nabla_w[1]=np.dot(delta,activations[2].transpose())
#Notethatthevariablelintheloopbelowisusedalittle
#differentlytothenotationinChapter2ofthebook.Here,
#l=1meansthelastlayerofneurons,l=2isthe
#secondlastlayer,andsoon.It'sarenumberingofthe
#schemeinthebook,usedheretotakeadvantageofthefact
#thatPythoncanusenegativeindicesinlists.
forlinxrange(2,self.num_layers):
z=zs[l]
sp=sigmoid_prime(z)
delta=np.dot(self.weights[l+1].transpose(),delta)*sp
nabla_b[l]=delta
nabla_w[l]=np.dot(delta,activations[l1].transpose())
return(nabla_b,nabla_w)
defevaluate(self,test_data):
"""Returnthenumberoftestinputsforwhichtheneural
networkoutputsthecorrectresult.Notethattheneural
network'soutputisassumedtobetheindexofwhichever
neuroninthefinallayerhasthehighestactivation."""
test_results=[(np.argmax(self.feedforward(x)),y)
for(x,y)intest_data]
returnsum(int(x==y)for(x,y)intest_results)
defcost_derivative(self,output_activations,y):
"""Returnthevectorofpartialderivatives\partialC_x/
\partialafortheoutputactivations."""
return(output_activationsy)
####Miscellaneousfunctions
defsigmoid(z):
"""Thesigmoidfunction."""
return1.0/(1.0+np.exp(z))
defsigmoid_prime(z):
"""Derivativeofthesigmoidfunction."""
returnsigmoid(z)*(1sigmoid(z))
Howwelldoestheprogramrecognizehandwrittendigits?Well,let's
startbyloadingintheMNISTdata.I'lldothisusingalittlehelper
program,mnist_loader.py,tobedescribedbelow.Weexecutethe
followingcommandsinaPythonshell,
>>>importmnist_loader
>>>training_data,validation_data,test_data=\
...mnist_loader.load_data_wrapper()
Ofcourse,thiscouldalsobedoneinaseparatePythonprogram,
butifyou'refollowingalongit'sprobablyeasiesttodoinaPython
shell.
AfterloadingtheMNISTdata,we'llsetupaNetworkwith30 hidden
neurons.WedothisafterimportingthePythonprogramlisted
above,whichisnamednetwork,
>>>importnetwork
>>>net=network.Network([784,30,10])
Finally,we'llusestochasticgradientdescenttolearnfromthe
MNISTtraining_dataover30epochs,withaminibatchsizeof10,
andalearningrateof = 3.0 ,
>>>net.SGD(training_data,30,10,3.0,test_data=test_data)
Notethatifyou'rerunningthecodeasyoureadalong,itwilltake
sometimetoexecuteforatypicalmachine(asof2015)itwill
likelytakeafewminutestorun.Isuggestyousetthingsrunning,
continuetoread,andperiodicallychecktheoutputfromthecode.If
you'reinarushyoucanspeedthingsupbydecreasingthenumber
ofepochs,bydecreasingthenumberofhiddenneurons,orbyusing
onlypartofthetrainingdata.Notethatproductioncodewouldbe
much,muchfaster:thesePythonscriptsareintendedtohelpyou
understandhowneuralnetswork,nottobehighperformance
code!And,ofcourse,oncewe'vetrainedanetworkitcanberun
veryquicklyindeed,onalmostanycomputingplatform.For
example,oncewe'velearnedagoodsetofweightsandbiasesfora
network,itcaneasilybeportedtoruninJavascriptinaweb
browser,orasanativeapponamobiledevice.Inanycase,hereisa
partialtranscriptoftheoutputofonetrainingrunoftheneural
network.Thetranscriptshowsthenumberoftestimagescorrectly
recognizedbytheneuralnetworkaftereachepochoftraining.As
youcansee,afterjustasingleepochthishasreached9,129outof
10,000,andthenumbercontinuestogrow,
Epoch0:9129/10000
Epoch1:9295/10000
Epoch2:9348/10000
...
Epoch27:9528/10000
Epoch28:9542/10000
Epoch29:9534/10000
Thatis,thetrainednetworkgivesusaclassificationrateofabout95
percent95.42percentatitspeak("Epoch28")!That'squite
encouragingasafirstattempt.Ishouldwarnyou,however,thatif
yourunthecodethenyourresultsarenotnecessarilygoingtobe
quitethesameasmine,sincewe'llbeinitializingournetworkusing
(different)randomweightsandbiases.Togenerateresultsinthis
chapterI'vetakenbestofthreeruns.
Let'sreruntheaboveexperiment,changingthenumberofhidden
neuronsto100.Aswasthecaseearlier,ifyou'rerunningthecodeas
youreadalong,youshouldbewarnedthatittakesquiteawhileto
execute(onmymachinethisexperimenttakestensofsecondsfor
eachtrainingepoch),soit'swisetocontinuereadinginparallel
whilethecodeexecutes.
Sureenough,thisimprovestheresultsto96.59percent.Atleastin
*Readerfeedbackindicatesquitesomevariation
thiscase,usingmorehiddenneuronshelpsusgetbetterresults*. inresultsforthisexperiment,andsometraining
runsgiveresultsquiteabitworse.Usingthe
Ofcourse,toobtaintheseaccuraciesIhadtomakespecificchoices techniquesintroducedinchapter3willgreatly
reducethevariationinperformanceacross
forthenumberofepochsoftraining,theminibatchsize,andthe differenttrainingrunsforournetworks.
learningrate,.AsImentionedabove,theseareknownashyper
parametersforourneuralnetwork,inordertodistinguishthem
fromtheparameters(weightsandbiases)learntbyourlearning
algorithm.Ifwechooseourhyperparameterspoorly,wecanget
badresults.Suppose,forexample,thatwe'dchosenthelearning
ratetobe = 0.001 ,
Theresultsaremuchlessencouraging,
Epoch0:1139/10000
Epoch1:1136/10000
Epoch2:1135/10000
...
Epoch27:2101/10000
Epoch28:2123/10000
Epoch29:2142/10000
However,youcanseethattheperformanceofthenetworkisgetting
slowlybetterovertime.Thatsuggestsincreasingthelearningrate,
sayto = 0.01 .Ifwedothat,wegetbetterresults,whichsuggests
increasingthelearningrateagain.(Ifmakingachangeimproves
things,trydoingmore!)Ifwedothatseveraltimesover,we'llend
upwithalearningrateofsomethinglike = 1.0 (andperhapsfine
tuneto3.0),whichisclosetoourearlierexperiments.Soeven
thoughweinitiallymadeapoorchoiceofhyperparameters,weat
leastgotenoughinformationtohelpusimproveourchoiceof
hyperparameters.
Ingeneral,debugginganeuralnetworkcanbechallenging.Thisis
especiallytruewhentheinitialchoiceofhyperparameters
producesresultsnobetterthanrandomnoise.Supposewetrythe
successful30hiddenneuronnetworkarchitecturefromearlier,but
withthelearningratechangedto = 100.0 :
Atthispointwe'veactuallygonetoofar,andthelearningrateistoo
high:
Epoch0:1009/10000
Epoch1:1009/10000
Epoch2:1009/10000
Epoch3:1009/10000
...
Epoch27:982/10000
Epoch28:982/10000
Epoch29:982/10000
Nowimaginethatwewerecomingtothisproblemforthefirsttime.
Ofcourse,weknowfromourearlierexperimentsthattheright
thingtodoistodecreasethelearningrate.Butifwewerecomingto
thisproblemforthefirsttimethentherewouldn'tbemuchinthe
outputtoguideusonwhattodo.Wemightworrynotonlyabout
thelearningrate,butabouteveryotheraspectofourneural
network.Wemightwonderifwe'veinitializedtheweightsand
biasesinawaythatmakesithardforthenetworktolearn?Or
maybewedon'thaveenoughtrainingdatatogetmeaningful
learning?Perhapswehaven'trunforenoughepochs?Ormaybeit's
impossibleforaneuralnetworkwiththisarchitecturetolearnto
recognizehandwrittendigits?Maybethelearningrateistoolow?
Or,maybe,thelearningrateistoohigh?Whenyou'recomingtoa
problemforthefirsttime,you'renotalwayssure.
Thelessontotakeawayfromthisisthatdebugginganeural
networkisnottrivial,and,justasforordinaryprogramming,there
isanarttoit.Youneedtolearnthatartofdebugginginordertoget
goodresultsfromneuralnetworks.Moregenerally,weneedto
developheuristicsforchoosinggoodhyperparametersandagood
architecture.We'lldiscussalltheseatlengththroughthebook,
includinghowIchosethehyperparametersabove.
Exercise
Trycreatinganetworkwithjusttwolayersaninputandan
outputlayer,nohiddenlayerwith784and10neurons,
respectively.Trainthenetworkusingstochasticgradient
descent.Whatclassificationaccuracycanyouachieve?
Earlier,IskippedoverthedetailsofhowtheMNISTdataisloaded.
It'sprettystraightforward.Forcompleteness,here'sthecode.The
datastructuresusedtostoretheMNISTdataaredescribedinthe
documentationstringsit'sstraightforwardstuff,tuplesandlistsof
Numpyndarrayobjects(thinkofthemasvectorsifyou'renot
familiarwithndarrays):
"""
mnist_loader
~~~~~~~~~~~~
AlibrarytoloadtheMNISTimagedata.Fordetailsofthedata
structuresthatarereturned,seethedocstringsfor``load_data``
and``load_data_wrapper``.Inpractice,``load_data_wrapper`ìsthe
functionusuallycalledbyourneuralnetworkcode.
"""
####Libraries
#Standardlibrary
importcPickle
importgzip
#Thirdpartylibraries
importnumpyasnp
defload_data():
"""ReturntheMNISTdataasatuplecontainingthetrainingdata,
thevalidationdata,andthetestdata.
The``training_data`ìsreturnedasatuplewithtwoentries.
Thefirstentrycontainstheactualtrainingimages.Thisisa
numpyndarraywith50,000entries.Eachentryis,inturn,a
numpyndarraywith784values,representingthe28*28=784
pixelsinasingleMNISTimage.
Thesecondentryinthe``training_data``tupleisanumpyndarray
containing50,000entries.Thoseentriesarejustthedigit
values(0...9)forthecorrespondingimagescontainedinthefirst
entryofthetuple.
The``validation_data`ànd``test_data`àresimilar,except
eachcontainsonly10,000images.
Thisisanicedataformat,butforuseinneuralnetworksit's
helpfultomodifytheformatofthe``training_data`àlittle.
That'sdoneinthewrapperfunction``load_data_wrapper()``,see
below.
"""
f=gzip.open('../data/mnist.pkl.gz','rb')
training_data,validation_data,test_data=cPickle.load(f)
f.close()
return(training_data,validation_data,test_data)
defload_data_wrapper():
"""Returnatuplecontaining``(training_data,validation_data,
test_data)``.Basedon``load_data``,buttheformatismore
convenientforuseinourimplementationofneuralnetworks.
Inparticular,``training_data`ìsalistcontaining50,000
2tuples``(x,y)``.``x`ìsa784dimensionalnumpy.ndarray
containingtheinputimage.``y`ìsa10dimensional
numpy.ndarrayrepresentingtheunitvectorcorrespondingtothe
correctdigitfor``x``.
``validation_data`ànd``test_data`àrelistscontaining10,000
2tuples``(x,y)``.Ineachcase,``x`ìsa784dimensional
numpy.ndarrycontainingtheinputimage,and``y`ìsthe
correspondingclassification,i.e.,thedigitvalues(integers)
correspondingto``x``.
Obviously,thismeanswe'reusingslightlydifferentformatsfor
thetrainingdataandthevalidation/testdata.Theseformats
turnouttobethemostconvenientforuseinourneuralnetwork
code."""
tr_d,va_d,te_d=load_data()
training_inputs=[np.reshape(x,(784,1))forxintr_d[0]]
training_results=[vectorized_result(y)foryintr_d[1]]
training_data=zip(training_inputs,training_results)
validation_inputs=[np.reshape(x,(784,1))forxinva_d[0]]
validation_data=zip(validation_inputs,va_d[1])
test_inputs=[np.reshape(x,(784,1))forxinte_d[0]]
test_data=zip(test_inputs,te_d[1])
return(training_data,validation_data,test_data)
defvectorized_result(j):
"""Returna10dimensionalunitvectorwitha1.0inthejth
positionandzeroeselsewhere.Thisisusedtoconvertadigit
(0...9)intoacorrespondingdesiredoutputfromtheneural
network."""
e=np.zeros((10,1))
e[j]=1.0
returne
Isaidabovethatourprogramgetsprettygoodresults.Whatdoes
thatmean?Goodcomparedtowhat?It'sinformativetohavesome
simple(nonneuralnetwork)baselineteststocompareagainst,to
understandwhatitmeanstoperformwell.Thesimplestbaselineof
all,ofcourse,istorandomlyguessthedigit.That'llberightabout
tenpercentofthetime.We'redoingmuchbetterthanthat!
Whataboutalesstrivialbaseline?Let'stryanextremelysimple
idea:we'lllookathowdarkanimageis.Forinstance,animageofa
2willtypicallybequiteabitdarkerthananimageofa1,just
becausemorepixelsareblackenedout,asthefollowingexamples
illustrate:
Thissuggestsusingthetrainingdatatocomputeaverage
darknessesforeachdigit,0, 1, 2, , 9.Whenpresentedwithanew
image,wecomputehowdarktheimageis,andthenguessthatit's
whicheverdigithastheclosestaveragedarkness.Thisisasimple
procedure,andiseasytocodeup,soIwon'texplicitlywriteoutthe
codeifyou'reinterestedit'sintheGitHubrepository.Butit'sabig
improvementoverrandomguessing,getting2, 225ofthe10, 000
testimagescorrect,i.e.,22.25percentaccuracy.
It'snotdifficulttofindotherideaswhichachieveaccuraciesinthe
20 to50 percentrange.Ifyouworkabitharderyoucangetupover
50 percent.Buttogetmuchhigheraccuraciesithelpstouse
establishedmachinelearningalgorithms.Let'stryusingoneofthe
bestknownalgorithms,thesupportvectormachineorSVM.If
you'renotfamiliarwithSVMs,nottoworry,we'renotgoingtoneed
tounderstandthedetailsofhowSVMswork.Instead,we'llusea
Pythonlibrarycalledscikitlearn,whichprovidesasimplePython
interfacetoafastCbasedlibraryforSVMsknownasLIBSVM.
Ifwerunscikitlearn'sSVMclassifierusingthedefaultsettings,
thenitgets9,435of10,000testimagescorrect.(Thecodeis
availablehere.)That'sabigimprovementoverournaiveapproach
ofclassifyinganimagebasedonhowdarkitis.Indeed,itmeans
thattheSVMisperformingroughlyaswellasourneuralnetworks,
justalittleworse.Inlaterchapterswe'llintroducenewtechniques
thatenableustoimproveourneuralnetworkssothattheyperform
muchbetterthantheSVM.
That'snottheendofthestory,however.The9,435of10,000result
isforscikitlearn'sdefaultsettingsforSVMs.SVMshaveanumber
oftunableparameters,andit'spossibletosearchforparameters
whichimprovethisoutoftheboxperformance.Iwon'texplicitly
dothissearch,butinsteadreferyoutothisblogpostbyAndreas
Muellerifyou'dliketoknowmore.Muellershowsthatwithsome
workoptimizingtheSVM'sparametersit'spossibletogetthe
performanceupabove98.5percentaccuracy.Inotherwords,a
welltunedSVMonlymakesanerroronaboutonedigitin70.
That'sprettygood!Canneuralnetworksdobetter?
Infact,theycan.Atpresent,welldesignedneuralnetworks
outperformeveryothertechniqueforsolvingMNIST,including
SVMs.Thecurrent(2013)recordisclassifying9,979of10,000
imagescorrectly.ThiswasdonebyLiWan,MatthewZeiler,Sixin
Zhang,YannLeCun,andRobFergus.We'llseemostofthe
techniquestheyusedlaterinthebook.Atthatlevelthe
performanceisclosetohumanequivalent,andisarguablybetter,
sincequiteafewoftheMNISTimagesaredifficultevenforhumans
torecognizewithconfidence,forexample:
Itrustyou'llagreethatthosearetoughtoclassify!Withimageslike
theseintheMNISTdatasetit'sremarkablethatneuralnetworks
canaccuratelyclassifyallbut21ofthe10,000testimages.Usually,
whenprogrammingwebelievethatsolvingacomplicatedproblem
likerecognizingtheMNISTdigitsrequiresasophisticated
algorithm.ButeventheneuralnetworksintheWanetalpaperjust
mentionedinvolvequitesimplealgorithms,variationsonthe
algorithmwe'veseeninthischapter.Allthecomplexityislearned,
automatically,fromthetrainingdata.Insomesense,themoralof
bothourresultsandthoseinmoresophisticatedpapers,isthatfor
someproblems:
sophisticatedalgorithm simplelearningalgorithm+good
trainingdata.
Towarddeeplearning
Whileourneuralnetworkgivesimpressiveperformance,that
performanceissomewhatmysterious.Theweightsandbiasesinthe
networkwerediscoveredautomatically.Andthatmeanswedon't
immediatelyhaveanexplanationofhowthenetworkdoeswhatit
does.Canwefindsomewaytounderstandtheprinciplesbywhich
ournetworkisclassifyinghandwrittendigits?And,givensuch
principles,canwedobetter?
Toputthesequestionsmorestarkly,supposethatafewdecades
henceneuralnetworksleadtoartificialintelligence(AI).Willwe
understandhowsuchintelligentnetworkswork?Perhapsthe
networkswillbeopaquetous,withweightsandbiaseswedon't
understand,becausethey'vebeenlearnedautomatically.Inthe
earlydaysofAIresearchpeoplehopedthattheefforttobuildanAI
wouldalsohelpusunderstandtheprinciplesbehindintelligence
and,maybe,thefunctioningofthehumanbrain.Butperhapsthe
outcomewillbethatweendupunderstandingneitherthebrainnor
howartificialintelligenceworks!
Toaddressthesequestions,let'sthinkbacktotheinterpretationof
artificialneuronsthatIgaveatthestartofthechapter,asameans
ofweighingevidence.Supposewewanttodeterminewhetheran
imageshowsahumanfaceornot:
Credits:1.EsterInbar.2.Unknown.3.NASA,
ESA,G.Illingworth,D.Magee,andP.Oesch
(UniversityofCalifornia,SantaCruz),R.
Bouwens(LeidenUniversity),andtheHUDF09
Team.Clickontheimagesformoredetails.
Wecouldattackthisproblemthesamewayweattacked
handwritingrecognitionbyusingthepixelsintheimageasinput
toaneuralnetwork,withtheoutputfromthenetworkasingle
neuronindicatingeither"Yes,it'saface"or"No,it'snotaface".
Let'ssupposewedothis,butthatwe'renotusingalearning
algorithm.Instead,we'regoingtotrytodesignanetworkbyhand,
choosingappropriateweightsandbiases.Howmightwegoabout
it?Forgettingneuralnetworksentirelyforthemoment,aheuristic
wecoulduseistodecomposetheproblemintosubproblems:does
theimagehaveaneyeinthetopleft?Doesithaveaneyeinthetop
right?Doesithaveanoseinthemiddle?Doesithaveamouthin
thebottommiddle?Istherehairontop?Andsoon.
Iftheanswerstoseveralofthesequestionsare"yes",orevenjust
"probablyyes",thenwe'dconcludethattheimageislikelytobea
face.Conversely,iftheanswerstomostofthequestionsare"no",
thentheimageprobablyisn'taface.
Ofcourse,thisisjustaroughheuristic,anditsuffersfrommany
deficiencies.Maybethepersonisbald,sotheyhavenohair.Maybe
wecanonlyseepartoftheface,orthefaceisatanangle,sosomeof
thefacialfeaturesareobscured.Still,theheuristicsuggeststhatif
wecansolvethesubproblemsusingneuralnetworks,thenperhaps
wecanbuildaneuralnetworkforfacedetection,bycombiningthe
networksforthesubproblems.Here'sapossiblearchitecture,with
rectanglesdenotingthesubnetworks.Notethatthisisn'tintended
asarealisticapproachtosolvingthefacedetectionproblemrather,
it'stohelpusbuildintuitionabouthownetworksfunction.Here's
thearchitecture:
It'salsoplausiblethatthesubnetworkscanbedecomposed.
Supposewe'reconsideringthequestion:"Isthereaneyeinthetop
left?"Thiscanbedecomposedintoquestionssuchas:"Istherean
eyebrow?""Arethereeyelashes?""Isthereaniris?"andsoon.Of
course,thesequestionsshouldreallyincludepositional
information,aswell"Istheeyebrowinthetopleft,andabovethe
iris?",thatkindofthingbutlet'skeepitsimple.Thenetworkto
answerthequestion"Isthereaneyeinthetopleft?"cannowbe
decomposed:
Thosequestionstoocanbebrokendown,furtherandfurther
throughmultiplelayers.Ultimately,we'llbeworkingwithsub
networksthatanswerquestionssosimpletheycaneasilybe
answeredatthelevelofsinglepixels.Thosequestionsmight,for
example,beaboutthepresenceorabsenceofverysimpleshapesat
particularpointsintheimage.Suchquestionscanbeansweredby
singleneuronsconnectedtotherawpixelsintheimage.
Theendresultisanetworkwhichbreaksdownaverycomplicated
questiondoesthisimageshowafaceornotintoverysimple
questionsanswerableatthelevelofsinglepixels.Itdoesthis
throughaseriesofmanylayers,withearlylayersansweringvery
simpleandspecificquestionsabouttheinputimage,andlater
layersbuildingupahierarchyofevermorecomplexandabstract
concepts.Networkswiththiskindofmanylayerstructuretwoor
morehiddenlayersarecalleddeepneuralnetworks.
Ofcourse,Ihaven'tsaidhowtodothisrecursivedecomposition
intosubnetworks.Itcertainlyisn'tpracticaltohanddesignthe
weightsandbiasesinthenetwork.Instead,we'dliketouselearning
algorithmssothatthenetworkcanautomaticallylearntheweights
andbiasesandthus,thehierarchyofconceptsfromtraining
data.Researchersinthe1980sand1990striedusingstochastic
gradientdescentandbackpropagationtotraindeepnetworks.
Unfortunately,exceptforafewspecialarchitectures,theydidn't
havemuchluck.Thenetworkswouldlearn,butveryslowly,andin
practiceoftentooslowlytobeuseful.
Since2006,asetoftechniqueshasbeendevelopedthatenable
learningindeepneuralnets.Thesedeeplearningtechniquesare
basedonstochasticgradientdescentandbackpropagation,butalso
introducenewideas.Thesetechniqueshaveenabledmuchdeeper
(andlarger)networkstobetrainedpeoplenowroutinelytrain
networkswith5to10hiddenlayers.And,itturnsoutthatthese
performfarbetteronmanyproblemsthanshallowneuralnetworks,
i.e.,networkswithjustasinglehiddenlayer.Thereason,ofcourse,
istheabilityofdeepnetstobuildupacomplexhierarchyof
concepts.It'sabitlikethewayconventionalprogramming
languagesusemodulardesignandideasaboutabstractiontoenable
thecreationofcomplexcomputerprograms.Comparingadeep
networktoashallownetworkisabitlikecomparinga
programminglanguagewiththeabilitytomakefunctioncallstoa
strippeddownlanguagewithnoabilitytomakesuchcalls.
Abstractiontakesadifferentforminneuralnetworksthanitdoesin
conventionalprogramming,butit'sjustasimportant.
Inacademicwork,pleasecitethisbookas:MichaelA.Nielsen,"NeuralNetworksandDeepLearning", Lastupdate:ThuJan1906:09:482017
DeterminationPress,2015
ThisworkislicensedunderaCreativeCommonsAttributionNonCommercial3.0UnportedLicense.This
meansyou'refreetocopy,share,andbuildonthisbook,butnottosellit.Ifyou'reinterestedincommercialuse,
pleasecontactme.

Neural Networks and Deep Learning

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Neural Networks and Deep Learning

Transféré par

Droits d'auteur :

Formats disponibles

03/04/2017 Neuralnetworksanddeeplearning

producesoutput0,since(2) 1 + (2) 1 + 3 = 1isnegative.

UptonowI'vebeendrawinginputslikex andx asvariables

(z) 1 .Inotherwords,whenz = w x + b islargeandpositive,

wherethesumisoveralltheweights,w ,and output/ w and

asmallamountv inthev direction.CalculustellsusthatC

We'regoingtofindawayofchoosingv andv soastomake 1 2

Inamomentwe'llrewritethechangeC intermsofv andthe

hascomponentsw andb ,andthegradientvectorC has

correspondingcomponents C / w and C / b .Writingoutthe

betweenthek neuroninthesecondlayer,andthej neuronin

Vous aimerez peut-être aussi