Vous êtes sur la page 1sur 50

03/04/2017 Neuralnetworksanddeeplearning

CHAPTER1

Usingneuralnetstorecognizehandwrittendigits

Thehumanvisualsystemisoneofthewondersoftheworld. NeuralNetworksandDeepLearning
Whatthisbookisabout
Considerthefollowingsequenceofhandwrittendigits: Ontheexercisesandproblems
Usingneuralnetstorecognize
handwrittendigits
Howthebackpropagation
algorithmworks
Mostpeopleeffortlesslyrecognizethosedigitsas504192.Thatease
Improvingthewayneural
isdeceptive.Ineachhemisphereofourbrain,humanshavea networkslearn
Avisualproofthatneuralnetscan
primaryvisualcortex,alsoknownasV1,containing140million
computeanyfunction
neurons,withtensofbillionsofconnectionsbetweenthem.Andyet Whyaredeepneuralnetworks
humanvisioninvolvesnotjustV1,butanentireseriesofvisual hardtotrain?
Deeplearning
corticesV2,V3,V4,andV5doingprogressivelymorecomplex Appendix:Isthereasimple
imageprocessing.Wecarryinourheadsasupercomputer,tunedby algorithmforintelligence?
Acknowledgements
evolutionoverhundredsofmillionsofyears,andsuperblyadapted
FrequentlyAskedQuestions
tounderstandthevisualworld.Recognizinghandwrittendigitsisn't
easy.Rather,wehumansarestupendously,astoundinglygoodat Sponsors
makingsenseofwhatoureyesshowus.Butnearlyallthatworkis
doneunconsciously.Andsowedon'tusuallyappreciatehowtough
aproblemourvisualsystemssolve.

Thedifficultyofvisualpatternrecognitionbecomesapparentifyou
attempttowriteacomputerprogramtorecognizedigitslikethose
above.Whatseemseasywhenwedoitourselvessuddenlybecomes

extremelydifficult.Simpleintuitionsabouthowwerecognize Thankstoallthesupporterswho
shapes"a9hasaloopatthetop,andaverticalstrokeinthe madethebookpossible,with
especialthankstoPavelDudrenov.
bottomright"turnouttobenotsosimpletoexpress Thanksalsotoallthecontributorsto
algorithmically.Whenyoutrytomakesuchrulesprecise,you theBugfinderHallofFame.

quicklygetlostinamorassofexceptionsandcaveatsandspecial
Resources
cases.Itseemshopeless.
MichaelNielsenonTwitter
BookFAQ
Neuralnetworksapproachtheprobleminadifferentway.Theidea
Coderepository
istotakealargenumberofhandwrittendigits,knownastraining MichaelNielsen'sproject
examples, announcementmailinglist
DeepLearning,bookbyIan
Goodfellow,YoshuaBengio,and
AaronCourville
cognitivemedium.com

http://neuralnetworksanddeeplearning.com/chap1.html 1/50
03/04/2017 Neuralnetworksanddeeplearning

ByMichaelNielsen/Jan2017

andthendevelopasystemwhichcanlearnfromthosetraining
examples.Inotherwords,theneuralnetworkusestheexamplesto
automaticallyinferrulesforrecognizinghandwrittendigits.
Furthermore,byincreasingthenumberoftrainingexamples,the
networkcanlearnmoreabouthandwriting,andsoimproveits
accuracy.SowhileI'veshownjust100trainingdigitsabove,
perhapswecouldbuildabetterhandwritingrecognizerbyusing
thousandsorevenmillionsorbillionsoftrainingexamples.

Inthischapterwe'llwriteacomputerprogramimplementinga
neuralnetworkthatlearnstorecognizehandwrittendigits.The
programisjust74lineslong,andusesnospecialneuralnetwork
libraries.Butthisshortprogramcanrecognizedigitswithan
accuracyover96percent,withouthumanintervention.
Furthermore,inlaterchapterswe'lldevelopideaswhichcan
improveaccuracytoover99percent.Infact,thebestcommercial
neuralnetworksarenowsogoodthattheyareusedbybanksto
processcheques,andbypostofficestorecognizeaddresses.

We'refocusingonhandwritingrecognitionbecauseit'sanexcellent
prototypeproblemforlearningaboutneuralnetworksingeneral.
Asaprototypeithitsasweetspot:it'schallengingit'snosmallfeat
torecognizehandwrittendigitsbutit'snotsodifficultastorequire
anextremelycomplicatedsolution,ortremendouscomputational
power.Furthermore,it'sagreatwaytodevelopmoreadvanced
techniques,suchasdeeplearning.Andsothroughoutthebookwe'll
returnrepeatedlytotheproblemofhandwritingrecognition.Later
inthebook,we'lldiscusshowtheseideasmaybeappliedtoother

http://neuralnetworksanddeeplearning.com/chap1.html 2/50
03/04/2017 Neuralnetworksanddeeplearning

problemsincomputervision,andalsoinspeech,naturallanguage
processing,andotherdomains.

Ofcourse,ifthepointofthechapterwasonlytowriteacomputer
programtorecognizehandwrittendigits,thenthechapterwouldbe
muchshorter!Butalongthewaywe'lldevelopmanykeyideas
aboutneuralnetworks,includingtwoimportanttypesofartificial
neuron(theperceptronandthesigmoidneuron),andthestandard
learningalgorithmforneuralnetworks,knownasstochastic
gradientdescent.Throughout,Ifocusonexplainingwhythingsare
donethewaytheyare,andonbuildingyourneuralnetworks
intuition.ThatrequiresalengthierdiscussionthanifIjust
presentedthebasicmechanicsofwhat'sgoingon,butit'sworthit
forthedeeperunderstandingyou'llattain.Amongstthepayoffs,by
theendofthechapterwe'llbeinpositiontounderstandwhatdeep
learningis,andwhyitmatters.

Perceptrons
Whatisaneuralnetwork?Togetstarted,I'llexplainatypeof
artificialneuroncalledaperceptron.Perceptronsweredevelopedin
the1950sand1960sbythescientistFrankRosenblatt,inspiredby
earlierworkbyWarrenMcCullochandWalterPitts.Today,it's
morecommontouseothermodelsofartificialneuronsinthis
book,andinmuchmodernworkonneuralnetworks,themain
neuronmodelusedisonecalledthesigmoidneuron.We'llgetto
sigmoidneuronsshortly.Buttounderstandwhysigmoidneurons
aredefinedthewaytheyare,it'sworthtakingthetimetofirst
understandperceptrons.

Sohowdoperceptronswork?Aperceptrontakesseveralbinary
inputs,x 1, x2 , ,andproducesasinglebinaryoutput:

Intheexampleshowntheperceptronhasthreeinputs,x 1, .In
x2 , x3

generalitcouldhavemoreorfewerinputs.Rosenblattproposeda
simpleruletocomputetheoutput.Heintroducedweights,
w1 , w2 , ,realnumbersexpressingtheimportanceofthe
http://neuralnetworksanddeeplearning.com/chap1.html 3/50
03/04/2017 Neuralnetworksanddeeplearning

respectiveinputstotheoutput.Theneuron'soutput,0or1,is
determinedbywhethertheweightedsum j
wj xj islessthanor
greaterthansomethresholdvalue.Justliketheweights,the
thresholdisarealnumberwhichisaparameteroftheneuron.To
putitinmoreprecisealgebraicterms:

0 if w j x j threshold
j
output = { (1)
1 if w j x j > threshold
j

That'sallthereistohowaperceptronworks!

That'sthebasicmathematicalmodel.Awayyoucanthinkaboutthe
perceptronisthatit'sadevicethatmakesdecisionsbyweighingup
evidence.Letmegiveanexample.It'snotaveryrealisticexample,
butit'seasytounderstand,andwe'llsoongettomorerealistic
examples.Supposetheweekendiscomingup,andyou'veheard
thatthere'sgoingtobeacheesefestivalinyourcity.Youlike
cheese,andaretryingtodecidewhetherornottogotothefestival.
Youmightmakeyourdecisionbyweighingupthreefactors:

1.Istheweathergood?
2.Doesyourboyfriendorgirlfriendwanttoaccompanyyou?
3.Isthefestivalnearpublictransit?(Youdon'townacar).

Wecanrepresentthesethreefactorsbycorrespondingbinary
variablesx 1, x2 ,andx .Forinstance,we'dhavex
3 1 = 1 ifthe
weatherisgood,andx 1 = 0 iftheweatherisbad.Similarly,x 2 = 1

ifyourboyfriendorgirlfriendwantstogo,andx 2 = 0 ifnot.And
similarlyagainforx andpublictransit.
3

Now,supposeyouabsolutelyadorecheese,somuchsothatyou're
happytogotothefestivalevenifyourboyfriendorgirlfriendis
uninterestedandthefestivalishardtogetto.Butperhapsyou
reallyloathebadweather,andthere'snowayyou'dgotothefestival
iftheweatherisbad.Youcanuseperceptronstomodelthiskindof
decisionmaking.Onewaytodothisistochooseaweightw 1 = 6

fortheweather,andw 2 andw
= 2 3 = 2fortheotherconditions.
Thelargervalueofw indicatesthattheweathermattersalotto
1

you,muchmorethanwhetheryourboyfriendorgirlfriendjoins
you,orthenearnessofpublictransit.Finally,supposeyouchoosea
thresholdof5fortheperceptron.Withthesechoices,the
perceptronimplementsthedesireddecisionmakingmodel,

http://neuralnetworksanddeeplearning.com/chap1.html 4/50
03/04/2017 Neuralnetworksanddeeplearning

outputting1whenevertheweatherisgood,and0wheneverthe
weatherisbad.Itmakesnodifferencetotheoutputwhetheryour
boyfriendorgirlfriendwantstogo,orwhetherpublictransitis
nearby.

Byvaryingtheweightsandthethreshold,wecangetdifferent
modelsofdecisionmaking.Forexample,supposeweinsteadchose
athresholdof3.Thentheperceptronwoulddecidethatyoushould
gotothefestivalwhenevertheweatherwasgoodorwhenboththe
festivalwasnearpublictransitandyourboyfriendorgirlfriendwas
willingtojoinyou.Inotherwords,it'dbeadifferentmodelof
decisionmaking.Droppingthethresholdmeansyou'remore
willingtogotothefestival.

Obviously,theperceptronisn'tacompletemodelofhuman
decisionmaking!Butwhattheexampleillustratesishowa
perceptroncanweighupdifferentkindsofevidenceinorderto
makedecisions.Anditshouldseemplausiblethatacomplex
networkofperceptronscouldmakequitesubtledecisions:

Inthisnetwork,thefirstcolumnofperceptronswhatwe'llcallthe
firstlayerofperceptronsismakingthreeverysimpledecisions,by
weighingtheinputevidence.Whatabouttheperceptronsinthe
secondlayer?Eachofthoseperceptronsismakingadecisionby
weighinguptheresultsfromthefirstlayerofdecisionmaking.In
thiswayaperceptroninthesecondlayercanmakeadecisionata
morecomplexandmoreabstractlevelthanperceptronsinthefirst
layer.Andevenmorecomplexdecisionscanbemadebythe
perceptroninthethirdlayer.Inthisway,amanylayernetworkof
perceptronscanengageinsophisticateddecisionmaking.

Incidentally,whenIdefinedperceptronsIsaidthataperceptron
hasjustasingleoutput.Inthenetworkabovetheperceptronslook
liketheyhavemultipleoutputs.Infact,they'restillsingleoutput.
Themultipleoutputarrowsaremerelyausefulwayofindicating

http://neuralnetworksanddeeplearning.com/chap1.html 5/50
03/04/2017 Neuralnetworksanddeeplearning

thattheoutputfromaperceptronisbeingusedastheinputto
severalotherperceptrons.It'slessunwieldythandrawingasingle
outputlinewhichthensplits.

Let'ssimplifythewaywedescribeperceptrons.Thecondition

j
w j x j > threshold iscumbersome,andwecanmaketwo
notationalchangestosimplifyit.Thefirstchangeistowrite

j
wj xj asadotproduct,w x j
wj xj,wherewandxare
vectorswhosecomponentsaretheweightsandinputs,respectively.
Thesecondchangeistomovethethresholdtotheothersideofthe
inequality,andtoreplaceitbywhat'sknownastheperceptron's
bias,b threshold.Usingthebiasinsteadofthethreshold,the
perceptronrulecanberewritten:

0 ifw x + b 0
output = { (2)
1 ifw x + b > 0

Youcanthinkofthebiasasameasureofhoweasyitistogetthe
perceptrontooutputa1.Ortoputitinmorebiologicalterms,the
biasisameasureofhoweasyitistogettheperceptrontofire.Fora
perceptronwithareallybigbias,it'sextremelyeasyforthe
perceptrontooutputa1.Butifthebiasisverynegative,thenit's
difficultfortheperceptrontooutputa1.Obviously,introducingthe
biasisonlyasmallchangeinhowwedescribeperceptrons,but
we'llseelaterthatitleadstofurthernotationalsimplifications.
Becauseofthis,intheremainderofthebookwewon'tusethe
threshold,we'llalwaysusethebias.

I'vedescribedperceptronsasamethodforweighingevidenceto
makedecisions.Anotherwayperceptronscanbeusedistocompute
theelementarylogicalfunctionsweusuallythinkofasunderlying
computation,functionssuchasAND,OR,andNAND.Forexample,
supposewehaveaperceptronwithtwoinputs,eachwithweight
2 ,andanoverallbiasof3.Here'sourperceptron:

Thenweseethatinput00 producesoutput1,since
(2) 0 + (2) 0 + 3 = 3 ispositive.Here,I'veintroducedthe
symboltomakethemultiplicationsexplicit.Similarcalculations
showthattheinputs01 and10 produceoutput1.Buttheinput11

http://neuralnetworksanddeeplearning.com/chap1.html 6/50
03/04/2017 Neuralnetworksanddeeplearning

producesoutput0,since(2) 1 + (2) 1 + 3 = 1isnegative.


AndsoourperceptronimplementsaNANDgate!

TheNANDexampleshowsthatwecanuseperceptronstocompute
simplelogicalfunctions.Infact,wecanusenetworksof
perceptronstocomputeanylogicalfunctionatall.Thereasonis
thattheNANDgateisuniversalforcomputation,thatis,wecanbuild
anycomputationupoutofNANDgates.Forexample,wecanuse
NANDgatestobuildacircuitwhichaddstwobits,x andx .This 1 2

requirescomputingthebitwisesum,x 1 x2 ,aswellasacarrybit
whichissetto1whenbothx andx are1,i.e.,thecarrybitisjust
1 2

thebitwiseproductx 1 x2 :

TogetanequivalentnetworkofperceptronswereplacealltheNAND
gatesbyperceptronswithtwoinputs,eachwithweight2,andan
overallbiasof3.Here'stheresultingnetwork.NotethatI'vemoved
theperceptroncorrespondingtothebottomrightNANDgatealittle,
justtomakeiteasiertodrawthearrowsonthediagram:

Onenotableaspectofthisnetworkofperceptronsisthattheoutput
fromtheleftmostperceptronisusedtwiceasinputtothe
bottommostperceptron.WhenIdefinedtheperceptronmodelI
didn'tsaywhetherthiskindofdoubleoutputtothesameplace
wasallowed.Actually,itdoesn'tmuchmatter.Ifwedon'twantto
allowthiskindofthing,thenit'spossibletosimplymergethetwo
lines,intoasingleconnectionwithaweightof4insteadoftwo
connectionswith2weights.(Ifyoudon'tfindthisobvious,you
shouldstopandprovetoyourselfthatthisisequivalent.)Withthat
change,thenetworklooksasfollows,withallunmarkedweights

http://neuralnetworksanddeeplearning.com/chap1.html 7/50
03/04/2017 Neuralnetworksanddeeplearning

equalto2,allbiasesequalto3,andasingleweightof4,as
marked:

UptonowI'vebeendrawinginputslikex andx asvariables


1 2

floatingtotheleftofthenetworkofperceptrons.Infact,it's
conventionaltodrawanextralayerofperceptronstheinputlayer
toencodetheinputs:

Thisnotationforinputperceptrons,inwhichwehaveanoutput,
butnoinputs,

isashorthand.Itdoesn'tactuallymeanaperceptronwithno
inputs.Toseethis,supposewedidhaveaperceptronwithno
inputs.Thentheweightedsum j
wj xj wouldalwaysbezero,and
sotheperceptronwouldoutput1ifb > 0,and0ifb 0.Thatis,the
perceptronwouldsimplyoutputafixedvalue,notthedesiredvalue
(x ,intheexampleabove).It'sbettertothinkoftheinput
1

perceptronsasnotreallybeingperceptronsatall,butratherspecial
unitswhicharesimplydefinedtooutputthedesiredvalues,
x1 , x2 , .

Theadderexampledemonstrateshowanetworkofperceptronscan
beusedtosimulateacircuitcontainingmanyNANDgates.And
becauseNANDgatesareuniversalforcomputation,itfollowsthat
perceptronsarealsouniversalforcomputation.

Thecomputationaluniversalityofperceptronsissimultaneously
reassuringanddisappointing.It'sreassuringbecauseittellsusthat
networksofperceptronscanbeaspowerfulasanyothercomputing

http://neuralnetworksanddeeplearning.com/chap1.html 8/50
03/04/2017 Neuralnetworksanddeeplearning

device.Butit'salsodisappointing,becauseitmakesitseemas
thoughperceptronsaremerelyanewtypeofNANDgate.That's
hardlybignews!

However,thesituationisbetterthanthisviewsuggests.Itturnsout
thatwecandeviselearningalgorithmswhichcanautomatically
tunetheweightsandbiasesofanetworkofartificialneurons.This
tuninghappensinresponsetoexternalstimuli,withoutdirect
interventionbyaprogrammer.Theselearningalgorithmsenableus
touseartificialneuronsinawaywhichisradicallydifferentto
conventionallogicgates.Insteadofexplicitlylayingoutacircuitof
NANDandothergates,ourneuralnetworkscansimplylearntosolve
problems,sometimesproblemswhereitwouldbeextremely
difficulttodirectlydesignaconventionalcircuit.

Sigmoidneurons
Learningalgorithmssoundterrific.Buthowcanwedevisesuch
algorithmsforaneuralnetwork?Supposewehaveanetworkof
perceptronsthatwe'dliketousetolearntosolvesomeproblem.
Forexample,theinputstothenetworkmightbetherawpixeldata
fromascanned,handwrittenimageofadigit.Andwe'dlikethe
networktolearnweightsandbiasessothattheoutputfromthe
networkcorrectlyclassifiesthedigit.Toseehowlearningmight
work,supposewemakeasmallchangeinsomeweight(orbias)in
thenetwork.Whatwe'dlikeisforthissmallchangeinweightto
causeonlyasmallcorrespondingchangeintheoutputfromthe
network.Aswe'llseeinamoment,thispropertywillmakelearning
possible.Schematically,here'swhatwewant(obviouslythis
networkistoosimpletodohandwritingrecognition!):

http://neuralnetworksanddeeplearning.com/chap1.html 9/50
03/04/2017 Neuralnetworksanddeeplearning

Ifitweretruethatasmallchangeinaweight(orbias)causesonlya
smallchangeinoutput,thenwecouldusethisfacttomodifythe
weightsandbiasestogetournetworktobehavemoreinthe
mannerwewant.Forexample,supposethenetworkwas
mistakenlyclassifyinganimageasan"8"whenitshouldbea"9".
Wecouldfigureouthowtomakeasmallchangeintheweightsand
biasessothenetworkgetsalittleclosertoclassifyingtheimageasa
"9".Andthenwe'drepeatthis,changingtheweightsandbiases
overandovertoproducebetterandbetteroutput.Thenetwork
wouldbelearning.

Theproblemisthatthisisn'twhathappenswhenournetwork
containsperceptrons.Infact,asmallchangeintheweightsorbias
ofanysingleperceptroninthenetworkcansometimescausethe
outputofthatperceptrontocompletelyflip,sayfrom0to1.That
flipmaythencausethebehaviouroftherestofthenetworkto
completelychangeinsomeverycomplicatedway.Sowhileyour"9"
mightnowbeclassifiedcorrectly,thebehaviourofthenetworkon
alltheotherimagesislikelytohavecompletelychangedinsome
hardtocontrolway.Thatmakesitdifficulttoseehowtogradually
modifytheweightsandbiasessothatthenetworkgetsclosertothe
desiredbehaviour.Perhapsthere'ssomecleverwayofgetting
aroundthisproblem.Butit'snotimmediatelyobvioushowwecan
getanetworkofperceptronstolearn.

Wecanovercomethisproblembyintroducinganewtypeof
artificialneuroncalledasigmoidneuron.Sigmoidneuronsare
similartoperceptrons,butmodifiedsothatsmallchangesintheir
weightsandbiascauseonlyasmallchangeintheiroutput.That's
thecrucialfactwhichwillallowanetworkofsigmoidneuronsto
learn.

Okay,letmedescribethesigmoidneuron.We'lldepictsigmoid
neuronsinthesamewaywedepictedperceptrons:

Justlikeaperceptron,thesigmoidneuronhasinputs,x 1, x2 , .
Butinsteadofbeingjust0or1,theseinputscanalsotakeonany

http://neuralnetworksanddeeplearning.com/chap1.html 10/50
03/04/2017 Neuralnetworksanddeeplearning

valuesbetween0and1.So,forinstance,0.638 isavalidinputfor
asigmoidneuron.Alsojustlikeaperceptron,thesigmoidneuron
hasweightsforeachinput,w 1, w2 , ,andanoverallbias,b.Butthe
outputisnot0or1.Instead,it's(w x + b),whereiscalledthe
*Incidentally,issometimescalledthelogistic
sigmoidfunction*,andisdefinedby: function,andthisnewclassofneuronscalled
logisticneurons.It'susefultorememberthis
1 terminology,sincethesetermsareusedbymany
(z) . (3)
1 + e
z peopleworkingwithneuralnets.However,we'll
stickwiththesigmoidterminology.

Toputitallalittlemoreexplicitly,theoutputofasigmoidneuron
withinputsx 1, x2 , ,weightsw 1, w2 , ,andbiasbis

1
. (4)
1 + exp( w j x j b)
j

Atfirstsight,sigmoidneuronsappearverydifferenttoperceptrons.
Thealgebraicformofthesigmoidfunctionmayseemopaqueand
forbiddingifyou'renotalreadyfamiliarwithit.Infact,thereare
manysimilaritiesbetweenperceptronsandsigmoidneurons,and
thealgebraicformofthesigmoidfunctionturnsouttobemoreofa
technicaldetailthanatruebarriertounderstanding.

Tounderstandthesimilaritytotheperceptronmodel,suppose
z w x + b isalargepositivenumber.Thene z
andso
0

(z) 1 .Inotherwords,whenz = w x + b islargeandpositive,


theoutputfromthesigmoidneuronisapproximately1,justasit
wouldhavebeenforaperceptron.Supposeontheotherhandthat
z = w x + b isverynegative.Thene z
,and(z) 0.So
whenz = w x + b isverynegative,thebehaviourofasigmoid
neuronalsocloselyapproximatesaperceptron.It'sonlywhen
w x + b isofmodestsizethatthere'smuchdeviationfromthe
perceptronmodel.

Whataboutthealgebraicformof?Howcanweunderstandthat?
Infact,theexactformofisn'tsoimportantwhatreallymatters
istheshapeofthefunctionwhenplotted.Here'stheshape:

http://neuralnetworksanddeeplearning.com/chap1.html 11/50
03/04/2017 Neuralnetworksanddeeplearning

sigmoidfunction
1.0

0.8

0.6

0.4

0.2

0.0
4 3 2 1 0 1 2 3 4
z

Thisshapeisasmoothedoutversionofastepfunction:

stepfunction
1.0

0.8

0.6

0.4

0.2

0.0
4 3 2 1 0 1 2 3 4
z

Ifhadinfactbeenastepfunction,thenthesigmoidneuronwould
beaperceptron,sincetheoutputwouldbe1or0dependingon
*Actually,whenw x + b = 0theperceptron
whetherw x + bwaspositiveornegative*.Byusingtheactual outputs0,whilethestepfunctionoutputs1.So,

functionweget,asalreadyimpliedabove,asmoothedout strictlyspeaking,we'dneedtomodifythestep
functionatthatonepoint.Butyougettheidea.
perceptron.Indeed,it'sthesmoothnessofthefunctionthatisthe
crucialfact,notitsdetailedform.Thesmoothnessofmeansthat
smallchangesw intheweightsandb inthebiaswillproducea
j

smallchangeoutputintheoutputfromtheneuron.Infact,
calculustellsusthatoutputiswellapproximatedby

output output
output w j + b, (5)
wj b
j

wherethesumisoveralltheweights,w ,and output/ w and


j j

output/ b denotepartialderivativesoftheoutputwithrespectto
wj andb,respectively.Don'tpanicifyou'renotcomfortablewith
partialderivatives!Whiletheexpressionabovelookscomplicated,
withallthepartialderivatives,it'sactuallysayingsomethingvery
simple(andwhichisverygoodnews):outputisalinearfunction
ofthechangesw andb intheweightsandbias.Thislinearity
j

makesiteasytochoosesmallchangesintheweightsandbiasesto
http://neuralnetworksanddeeplearning.com/chap1.html 12/50
03/04/2017 Neuralnetworksanddeeplearning

achieveanydesiredsmallchangeintheoutput.Sowhilesigmoid
neuronshavemuchofthesamequalitativebehaviouras
perceptrons,theymakeitmucheasiertofigureouthowchanging
theweightsandbiaseswillchangetheoutput.

Ifit'stheshapeofwhichreallymatters,andnotitsexactform,
thenwhyusetheparticularformusedforinEquation(3)?Infact,
laterinthebookwewilloccasionallyconsiderneuronswherethe
outputisf(w x + b)forsomeotheractivationfunctionf() .The
mainthingthatchangeswhenweuseadifferentactivationfunction
isthattheparticularvaluesforthepartialderivativesinEquation
(5)change.Itturnsoutthatwhenwecomputethosepartial
derivativeslater,usingwillsimplifythealgebra,simplybecause
exponentialshavelovelypropertieswhendifferentiated.Inany
case,iscommonlyusedinworkonneuralnets,andisthe
activationfunctionwe'llusemostofteninthisbook.

Howshouldweinterprettheoutputfromasigmoidneuron?
Obviously,onebigdifferencebetweenperceptronsandsigmoid
neuronsisthatsigmoidneuronsdon'tjustoutput0or1.Theycan
haveasoutputanyrealnumberbetween0and1,sovaluessuchas
0.173 and0.689 arelegitimateoutputs.Thiscanbeuseful,for
example,ifwewanttousetheoutputvaluetorepresenttheaverage
intensityofthepixelsinanimageinputtoaneuralnetwork.But
sometimesitcanbeanuisance.Supposewewanttheoutputfrom
thenetworktoindicateeither"theinputimageisa9"or"theinput
imageisnota9".Obviously,it'dbeeasiesttodothisiftheoutput
wasa0ora1,asinaperceptron.Butinpracticewecansetupa
conventiontodealwiththis,forexample,bydecidingtointerpret
anyoutputofatleast0.5asindicatinga"9",andanyoutputless
than0.5asindicating"nota9".I'llalwaysexplicitlystatewhen
we'reusingsuchaconvention,soitshouldn'tcauseanyconfusion.

Exercises
Sigmoidneuronssimulatingperceptrons,partI
Supposewetakealltheweightsandbiasesinanetworkof
perceptrons,andmultiplythembyapositiveconstant,c > 0.
Showthatthebehaviourofthenetworkdoesn'tchange.

Sigmoidneuronssimulatingperceptrons,partII
Supposewehavethesamesetupasthelastproblema
http://neuralnetworksanddeeplearning.com/chap1.html 13/50
03/04/2017 Neuralnetworksanddeeplearning

networkofperceptrons.Supposealsothattheoverallinputto
thenetworkofperceptronshasbeenchosen.Wewon'tneed
theactualinputvalue,wejustneedtheinputtohavebeen
fixed.Supposetheweightsandbiasesaresuchthat
w x + b 0 fortheinputxtoanyparticularperceptroninthe
network.Nowreplacealltheperceptronsinthenetworkby
sigmoidneurons,andmultiplytheweightsandbiasesbya
positiveconstantc > 0.Showthatinthelimitasc the
behaviourofthisnetworkofsigmoidneuronsisexactlythe
sameasthenetworkofperceptrons.Howcanthisfailwhen
w x + b = 0 foroneoftheperceptrons?

Thearchitectureofneuralnetworks
InthenextsectionI'llintroduceaneuralnetworkthatcandoa
prettygoodjobclassifyinghandwrittendigits.Inpreparationfor
that,ithelpstoexplainsometerminologythatletsusname
differentpartsofanetwork.Supposewehavethenetwork:

Asmentionedearlier,theleftmostlayerinthisnetworkiscalledthe
inputlayer,andtheneuronswithinthelayerarecalledinput
neurons.Therightmostoroutputlayercontainstheoutput
neurons,or,asinthiscase,asingleoutputneuron.Themiddle
layeriscalledahiddenlayer,sincetheneuronsinthislayerare
neitherinputsnoroutputs.Theterm"hidden"perhapssoundsa
littlemysteriousthefirsttimeIheardthetermIthoughtitmust
havesomedeepphilosophicalormathematicalsignificancebutit
reallymeansnothingmorethan"notaninputoranoutput".The
networkabovehasjustasinglehiddenlayer,butsomenetworks
havemultiplehiddenlayers.Forexample,thefollowingfourlayer
networkhastwohiddenlayers:

http://neuralnetworksanddeeplearning.com/chap1.html 14/50
03/04/2017 Neuralnetworksanddeeplearning

Somewhatconfusingly,andforhistoricalreasons,suchmultiple
layernetworksaresometimescalledmultilayerperceptronsor
MLPs,despitebeingmadeupofsigmoidneurons,notperceptrons.
I'mnotgoingtousetheMLPterminologyinthisbook,sinceIthink
it'sconfusing,butwantedtowarnyouofitsexistence.

Thedesignoftheinputandoutputlayersinanetworkisoften
straightforward.Forexample,supposewe'retryingtodetermine
whetherahandwrittenimagedepictsa"9"ornot.Anaturalwayto
designthenetworkistoencodetheintensitiesoftheimagepixels
intotheinputneurons.Iftheimageisa64 by64 greyscaleimage,
thenwe'dhave4, 096 = 64 64inputneurons,withtheintensities
scaledappropriatelybetween0and1.Theoutputlayerwillcontain
justasingleneuron,withoutputvaluesoflessthan0.5indicating
"inputimageisnota9",andvaluesgreaterthan0.5indicating
"inputimageisa9".

Whilethedesignoftheinputandoutputlayersofaneuralnetwork
isoftenstraightforward,therecanbequiteanarttothedesignof
thehiddenlayers.Inparticular,it'snotpossibletosumupthe
designprocessforthehiddenlayerswithafewsimplerulesof
thumb.Instead,neuralnetworksresearchershavedevelopedmany
designheuristicsforthehiddenlayers,whichhelppeoplegetthe
behaviourtheywantoutoftheirnets.Forexample,suchheuristics
canbeusedtohelpdeterminehowtotradeoffthenumberof
hiddenlayersagainstthetimerequiredtotrainthenetwork.We'll
meetseveralsuchdesignheuristicslaterinthisbook.

Uptonow,we'vebeendiscussingneuralnetworkswheretheoutput
fromonelayerisusedasinputtothenextlayer.Suchnetworksare
calledfeedforwardneuralnetworks.Thismeanstherearenoloops

http://neuralnetworksanddeeplearning.com/chap1.html 15/50
03/04/2017 Neuralnetworksanddeeplearning

inthenetworkinformationisalwaysfedforward,neverfedback.
Ifwedidhaveloops,we'dendupwithsituationswheretheinputto
thefunctiondependedontheoutput.That'dbehardtomake
senseof,andsowedon'tallowsuchloops.

However,thereareothermodelsofartificialneuralnetworksin
whichfeedbackloopsarepossible.Thesemodelsarecalled
recurrentneuralnetworks.Theideainthesemodelsistohave
neuronswhichfireforsomelimiteddurationoftime,before
becomingquiescent.Thatfiringcanstimulateotherneurons,which
mayfirealittlewhilelater,alsoforalimitedduration.Thatcauses
stillmoreneuronstofire,andsoovertimewegetacascadeof
neuronsfiring.Loopsdon'tcauseproblemsinsuchamodel,sincea
neuron'soutputonlyaffectsitsinputatsomelatertime,not
instantaneously.

Recurrentneuralnetshavebeenlessinfluentialthanfeedforward
networks,inpartbecausethelearningalgorithmsforrecurrentnets
are(atleasttodate)lesspowerful.Butrecurrentnetworksarestill
extremelyinteresting.They'remuchcloserinspirittohowour
brainsworkthanfeedforwardnetworks.Andit'spossiblethat
recurrentnetworkscansolveimportantproblemswhichcanonlybe
solvedwithgreatdifficultybyfeedforwardnetworks.However,to
limitourscope,inthisbookwe'regoingtoconcentrateonthemore
widelyusedfeedforwardnetworks.

Asimplenetworktoclassify
handwrittendigits
Havingdefinedneuralnetworks,let'sreturntohandwriting
recognition.Wecansplittheproblemofrecognizinghandwritten
digitsintotwosubproblems.First,we'dlikeawayofbreakingan
imagecontainingmanydigitsintoasequenceofseparateimages,
eachcontainingasingledigit.Forexample,we'dliketobreakthe
image

intosixseparateimages,

http://neuralnetworksanddeeplearning.com/chap1.html 16/50
03/04/2017 Neuralnetworksanddeeplearning

Wehumanssolvethissegmentationproblemwithease,butit's
challengingforacomputerprogramtocorrectlybreakupthe
image.Oncetheimagehasbeensegmented,theprogramthen
needstoclassifyeachindividualdigit.So,forinstance,we'dlikeour
programtorecognizethatthefirstdigitabove,

isa5.

We'llfocusonwritingaprogramtosolvethesecondproblem,that
is,classifyingindividualdigits.Wedothisbecauseitturnsoutthat
thesegmentationproblemisnotsodifficulttosolve,onceyouhave
agoodwayofclassifyingindividualdigits.Therearemany
approachestosolvingthesegmentationproblem.Oneapproachis
totrialmanydifferentwaysofsegmentingtheimage,usingthe
individualdigitclassifiertoscoreeachtrialsegmentation.Atrial
segmentationgetsahighscoreiftheindividualdigitclassifieris
confidentofitsclassificationinallsegments,andalowscoreifthe
classifierishavingalotoftroubleinoneormoresegments.The
ideaisthatiftheclassifierishavingtroublesomewhere,thenit's
probablyhavingtroublebecausethesegmentationhasbeenchosen
incorrectly.Thisideaandothervariationscanbeusedtosolvethe
segmentationproblemquitewell.Soinsteadofworryingabout
segmentationwe'llconcentrateondevelopinganeuralnetwork
whichcansolvethemoreinterestinganddifficultproblem,namely,
recognizingindividualhandwrittendigits.

Torecognizeindividualdigitswewilluseathreelayerneural
network:

http://neuralnetworksanddeeplearning.com/chap1.html 17/50
03/04/2017 Neuralnetworksanddeeplearning

Theinputlayerofthenetworkcontainsneuronsencodingthe
valuesoftheinputpixels.Asdiscussedinthenextsection,our
trainingdataforthenetworkwillconsistofmany28 by28 pixel
imagesofscannedhandwrittendigits,andsotheinputlayer
contains784 = 28 28neurons.ForsimplicityI'veomittedmostof
the784inputneuronsinthediagramabove.Theinputpixelsare
greyscale,withavalueof0.0representingwhite,avalueof1.0
representingblack,andinbetweenvaluesrepresentinggradually
darkeningshadesofgrey.

Thesecondlayerofthenetworkisahiddenlayer.Wedenotethe
numberofneuronsinthishiddenlayerbyn,andwe'llexperiment
withdifferentvaluesforn.Theexampleshownillustratesasmall
hiddenlayer,containingjustn = 15neurons.

Theoutputlayerofthenetworkcontains10neurons.Ifthefirst
neuronfires,i.e.,hasanoutput 1,thenthatwillindicatethatthe
networkthinksthedigitisa0.Ifthesecondneuronfiresthenthat
willindicatethatthenetworkthinksthedigitisa1.Andsoon.A
littlemoreprecisely,wenumbertheoutputneuronsfrom0through
9,andfigureoutwhichneuronhasthehighestactivationvalue.If
thatneuronis,say,neuronnumber6,thenournetworkwillguess
thattheinputdigitwasa6.Andsoonfortheotheroutputneurons.

Youmightwonderwhyweuse10 outputneurons.Afterall,thegoal
ofthenetworkistotelluswhichdigit(0, 1, 2, , 9)correspondsto
theinputimage.Aseeminglynaturalwayofdoingthatistousejust

http://neuralnetworksanddeeplearning.com/chap1.html 18/50
03/04/2017 Neuralnetworksanddeeplearning

4outputneurons,treatingeachneuronastakingonabinaryvalue,
dependingonwhethertheneuron'soutputiscloserto0orto1.
Fourneuronsareenoughtoencodetheanswer,since2 4
= 16 is
morethanthe10possiblevaluesfortheinputdigit.Whyshouldour
networkuse10 neuronsinstead?Isn'tthatinefficient?Theultimate
justificationisempirical:wecantryoutbothnetworkdesigns,and
itturnsoutthat,forthisparticularproblem,thenetworkwith10
outputneuronslearnstorecognizedigitsbetterthanthenetwork
with4outputneurons.Butthatleavesuswonderingwhyusing10
outputneuronsworksbetter.Istheresomeheuristicthatwouldtell
usinadvancethatweshouldusethe10 outputencodinginsteadof
the4outputencoding?

Tounderstandwhywedothis,ithelpstothinkaboutwhatthe
neuralnetworkisdoingfromfirstprinciples.Considerfirstthecase
whereweuse10 outputneurons.Let'sconcentrateonthefirst
outputneuron,theonethat'stryingtodecidewhetherornotthe
digitisa0.Itdoesthisbyweighingupevidencefromthehidden
layerofneurons.Whatarethosehiddenneuronsdoing?Well,just
supposeforthesakeofargumentthatthefirstneuroninthehidden
layerdetectswhetherornotanimagelikethefollowingispresent:

Itcandothisbyheavilyweightinginputpixelswhichoverlapwith
theimage,andonlylightlyweightingtheotherinputs.Inasimilar
way,let'ssupposeforthesakeofargumentthatthesecond,third,
andfourthneuronsinthehiddenlayerdetectwhetherornotthe
followingimagesarepresent:

Asyoumayhaveguessed,thesefourimagestogethermakeupthe0
imagethatwesawinthelineofdigitsshownearlier:

http://neuralnetworksanddeeplearning.com/chap1.html 19/50
03/04/2017 Neuralnetworksanddeeplearning

Soifallfourofthesehiddenneuronsarefiringthenwecan
concludethatthedigitisa0.Ofcourse,that'snottheonlysortof
evidencewecanusetoconcludethattheimagewasa0wecould
legitimatelygeta0inmanyotherways(say,throughtranslationsof
theaboveimages,orslightdistortions).Butitseemssafetosaythat
atleastinthiscasewe'dconcludethattheinputwasa0.

Supposingtheneuralnetworkfunctionsinthisway,wecangivea
plausibleexplanationforwhyit'sbettertohave10 outputsfromthe
network,ratherthan4.Ifwehad4outputs,thenthefirstoutput
neuronwouldbetryingtodecidewhatthemostsignificantbitof
thedigitwas.Andthere'snoeasywaytorelatethatmostsignificant
bittosimpleshapeslikethoseshownabove.It'shardtoimagine
thatthere'sanygoodhistoricalreasonthecomponentshapesofthe
digitwillbecloselyrelatedto(say)themostsignificantbitinthe
output.

Now,withallthatsaid,thisisalljustaheuristic.Nothingsaysthat
thethreelayerneuralnetworkhastooperateinthewayI
described,withthehiddenneuronsdetectingsimplecomponent
shapes.Maybeacleverlearningalgorithmwillfindsome
assignmentofweightsthatletsususeonly4outputneurons.Butas
aheuristicthewayofthinkingI'vedescribedworksprettywell,and
cansaveyoualotoftimeindesigninggoodneuralnetwork
architectures.

Exercise
Thereisawayofdeterminingthebitwiserepresentationofa
digitbyaddinganextralayertothethreelayernetworkabove.
Theextralayerconvertstheoutputfromthepreviouslayerinto
abinaryrepresentation,asillustratedinthefigurebelow.Find
asetofweightsandbiasesforthenewoutputlayer.Assume
thatthefirst3layersofneuronsaresuchthatthecorrect
outputinthethirdlayer(i.e.,theoldoutputlayer)has
activationatleast0.99,andincorrectoutputshaveactivation
lessthan0.01.
http://neuralnetworksanddeeplearning.com/chap1.html 20/50
03/04/2017 Neuralnetworksanddeeplearning

Learningwithgradientdescent
Nowthatwehaveadesignforourneuralnetwork,howcanitlearn
torecognizedigits?Thefirstthingwe'llneedisadatasettolearn
fromasocalledtrainingdataset.We'llusetheMNISTdataset,
whichcontainstensofthousandsofscannedimagesofhandwritten
digits,togetherwiththeircorrectclassifications.MNIST'sname
comesfromthefactthatitisamodifiedsubsetoftwodatasets
collectedbyNIST,theUnitedStates'NationalInstituteof
StandardsandTechnology.Here'safewimagesfromMNIST:

Asyoucansee,thesedigitsare,infact,thesameasthoseshownat
thebeginningofthischapterasachallengetorecognize.Ofcourse,
whentestingournetworkwe'llaskittorecognizeimageswhich
aren'tinthetrainingset!

TheMNISTdatacomesintwoparts.Thefirstpartcontains60,000
imagestobeusedastrainingdata.Theseimagesarescanned
handwritingsamplesfrom250people,halfofwhomwereUS
CensusBureauemployees,andhalfofwhomwerehighschool
students.Theimagesaregreyscaleand28by28pixelsinsize.The
secondpartoftheMNISTdatasetis10,000imagestobeusedas
testdata.Again,theseare28by28greyscaleimages.We'llusethe
testdatatoevaluatehowwellourneuralnetworkhaslearnedto
recognizedigits.Tomakethisagoodtestofperformance,thetest
datawastakenfromadifferentsetof250peoplethantheoriginal
trainingdata(albeitstillagroupsplitbetweenCensusBureau
employeesandhighschoolstudents).Thishelpsgiveusconfidence
http://neuralnetworksanddeeplearning.com/chap1.html 21/50
03/04/2017 Neuralnetworksanddeeplearning

thatoursystemcanrecognizedigitsfrompeoplewhosewritingit
didn'tseeduringtraining.

We'llusethenotationxtodenoteatraininginput.It'llbe
convenienttoregardeachtraininginputxasa28 28 = 784
dimensionalvector.Eachentryinthevectorrepresentsthegrey
valueforasinglepixelintheimage.We'lldenotethecorresponding
desiredoutputbyy = y(x) ,whereyisa10 dimensionalvector.For
example,ifaparticulartrainingimage,x,depictsa6,then
y(x) = (0, 0, 0, 0, 0, 0, 1, 0, 0, 0)
T
isthedesiredoutputfromthe
network.NotethatT hereisthetransposeoperation,turningarow
vectorintoanordinary(column)vector.

Whatwe'dlikeisanalgorithmwhichletsusfindweightsandbiases
sothattheoutputfromthenetworkapproximatesy(x) forall
traininginputsx.Toquantifyhowwellwe'reachievingthisgoalwe
*Sometimesreferredtoasalossorobjective
defineacostfunction*: function.Weusethetermcostfunction
throughoutthisbook,butyoushouldnotethe
1 otherterminology,sinceit'softenusedin
2
C (w, b) y(x) a . (6)
2n researchpapersandotherdiscussionsofneural
x
networks.

Here,wdenotesthecollectionofallweightsinthenetwork,ballthe
biases,nisthetotalnumberoftraininginputs,aisthevectorof
outputsfromthenetworkwhenxisinput,andthesumisoverall
traininginputs,x.Ofcourse,theoutputadependsonx,wandb,
buttokeepthenotationsimpleIhaven'texplicitlyindicatedthis
dependence.Thenotationvjustdenotestheusuallength
functionforavectorv.We'llcallC thequadraticcostfunctionit's
alsosometimesknownasthemeansquarederrororjustMSE.
Inspectingtheformofthequadraticcostfunction,weseethat
C (w, b) isnonnegative,sinceeveryterminthesumisnon
negative.Furthermore,thecostC (w, b)becomessmall,i.e.,
C (w, b) 0 ,preciselywheny(x) isapproximatelyequaltothe
output,a,foralltraininginputs,x.Soourtrainingalgorithmhas
doneagoodjobifitcanfindweightsandbiasessothatC (w, b) 0.
Bycontrast,it'snotdoingsowellwhenC (w, b)islargethatwould
meanthaty(x) isnotclosetotheoutputaforalargenumberof
inputs.Sotheaimofourtrainingalgorithmwillbetominimizethe
costC (w, b)asafunctionoftheweightsandbiases.Inotherwords,
wewanttofindasetofweightsandbiaseswhichmakethecostas
smallaspossible.We'lldothatusinganalgorithmknownas
gradientdescent.

http://neuralnetworksanddeeplearning.com/chap1.html 22/50
03/04/2017 Neuralnetworksanddeeplearning

Whyintroducethequadraticcost?Afterall,aren'tweprimarily
interestedinthenumberofimagescorrectlyclassifiedbythe
network?Whynottrytomaximizethatnumberdirectly,rather
thanminimizingaproxymeasurelikethequadraticcost?The
problemwiththatisthatthenumberofimagescorrectlyclassified
isnotasmoothfunctionoftheweightsandbiasesinthenetwork.
Forthemostpart,makingsmallchangestotheweightsandbiases
won'tcauseanychangeatallinthenumberoftrainingimages
classifiedcorrectly.Thatmakesitdifficulttofigureouthowto
changetheweightsandbiasestogetimprovedperformance.Ifwe
insteaduseasmoothcostfunctionlikethequadraticcostitturns
outtobeeasytofigureouthowtomakesmallchangesinthe
weightsandbiasessoastogetanimprovementinthecost.That's
whywefocusfirstonminimizingthequadraticcost,andonlyafter
thatwillweexaminetheclassificationaccuracy.

Evengiventhatwewanttouseasmoothcostfunction,youmaystill
wonderwhywechoosethequadraticfunctionusedinEquation(6).
Isn'tthisaratheradhocchoice?Perhapsifwechoseadifferent
costfunctionwe'dgetatotallydifferentsetofminimizingweights
andbiases?Thisisavalidconcern,andlaterwe'llrevisitthecost
function,andmakesomemodifications.However,thequadratic
costfunctionofEquation(6)worksperfectlywellforunderstanding
thebasicsoflearninginneuralnetworks,sowe'llstickwithitfor
now.

Recapping,ourgoalintraininganeuralnetworkistofindweights
andbiaseswhichminimizethequadraticcostfunctionC (w, b).This
isawellposedproblem,butit'sgotalotofdistractingstructureas
currentlyposedtheinterpretationofwandbasweightsand
biases,thefunctionlurkinginthebackground,thechoiceof
networkarchitecture,MNIST,andsoon.Itturnsoutthatwecan
understandatremendousamountbyignoringmostofthat
structure,andjustconcentratingontheminimizationaspect.Sofor
nowwe'regoingtoforgetallaboutthespecificformofthecost
function,theconnectiontoneuralnetworks,andsoon.Instead,
we'regoingtoimaginethatwe'vesimplybeengivenafunctionof
manyvariablesandwewanttominimizethatfunction.We'regoing
todevelopatechniquecalledgradientdescentwhichcanbeusedto
solvesuchminimizationproblems.Thenwe'llcomebacktothe
specificfunctionwewanttominimizeforneuralnetworks.
http://neuralnetworksanddeeplearning.com/chap1.html 23/50
03/04/2017 Neuralnetworksanddeeplearning

Okay,let'ssupposewe'retryingtominimizesomefunction,C (v) .
Thiscouldbeanyrealvaluedfunctionofmanyvariables,
v = v1 , v2 , .NotethatI'vereplacedthewandbnotationbyvto
emphasizethatthiscouldbeanyfunctionwe'renotspecifically
thinkingintheneuralnetworkscontextanymore.TominimizeC (v)
ithelpstoimagineC asafunctionofjusttwovariables,whichwe'll
callv andv :
1 2

Whatwe'dlikeistofindwhereC achievesitsglobalminimum.
Now,ofcourse,forthefunctionplottedabove,wecaneyeballthe
graphandfindtheminimum.Inthatsense,I'veperhapsshown
slightlytoosimpleafunction!Ageneralfunction,C ,maybea
complicatedfunctionofmanyvariables,anditwon'tusuallybe
possibletojusteyeballthegraphtofindtheminimum.

Onewayofattackingtheproblemistousecalculustotrytofindthe
minimumanalytically.Wecouldcomputederivativesandthentry
usingthemtofindplaceswhereC isanextremum.Withsomeluck
thatmightworkwhenC isafunctionofjustoneorafewvariables.
Butit'llturnintoanightmarewhenwehavemanymorevariables.
Andforneuralnetworkswe'lloftenwantfarmorevariablesthe
biggestneuralnetworkshavecostfunctionswhichdependon
billionsofweightsandbiasesinanextremelycomplicatedway.
Usingcalculustominimizethatjustwon'twork!

(Afterassertingthatwe'llgaininsightbyimaginingC asafunction
ofjusttwovariables,I'veturnedaroundtwiceintwoparagraphs
andsaid,"hey,butwhatifit'safunctionofmanymorethantwo

http://neuralnetworksanddeeplearning.com/chap1.html 24/50
03/04/2017 Neuralnetworksanddeeplearning

variables?"Sorryaboutthat.PleasebelievemewhenIsaythatit
reallydoeshelptoimagineC asafunctionoftwovariables.Itjust
happensthatsometimesthatpicturebreaksdown,andthelasttwo
paragraphsweredealingwithsuchbreakdowns.Goodthinking
aboutmathematicsofteninvolvesjugglingmultipleintuitive
pictures,learningwhenit'sappropriatetouseeachpicture,and
whenit'snot.)

Okay,socalculusdoesn'twork.Fortunately,thereisabeautiful
analogywhichsuggestsanalgorithmwhichworksprettywell.We
startbythinkingofourfunctionasakindofavalley.Ifyousquint
justalittleattheplotabove,thatshouldn'tbetoohard.Andwe
imagineaballrollingdowntheslopeofthevalley.Oureveryday
experiencetellsusthattheballwilleventuallyrolltothebottomof
thevalley.Perhapswecanusethisideaasawaytofindaminimum
forthefunction?We'drandomlychooseastartingpointforan
(imaginary)ball,andthensimulatethemotionoftheballasit
rolleddowntothebottomofthevalley.Wecoulddothissimulation
simplybycomputingderivatives(andperhapssomesecond
derivatives)ofC thosederivativeswouldtelluseverythingwe
needtoknowaboutthelocal"shape"ofthevalley,andtherefore
howourballshouldroll.

BasedonwhatI'vejustwritten,youmightsupposethatwe'llbe
tryingtowritedownNewton'sequationsofmotionfortheball,
consideringtheeffectsoffrictionandgravity,andsoon.Actually,
we'renotgoingtotaketheballrollinganalogyquitethatseriously
we'redevisinganalgorithmtominimizeC ,notdevelopingan
accuratesimulationofthelawsofphysics!Theball'seyeviewis
meanttostimulateourimagination,notconstrainourthinking.So
ratherthangetintoallthemessydetailsofphysics,let'ssimplyask
ourselves:ifweweredeclaredGodforaday,andcouldmakeupour
ownlawsofphysics,dictatingtotheballhowitshouldroll,what
laworlawsofmotioncouldwepickthatwouldmakeitsotheball
alwaysrolledtothebottomofthevalley?

Tomakethisquestionmoreprecise,let'sthinkaboutwhathappens
whenwemovetheballasmallamountv inthev direction,and
1 1

asmallamountv inthev direction.CalculustellsusthatC


2 2

changesasfollows:

C
http://neuralnetworksanddeeplearning.com/chap1.html C 25/50
03/04/2017 Neuralnetworksanddeeplearning
C C
C v1 + v2 . (7)
v1 v2

We'regoingtofindawayofchoosingv andv soastomake 1 2

C negativei.e.,we'llchoosethemsotheballisrollingdowninto
thevalley.Tofigureouthowtomakesuchachoiceithelpstodefine
v tobethevectorofchangesinv,v (v 1, v2 )
T
,whereT is
againthetransposeoperation,turningrowvectorsintocolumn
vectors.We'llalsodefinethegradientofC tobethevectorof
T

partialderivatives,( C

v1
,
C

v2
) .Wedenotethegradientvectorby
C ,i.e.:

T
C C
C ( , ) . (8)
v1 v2

Inamomentwe'llrewritethechangeC intermsofv andthe


gradient,C .Beforegettingtothat,though,Iwanttoclarify
somethingthatsometimesgetspeoplehunguponthegradient.
WhenmeetingtheC notationforthefirsttime,peoplesometimes
wonderhowtheyshouldthinkaboutthe symbol.What,exactly,
does mean?Infact,it'sperfectlyfinetothinkofC asasingle
mathematicalobjectthevectordefinedabovewhichhappensto
bewrittenusingtwosymbols.Inthispointofview, isjustapiece
ofnotationalflagwaving,tellingyou"hey,C isagradientvector".
Therearemoreadvancedpointsofviewwhere canbeviewedas
anindependentmathematicalentityinitsownright(forexample,
asadifferentialoperator),butwewon'tneedsuchpointsofview.

Withthesedefinitions,theexpression(7)forC canberewritten
as

C C v. (9)

ThisequationhelpsexplainwhyC iscalledthegradientvector:
C relateschangesinvtochangesinC ,justaswe'dexpect
somethingcalledagradienttodo.Butwhat'sreallyexcitingabout
theequationisthatitletsusseehowtochoosev soastomake
C negative.Inparticular,supposewechoose

v = C , (10)

whereisasmall,positiveparameter(knownasthelearningrate).
ThenEquation(9)tellsusthatC C C = C
2
.
BecauseC 2
0,thisguaranteesthatC 0 ,i.e.,C will

http://neuralnetworksanddeeplearning.com/chap1.html 26/50
03/04/2017 Neuralnetworksanddeeplearning

alwaysdecrease,neverincrease,ifwechangevaccordingtothe
prescriptionin(10).(Within,ofcourse,thelimitsofthe
approximationinEquation(9)).Thisisexactlythepropertywe
wanted!Andsowe'lltakeEquation(10)todefinethe"lawof
motion"fortheballinourgradientdescentalgorithm.Thatis,we'll
useEquation(10)tocomputeavalueforv ,thenmovetheball's
positionvbythatamount:


v v = v C . (11)

Thenwe'llusethisupdateruleagain,tomakeanothermove.Ifwe
keepdoingthis,overandover,we'llkeepdecreasingC untilwe
hopewereachaglobalminimum.

Summingup,thewaythegradientdescentalgorithmworksisto
repeatedlycomputethegradientC ,andthentomoveinthe
oppositedirection,"fallingdown"theslopeofthevalley.Wecan
visualizeitlikethis:

Noticethatwiththisrulegradientdescentdoesn'treproducereal
physicalmotion.Inreallifeaballhasmomentum,andthat
momentummayallowittorollacrosstheslope,oreven
(momentarily)rolluphill.It'sonlyaftertheeffectsoffrictionsetin
thattheballisguaranteedtorolldownintothevalley.Bycontrast,
ourruleforchoosingv justsays"godown,rightnow".That'sstill
aprettygoodruleforfindingtheminimum!

Tomakegradientdescentworkcorrectly,weneedtochoosethe
learningratetobesmallenoughthatEquation(9)isagood

http://neuralnetworksanddeeplearning.com/chap1.html 27/50
03/04/2017 Neuralnetworksanddeeplearning

approximation.Ifwedon't,wemightendupwithC ,which
> 0

obviouslywouldnotbegood!Atthesametime,wedon'twantto
betoosmall,sincethatwillmakethechangesv tiny,andthusthe
gradientdescentalgorithmwillworkveryslowly.Inpractical
implementations,isoftenvariedsothatEquation(9)remainsa
goodapproximation,butthealgorithmisn'ttooslow.We'llseelater
howthisworks.

I'veexplainedgradientdescentwhenC isafunctionofjusttwo
variables.But,infact,everythingworksjustaswellevenwhenC is
afunctionofmanymorevariables.SupposeinparticularthatC isa
functionofm variables,v 1, , vm .ThenthechangeC inC
producedbyasmallchangev = (v 1, , vm )
T
is

C C v, (12)

wherethegradientC isthevector

T
C C
C ( ,, ) . (13)
v1 vm

Justasforthetwovariablecase,wecanchoose

v = C , (14)

andwe'reguaranteedthatour(approximate)expression(12)forC
willbenegative.Thisgivesusawayoffollowingthegradienttoa
minimum,evenwhenC isafunctionofmanyvariables,by
repeatedlyapplyingtheupdaterule


v v = v C . (15)

Youcanthinkofthisupdateruleasdefiningthegradientdescent
algorithm.Itgivesusawayofrepeatedlychangingthepositionvin
ordertofindaminimumofthefunctionC .Theruledoesn'talways
workseveralthingscangowrongandpreventgradientdescent
fromfindingtheglobalminimumofC ,apointwe'llreturnto
exploreinlaterchapters.But,inpracticegradientdescentoften
worksextremelywell,andinneuralnetworkswe'llfindthatit'sa
powerfulwayofminimizingthecostfunction,andsohelpingthe
netlearn.

Indeed,there'sevenasenseinwhichgradientdescentisthe
optimalstrategyforsearchingforaminimum.Let'ssupposethat
we'retryingtomakeamovev inpositionsoastodecreaseC as

http://neuralnetworksanddeeplearning.com/chap1.html 28/50
03/04/2017 Neuralnetworksanddeeplearning

muchaspossible.ThisisequivalenttominimizingC C v .
We'llconstrainthesizeofthemovesothatv = forsomesmall
fixed > 0.Inotherwords,wewantamovethatisasmallstepofa
fixedsize,andwe'retryingtofindthemovementdirectionwhich
decreasesC asmuchaspossible.Itcanbeprovedthatthechoiceof
v whichminimizesC v isv = C ,where = /C is
determinedbythesizeconstraintv = .Sogradientdescent
canbeviewedasawayoftakingsmallstepsinthedirectionwhich
doesthemosttoimmediatelydecreaseC .

Exercises
Provetheassertionofthelastparagraph.Hint:Ifyou'renot
alreadyfamiliarwiththeCauchySchwarzinequality,youmay
findithelpfultofamiliarizeyourselfwithit.

IexplainedgradientdescentwhenC isafunctionoftwo
variables,andwhenit'safunctionofmorethantwovariables.
WhathappenswhenC isafunctionofjustonevariable?Can
youprovideageometricinterpretationofwhatgradient
descentisdoingintheonedimensionalcase?

Peoplehaveinvestigatedmanyvariationsofgradientdescent,
includingvariationsthatmorecloselymimicarealphysicalball.
Theseballmimickingvariationshavesomeadvantages,butalso
haveamajordisadvantage:itturnsouttobenecessarytocompute
secondpartialderivativesofC ,andthiscanbequitecostly.Tosee
whyit'scostly,supposewewanttocomputeallthesecondpartial
derivatives 2
C / vj vk .Ifthereareamillionsuchv variablesthen
j

we'dneedtocomputesomethinglikeatrillion(i.e.,amillion
*Actually,morelikehalfatrillion,since
squared)secondpartialderivatives*!That'sgoingtobe
2
C /vj vk =
2
.Still,yougetthe
C /vk vj

computationallycostly.Withthatsaid,therearetricksforavoiding point.

thiskindofproblem,andfindingalternativestogradientdescentis
anactiveareaofinvestigation.Butinthisbookwe'llusegradient
descent(andvariations)asourmainapproachtolearninginneural
networks.

Howcanweapplygradientdescenttolearninaneuralnetwork?
Theideaistousegradientdescenttofindtheweightsw andbiases k

bl whichminimizethecostinEquation(6).Toseehowthisworks,
let'srestatethegradientdescentupdaterule,withtheweightsand
biasesreplacingthevariablesv .Inotherwords,our"position"now
j

http://neuralnetworksanddeeplearning.com/chap1.html 29/50
03/04/2017 Neuralnetworksanddeeplearning

hascomponentsw andb ,andthegradientvectorC has


k l

correspondingcomponents C / w and C / b .Writingoutthe


k l

gradientdescentupdateruleintermsofcomponents,wehave

C

wk w = wk (16)
k
wk

C

bl b = bl . (17)
l
bl

Byrepeatedlyapplyingthisupdaterulewecan"rolldownthehill",
andhopefullyfindaminimumofthecostfunction.Inotherwords,
thisisarulewhichcanbeusedtolearninaneuralnetwork.

Thereareanumberofchallengesinapplyingthegradientdescent
rule.We'lllookintothoseindepthinlaterchapters.ButfornowI
justwanttomentiononeproblem.Tounderstandwhattheproblem
is,let'slookbackatthequadraticcostinEquation(6).Noticethat
thiscostfunctionhastheformC =
1

n

x
Cx ,thatis,it'sanaverage
2
y(x)a
overcostsC x
2
forindividualtrainingexamples.In
practice,tocomputethegradientC weneedtocomputethe
gradientsC separatelyforeachtraininginput,x,andthen
x

averagethem,C =
1

n

x
C x .Unfortunately,whenthenumber
oftraininginputsisverylargethiscantakealongtime,and
learningthusoccursslowly.

Anideacalledstochasticgradientdescentcanbeusedtospeedup
learning.TheideaistoestimatethegradientC bycomputingC x

forasmallsampleofrandomlychosentraininginputs.By
averagingoverthissmallsampleitturnsoutthatwecanquicklyget
agoodestimateofthetruegradientC ,andthishelpsspeedup
gradientdescent,andthuslearning.

Tomaketheseideasmoreprecise,stochasticgradientdescent
worksbyrandomlypickingoutasmallnumberm ofrandomly
chosentraininginputs.We'lllabelthoserandomtraininginputs
X1 , X2 , , Xm ,andrefertothemasaminibatch.Providedthe
samplesizem islargeenoughweexpectthattheaveragevalueof
theC willberoughlyequaltotheaverageoverallC ,thatis,
Xj x

m
C Xj C x
j=1 x
= C , (18)
m n

wherethesecondsumisovertheentiresetoftrainingdata.
Swappingsidesweget

m
http://neuralnetworksanddeeplearning.com/chap1.html 30/50
03/04/2017 Neuralnetworksanddeeplearning
m
1
C C Xj , (19)
m
j=1

confirmingthatwecanestimatetheoverallgradientbycomputing
gradientsjustfortherandomlychosenminibatch.

Toconnectthisexplicitlytolearninginneuralnetworks,supposew k

andb denotetheweightsandbiasesinourneuralnetwork.Then
l

stochasticgradientdescentworksbypickingoutarandomlychosen
minibatchoftraininginputs,andtrainingwiththose,

C Xj

wk w = wk (20)
k
m wk
j

C Xj

bl b = bl , (21)
l
m bl
j

wherethesumsareoverallthetrainingexamplesX inthecurrent j

minibatch.Thenwepickoutanotherrandomlychosenminibatch
andtrainwiththose.Andsoon,untilwe'veexhaustedthetraining
inputs,whichissaidtocompleteanepochoftraining.Atthatpoint
westartoverwithanewtrainingepoch.

Incidentally,it'sworthnotingthatconventionsvaryaboutscalingof
thecostfunctionandofminibatchupdatestotheweightsand
biases.InEquation(6)wescaledtheoverallcostfunctionbya
factor .Peoplesometimesomitthe ,summingoverthecostsof
1

n
1

individualtrainingexamplesinsteadofaveraging.Thisis
particularlyusefulwhenthetotalnumberoftrainingexamplesisn't
knowninadvance.Thiscanoccurifmoretrainingdataisbeing
generatedinrealtime,forinstance.And,inasimilarway,themini
batchupdaterules(20)and(21)sometimesomitthe termout 1

thefrontofthesums.Conceptuallythismakeslittledifference,
sinceit'sequivalenttorescalingthelearningrate.Butwhendoing
detailedcomparisonsofdifferentworkit'sworthwatchingoutfor.

Wecanthinkofstochasticgradientdescentasbeinglikepolitical
polling:it'smucheasiertosampleasmallminibatchthanitisto
applygradientdescenttothefullbatch,justascarryingoutapollis
easierthanrunningafullelection.Forexample,ifwehavea
trainingsetofsizen = 60, 000,asinMNIST,andchooseamini
batchsizeof(say)m = 10,thismeanswe'llgetafactorof6, 000
speedupinestimatingthegradient!Ofcourse,theestimatewon'tbe
perfecttherewillbestatisticalfluctuationsbutitdoesn'tneedto

http://neuralnetworksanddeeplearning.com/chap1.html 31/50
03/04/2017 Neuralnetworksanddeeplearning

beperfect:allwereallycareaboutismovinginageneraldirection
thatwillhelpdecreaseC ,andthatmeanswedon'tneedanexact
computationofthegradient.Inpractice,stochasticgradient
descentisacommonlyusedandpowerfultechniqueforlearningin
neuralnetworks,andit'sthebasisformostofthelearning
techniqueswe'lldevelopinthisbook.

Exercise
Anextremeversionofgradientdescentistouseaminibatch
sizeofjust1.Thatis,givenatraininginput,x,weupdateour
weightsandbiasesaccordingtotherules
wk w

k
= w k C x / w k andb l b

l
= bl C x / bl .
Thenwechooseanothertraininginput,andupdatetheweights
andbiasesagain.Andsoon,repeatedly.Thisprocedureis
knownasonline,online,orincrementallearning.Inonline
learning,aneuralnetworklearnsfromjustonetraininginput
atatime(justashumanbeingsdo).Nameoneadvantageand
onedisadvantageofonlinelearning,comparedtostochastic
gradientdescentwithaminibatchsizeof,say,20 .

Letmeconcludethissectionbydiscussingapointthatsometimes
bugspeoplenewtogradientdescent.InneuralnetworksthecostC
is,ofcourse,afunctionofmanyvariablesalltheweightsand
biasesandsoinsomesensedefinesasurfaceinaveryhigh
dimensionalspace.Somepeoplegethungupthinking:"Hey,Ihave
tobeabletovisualizealltheseextradimensions".Andtheymay
starttoworry:"Ican'tthinkinfourdimensions,letalonefive(or
fivemillion)".Istheresomespecialabilitythey'remissing,some
abilitythat"real"supermathematicianshave?Ofcourse,theanswer
isno.Evenmostprofessionalmathematicianscan'tvisualizefour
dimensionsespeciallywell,ifatall.Thetricktheyuse,instead,isto
developotherwaysofrepresentingwhat'sgoingon.That'sexactly
whatwedidabove:weusedanalgebraic(ratherthanvisual)
representationofC tofigureouthowtomovesoastodecreaseC .
Peoplewhoaregoodatthinkinginhighdimensionshaveamental
librarycontainingmanydifferenttechniquesalongtheselinesour
algebraictrickisjustoneexample.Thosetechniquesmaynothave
thesimplicitywe'reaccustomedtowhenvisualizingthree
dimensions,butonceyoubuildupalibraryofsuchtechniques,you
cangetprettygoodatthinkinginhighdimensions.Iwon'tgointo

http://neuralnetworksanddeeplearning.com/chap1.html 32/50
03/04/2017 Neuralnetworksanddeeplearning

moredetailhere,butifyou'reinterestedthenyoumayenjoy
readingthisdiscussionofsomeofthetechniquesprofessional
mathematiciansusetothinkinhighdimensions.Whilesomeofthe
techniquesdiscussedarequitecomplex,muchofthebestcontentis
intuitiveandaccessible,andcouldbemasteredbyanyone.

Implementingournetworktoclassify
digits
Alright,let'swriteaprogramthatlearnshowtorecognize
handwrittendigits,usingstochasticgradientdescentandthe
MNISTtrainingdata.We'lldothiswithashortPython(2.7)
program,just74linesofcode!Thefirstthingweneedistogetthe
MNISTdata.Ifyou'reagituserthenyoucanobtainthedataby
cloningthecoderepositoryforthisbook,

gitclonehttps://github.com/mnielsen/neuralnetworksanddeeplearning.git

Ifyoudon'tusegitthenyoucandownloadthedataandcodehere.

Incidentally,whenIdescribedtheMNISTdataearlier,Isaiditwas
splitinto60,000trainingimages,and10,000testimages.That's
theofficialMNISTdescription.Actually,we'regoingtosplitthe
dataalittledifferently.We'llleavethetestimagesasis,butsplitthe
60,000imageMNISTtrainingsetintotwoparts:asetof50,000
images,whichwe'llusetotrainourneuralnetwork,andaseparate
10,000imagevalidationset.Wewon'tusethevalidationdatain
thischapter,butlaterinthebookwe'llfinditusefulinfiguringout
howtosetcertainhyperparametersoftheneuralnetworkthings
likethelearningrate,andsoon,whicharen'tdirectlyselectedby
ourlearningalgorithm.Althoughthevalidationdataisn'tpartof
theoriginalMNISTspecification,manypeopleuseMNISTinthis
fashion,andtheuseofvalidationdataiscommoninneural
networks.WhenIrefertothe"MNISTtrainingdata"fromnowon,
I'llbereferringtoour50,000imagedataset,nottheoriginal
*Asnotedearlier,theMNISTdatasetisbased
60,000imagedataset*. ontwodatasetscollectedbyNIST,theUnited
States'NationalInstituteofStandardsand

ApartfromtheMNISTdatawealsoneedaPythonlibrarycalled Technology.ToconstructMNISTtheNISTdata
setswerestrippeddownandputintoamore
Numpy,fordoingfastlinearalgebra.Ifyoudon'talreadyhave convenientformatbyYannLeCun,Corinna
Cortes,andChristopherJ.C.Burges.Seethis
Numpyinstalled,youcangetithere. linkformoredetails.Thedatasetinmy
repositoryisinaformthatmakesiteasytoload
andmanipulatetheMNISTdatainPython.I
obtainedthisparticularformofthedatafrom

http://neuralnetworksanddeeplearning.com/chap1.html 33/50
03/04/2017 Neuralnetworksanddeeplearning
theLISAmachinelearninglaboratoryatthe
Letmeexplainthecorefeaturesoftheneuralnetworkscode,before
UniversityofMontreal(link).

givingafulllisting,below.ThecenterpieceisaNetworkclass,which
weusetorepresentaneuralnetwork.Here'sthecodeweuseto
initializeaNetworkobject:

classNetwork(object):

def__init__(self,sizes):
self.num_layers=len(sizes)
self.sizes=sizes
self.biases=[np.random.randn(y,1)foryinsizes[1:]]
self.weights=[np.random.randn(y,x)
forx,yinzip(sizes[:1],sizes[1:])]

Inthiscode,thelistsizescontainsthenumberofneuronsinthe
respectivelayers.So,forexample,ifwewanttocreateaNetwork
objectwith2neuronsinthefirstlayer,3neuronsinthesecond
layer,and1neuroninthefinallayer,we'ddothiswiththecode:

net=Network([2,3,1])

ThebiasesandweightsintheNetworkobjectareallinitialized
randomly,usingtheNumpynp.random.randnfunctiontogenerate
Gaussiandistributionswithmean0andstandarddeviation1.This
randominitializationgivesourstochasticgradientdescent
algorithmaplacetostartfrom.Inlaterchapterswe'llfindbetter
waysofinitializingtheweightsandbiases,butthiswilldofornow.
NotethattheNetworkinitializationcodeassumesthatthefirstlayer
ofneuronsisaninputlayer,andomitstosetanybiasesforthose
neurons,sincebiasesareonlyeverusedincomputingtheoutputs
fromlaterlayers.

NotealsothatthebiasesandweightsarestoredaslistsofNumpy
matrices.So,forexamplenet.weights[1]isaNumpymatrixstoring
theweightsconnectingthesecondandthirdlayersofneurons.(It's
notthefirstandsecondlayers,sincePython'slistindexingstartsat
0.)Sincenet.weights[1]isratherverbose,let'sjustdenotethat

matrixw.It'samatrixsuchthatw istheweightfortheconnection
jk

betweenthek neuroninthesecondlayer,andthej neuronin


th th

thethirdlayer.Thisorderingofthejandkindicesmayseem
strangesurelyit'dmakemoresensetoswapthejandkindices
around?Thebigadvantageofusingthisorderingisthatitmeans
thatthevectorofactivationsofthethirdlayerofneuronsis:


a = (wa + b). (22)

http://neuralnetworksanddeeplearning.com/chap1.html 34/50
03/04/2017 Neuralnetworksanddeeplearning

There'squiteabitgoingoninthisequation,solet'sunpackitpiece
bypiece.aisthevectorofactivationsofthesecondlayerof
neurons.Toobtaina wemultiplyabytheweightmatrixw,andadd

thevectorbofbiases.Wethenapplythefunctionelementwiseto
everyentryinthevectorwa + b.(Thisiscalledvectorizingthe
function.)It'seasytoverifythatEquation(22)givesthesame
resultasourearlierrule,Equation(4),forcomputingtheoutputof
asigmoidneuron.

Exercise
WriteoutEquation(22)incomponentform,andverifythatit
givesthesameresultastherule(4)forcomputingtheoutput
ofasigmoidneuron.

Withallthisinmind,it'seasytowritecodecomputingtheoutput
fromaNetworkinstance.Webeginbydefiningthesigmoidfunction:

defsigmoid(z):
return1.0/(1.0+np.exp(z))

NotethatwhentheinputzisavectororNumpyarray,Numpy
automaticallyappliesthefunctionsigmoidelementwise,thatis,in
vectorizedform.

WethenaddafeedforwardmethodtotheNetworkclass,which,given
*Itisassumedthattheinputaisan(n,1)
aninputaforthenetwork,returnsthecorrespondingoutput*.All Numpyndarray,nota(n,)vector.Here,nisthe

themethoddoesisappliesEquation(22)foreachlayer: numberofinputstothenetwork.Ifyoutryto
usean(n,)vectorasinputyou'llgetstrange
results.Althoughusingan(n,)vectorappears
deffeedforward(self,a):
themorenaturalchoice,usingan(n,1)
"""Returntheoutputofthenetworkif"a"isinput."""
ndarraymakesitparticularlyeasytomodifythe
forb,winzip(self.biases,self.weights):
codetofeedforwardmultipleinputsatonce,and
a=sigmoid(np.dot(w,a)+b)
thatissometimesconvenient.
returna

Ofcourse,themainthingwewantourNetworkobjectstodoisto
learn.Tothatendwe'llgivethemanSGDmethodwhichimplements
stochasticgradientdescent.Here'sthecode.It'salittlemysterious
inafewplaces,butI'llbreakitdownbelow,afterthelisting.

defSGD(self,training_data,epochs,mini_batch_size,eta,
test_data=None):
"""Traintheneuralnetworkusingminibatchstochastic
gradientdescent.The"training_data"isalistoftuples
"(x,y)"representingthetraininginputsandthedesired
outputs.Theothernonoptionalparametersare
selfexplanatory.If"test_data"isprovidedthenthe
networkwillbeevaluatedagainstthetestdataaftereach
epoch,andpartialprogressprintedout.Thisisusefulfor
trackingprogress,butslowsthingsdownsubstantially."""
iftest_data:n_test=len(test_data)

http://neuralnetworksanddeeplearning.com/chap1.html 35/50
03/04/2017 Neuralnetworksanddeeplearning
n=len(training_data)
forjinxrange(epochs):
random.shuffle(training_data)
mini_batches=[
training_data[k:k+mini_batch_size]
forkinxrange(0,n,mini_batch_size)]
formini_batchinmini_batches:
self.update_mini_batch(mini_batch,eta)
iftest_data:
print"Epoch{0}:{1}/{2}".format(
j,self.evaluate(test_data),n_test)
else:
print"Epoch{0}complete".format(j)

Thetraining_dataisalistoftuples(x,y)representingthetraining
inputsandcorrespondingdesiredoutputs.Thevariablesepochsand
mini_batch_sizearewhatyou'dexpectthenumberofepochsto

trainfor,andthesizeoftheminibatchestousewhensampling.eta
isthelearningrate,.Iftheoptionalargumenttest_datais
supplied,thentheprogramwillevaluatethenetworkaftereach
epochoftraining,andprintoutpartialprogress.Thisisusefulfor
trackingprogress,butslowsthingsdownsubstantially.

Thecodeworksasfollows.Ineachepoch,itstartsbyrandomly
shufflingthetrainingdata,andthenpartitionsitintominibatches
oftheappropriatesize.Thisisaneasywayofsamplingrandomly
fromthetrainingdata.Thenforeachmini_batchweapplyasingle
stepofgradientdescent.Thisisdonebythecode
self.update_mini_batch(mini_batch,eta),whichupdatesthenetwork

weightsandbiasesaccordingtoasingleiterationofgradient
descent,usingjustthetrainingdatainmini_batch.Here'sthecode
fortheupdate_mini_batchmethod:

defupdate_mini_batch(self,mini_batch,eta):
"""Updatethenetwork'sweightsandbiasesbyapplying
gradientdescentusingbackpropagationtoasingleminibatch.
The"mini_batch"isalistoftuples"(x,y)",and"eta"
isthelearningrate."""
nabla_b=[np.zeros(b.shape)forbinself.biases]
nabla_w=[np.zeros(w.shape)forwinself.weights]
forx,yinmini_batch:
delta_nabla_b,delta_nabla_w=self.backprop(x,y)
nabla_b=[nb+dnbfornb,dnbinzip(nabla_b,delta_nabla_b)]
nabla_w=[nw+dnwfornw,dnwinzip(nabla_w,delta_nabla_w)]
self.weights=[w(eta/len(mini_batch))*nw
forw,nwinzip(self.weights,nabla_w)]
self.biases=[b(eta/len(mini_batch))*nb
forb,nbinzip(self.biases,nabla_b)]

Mostoftheworkisdonebytheline
delta_nabla_b,delta_nabla_w=self.backprop(x,y)

Thisinvokessomethingcalledthebackpropagationalgorithm,
whichisafastwayofcomputingthegradientofthecostfunction.

http://neuralnetworksanddeeplearning.com/chap1.html 36/50
03/04/2017 Neuralnetworksanddeeplearning

Soupdate_mini_batchworkssimplybycomputingthesegradientsfor
everytrainingexampleinthemini_batch,andthenupdating
self.weightsandself.biasesappropriately.

I'mnotgoingtoshowthecodeforself.backproprightnow.We'll
studyhowbackpropagationworksinthenextchapter,includingthe
codeforself.backprop.Fornow,justassumethatitbehavesas
claimed,returningtheappropriategradientforthecostassociated
tothetrainingexamplex.

Let'slookatthefullprogram,includingthedocumentationstrings,
whichIomittedabove.Apartfromself.backproptheprogramisself
explanatoryalltheheavyliftingisdoneinself.SGDand
self.update_mini_batch,whichwe'vealreadydiscussed.The

self.backpropmethodmakesuseofafewextrafunctionstohelpin

computingthegradient,namelysigmoid_prime,whichcomputesthe
derivativeofthefunction,andself.cost_derivative,whichIwon't
describehere.Youcangetthegistofthese(andperhapsthedetails)
justbylookingatthecodeanddocumentationstrings.We'lllookat
themindetailinthenextchapter.Notethatwhiletheprogram
appearslengthy,muchofthecodeisdocumentationstrings
intendedtomakethecodeeasytounderstand.Infact,theprogram
containsjust74linesofnonwhitespace,noncommentcode.All
thecodemaybefoundonGitHubhere.

"""
network.py
~~~~~~~~~~

Amoduletoimplementthestochasticgradientdescentlearning
algorithmforafeedforwardneuralnetwork.Gradientsarecalculated
usingbackpropagation.NotethatIhavefocusedonmakingthecode
simple,easilyreadable,andeasilymodifiable.Itisnotoptimized,
andomitsmanydesirablefeatures.
"""

####Libraries
#Standardlibrary
importrandom

#Thirdpartylibraries
importnumpyasnp

classNetwork(object):

def__init__(self,sizes):
"""Thelist``sizes``containsthenumberofneuronsinthe
respectivelayersofthenetwork.Forexample,ifthelist
was[2,3,1]thenitwouldbeathreelayernetwork,withthe
firstlayercontaining2neurons,thesecondlayer3neurons,
andthethirdlayer1neuron.Thebiasesandweightsforthe
networkareinitializedrandomly,usingaGaussian
distributionwithmean0,andvariance1.Notethatthefirst

http://neuralnetworksanddeeplearning.com/chap1.html 37/50
03/04/2017 Neuralnetworksanddeeplearning
layerisassumedtobeaninputlayer,andbyconventionwe
won'tsetanybiasesforthoseneurons,sincebiasesareonly
everusedincomputingtheoutputsfromlaterlayers."""
self.num_layers=len(sizes)
self.sizes=sizes
self.biases=[np.random.randn(y,1)foryinsizes[1:]]
self.weights=[np.random.randn(y,x)
forx,yinzip(sizes[:1],sizes[1:])]

deffeedforward(self,a):
"""Returntheoutputofthenetworkif``a``isinput."""
forb,winzip(self.biases,self.weights):
a=sigmoid(np.dot(w,a)+b)
returna

defSGD(self,training_data,epochs,mini_batch_size,eta,
test_data=None):
"""Traintheneuralnetworkusingminibatchstochastic
gradientdescent.The``training_data``isalistoftuples
``(x,y)``representingthetraininginputsandthedesired
outputs.Theothernonoptionalparametersare
selfexplanatory.If``test_data``isprovidedthenthe
networkwillbeevaluatedagainstthetestdataaftereach
epoch,andpartialprogressprintedout.Thisisusefulfor
trackingprogress,butslowsthingsdownsubstantially."""
iftest_data:n_test=len(test_data)
n=len(training_data)
forjinxrange(epochs):
random.shuffle(training_data)
mini_batches=[
training_data[k:k+mini_batch_size]
forkinxrange(0,n,mini_batch_size)]
formini_batchinmini_batches:
self.update_mini_batch(mini_batch,eta)
iftest_data:
print"Epoch{0}:{1}/{2}".format(
j,self.evaluate(test_data),n_test)
else:
print"Epoch{0}complete".format(j)

defupdate_mini_batch(self,mini_batch,eta):
"""Updatethenetwork'sweightsandbiasesbyapplying
gradientdescentusingbackpropagationtoasingleminibatch.
The``mini_batch``isalistoftuples``(x,y)``,and``eta``
isthelearningrate."""
nabla_b=[np.zeros(b.shape)forbinself.biases]
nabla_w=[np.zeros(w.shape)forwinself.weights]
forx,yinmini_batch:
delta_nabla_b,delta_nabla_w=self.backprop(x,y)
nabla_b=[nb+dnbfornb,dnbinzip(nabla_b,delta_nabla_b)]
nabla_w=[nw+dnwfornw,dnwinzip(nabla_w,delta_nabla_w)]
self.weights=[w(eta/len(mini_batch))*nw
forw,nwinzip(self.weights,nabla_w)]
self.biases=[b(eta/len(mini_batch))*nb
forb,nbinzip(self.biases,nabla_b)]

defbackprop(self,x,y):
"""Returnatuple``(nabla_b,nabla_w)``representingthe
gradientforthecostfunctionC_x.``nabla_b``and
``nabla_w``arelayerbylayerlistsofnumpyarrays,similar
to``self.biases``and``self.weights``."""
nabla_b=[np.zeros(b.shape)forbinself.biases]
nabla_w=[np.zeros(w.shape)forwinself.weights]
#feedforward
activation=x
activations=[x]#listtostorealltheactivations,layerbylayer
zs=[]#listtostoreallthezvectors,layerbylayer
forb,winzip(self.biases,self.weights):

http://neuralnetworksanddeeplearning.com/chap1.html 38/50
03/04/2017 Neuralnetworksanddeeplearning
z=np.dot(w,activation)+b
zs.append(z)
activation=sigmoid(z)
activations.append(activation)
#backwardpass
delta=self.cost_derivative(activations[1],y)*\
sigmoid_prime(zs[1])
nabla_b[1]=delta
nabla_w[1]=np.dot(delta,activations[2].transpose())
#Notethatthevariablelintheloopbelowisusedalittle
#differentlytothenotationinChapter2ofthebook.Here,
#l=1meansthelastlayerofneurons,l=2isthe
#secondlastlayer,andsoon.It'sarenumberingofthe
#schemeinthebook,usedheretotakeadvantageofthefact
#thatPythoncanusenegativeindicesinlists.
forlinxrange(2,self.num_layers):
z=zs[l]
sp=sigmoid_prime(z)
delta=np.dot(self.weights[l+1].transpose(),delta)*sp
nabla_b[l]=delta
nabla_w[l]=np.dot(delta,activations[l1].transpose())
return(nabla_b,nabla_w)

defevaluate(self,test_data):
"""Returnthenumberoftestinputsforwhichtheneural
networkoutputsthecorrectresult.Notethattheneural
network'soutputisassumedtobetheindexofwhichever
neuroninthefinallayerhasthehighestactivation."""
test_results=[(np.argmax(self.feedforward(x)),y)
for(x,y)intest_data]
returnsum(int(x==y)for(x,y)intest_results)

defcost_derivative(self,output_activations,y):
"""Returnthevectorofpartialderivatives\partialC_x/
\partialafortheoutputactivations."""
return(output_activationsy)

####Miscellaneousfunctions
defsigmoid(z):
"""Thesigmoidfunction."""
return1.0/(1.0+np.exp(z))

defsigmoid_prime(z):
"""Derivativeofthesigmoidfunction."""
returnsigmoid(z)*(1sigmoid(z))

Howwelldoestheprogramrecognizehandwrittendigits?Well,let's
startbyloadingintheMNISTdata.I'lldothisusingalittlehelper
program,mnist_loader.py,tobedescribedbelow.Weexecutethe
followingcommandsinaPythonshell,

>>>importmnist_loader
>>>training_data,validation_data,test_data=\
...mnist_loader.load_data_wrapper()

Ofcourse,thiscouldalsobedoneinaseparatePythonprogram,
butifyou'refollowingalongit'sprobablyeasiesttodoinaPython
shell.

AfterloadingtheMNISTdata,we'llsetupaNetworkwith30 hidden
neurons.WedothisafterimportingthePythonprogramlisted

http://neuralnetworksanddeeplearning.com/chap1.html 39/50
03/04/2017 Neuralnetworksanddeeplearning

above,whichisnamednetwork,

>>>importnetwork
>>>net=network.Network([784,30,10])

Finally,we'llusestochasticgradientdescenttolearnfromthe
MNISTtraining_dataover30epochs,withaminibatchsizeof10,
andalearningrateof = 3.0 ,

>>>net.SGD(training_data,30,10,3.0,test_data=test_data)

Notethatifyou'rerunningthecodeasyoureadalong,itwilltake
sometimetoexecuteforatypicalmachine(asof2015)itwill
likelytakeafewminutestorun.Isuggestyousetthingsrunning,
continuetoread,andperiodicallychecktheoutputfromthecode.If
you'reinarushyoucanspeedthingsupbydecreasingthenumber
ofepochs,bydecreasingthenumberofhiddenneurons,orbyusing
onlypartofthetrainingdata.Notethatproductioncodewouldbe
much,muchfaster:thesePythonscriptsareintendedtohelpyou
understandhowneuralnetswork,nottobehighperformance
code!And,ofcourse,oncewe'vetrainedanetworkitcanberun
veryquicklyindeed,onalmostanycomputingplatform.For
example,oncewe'velearnedagoodsetofweightsandbiasesfora
network,itcaneasilybeportedtoruninJavascriptinaweb
browser,orasanativeapponamobiledevice.Inanycase,hereisa
partialtranscriptoftheoutputofonetrainingrunoftheneural
network.Thetranscriptshowsthenumberoftestimagescorrectly
recognizedbytheneuralnetworkaftereachepochoftraining.As
youcansee,afterjustasingleepochthishasreached9,129outof
10,000,andthenumbercontinuestogrow,

Epoch0:9129/10000
Epoch1:9295/10000
Epoch2:9348/10000
...
Epoch27:9528/10000
Epoch28:9542/10000
Epoch29:9534/10000

Thatis,thetrainednetworkgivesusaclassificationrateofabout95
percent95.42percentatitspeak("Epoch28")!That'squite
encouragingasafirstattempt.Ishouldwarnyou,however,thatif
yourunthecodethenyourresultsarenotnecessarilygoingtobe
quitethesameasmine,sincewe'llbeinitializingournetworkusing
(different)randomweightsandbiases.Togenerateresultsinthis
chapterI'vetakenbestofthreeruns.

http://neuralnetworksanddeeplearning.com/chap1.html 40/50
03/04/2017 Neuralnetworksanddeeplearning

Let'sreruntheaboveexperiment,changingthenumberofhidden
neuronsto100.Aswasthecaseearlier,ifyou'rerunningthecodeas
youreadalong,youshouldbewarnedthatittakesquiteawhileto
execute(onmymachinethisexperimenttakestensofsecondsfor
eachtrainingepoch),soit'swisetocontinuereadinginparallel
whilethecodeexecutes.

>>>net=network.Network([784,100,10])
>>>net.SGD(training_data,30,10,3.0,test_data=test_data)

Sureenough,thisimprovestheresultsto96.59percent.Atleastin
*Readerfeedbackindicatesquitesomevariation
thiscase,usingmorehiddenneuronshelpsusgetbetterresults*. inresultsforthisexperiment,andsometraining
runsgiveresultsquiteabitworse.Usingthe

Ofcourse,toobtaintheseaccuraciesIhadtomakespecificchoices techniquesintroducedinchapter3willgreatly
reducethevariationinperformanceacross
forthenumberofepochsoftraining,theminibatchsize,andthe differenttrainingrunsforournetworks.

learningrate,.AsImentionedabove,theseareknownashyper
parametersforourneuralnetwork,inordertodistinguishthem
fromtheparameters(weightsandbiases)learntbyourlearning
algorithm.Ifwechooseourhyperparameterspoorly,wecanget
badresults.Suppose,forexample,thatwe'dchosenthelearning
ratetobe = 0.001 ,

>>>net=network.Network([784,100,10])
>>>net.SGD(training_data,30,10,0.001,test_data=test_data)

Theresultsaremuchlessencouraging,

Epoch0:1139/10000
Epoch1:1136/10000
Epoch2:1135/10000
...
Epoch27:2101/10000
Epoch28:2123/10000
Epoch29:2142/10000

However,youcanseethattheperformanceofthenetworkisgetting
slowlybetterovertime.Thatsuggestsincreasingthelearningrate,
sayto = 0.01 .Ifwedothat,wegetbetterresults,whichsuggests
increasingthelearningrateagain.(Ifmakingachangeimproves
things,trydoingmore!)Ifwedothatseveraltimesover,we'llend
upwithalearningrateofsomethinglike = 1.0 (andperhapsfine
tuneto3.0),whichisclosetoourearlierexperiments.Soeven
thoughweinitiallymadeapoorchoiceofhyperparameters,weat
leastgotenoughinformationtohelpusimproveourchoiceof
hyperparameters.

Ingeneral,debugginganeuralnetworkcanbechallenging.Thisis
especiallytruewhentheinitialchoiceofhyperparameters

http://neuralnetworksanddeeplearning.com/chap1.html 41/50
03/04/2017 Neuralnetworksanddeeplearning

producesresultsnobetterthanrandomnoise.Supposewetrythe
successful30hiddenneuronnetworkarchitecturefromearlier,but
withthelearningratechangedto = 100.0 :

>>>net=network.Network([784,30,10])
>>>net.SGD(training_data,30,10,100.0,test_data=test_data)

Atthispointwe'veactuallygonetoofar,andthelearningrateistoo
high:
Epoch0:1009/10000
Epoch1:1009/10000
Epoch2:1009/10000
Epoch3:1009/10000
...
Epoch27:982/10000
Epoch28:982/10000
Epoch29:982/10000

Nowimaginethatwewerecomingtothisproblemforthefirsttime.
Ofcourse,weknowfromourearlierexperimentsthattheright
thingtodoistodecreasethelearningrate.Butifwewerecomingto
thisproblemforthefirsttimethentherewouldn'tbemuchinthe
outputtoguideusonwhattodo.Wemightworrynotonlyabout
thelearningrate,butabouteveryotheraspectofourneural
network.Wemightwonderifwe'veinitializedtheweightsand
biasesinawaythatmakesithardforthenetworktolearn?Or
maybewedon'thaveenoughtrainingdatatogetmeaningful
learning?Perhapswehaven'trunforenoughepochs?Ormaybeit's
impossibleforaneuralnetworkwiththisarchitecturetolearnto
recognizehandwrittendigits?Maybethelearningrateistoolow?
Or,maybe,thelearningrateistoohigh?Whenyou'recomingtoa
problemforthefirsttime,you'renotalwayssure.

Thelessontotakeawayfromthisisthatdebugginganeural
networkisnottrivial,and,justasforordinaryprogramming,there
isanarttoit.Youneedtolearnthatartofdebugginginordertoget
goodresultsfromneuralnetworks.Moregenerally,weneedto
developheuristicsforchoosinggoodhyperparametersandagood
architecture.We'lldiscussalltheseatlengththroughthebook,
includinghowIchosethehyperparametersabove.

Exercise
Trycreatinganetworkwithjusttwolayersaninputandan
outputlayer,nohiddenlayerwith784and10neurons,

http://neuralnetworksanddeeplearning.com/chap1.html 42/50
03/04/2017 Neuralnetworksanddeeplearning

respectively.Trainthenetworkusingstochasticgradient
descent.Whatclassificationaccuracycanyouachieve?

Earlier,IskippedoverthedetailsofhowtheMNISTdataisloaded.
It'sprettystraightforward.Forcompleteness,here'sthecode.The
datastructuresusedtostoretheMNISTdataaredescribedinthe
documentationstringsit'sstraightforwardstuff,tuplesandlistsof
Numpyndarrayobjects(thinkofthemasvectorsifyou'renot
familiarwithndarrays):

"""
mnist_loader
~~~~~~~~~~~~

AlibrarytoloadtheMNISTimagedata.Fordetailsofthedata
structuresthatarereturned,seethedocstringsfor``load_data``
and``load_data_wrapper``.Inpractice,``load_data_wrapper``isthe
functionusuallycalledbyourneuralnetworkcode.
"""

####Libraries
#Standardlibrary
importcPickle
importgzip

#Thirdpartylibraries
importnumpyasnp

defload_data():
"""ReturntheMNISTdataasatuplecontainingthetrainingdata,
thevalidationdata,andthetestdata.

The``training_data``isreturnedasatuplewithtwoentries.
Thefirstentrycontainstheactualtrainingimages.Thisisa
numpyndarraywith50,000entries.Eachentryis,inturn,a
numpyndarraywith784values,representingthe28*28=784
pixelsinasingleMNISTimage.

Thesecondentryinthe``training_data``tupleisanumpyndarray
containing50,000entries.Thoseentriesarejustthedigit
values(0...9)forthecorrespondingimagescontainedinthefirst
entryofthetuple.

The``validation_data``and``test_data``aresimilar,except
eachcontainsonly10,000images.

Thisisanicedataformat,butforuseinneuralnetworksit's
helpfultomodifytheformatofthe``training_data``alittle.
That'sdoneinthewrapperfunction``load_data_wrapper()``,see
below.
"""
f=gzip.open('../data/mnist.pkl.gz','rb')
training_data,validation_data,test_data=cPickle.load(f)
f.close()
return(training_data,validation_data,test_data)

defload_data_wrapper():
"""Returnatuplecontaining``(training_data,validation_data,
test_data)``.Basedon``load_data``,buttheformatismore
convenientforuseinourimplementationofneuralnetworks.

Inparticular,``training_data``isalistcontaining50,000

http://neuralnetworksanddeeplearning.com/chap1.html 43/50
03/04/2017 Neuralnetworksanddeeplearning
2tuples``(x,y)``.``x``isa784dimensionalnumpy.ndarray
containingtheinputimage.``y``isa10dimensional
numpy.ndarrayrepresentingtheunitvectorcorrespondingtothe
correctdigitfor``x``.

``validation_data``and``test_data``arelistscontaining10,000
2tuples``(x,y)``.Ineachcase,``x``isa784dimensional
numpy.ndarrycontainingtheinputimage,and``y``isthe
correspondingclassification,i.e.,thedigitvalues(integers)
correspondingto``x``.

Obviously,thismeanswe'reusingslightlydifferentformatsfor
thetrainingdataandthevalidation/testdata.Theseformats
turnouttobethemostconvenientforuseinourneuralnetwork
code."""
tr_d,va_d,te_d=load_data()
training_inputs=[np.reshape(x,(784,1))forxintr_d[0]]
training_results=[vectorized_result(y)foryintr_d[1]]
training_data=zip(training_inputs,training_results)
validation_inputs=[np.reshape(x,(784,1))forxinva_d[0]]
validation_data=zip(validation_inputs,va_d[1])
test_inputs=[np.reshape(x,(784,1))forxinte_d[0]]
test_data=zip(test_inputs,te_d[1])
return(training_data,validation_data,test_data)

defvectorized_result(j):
"""Returna10dimensionalunitvectorwitha1.0inthejth
positionandzeroeselsewhere.Thisisusedtoconvertadigit
(0...9)intoacorrespondingdesiredoutputfromtheneural
network."""
e=np.zeros((10,1))
e[j]=1.0
returne

Isaidabovethatourprogramgetsprettygoodresults.Whatdoes
thatmean?Goodcomparedtowhat?It'sinformativetohavesome
simple(nonneuralnetwork)baselineteststocompareagainst,to
understandwhatitmeanstoperformwell.Thesimplestbaselineof
all,ofcourse,istorandomlyguessthedigit.That'llberightabout
tenpercentofthetime.We'redoingmuchbetterthanthat!

Whataboutalesstrivialbaseline?Let'stryanextremelysimple
idea:we'lllookathowdarkanimageis.Forinstance,animageofa
2willtypicallybequiteabitdarkerthananimageofa1,just
becausemorepixelsareblackenedout,asthefollowingexamples
illustrate:

Thissuggestsusingthetrainingdatatocomputeaverage
darknessesforeachdigit,0, 1, 2, , 9.Whenpresentedwithanew
image,wecomputehowdarktheimageis,andthenguessthatit's

http://neuralnetworksanddeeplearning.com/chap1.html 44/50
03/04/2017 Neuralnetworksanddeeplearning

whicheverdigithastheclosestaveragedarkness.Thisisasimple
procedure,andiseasytocodeup,soIwon'texplicitlywriteoutthe
codeifyou'reinterestedit'sintheGitHubrepository.Butit'sabig
improvementoverrandomguessing,getting2, 225ofthe10, 000
testimagescorrect,i.e.,22.25percentaccuracy.

It'snotdifficulttofindotherideaswhichachieveaccuraciesinthe
20 to50 percentrange.Ifyouworkabitharderyoucangetupover
50 percent.Buttogetmuchhigheraccuraciesithelpstouse
establishedmachinelearningalgorithms.Let'stryusingoneofthe
bestknownalgorithms,thesupportvectormachineorSVM.If
you'renotfamiliarwithSVMs,nottoworry,we'renotgoingtoneed
tounderstandthedetailsofhowSVMswork.Instead,we'llusea
Pythonlibrarycalledscikitlearn,whichprovidesasimplePython
interfacetoafastCbasedlibraryforSVMsknownasLIBSVM.

Ifwerunscikitlearn'sSVMclassifierusingthedefaultsettings,
thenitgets9,435of10,000testimagescorrect.(Thecodeis
availablehere.)That'sabigimprovementoverournaiveapproach
ofclassifyinganimagebasedonhowdarkitis.Indeed,itmeans
thattheSVMisperformingroughlyaswellasourneuralnetworks,
justalittleworse.Inlaterchapterswe'llintroducenewtechniques
thatenableustoimproveourneuralnetworkssothattheyperform
muchbetterthantheSVM.

That'snottheendofthestory,however.The9,435of10,000result
isforscikitlearn'sdefaultsettingsforSVMs.SVMshaveanumber
oftunableparameters,andit'spossibletosearchforparameters
whichimprovethisoutoftheboxperformance.Iwon'texplicitly
dothissearch,butinsteadreferyoutothisblogpostbyAndreas
Muellerifyou'dliketoknowmore.Muellershowsthatwithsome
workoptimizingtheSVM'sparametersit'spossibletogetthe
performanceupabove98.5percentaccuracy.Inotherwords,a
welltunedSVMonlymakesanerroronaboutonedigitin70.
That'sprettygood!Canneuralnetworksdobetter?

Infact,theycan.Atpresent,welldesignedneuralnetworks
outperformeveryothertechniqueforsolvingMNIST,including
SVMs.Thecurrent(2013)recordisclassifying9,979of10,000
imagescorrectly.ThiswasdonebyLiWan,MatthewZeiler,Sixin
Zhang,YannLeCun,andRobFergus.We'llseemostofthe

http://neuralnetworksanddeeplearning.com/chap1.html 45/50
03/04/2017 Neuralnetworksanddeeplearning

techniquestheyusedlaterinthebook.Atthatlevelthe
performanceisclosetohumanequivalent,andisarguablybetter,
sincequiteafewoftheMNISTimagesaredifficultevenforhumans
torecognizewithconfidence,forexample:

Itrustyou'llagreethatthosearetoughtoclassify!Withimageslike
theseintheMNISTdatasetit'sremarkablethatneuralnetworks
canaccuratelyclassifyallbut21ofthe10,000testimages.Usually,
whenprogrammingwebelievethatsolvingacomplicatedproblem
likerecognizingtheMNISTdigitsrequiresasophisticated
algorithm.ButeventheneuralnetworksintheWanetalpaperjust
mentionedinvolvequitesimplealgorithms,variationsonthe
algorithmwe'veseeninthischapter.Allthecomplexityislearned,
automatically,fromthetrainingdata.Insomesense,themoralof
bothourresultsandthoseinmoresophisticatedpapers,isthatfor
someproblems:

sophisticatedalgorithm simplelearningalgorithm+good
trainingdata.

Towarddeeplearning
Whileourneuralnetworkgivesimpressiveperformance,that
performanceissomewhatmysterious.Theweightsandbiasesinthe
networkwerediscoveredautomatically.Andthatmeanswedon't
immediatelyhaveanexplanationofhowthenetworkdoeswhatit
does.Canwefindsomewaytounderstandtheprinciplesbywhich
ournetworkisclassifyinghandwrittendigits?And,givensuch
principles,canwedobetter?

Toputthesequestionsmorestarkly,supposethatafewdecades
henceneuralnetworksleadtoartificialintelligence(AI).Willwe
understandhowsuchintelligentnetworkswork?Perhapsthe
networkswillbeopaquetous,withweightsandbiaseswedon't
understand,becausethey'vebeenlearnedautomatically.Inthe
earlydaysofAIresearchpeoplehopedthattheefforttobuildanAI
wouldalsohelpusunderstandtheprinciplesbehindintelligence

http://neuralnetworksanddeeplearning.com/chap1.html 46/50
03/04/2017 Neuralnetworksanddeeplearning

and,maybe,thefunctioningofthehumanbrain.Butperhapsthe
outcomewillbethatweendupunderstandingneitherthebrainnor
howartificialintelligenceworks!

Toaddressthesequestions,let'sthinkbacktotheinterpretationof
artificialneuronsthatIgaveatthestartofthechapter,asameans
ofweighingevidence.Supposewewanttodeterminewhetheran
imageshowsahumanfaceornot:

Credits:1.EsterInbar.2.Unknown.3.NASA,
ESA,G.Illingworth,D.Magee,andP.Oesch
(UniversityofCalifornia,SantaCruz),R.
Bouwens(LeidenUniversity),andtheHUDF09
Team.Clickontheimagesformoredetails.

Wecouldattackthisproblemthesamewayweattacked
handwritingrecognitionbyusingthepixelsintheimageasinput
toaneuralnetwork,withtheoutputfromthenetworkasingle
neuronindicatingeither"Yes,it'saface"or"No,it'snotaface".

Let'ssupposewedothis,butthatwe'renotusingalearning
algorithm.Instead,we'regoingtotrytodesignanetworkbyhand,
choosingappropriateweightsandbiases.Howmightwegoabout
it?Forgettingneuralnetworksentirelyforthemoment,aheuristic
wecoulduseistodecomposetheproblemintosubproblems:does
theimagehaveaneyeinthetopleft?Doesithaveaneyeinthetop
right?Doesithaveanoseinthemiddle?Doesithaveamouthin
thebottommiddle?Istherehairontop?Andsoon.

Iftheanswerstoseveralofthesequestionsare"yes",orevenjust
"probablyyes",thenwe'dconcludethattheimageislikelytobea
face.Conversely,iftheanswerstomostofthequestionsare"no",
thentheimageprobablyisn'taface.

Ofcourse,thisisjustaroughheuristic,anditsuffersfrommany
deficiencies.Maybethepersonisbald,sotheyhavenohair.Maybe
wecanonlyseepartoftheface,orthefaceisatanangle,sosomeof
thefacialfeaturesareobscured.Still,theheuristicsuggeststhatif
wecansolvethesubproblemsusingneuralnetworks,thenperhaps
wecanbuildaneuralnetworkforfacedetection,bycombiningthe
networksforthesubproblems.Here'sapossiblearchitecture,with

http://neuralnetworksanddeeplearning.com/chap1.html 47/50
03/04/2017 Neuralnetworksanddeeplearning

rectanglesdenotingthesubnetworks.Notethatthisisn'tintended
asarealisticapproachtosolvingthefacedetectionproblemrather,
it'stohelpusbuildintuitionabouthownetworksfunction.Here's
thearchitecture:

It'salsoplausiblethatthesubnetworkscanbedecomposed.
Supposewe'reconsideringthequestion:"Isthereaneyeinthetop
left?"Thiscanbedecomposedintoquestionssuchas:"Istherean
eyebrow?""Arethereeyelashes?""Isthereaniris?"andsoon.Of
course,thesequestionsshouldreallyincludepositional
information,aswell"Istheeyebrowinthetopleft,andabovethe
iris?",thatkindofthingbutlet'skeepitsimple.Thenetworkto
answerthequestion"Isthereaneyeinthetopleft?"cannowbe
decomposed:

Thosequestionstoocanbebrokendown,furtherandfurther
throughmultiplelayers.Ultimately,we'llbeworkingwithsub
networksthatanswerquestionssosimpletheycaneasilybe
answeredatthelevelofsinglepixels.Thosequestionsmight,for
example,beaboutthepresenceorabsenceofverysimpleshapesat

http://neuralnetworksanddeeplearning.com/chap1.html 48/50
03/04/2017 Neuralnetworksanddeeplearning

particularpointsintheimage.Suchquestionscanbeansweredby
singleneuronsconnectedtotherawpixelsintheimage.

Theendresultisanetworkwhichbreaksdownaverycomplicated
questiondoesthisimageshowafaceornotintoverysimple
questionsanswerableatthelevelofsinglepixels.Itdoesthis
throughaseriesofmanylayers,withearlylayersansweringvery
simpleandspecificquestionsabouttheinputimage,andlater
layersbuildingupahierarchyofevermorecomplexandabstract
concepts.Networkswiththiskindofmanylayerstructuretwoor
morehiddenlayersarecalleddeepneuralnetworks.

Ofcourse,Ihaven'tsaidhowtodothisrecursivedecomposition
intosubnetworks.Itcertainlyisn'tpracticaltohanddesignthe
weightsandbiasesinthenetwork.Instead,we'dliketouselearning
algorithmssothatthenetworkcanautomaticallylearntheweights
andbiasesandthus,thehierarchyofconceptsfromtraining
data.Researchersinthe1980sand1990striedusingstochastic
gradientdescentandbackpropagationtotraindeepnetworks.
Unfortunately,exceptforafewspecialarchitectures,theydidn't
havemuchluck.Thenetworkswouldlearn,butveryslowly,andin
practiceoftentooslowlytobeuseful.

Since2006,asetoftechniqueshasbeendevelopedthatenable
learningindeepneuralnets.Thesedeeplearningtechniquesare
basedonstochasticgradientdescentandbackpropagation,butalso
introducenewideas.Thesetechniqueshaveenabledmuchdeeper
(andlarger)networkstobetrainedpeoplenowroutinelytrain
networkswith5to10hiddenlayers.And,itturnsoutthatthese
performfarbetteronmanyproblemsthanshallowneuralnetworks,
i.e.,networkswithjustasinglehiddenlayer.Thereason,ofcourse,
istheabilityofdeepnetstobuildupacomplexhierarchyof
concepts.It'sabitlikethewayconventionalprogramming
languagesusemodulardesignandideasaboutabstractiontoenable
thecreationofcomplexcomputerprograms.Comparingadeep
networktoashallownetworkisabitlikecomparinga
programminglanguagewiththeabilitytomakefunctioncallstoa
strippeddownlanguagewithnoabilitytomakesuchcalls.
Abstractiontakesadifferentforminneuralnetworksthanitdoesin
conventionalprogramming,butit'sjustasimportant.

Inacademicwork,pleasecitethisbookas:MichaelA.Nielsen,"NeuralNetworksandDeepLearning", Lastupdate:ThuJan1906:09:482017

http://neuralnetworksanddeeplearning.com/chap1.html 49/50
03/04/2017 Neuralnetworksanddeeplearning
DeterminationPress,2015

ThisworkislicensedunderaCreativeCommonsAttributionNonCommercial3.0UnportedLicense.This
meansyou'refreetocopy,share,andbuildonthisbook,butnottosellit.Ifyou'reinterestedincommercialuse,
pleasecontactme.

http://neuralnetworksanddeeplearning.com/chap1.html 50/50

Vous aimerez peut-être aussi