21MotivationandDe12nition WhichMethodtoChooseWehaveseenlinearRegressionkernelregressionregularizationKPCAKSVMandthereexistazillionothermethods Isthereauniversallybestmethod NoFreeLunc ID: 831242
Download Pdf The PPT/PDF document "2.NeuralNetworks" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
2.NeuralNetworks2.1MotivationandDe
2.NeuralNetworks2.1MotivationandDenitionWhichMethodtoChoose?WehaveseenlinearRegression,kernelregression,regularization,K-PCA,K-SVM,...andthereexistazillionothermethods.Isthereauniversallybestmethod?NoFreeLunchTheorem\NoFreeLunch"Theorem[Wolpert(1996)],Inform
alVersionOfcoursenot!!\Proof"ofthe\The
alVersionOfcoursenot!!\Proof"ofthe\Theorem"Ifiscompletelyarbitraryandnothingisknown,wecannotpossiblyinferanythingaboutfromsamples((xi;yi))mi=1...Everyalgorithmwillhaveaspecicpreference,forexampleasspeciedthroughthehypothesisclassH{all\categorie
s"arearticial!Wewantouralgorithmto
s"arearticial!Wewantouralgorithmtoreproducethearticialcategoriesproducedbyourbrain{solet'sbuildahypothesisclassthatmimicksourthinking!NeuroscienceTheBrainasBiologicalNeuralNetwork\Inneuroscience,abiologicalneuralnetworkisaseriesofinterconnectedneuronswho
seactivationdenesarecognizablelinea
seactivationdenesarecognizablelinearpathway.Theinterfacethroughwhichneuronsinteractwiththeirneighborsusuallyconsistsofseveralaxonterminalsconnectedviasynapsestodendritesonotherneurons.Ifthesumoftheinputsignalsintooneneuronsurpassesacertainthreshold,theneuronsends
anactionpotential(AP)attheaxonhillockand
anactionpotential(AP)attheaxonhillockandtransmitsthiselectricalsignalalongtheaxon."Source:WikipediaNeuronsrecall:\Ifthesumoftheinputsignalsintooneneuronsurpassesacertainthreshold,[...]theneurontransmitsthis[...]signal[...]."ArticialNeuronsx2w2n1Pixiwi
b00Pixiwib0Activat
b00Pixiwib0ActivationyOutputx1w1x3w3WeightsThresholdbInputsArticialNeuronAnarticialneuronwithweightsw1;:::;ws,biasbandactivationfunction:R!Risdenedasthefunctionf(x1;:::;xs)= sXi=1xiwib!:ActivationFunctions
Figure:Heavisideactivationfunction(asin
Figure:Heavisideactivationfunction(asinbiologicalmotivation)Figure:Sigmoidactivationfunction(x)=11+exFromnowonweusePython!1importmath2importmatplotlib.pyplotasplt3importnumpyasnp45defsigmoid(x):6a=[]7foriteminx:8a.append(1/(1+math.exp(-item)))9re
turna1011x=np.arange(-10.,10.,0.2)12s
turna1011x=np.arange(-10.,10.,0.2)12sig=sigmoid(x)1314plt.plot(x,sig)15plt.savefig('sig.png')ArticialNeuralNetworksArticialneuralnetworksconsistofagraph,connectingarticialneurons!Dynamicsdiculttomodel,duetoloops,etc...ArticialFeedfor
wardNeuralNetworksUsedirected,acyclicgr
wardNeuralNetworksUsedirected,acyclicgraph!ArticialFeedforwardNeuralNetworksDenitionLetL;d;N1;:::;NL2N.Amap:Rd!RNLgivenby(x)=AL(AL1(:::(A1(x))));x2Rd;iscalledaneuralnetwork.ItiscomposedofanelinearmapsA`:RN`1!RN`,1
`L(whereN0=d),andnon-linearfunction
`L(whereN0=d),andnon-linearfunctions|oftenreferredtoasactivationfunction|actingcomponent-wise.Here,disthedimensionoftheinputlayer,Ldenotesthenumberoflayers,N1;:::;NL1standsforthedimensionsoftheL1hiddenlayers,andNListhedimensionoftheoutputlayer.Ana
14;nemapA:RN`1!RN`isgivenbyx7!Wx+bwi
14;nemapA:RN`1!RN`isgivenbyx7!Wx+bwithweightmatrixW2RN`1N`andbiasvectorb2RN`.ArticialFeedforwardNeuralNetworksAPythonClass1importrandom2importnumpyasnp34classNetwork(object):56def__init__(self,sizes):7self.num_layers=len(sizes)8self.sizes=si
zes9self.biases=[np.random.randn(y,1)fo
zes9self.biases=[np.random.randn(y,1)foryinsizes[1:]]10self.weights=[np.random.randn(y,x)11forx,yinzip(sizes[:-1],sizes[1:])]Figure:InitializationAPythonClass1deffeedforward(self,a):2"""Returntheoutputofthenetworkif``a``isinput."""3forb,winzip(self.biases,se
lf.weights):4a0=np.dot(w,a)+b5a=sigmo
lf.weights):4a0=np.dot(w,a)+b5a=sigmoid(a0)6returna0Figure:FeedforwardComputation1Phi=net.Network([3,2,1])2x=np.array([[1],[0.5],[2]])3y=Phi.feedforward(x)4print(y)Figure:EvaluationVisualizing(Small)NeuralNetworksFigure:Thicknessrepresentsthesizeoftheweight
1Phi=net.Network([10,4,4])2Phi.draw()O
1Phi=net.Network([10,4,4])2Phi.draw()OntheBiologicalMotivationArticial(feedforward)neuralnetworksshouldnotbeconfusedwithamodelforourbrain:NeuronsaremorecomplicatedthansimplyweightedlinearcombinationsOurbrainisnot\feedforward"Biologicalneuralnetworksevolvewi
thtime neuronalplasticity...Artic
thtime neuronalplasticity...Articialfeedforwardneuralnetworksconstituteamathematicallyandcomputationallyconvenientbutverysimplisticmathematicalconstructwhichisinspiredbyourunderstandingofhowthebrainworks.Terminology\NeuralNetworkLearning":Useneuralnetworksofa
xed\topology"ashypothesisclassforre
xed\topology"ashypothesisclassforregressionorclassicationtasks.Thisrequiresoptimizingtheweightsandbiasparameters.\DeepLearning":Neuralnetworklearningwithneuralnetworksconsistingofmany(e.g.,3)layers.2.2UniversalApproximationApproximationQuestionMain
ApproximationProblemUnderwhichcondition
ApproximationProblemUnderwhichconditionsontheactivationfunctioncanevery(continuous,ormeasurable)functionf:Rd!RNLbearbitrarilywellapproximatedbyaneuralnetwork,providedthatwechooseN1;:::;NL1;Llargeenough?Surelynot!Supposethatisapolynomialofdegreer.Then
(Ax)isapolynomialofdegreerfora
(Ax)isapolynomialofdegreerforallanemapsAandthereforeanyneuralnetworkwithactivationfunctionwillbeapolynomialofdegreer.UniversalApproximationTheoremTheoremSupposethat:R!Rcontinuousisnotapolynomialandxd1;L2;NL12Nandaco
mpactsubsetKRd.Thenforanycontinuous
mpactsubsetKRd.Thenforanycontinuousf:Rd!RNlandany"0thereexistN1;:::;NL12NandanelinearmapsA`:RN`1!RN`,1`Lsuchthattheneuralnetwork(x)=AL(AL1(:::(A1(x))));x2Rd;approximatesftowithinaccuracy",i.e.,supx2Ljf(x)&
#8;(x)j":Neuralnetworksare\univers
#8;(x)j":Neuralnetworksare\universalapproximators"andalreadyonehiddenlayer(L=2)isenoughifthenumberofnodesissucientlylarge!ProofoftheUniversalApproximationTheoremForsimplicityweonlythecaseofonehiddenlayer,e.g.,L=2andoneoutputneuron,e.g.,NL=1:(x)=N1Xi=1
ci(wixbi);wi2Rd;ci;bi2R:Pr
ci(wixbi);wi2Rd;ci;bi2R:ProofoftheUniversalApproximationTheoremWewillshowthefollowing.TheoremFord2Nand:R!RcontinuousconsiderR(;d):=spann(wxb):w2Rd;b2Ro:ThenR(;d)isdenseinC(Rd)ifandonlyifisnotapolynomial.Proofford=1and&
#27;smoothifisnotapolynomial,there
#27;smoothifisnotapolynomial,thereexistsx02Rwith(k)(x0)6=0forallk2N.constantfunctionscanbeapproximatedbecause(hxx0)!(x0)6=0;h!0:linearfunctionscanbeapproximatedbecause1h(((+h)xx0)(xx0))!x0(x0
);h;!0:sameargument polynomialsinx
);h;!0:sameargument polynomialsinxcanbeapproximated.Stone-WeierstrassTheoremyieldstheresult.GeneraldNotethatthefunctionsspanfg(wxb):w2Rd;b2R;g2C(R)arbitraryg;aredenseinC(Rd)(justtakegassin(wx);cos(wx)justasintheFourierseriescase).Firstapproxi
matef2C(Rd)byNXi=1digi(vixei);vi
matef2C(Rd)byNXi=1digi(vixei);vi2Rd;di;ei2R;gi2C(R):Thenapplyourunivariateresulttoapproximatetheunivariatefunctionst7!gi(tei)usingneuralnetworks.Thecasethatisnonsmoothpickfamily(g")"0ofmolliers,i.e.lim"!0g"!uniformlyoncom
pacta.Applypreviousresulttothesmoothfun
pacta.Applypreviousresulttothesmoothfunctiong"andlet"!0:2.3BackpropagationRegression/ClassicationwithNeuralNetworksNeuralNetworkHypothesisClassGivend;L;N1;:::;NLanddenetheassociatedhypothesisclassHd;L;N1;:::;NL;:=AL(AL
0;1(:::(A1(x)))):A`:RN`1!R
0;1(:::(A1(x)))):A`:RN`1!RN`anelinear :TypicalRegression/ClassicationTaskGivendataz=((xi;yi))mi=1RdRNL,ndtheempiricalregressionfunctionfz2argminf2Hd;L;N1;:::;NL;mXi=1L(f;xi;yi);whereL:C(Rd)RdRNL!R+isthelossfunction(
inleastsquaresproblemswehaveL(f;x;y)=jf(
inleastsquaresproblemswehaveL(f;x;y)=jf(x)yj2).Example:HandwrittenDigitsMNISTDatabaseforhand-writtendigitrecognitionhttp://yann.lecun.com/exdb/mnist/Everyimageisgivenasa2828matrixx2R2828R784:Everylabelisgivenasa10-dimvectory2R10describingthe`proba
bility'ofeachdigitExample:HandwrittenDi
bility'ofeachdigitExample:HandwrittenDigitsEveryimageisgivenasa2828matrixx2R2828R784:Everylabelisgivenasa10-dimvectory2R10describingthe`probability'ofeachdigitGivenlabeledtrainingdata(xi;yi)mi=1R784R10.Fixnetworktopology,e.g.,numberoflayers(
forexampleL=3)andnumbersofneurons(N1=20;
forexampleL=3)andnumbersofneurons(N1=20;N2=20).Thelearninggoalistondtheempiricalregressionfunctionfz2H784;3;20;20;10;.???how???Non-linear,non-convexGradientDescent:TheSimplestOptimizationMethodGradientDescentGradientofF:RN!RisdenedbyrF(u)=@F(
u)@(u)1;:::;@F(u)@(u)NT:Gradient
u)@(u)1;:::;@F(u)@(u)NT:Gradientdescentwithstepsize0isdenedbyun+1 unrF(un):Converges(slowly)tostationarypointofF.BackpropInourproblem:F=Pmi=1L(f;xi;yi)andu=((W`;b`))L`=1.Sincer((W`;b`))L`=1F=mXi=1r((W`;b`))L`=1L(f;xi;yi);weneedtod
etermine(forx;y2RdRNLxed)@L(f;x
etermine(forx;y2RdRNLxed)@L(f;x;y)@(W`)i;j;@L(f;x;y)@(b`)i;`=1;:::;L:ForsimplicitysupposethatL(f;x;y)=(f(x)y)2;sothat@L(f;x;y)@(W`)i;j=2(f(x)y)T@f(x)@(W`)i;j;@L(f;x;y)@(b`)i=2(f(x)y)T@f(x)@(b`)i:x= (x)1(x)2!a1=(z1)=
27;(W1x+b1)W1=0B@(W1)1;1(W1)1;2(W1)2;1(W
27;(W1x+b1)W1=0B@(W1)1;1(W1)1;2(W1)2;1(W1)2;2(W1)3;1(W1)3;21CAb1=0B@(b1)1(b1)2(b1)31CAa2=(z2)=(W2a1+b2)W2=0B@(W2)1;1(W2)1;2(W2)1;3(W2)2;1(W2)2;2(W2)2;3(W2)3;1(W2)3;2(W2)3;31CAb2=0B@(b2)1(b2)2(b2)31CA(x)=z3=W3a2+b3W3= (W3)1;1(W3)1;2(W3)1;3(W3)2;1(W3)2;2
(W3)2;3!b3= (b3)1(b3)2!@(z3)1@(W3)1;2=
(W3)2;3!b3= (b3)1(b3)2!@(z3)1@(W3)1;2=@@(W3)1;2((W3)1;1(a2)1+(W3)1;2(a2)2+(W3)1;3(a2)3)=(a2)2@(z3)2@(W3)1;2=@@(W3)1;2((W3)2;1(a2)1+(W3)2;2(a2)2+(W3)2;3(a2)3)=0@(z3)k@(W3)i;j=(a2)ji=k0i6=k@(z3)k@(b3)i=1i=k0i6=k@(x)@W3=0@(a2)10
6;(a2)20(a2)300(a2)1
6;(a2)20(a2)300(a2)10(a2)20(a2)31A@(x)@b3=0@10011ABackprop:LastLayer@L(f;x;y)@(WL)i;j=2(f(x)y)T@f(x)@(WL)i;j,@L(f;x;y)@(bL)i=2(f(x)y)T@f(x)@(bL)i.Letf(x)=WL(WL
;1(:::)+bL1)+bL.Itfollowsthat@f(x)
;1(:::)+bL1)+bL.Itfollowsthat@f(x)@(WL)i;j=(0;:::;(WL1(:::)+bL1)j|{z}i;:::;0)T@f(x)@(bL)i=(0;:::;1|{z}i;:::;0)T2(f(x)y)T@f(x)@(WL)i;j=2(f(x)y)i(WL1(:::)+bL1)j,2(f(x)y)T@f(x)@(bL)i=2(f(x)y)iInmatrixnotation:@L(f
;x;y)@WL=2(f(x)y)|{z}L(
;x;y)@WL=2(f(x)y)|{z}L((zL1z}|{WL1(:::)+bL1)|{z}aL1)T,@L(f;x;y)@bL=2(f(x)y).Backprop:Second-to-lastLayerDenea`+1=(z`+1)wherez`+1=W`+1a`+b`+1,a0=x,f(x)=zL.WehavecomputedL(f;x;y)@WL,L(F;x;y)@bLThen,usechainrule:
@L(f;x;y)@WL1=@L(f;x;y)aL1
@L(f;x;y)@WL1=@L(f;x;y)aL1@aL1@WL1=2(f(x)y)TWL@aL1@WL1:=2(f(x)y)TWLdiag(0(zL1))@zL1@WL1sameasbefore!=diag(0(zL1))WTL2(f(x)y)|{z}L1aTL2:Similararg
umentsyield@L(f;x;y)@bL1=L
umentsyield@L(f;x;y)@bL1=L1.TheBackpropAlgorithm1Calculatea`;z`for`=0;:::;L(forwardpass).2SetL=2(f(x)y)3Then@L(f;x;y)@bL=Land@L(f;x;y)@WL=LaTL1.4for`fromL1to1do:`=diag(0(z`))WT`+1`+1Then@L(f;
x;y)@b`=`and@L(f;x;y)@W`=`
x;y)@b`=`and@L(f;x;y)@W`=`aT`1.5return@L(f;x;y)@b`;@L(f;x;y)@W`,l=1;:::;L.1defbackprop(self,x,y):2nabla_b=[np.zeros(b.shape)forbinself.biases]3nabla_w=[np.zeros(w.shape)forwinself.weights]4activations=[x]5zs=[]6forb,winzip(self.biases,se
lf.weights):7zs.append(np.dot(w,activat
lf.weights):7zs.append(np.dot(w,activations[-1])+b)8activations.append(sigmoid(zs[-1]))9delta=self.cost_derivative(zs[-1],y)10nabla_b[-1]=delta11nabla_w[-1]=np.dot(delta,activations[-2].transpose())12forlinxrange(2,self.num_layers):13delta=np.dot(self.weights[
-l+1].transpose(),delta)*sigmoid_prime(
-l+1].transpose(),delta)*sigmoid_prime(zs[-l])14nabla_b[-l]=delta15nabla_w[-l]=np.dot(delta,activations[-l-1].transpose())16return(nabla_b,nabla_w)2.4StochasticGradientDescentTheComplexityofGradientDescentRecallthatonegradientdescentsteprequiresthecalculationo
fmXi=1r((W`;b`))L`=1L(f;xi;yi):andeachof
fmXi=1r((W`;b`))L`=1L(f;xi;yi):andeachofthesummandsrequiresonebackpropagationrun.Thus,thetotalcomplexityofonegradientdescentstepisequaltomcomplexity(backprop):ThecomplexityofbackpropisasymptoticallyequaltothenumberofDOFsofthenetwork:complexity(backprop)LX`=
1N`1N`+b`:AnExampleImageNetdat
1N`1N`+b`:AnExampleImageNetdatabaseconsistsof1:2mimagesand1000categories.AlexNet,neuralnetworkwith160mDOFsisoneofthemostsuc-cessfulannotationmethodsOnestepofgradientdescentrequires21014 ops(andmemoryunits)!!StochasticGradientDescent(SGD
)ApproximatemXi=1r((W`;b`))L`=1L(f;xi;y
)ApproximatemXi=1r((W`;b`))L`=1L(f;xi;yi)byr((W`;b`))L`=1L(f;xi;yi)forsomeichosenuniformlyatrandomfromf1;:::;mg.InexpectationwehaveEr((W`;b`))L`=1L(f;xi;yi)=1mmXi=1r((W`;b`))L`=1L(f;xi;yi)TheSGDAlgorithmGoal:FindstationarypointoffunctionF=Pmi=1
Fi:RN!R.1Setstartingvalueu0andn=02whil
Fi:RN!R.1Setstartingvalueu0andn=02while(errorislarge)do:Picki2f1;:::;mguniformlyatrandomupdateun+1=unrFin=n+13returnunTypicalBehaviorFigure:Comparisonbtw.GDandSGD.mstepsofSGDarecountedasoneiteration.Initiallyveryfastconvergence,followedbystagnati
on!MinibatchSGDForeveryfi1;:::;i
on!MinibatchSGDForeveryfi1;:::;iKgf1;:::;mgchosenuniformlyatrandom,itholdsthatE1KkXl=1r((W`;b`))L`=1L(f;xil;yil)=1mmXi=1r((W`;b`))L`=1L(f;xi;yi);e.g.,wehaveanunbiasedestimatorforthegradient.K=1 SGDK1 MinibatchSGDwithbatchsizeK.SomeH
euristicsThesamplemean1KPkl=1r((W`;b`)
euristicsThesamplemean1KPkl=1r((W`;b`))L`=1L(f;xil;yil)isitselfarandomvariablethathasexpectedvalue1mPmi=1r((W`;b`))L`=1L(f;xi;yi).Inordertoassessthedeviationofthesamplemeanfromitsexpectedvaluewemaycomputeitsstandarddeviation=pnwhereisthestandard
deviationofi7!r((W`;b`))L`=1L(f;xi;yi).
deviationofi7!r((W`;b`))L`=1L(f;xi;yi).Increasingthebatchsizebyafactor100yieldsanimprovementofthevariancebyafactor10whilethecomplexityincreasesbyafactor100!Commonbatchsizeforlargemodels:K=16;32.1defSGD(self,training_data,epochs,mini_batch_size,eta):2n=len(trainin
g_data)3forjinxrange(epochs):4random.s
g_data)3forjinxrange(epochs):4random.shuffle(training_data)5mini_batches=[training_data[k:k+mini_batch_size]6forkinxrange(0,n,mini_batch_size)]7formini_batchinmini_batches:8self.update_mini_batch(mini_batch,eta)1defupdate_mini_batch(self,mini_batch,eta):2na
bla_b=[np.zeros(b.shape)forbinself.bias
bla_b=[np.zeros(b.shape)forbinself.biases]3nabla_w=[np.zeros(w.shape)forwinself.weights]4forx,yinmini_batch:5delta_nabla_b,delta_nabla_w=self.backprop(x,y)6nabla_b=[nb+dnbfornb,dnbinzip(nabla_b,delta_nabla_b)]7nabla_w=[nw+dnwfornw,dnwinzip(nabla_w,delta_nabl
a_w)]8self.weights=[w-(eta/len(mini_bat
a_w)]8self.weights=[w-(eta/len(mini_batch))*nw9forw,nwinzip(self.weights,nabla_w)]10self.biases=[b-(eta/len(mini_batch))*nb11forb,nbinzip(self.biases,nabla_b)]2.5SummaryThebasicNeuralNetworkRecipeforLearning1Neuro-inspiredmodel2Backprop3MinibatchSGDNowle
t'stryclassifyinghandwrittendigits!Resu
t'stryclassifyinghandwrittendigits!Results30epochs,learningrate=3:0,minibatchsizeK=10,networksize[784;30;10].Trainingsetsizem=50000,testsetsize=10000.Classicationaccuracy9596%.networksize[784;30;30;10].Classicationaccuracy9697%.networksize
[784;30;30;30;10].Classicationaccu
[784;30;30;30;10].Classicationaccuracy9596%.Deeplearningmightnothelpafterall...2.6GoingDeep(?)ProblemswithDeepNetworksOvertting(asusual...)Vanishing/ExplodingGradientProblemDealingwithOvertting:RegularizationRatherthanminimizingmXi=1L(f;xi;yi
);minimizemXi=1L(f;xi;yi)+((W`)L`=
);minimizemXi=1L(f;xi;yi)+((W`)L`=1);forexample((W`)L`=1)=Xl;i;jj(W`)i;jjp:Gradientupdatehastobeaugmentedby@@(W`)i;j((W`)L`=1)=pj(W`)i;jjp1sgn((W`)i;j)Sparsity-PromotingRegularizationSincelimp!0Xl;i;jj(W`)i;jjp=#nonzeroweights
;regularizationwithp1promotessparse
;regularizationwithp1promotessparseconnectivity(andhencesmallmemoryrequirements)!DropoutDuringeachfeedforward/back-propstepdropnodeswithproba-bilityp.Aftertraining,multiplyallweightswithp.Finaloutputis\average"overmanysparsenetworkmodels.TheVanishingGradient
ProblemFigure:ExtremelyDeepNetwork
ProblemFigure:ExtremelyDeepNetwork(x)=w5(w4(w3(w2(w1x+b1)+b2)+b3)+b4)+b5@(x)@b1=LY`=2w`L1Y`=10(z`)If(x)=11+ex,itholdsthatj0(x)j2ejxj,andthusj@(x)@b1j=2LY`=2jw`jePL1`=1jz`jb
ottomlayerswilllearnmuchslowerthantoplay
ottomlayerswilllearnmuchslowerthantoplayersandnotcontributetolearning.Depthisanuisance!?DealingwiththeVanishingGradientProblemUseactivationfunctionwith`large'gradient.ReLUTheRectiedLinearUnitisdenedasReLU(x):=xx00elseExercise:ImplementDrop