/
2.NeuralNetworks 2.NeuralNetworks

2.NeuralNetworks - PDF document

dorothy
dorothy . @dorothy
Follow
345 views
Uploaded On 2021-02-11

2.NeuralNetworks - PPT Presentation

21MotivationandDe12nition WhichMethodtoChooseWehaveseenlinearRegressionkernelregressionregularizationKPCAKSVMandthereexistazillionothermethods Isthereauniversallybestmethod NoFreeLunc ID: 831242

nabla batch arti weights batch nabla weights arti mini delta figure biases sizes x0000 eta dot backprop classi theorem

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "2.NeuralNetworks" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

2.NeuralNetworks2.1MotivationandDe
2.NeuralNetworks2.1MotivationandDe nitionWhichMethodtoChoose?WehaveseenlinearRegression,kernelregression,regularization,K-PCA,K-SVM,...andthereexistazillionothermethods.Isthereauniversallybestmethod?NoFreeLunchTheorem\NoFreeLunch"Theorem[Wolpert(1996)],Inform

alVersionOfcoursenot!!\Proof"ofthe\The
alVersionOfcoursenot!!\Proof"ofthe\Theorem"Ifiscompletelyarbitraryandnothingisknown,wecannotpossiblyinferanythingaboutfromsamples((xi;yi))mi=1...Everyalgorithmwillhaveaspeci cpreference,forexampleasspeci edthroughthehypothesisclassH{all\categorie

s"arearti cial!Wewantouralgorithmto
s"arearti cial!Wewantouralgorithmtoreproducethearti cialcategoriesproducedbyourbrain{solet'sbuildahypothesisclassthatmimicksourthinking!NeuroscienceTheBrainasBiologicalNeuralNetwork\Inneuroscience,abiologicalneuralnetworkisaseriesofinterconnectedneuronswho

seactivationde nesarecognizablelinea
seactivationde nesarecognizablelinearpathway.Theinterfacethroughwhichneuronsinteractwiththeirneighborsusuallyconsistsofseveralaxonterminalsconnectedviasynapsestodendritesonotherneurons.Ifthesumoftheinputsignalsintooneneuronsurpassesacertainthreshold,theneuronsends

anactionpotential(AP)attheaxonhillockand
anactionpotential(AP)attheaxonhillockandtransmitsthiselectricalsignalalongtheaxon."Source:WikipediaNeuronsrecall:\Ifthesumoftheinputsignalsintooneneuronsurpassesacertainthreshold,[...]theneurontransmitsthis[...]signal[...]."Arti cialNeuronsx2w2n1Pixiwi

�b�00Pixiwi�b0Activat
�b�00Pixiwi�b0ActivationyOutputx1w1x3w3WeightsThresholdbInputsArti cialNeuronAnarti cialneuronwithweightsw1;:::;ws,biasbandactivationfunction:R!Risde nedasthefunctionf(x1;:::;xs)= sXi=1xiwi�b!:ActivationFunctions

Figure:Heavisideactivationfunction(asin
Figure:Heavisideactivationfunction(asinbiologicalmotivation)Figure:Sigmoidactivationfunction(x)=11+e�xFromnowonweusePython!1importmath2importmatplotlib.pyplotasplt3importnumpyasnp45defsigmoid(x):6a=[]7foriteminx:8a.append(1/(1+math.exp(-item)))9re

turna1011x=np.arange(-10.,10.,0.2)12s
turna1011x=np.arange(-10.,10.,0.2)12sig=sigmoid(x)1314plt.plot(x,sig)15plt.savefig('sig.png')Arti cialNeuralNetworksArti cialneuralnetworksconsistofagraph,connectingarti cialneurons!Dynamicsdiculttomodel,duetoloops,etc...Arti cialFeedfor

wardNeuralNetworksUsedirected,acyclicgr
wardNeuralNetworksUsedirected,acyclicgraph!Arti cialFeedforwardNeuralNetworksDe nitionLetL;d;N1;:::;NL2N.Amap:Rd!RNLgivenby(x)=AL(AL�1(:::(A1(x))));x2Rd;iscalledaneuralnetwork.ItiscomposedofanelinearmapsA`:RN`�1!RN`,1

`L(whereN0=d),andnon-linearfunction
`L(whereN0=d),andnon-linearfunctions|oftenreferredtoasactivationfunction|actingcomponent-wise.Here,disthedimensionoftheinputlayer,Ldenotesthenumberoflayers,N1;:::;NL�1standsforthedimensionsoftheL�1hiddenlayers,andNListhedimensionoftheoutputlayer.Ana&#

14;nemapA:RN`�1!RN`isgivenbyx7!Wx+bwi
14;nemapA:RN`�1!RN`isgivenbyx7!Wx+bwithweightmatrixW2RN`�1N`andbiasvectorb2RN`.Arti cialFeedforwardNeuralNetworksAPythonClass1importrandom2importnumpyasnp34classNetwork(object):56def__init__(self,sizes):7self.num_layers=len(sizes)8self.sizes=si

zes9self.biases=[np.random.randn(y,1)fo
zes9self.biases=[np.random.randn(y,1)foryinsizes[1:]]10self.weights=[np.random.randn(y,x)11forx,yinzip(sizes[:-1],sizes[1:])]Figure:InitializationAPythonClass1deffeedforward(self,a):2"""Returntheoutputofthenetworkif``a``isinput."""3forb,winzip(self.biases,se

lf.weights):4a0=np.dot(w,a)+b5a=sigmo
lf.weights):4a0=np.dot(w,a)+b5a=sigmoid(a0)6returna0Figure:FeedforwardComputation1Phi=net.Network([3,2,1])2x=np.array([[1],[0.5],[2]])3y=Phi.feedforward(x)4print(y)Figure:EvaluationVisualizing(Small)NeuralNetworksFigure:Thicknessrepresentsthesizeoftheweight

1Phi=net.Network([10,4,4])2Phi.draw()O
1Phi=net.Network([10,4,4])2Phi.draw()OntheBiologicalMotivationArti cial(feedforward)neuralnetworksshouldnotbeconfusedwithamodelforourbrain:NeuronsaremorecomplicatedthansimplyweightedlinearcombinationsOurbrainisnot\feedforward"Biologicalneuralnetworksevolvewi

thtime neuronalplasticity...Arti c
thtime neuronalplasticity...Arti cialfeedforwardneuralnetworksconstituteamathematicallyandcomputationallyconvenientbutverysimplisticmathematicalconstructwhichisinspiredbyourunderstandingofhowthebrainworks.Terminology\NeuralNetworkLearning":Useneuralnetworksofa

xed\topology"ashypothesisclassforre
xed\topology"ashypothesisclassforregressionorclassi cationtasks.Thisrequiresoptimizingtheweightsandbiasparameters.\DeepLearning":Neuralnetworklearningwithneuralnetworksconsistingofmany(e.g.,3)layers.2.2UniversalApproximationApproximationQuestionMain

ApproximationProblemUnderwhichcondition
ApproximationProblemUnderwhichconditionsontheactivationfunctioncanevery(continuous,ormeasurable)functionf:Rd!RNLbearbitrarilywellapproximatedbyaneuralnetwork,providedthatwechooseN1;:::;NL�1;Llargeenough?Surelynot!Supposethatisapolynomialofdegreer.Then

(Ax)isapolynomialofdegreerfora
(Ax)isapolynomialofdegreerforallanemapsAandthereforeanyneuralnetworkwithactivationfunctionwillbeapolynomialofdegreer.UniversalApproximationTheoremTheoremSupposethat:R!Rcontinuousisnotapolynomialand xd1;L2;NL12Nandaco

mpactsubsetKRd.Thenforanycontinuous
mpactsubsetKRd.Thenforanycontinuousf:Rd!RNlandany"�0thereexistN1;:::;NL�12NandanelinearmapsA`:RN`�1!RN`,1`Lsuchthattheneuralnetwork(x)=AL(AL�1(:::(A1(x))));x2Rd;approximatesftowithinaccuracy",i.e.,supx2Ljf(x)�&

#8;(x)j":Neuralnetworksare\univers
#8;(x)j":Neuralnetworksare\universalapproximators"andalreadyonehiddenlayer(L=2)isenoughifthenumberofnodesissucientlylarge!ProofoftheUniversalApproximationTheoremForsimplicityweonlythecaseofonehiddenlayer,e.g.,L=2andoneoutputneuron,e.g.,NL=1:(x)=N1Xi=1

ci(wix�bi);wi2Rd;ci;bi2R:Pr
ci(wix�bi);wi2Rd;ci;bi2R:ProofoftheUniversalApproximationTheoremWewillshowthefollowing.TheoremFord2Nand:R!RcontinuousconsiderR(;d):=spann(wx�b):w2Rd;b2Ro:ThenR(;d)isdenseinC(Rd)ifandonlyifisnotapolynomial.Proofford=1and&

#27;smoothifisnotapolynomial,there
#27;smoothifisnotapolynomial,thereexistsx02Rwith(k)(�x0)6=0forallk2N.constantfunctionscanbeapproximatedbecause(hx�x0)!(�x0)6=0;h!0:linearfunctionscanbeapproximatedbecause1h(((+h)x�x0)�(x�x0))!x0(�x0

);h;!0:sameargument polynomialsinx
);h;!0:sameargument polynomialsinxcanbeapproximated.Stone-WeierstrassTheoremyieldstheresult.GeneraldNotethatthefunctionsspanfg(wx�b):w2Rd;b2R;g2C(R)arbitraryg;aredenseinC(Rd)(justtakegassin(wx);cos(wx)justasintheFourierseriescase).Firstapproxi

matef2C(Rd)byNXi=1digi(vix�ei);vi
matef2C(Rd)byNXi=1digi(vix�ei);vi2Rd;di;ei2R;gi2C(R):Thenapplyourunivariateresulttoapproximatetheunivariatefunctionst7!gi(t�ei)usingneuralnetworks.Thecasethatisnonsmoothpickfamily(g")�"0ofmolli ers,i.e.lim"!0g"!uniformlyoncom

pacta.Applypreviousresulttothesmoothfun
pacta.Applypreviousresulttothesmoothfunctiong"andlet"!0:2.3BackpropagationRegression/Classi cationwithNeuralNetworksNeuralNetworkHypothesisClassGivend;L;N1;:::;NLandde netheassociatedhypothesisclassHd;L;N1;:::;NL;:=AL(AL&#

0;1(:::(A1(x)))):A`:RN`�1!R
0;1(:::(A1(x)))):A`:RN`�1!RN`anelinear :TypicalRegression/Classi cationTaskGivendataz=((xi;yi))mi=1RdRNL, ndtheempiricalregressionfunctionfz2argminf2Hd;L;N1;:::;NL;mXi=1L(f;xi;yi);whereL:C(Rd)RdRNL!R+isthelossfunction(

inleastsquaresproblemswehaveL(f;x;y)=jf(
inleastsquaresproblemswehaveL(f;x;y)=jf(x)�yj2).Example:HandwrittenDigitsMNISTDatabaseforhand-writtendigitrecognitionhttp://yann.lecun.com/exdb/mnist/Everyimageisgivenasa2828matrixx2R2828R784:Everylabelisgivenasa10-dimvectory2R10describingthe`proba

bility'ofeachdigitExample:HandwrittenDi
bility'ofeachdigitExample:HandwrittenDigitsEveryimageisgivenasa2828matrixx2R2828R784:Everylabelisgivenasa10-dimvectory2R10describingthe`probability'ofeachdigitGivenlabeledtrainingdata(xi;yi)mi=1R784R10.Fixnetworktopology,e.g.,numberoflayers(

forexampleL=3)andnumbersofneurons(N1=20;
forexampleL=3)andnumbersofneurons(N1=20;N2=20).Thelearninggoalisto ndtheempiricalregressionfunctionfz2H784;3;20;20;10;.???how???Non-linear,non-convexGradientDescent:TheSimplestOptimizationMethodGradientDescentGradientofF:RN!Risde nedbyrF(u)=@F(

u)@(u)1;:::;@F(u)@(u)NT:Gradient
u)@(u)1;:::;@F(u)@(u)NT:Gradientdescentwithstepsize�0isde nedbyun+1 un�rF(un):Converges(slowly)tostationarypointofF.BackpropInourproblem:F=Pmi=1L(f;xi;yi)andu=((W`;b`))L`=1.Sincer((W`;b`))L`=1F=mXi=1r((W`;b`))L`=1L(f;xi;yi);weneedtod

etermine(forx;y2RdRNL xed)@L(f;x
etermine(forx;y2RdRNL xed)@L(f;x;y)@(W`)i;j;@L(f;x;y)@(b`)i;`=1;:::;L:ForsimplicitysupposethatL(f;x;y)=(f(x)�y)2;sothat@L(f;x;y)@(W`)i;j=2(f(x)�y)T@f(x)@(W`)i;j;@L(f;x;y)@(b`)i=2(f(x)�y)T@f(x)@(b`)i:x= (x)1(x)2!a1=(z1)=&#

27;(W1x+b1)W1=0B@(W1)1;1(W1)1;2(W1)2;1(W
27;(W1x+b1)W1=0B@(W1)1;1(W1)1;2(W1)2;1(W1)2;2(W1)3;1(W1)3;21CAb1=0B@(b1)1(b1)2(b1)31CAa2=(z2)=(W2a1+b2)W2=0B@(W2)1;1(W2)1;2(W2)1;3(W2)2;1(W2)2;2(W2)2;3(W2)3;1(W2)3;2(W2)3;31CAb2=0B@(b2)1(b2)2(b2)31CA(x)=z3=W3a2+b3W3= (W3)1;1(W3)1;2(W3)1;3(W3)2;1(W3)2;2

(W3)2;3!b3= (b3)1(b3)2!@(z3)1@(W3)1;2=
(W3)2;3!b3= (b3)1(b3)2!@(z3)1@(W3)1;2=@@(W3)1;2((W3)1;1(a2)1+(W3)1;2(a2)2+(W3)1;3(a2)3)=(a2)2@(z3)2@(W3)1;2=@@(W3)1;2((W3)2;1(a2)1+(W3)2;2(a2)2+(W3)2;3(a2)3)=0@(z3)k@(W3)i;j=(a2)ji=k0i6=k@(z3)k@(b3)i=1i=k0i6=k@(x)@W3=0@(a2)10

6;(a2)20(a2)300(a2)1
6;(a2)20(a2)300(a2)10(a2)20(a2)31A@(x)@b3=0@10011ABackprop:LastLayer@L(f;x;y)@(WL)i;j=2(f(x)�y)T@f(x)@(WL)i;j,@L(f;x;y)@(bL)i=2(f(x)�y)T@f(x)@(bL)i.Letf(x)=WL(WL�

;1(:::)+bL�1)+bL.Itfollowsthat@f(x)
;1(:::)+bL�1)+bL.Itfollowsthat@f(x)@(WL)i;j=(0;:::;(WL�1(:::)+bL�1)j|{z}i;:::;0)T@f(x)@(bL)i=(0;:::;1|{z}i;:::;0)T2(f(x)�y)T@f(x)@(WL)i;j=2(f(x)�y)i(WL�1(:::)+bL�1)j,2(f(x)�y)T@f(x)@(bL)i=2(f(x)�y)iInmatrixnotation:@L(f

;x;y)@WL=2(f(x)�y)|{z}L(
;x;y)@WL=2(f(x)�y)|{z}L((zL�1z}|{WL�1(:::)+bL�1)|{z}aL�1)T,@L(f;x;y)@bL=2(f(x)�y).Backprop:Second-to-lastLayerDe nea`+1=(z`+1)wherez`+1=W`+1a`+b`+1,a0=x,f(x)=zL.WehavecomputedL(f;x;y)@WL,L(F;x;y)@bLThen,usechainrule:

@L(f;x;y)@WL�1=@L(f;x;y)aL�1
@L(f;x;y)@WL�1=@L(f;x;y)aL�1@aL�1@WL�1=2(f(x)�y)TWL@aL�1@WL�1:=2(f(x)�y)TWLdiag(0(zL�1))@zL�1@WL�1sameasbefore!=diag(0(zL�1))WTL2(f(x)�y)|{z}L�1aTL�2:Similararg

umentsyield@L(f;x;y)@bL�1=L�
umentsyield@L(f;x;y)@bL�1=L�1.TheBackpropAlgorithm1Calculatea`;z`for`=0;:::;L(forwardpass).2SetL=2(f(x)�y)3Then@L(f;x;y)@bL=Land@L(f;x;y)@WL=LaTL�1.4for`fromL�1to1do:`=diag(0(z`))WT`+1`+1Then@L(f;

x;y)@b`=`and@L(f;x;y)@W`=`&#
x;y)@b`=`and@L(f;x;y)@W`=`aT`�1.5return@L(f;x;y)@b`;@L(f;x;y)@W`,l=1;:::;L.1defbackprop(self,x,y):2nabla_b=[np.zeros(b.shape)forbinself.biases]3nabla_w=[np.zeros(w.shape)forwinself.weights]4activations=[x]5zs=[]6forb,winzip(self.biases,se

lf.weights):7zs.append(np.dot(w,activat
lf.weights):7zs.append(np.dot(w,activations[-1])+b)8activations.append(sigmoid(zs[-1]))9delta=self.cost_derivative(zs[-1],y)10nabla_b[-1]=delta11nabla_w[-1]=np.dot(delta,activations[-2].transpose())12forlinxrange(2,self.num_layers):13delta=np.dot(self.weights[

-l+1].transpose(),delta)*sigmoid_prime(
-l+1].transpose(),delta)*sigmoid_prime(zs[-l])14nabla_b[-l]=delta15nabla_w[-l]=np.dot(delta,activations[-l-1].transpose())16return(nabla_b,nabla_w)2.4StochasticGradientDescentTheComplexityofGradientDescentRecallthatonegradientdescentsteprequiresthecalculationo

fmXi=1r((W`;b`))L`=1L(f;xi;yi):andeachof
fmXi=1r((W`;b`))L`=1L(f;xi;yi):andeachofthesummandsrequiresonebackpropagationrun.Thus,thetotalcomplexityofonegradientdescentstepisequaltomcomplexity(backprop):ThecomplexityofbackpropisasymptoticallyequaltothenumberofDOFsofthenetwork:complexity(backprop)LX`=

1N`�1N`+b`:AnExampleImageNetdat
1N`�1N`+b`:AnExampleImageNetdatabaseconsistsof1:2mimagesand1000categories.AlexNet,neuralnetworkwith160mDOFsisoneofthemostsuc-cessfulannotationmethodsOnestepofgradientdescentrequires21014 ops(andmemoryunits)!!StochasticGradientDescent(SGD

)ApproximatemXi=1r((W`;b`))L`=1L(f;xi;y
)ApproximatemXi=1r((W`;b`))L`=1L(f;xi;yi)byr((W`;b`))L`=1L(f;xi;yi)forsomeichosenuniformlyatrandomfromf1;:::;mg.InexpectationwehaveEr((W`;b`))L`=1L(f;xi;yi)=1mmXi=1r((W`;b`))L`=1L(f;xi;yi)TheSGDAlgorithmGoal:FindstationarypointoffunctionF=Pmi=1

Fi:RN!R.1Setstartingvalueu0andn=02whil
Fi:RN!R.1Setstartingvalueu0andn=02while(errorislarge)do:Picki2f1;:::;mguniformlyatrandomupdateun+1=un�rFin=n+13returnunTypicalBehaviorFigure:Comparisonbtw.GDandSGD.mstepsofSGDarecountedasoneiteration.Initiallyveryfastconvergence,followedbystagnati

on!MinibatchSGDForeveryfi1;:::;i&#
on!MinibatchSGDForeveryfi1;:::;iKgf1;:::;mgchosenuniformlyatrandom,itholdsthatE1KkXl=1r((W`;b`))L`=1L(f;xil;yil)=1mmXi=1r((W`;b`))L`=1L(f;xi;yi);e.g.,wehaveanunbiasedestimatorforthegradient.K=1 SGDK�1 MinibatchSGDwithbatchsizeK.SomeH

euristicsThesamplemean1KPkl=1r((W`;b`)
euristicsThesamplemean1KPkl=1r((W`;b`))L`=1L(f;xil;yil)isitselfarandomvariablethathasexpectedvalue1mPmi=1r((W`;b`))L`=1L(f;xi;yi).Inordertoassessthedeviationofthesamplemeanfromitsexpectedvaluewemaycomputeitsstandarddeviation=pnwhereisthestandard

deviationofi7!r((W`;b`))L`=1L(f;xi;yi).
deviationofi7!r((W`;b`))L`=1L(f;xi;yi).Increasingthebatchsizebyafactor100yieldsanimprovementofthevariancebyafactor10whilethecomplexityincreasesbyafactor100!Commonbatchsizeforlargemodels:K=16;32.1defSGD(self,training_data,epochs,mini_batch_size,eta):2n=len(trainin

g_data)3forjinxrange(epochs):4random.s
g_data)3forjinxrange(epochs):4random.shuffle(training_data)5mini_batches=[training_data[k:k+mini_batch_size]6forkinxrange(0,n,mini_batch_size)]7formini_batchinmini_batches:8self.update_mini_batch(mini_batch,eta)1defupdate_mini_batch(self,mini_batch,eta):2na

bla_b=[np.zeros(b.shape)forbinself.bias
bla_b=[np.zeros(b.shape)forbinself.biases]3nabla_w=[np.zeros(w.shape)forwinself.weights]4forx,yinmini_batch:5delta_nabla_b,delta_nabla_w=self.backprop(x,y)6nabla_b=[nb+dnbfornb,dnbinzip(nabla_b,delta_nabla_b)]7nabla_w=[nw+dnwfornw,dnwinzip(nabla_w,delta_nabl

a_w)]8self.weights=[w-(eta/len(mini_bat
a_w)]8self.weights=[w-(eta/len(mini_batch))*nw9forw,nwinzip(self.weights,nabla_w)]10self.biases=[b-(eta/len(mini_batch))*nb11forb,nbinzip(self.biases,nabla_b)]2.5SummaryThebasicNeuralNetworkRecipeforLearning1Neuro-inspiredmodel2Backprop3MinibatchSGDNowle

t'stryclassifyinghandwrittendigits!Resu
t'stryclassifyinghandwrittendigits!Results30epochs,learningrate=3:0,minibatchsizeK=10,networksize[784;30;10].Trainingsetsizem=50000,testsetsize=10000.Classi cationaccuracy95�96%.networksize[784;30;30;10].Classi cationaccuracy96�97%.networksize

[784;30;30;30;10].Classi cationaccu
[784;30;30;30;10].Classi cationaccuracy95�96%.Deeplearningmightnothelpafterall...2.6GoingDeep(?)ProblemswithDeepNetworksOver tting(asusual...)Vanishing/ExplodingGradientProblemDealingwithOver tting:RegularizationRatherthanminimizingmXi=1L(f;xi;yi

);minimizemXi=1L(f;xi;yi)+((W`)L`=
);minimizemXi=1L(f;xi;yi)+((W`)L`=1);forexample((W`)L`=1)=Xl;i;jj(W`)i;jjp:Gradientupdatehastobeaugmentedby@@(W`)i;j((W`)L`=1)=pj(W`)i;jjp�1sgn((W`)i;j)Sparsity-PromotingRegularizationSincelimp!0Xl;i;jj(W`)i;jjp=#nonzeroweights

;regularizationwithp1promotessparse
;regularizationwithp1promotessparseconnectivity(andhencesmallmemoryrequirements)!DropoutDuringeachfeedforward/back-propstepdropnodeswithproba-bilityp.Aftertraining,multiplyallweightswithp.Finaloutputis\average"overmanysparsenetworkmodels.TheVanishingGradient

ProblemFigure:ExtremelyDeepNetwork
ProblemFigure:ExtremelyDeepNetwork(x)=w5(w4(w3(w2(w1x+b1)+b2)+b3)+b4)+b5@(x)@b1=LY`=2w`L�1Y`=10(z`)If(x)=11+e�x,itholdsthatj0(x)j2e�jxj,andthusj@(x)@b1j=2LY`=2jw`je�PL�1`=1jz`jb

ottomlayerswilllearnmuchslowerthantoplay
ottomlayerswilllearnmuchslowerthantoplayersandnotcontributetolearning.Depthisanuisance!?DealingwiththeVanishingGradientProblemUseactivationfunctionwith`large'gradient.ReLUTheRecti edLinearUnitisde nedasReLU(x):=xx�00elseExercise:ImplementDrop