/
Review of Probability Theory Arian Maleki and Tom Do Stanford University Probability theory Review of Probability Theory Arian Maleki and Tom Do Stanford University Probability theory

Review of Probability Theory Arian Maleki and Tom Do Stanford University Probability theory - PDF document

pamella-moone
pamella-moone . @pamella-moone
Follow
617 views
Uploaded On 2014-12-13

Review of Probability Theory Arian Maleki and Tom Do Stanford University Probability theory - PPT Presentation

Through this class we will be relying on concepts from probability theory for deriving machine learning algorithms These notes attempt to cover the basics of probability theory at a level appropriate for CS 229 The mathematical theory of probability ID: 23112

Through this class

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Review of Probability Theory Arian Malek..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1.1ConditionalprobabilityandindependenceLetBbeaneventwithnon-zeroprobability.TheconditionalprobabilityofanyeventAgivenBisdenedas,P(AjB),P(A\B) P(B)Inotherwords,P(AjB)istheprobabilitymeasureoftheeventAafterobservingtheoccurrenceofeventB.TwoeventsarecalledindependentifandonlyifP(A\B)=P(A)P(B)(orequivalently,P(AjB)=P(A)).Therefore,independenceisequivalenttosayingthatobservingBdoesnothaveanyeffectontheprobabilityofA.2RandomvariablesConsideranexperimentinwhichweip10coins,andwewanttoknowthenumberofcoinsthatcomeupheads.Here,theelementsofthesamplespace are10-lengthsequencesofheadsandtails.Forexample,wemighthavew0=hH;H;T;H;T;H;H;T;T;Ti2 .However,inpractice,weusuallydonotcareabouttheprobabilityofobtaininganyparticularsequenceofheadsandtails.Insteadweusuallycareaboutreal-valuedfunctionsofoutcomes,suchasthenumberofheadsthatappearamongour10tosses,orthelengthofthelongestrunoftails.Thesefunctions,undersometechnicalconditions,areknownasrandomvariables.Moreformally,arandomvariableXisafunctionX: �!R.2Typically,wewilldenoterandomvariablesusinguppercaselettersX(!)ormoresimplyX(wherethedependenceontherandomoutcome!isimplied).Wewilldenotethevaluethatarandomvariablemaytakeonusinglowercaselettersx.Example:Inourexperimentabove,supposethatX(!)isthenumberofheadswhichoccurinthesequenceoftosses!.Giventhatonly10coinsaretossed,X(!)cantakeonlyanitenumberofvalues,soitisknownasadiscreterandomvariable.Here,theprobabilityofthesetassociatedwitharandomvariableXtakingonsomespecicvaluekisP(X=k):=P(f!:X(!)=kg):Example:SupposethatX(!)isarandomvariableindicatingtheamountoftimeittakesforaradioactiveparticletodecay.Inthiscase,X(!)takesonainnitenumberofpossiblevalues,soitiscalledacontinuousrandomvariable.WedenotetheprobabilitythatXtakesonavaluebetweentworealconstantsaandb(whereab)asP(aXb):=P(f!:aX(!)bg):2.1CumulativedistributionfunctionsInordertospecifytheprobabilitymeasuresusedwhendealingwithrandomvariables,itisoftenconvenienttospecifyalternativefunctions(CDFs,PDFs,andPMFs)fromwhichtheprobabilitymeasuregoverninganexperimentimmediatelyfollows.Inthissectionandthenexttwosections,wedescribeeachofthesetypesoffunctionsinturn.Acumulativedistributionfunction(CDF)isafunctionFX:R![0;1]whichspeciesaproba-bilitymeasureas,FX(x),P(Xx):(1)ByusingthisfunctiononecancalculatetheprobabilityofanyeventinF.3Figure??showsasampleCDFfunction.Properties: 2Technicallyspeaking,noteveryfunctionisnotacceptableasarandomvariable.Fromameasure-theoreticperspective,randomvariablesmustbeBorel-measurablefunctions.Intuitively,thisrestrictionensuresthatgivenarandomvariableanditsunderlyingoutcomespace,onecanimplicitlydenetheeachoftheeventsoftheeventspaceasbeingsetsofoutcomes!2 forwhichX(!)satisessomeproperty(e.g.,theeventf!:X(!)3g).3Thisisaremarkablefactandisactuallyatheoremthatisprovedinmoreadvancedcourses.2 E[X2]=Z1�1x2fX(x)dx=Z10x2dx=1 3:Var[X]=E[X2]�E[X]2=1 3�1 4=1 12:Example:Supposethatg(x)=1fx2AgforsomesubsetA .WhatisE[g(X)]?Discretecase:E[g(X)]=Xx2Val(X)1fx2AgPX(x)dx=Xx2APX(x)dx=P(x2A):Continuouscase:E[g(X)]=Z1�11fx2AgfX(x)dx=Zx2AfX(x)dx=P(x2A):2.6SomecommonrandomvariablesDiscreterandomvariablesXBernoulli(p)(where0p1):oneifacoinwithheadsprobabilitypcomesupheads,zerootherwise.p(x)=pifp=11�pifp=0XBinomial(n;p)(where0p1):thenumberofheadsinnindependentipsofacoinwithheadsprobabilityp.p(x)=nxpx(1�p)n�xXGeometric(p)(wherep�0):thenumberofipsofacoinwithheadsprobabilitypuntiltherstheads.p(x)=p(1�p)x�1XPoisson()(where�0):aprobabilitydistributionoverthenonnegativeintegersusedformodelingthefrequencyofrareevents.p(x)=e�x x!ContinuousrandomvariablesXUniform(a;b)(whereab):equalprobabilitydensitytoeveryvaluebetweenaandbontherealline.f(x)=(1 b�aifaxb0otherwiseXExponential()(where�0):decayingprobabilitydensityoverthenonnegativereals.f(x)=e�xifx00otherwiseXNormal(;2):alsoknownastheGaussiandistributionf(x)=1 p 2e�1 22(x�)25 3.4ConditionaldistributionsConditionaldistributionsseektoanswerthequestion,whatistheprobabilitydistributionoverY,whenweknowthatXmusttakeonacertainvaluex?Inthediscretecase,theconditionalprobabilitymassfunctionofXgivenYissimplypYjX(yjx)=pXY(x;y) pX(x);assumingthatpX(x)6=0.Inthecontinuouscase,thesituationistechnicallyalittlemorecomplicatedbecausetheprobabilitythatacontinuousrandomvariableXtakesonaspecicvaluexisequaltozero4.Ignoringthistechnicalpoint,wesimplydene,byanalogytothediscretecase,theconditionalprobabilitydensityofYgivenX=xtobefYjX(yjx)=fXY(x;y) fX(x);providedfX(x)6=0.3.5Bayes'sruleAusefulformulathatoftenariseswhentryingtoderiveexpressionfortheconditionalprobabilityofonevariablegivenanother,isBayes'srule.InthecaseofdiscreterandomvariablesXandY,PYjX(yjx)=PXY(x;y) PX(x)=PXjY(xjy)PY(y) Py02Val(Y)PXjY(xjy0)PY(y0):IftherandomvariablesXandYarecontinuous,fYjX(yjx)=fXY(x;y) fX(x)=fXjY(xjy)fY(y) R1�1fXjY(xjy0)fY(y0)dy0:3.6IndependenceTworandomvariablesXandYareindependentifFXY(x;y)=FX(x)FY(y)forallvaluesofxandy.Equivalently,Fordiscreterandomvariables,pXY(x;y)=pX(x)pY(y)forallx2Val(X),y2Val(Y).Fordiscreterandomvariables,pYjX(yjx)=pY(y)wheneverpX(x)6=0forally2Val(Y).Forcontinuousrandomvariables,fXY(x;y)=fX(x)fY(y)forallx;y2R.Forcontinuousrandomvariables,fYjX(yjx)=fY(y)wheneverfX(x)6=0forally2R. 4Togetaroundthis,amorereasonablewaytocalculatetheconditionalCDFis,FYjX(y;x)=limx!0P(YyjxXx+x):ItcanbeeasilyseenthatifF(x;y)isdifferentiableinbothx;ythen,FYjX(y;x)=Zy�1fX;Y(x; ) fX(x)d andthereforewedenetheconditionalPDFofYgivenX=xinthefollowingway,fYjX(yjx)=fXY(x;y) fX(x)8 4.1BasicpropertiesWecandenethejointdistributionfunctionofX1;X2;:::;Xn,thejointprobabilitydensityfunctionofX1;X2;:::;Xn,themarginalprobabilitydensityfunctionofX1,andthecondi-tionalprobabilitydensityfunctionofX1givenX2;:::;Xn,asFX1;X2;:::;Xn(x1;x2;:::xn)=P(X1x1;X2x2;:::;Xnxn)fX1;X2;:::;Xn(x1;x2;:::xn)=@nFX1;X2;:::;Xn(x1;x2;:::xn) @x1:::@xnfX1(X1)=Z1�1Z1�1fX1;X2;:::;Xn(x1;x2;:::xn)dx2:::dxnfX1jX2;:::;Xn(x1jx2;:::xn)=fX1;X2;:::;Xn(x1;x2;:::xn) fX2;:::;Xn(x1;x2;:::xn)TocalculatetheprobabilityofaneventARnwehave,P((x1;x2;:::xn)2A)=Z(x1;x2;:::xn)2AfX1;X2;:::;Xn(x1;x2;:::xn)dx1dx2:::dxn(4)Chainrule:Fromthedenitionofconditionalprobabilitiesformultiplerandomvariables,onecanshowthatf(x1;x2;:::;xn)=f(xnjx1;x2:::;xn�1)f(x1;x2:::;xn�1)=f(xnjx1;x2:::;xn�1)f(xn�1jx1;x2:::;xn�2)f(x1;x2:::;xn�2)=:::=f(x1)nYi=2f(xijx1;:::;xi�1):Independence:Formultipleevents,A1;:::;Ak,wesaythatA1;:::;Akaremutuallyindepen-dentifforanysubsetSf1;2;:::;kg,wehaveP(\i2SAi)=Yi2SP(Ai):Likewise,wesaythatrandomvariablesX1;:::;Xnareindependentiff(x1;:::;xn)=f(x1)f(x2)f(xn):Here,thedenitionofmutualindependenceissimplythenaturalgeneralizationofindependenceoftworandomvariablestomultiplerandomvariables.Independentrandomvariablesariseofteninmachinelearningalgorithmswhereweassumethatthetrainingexamplesbelongingtothetrainingsetrepresentindependentsamplesfromsomeunknownprobabilitydistribution.Tomakethesignicanceofindependenceclear,considera“bad”trainingsetinwhichwerstsampleasingletrainingexample(x(1);y(1))fromthesomeunknowndistribu-tion,andthenaddm�1copiesoftheexactsametrainingexampletothetrainingset.Inthiscase,wehave(withsomeabuseofnotation)P((x(1);y(1));::::(x(m);y(m)))6=mYi=1P(x(i);y(i)):Despitethefactthatthetrainingsethassizem,theexamplesarenotindependent!Whileclearlytheproceduredescribedhereisnotasensiblemethodforbuildingatrainingsetforamachinelearningalgorithm,itturnsoutthatinpractice,non-independenceofsamplesdoescomeupoften,andithastheeffectofreducingthe“effectivesize”ofthetrainingset.10 WewritethisasXN(;).Noticethatinthecasen=1,thisreducestheregulardenitionofanormaldistributionwithmeanparameter1andvariance11.Generallyspeaking,Gaussianrandomvariablesareextremelyusefulinmachinelearningandstatis-ticsfortwomainreasons.First,theyareextremelycommonwhenmodeling“noise”instatisticalalgorithms.Quiteoften,noisecanbeconsideredtobetheaccumulationofalargenumberofsmallindependentrandomperturbationsaffectingthemeasurementprocess;bytheCentralLimitTheo-rem,summationsofindependentrandomvariableswilltendto“lookGaussian.”Second,Gaussianrandomvariablesareconvenientformanyanalyticalmanipulations,becausemanyoftheintegralsinvolvingGaussiandistributionsthatariseinpracticehavesimpleclosedformsolutions.Wewillencounterthislaterinthecourse.5OtherresourcesAgoodtextbookonprobablityatthelevelneededforCS229isthebook,AFirstCourseonProba-bilitybySheldonRoss.12