Through this class we will be relying on concepts from probability theory for deriving machine learning algorithms These notes attempt to cover the basics of probability theory at a level appropriate for CS 229 The mathematical theory of probability ID: 23112
Download Pdf The PPT/PDF document "Review of Probability Theory Arian Malek..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
1.1ConditionalprobabilityandindependenceLetBbeaneventwithnon-zeroprobability.TheconditionalprobabilityofanyeventAgivenBisdenedas,P(AjB),P(A\B) P(B)Inotherwords,P(AjB)istheprobabilitymeasureoftheeventAafterobservingtheoccurrenceofeventB.TwoeventsarecalledindependentifandonlyifP(A\B)=P(A)P(B)(orequivalently,P(AjB)=P(A)).Therefore,independenceisequivalenttosayingthatobservingBdoesnothaveanyeffectontheprobabilityofA.2RandomvariablesConsideranexperimentinwhichweip10coins,andwewanttoknowthenumberofcoinsthatcomeupheads.Here,theelementsofthesamplespace are10-lengthsequencesofheadsandtails.Forexample,wemighthavew0=hH;H;T;H;T;H;H;T;T;Ti2 .However,inpractice,weusuallydonotcareabouttheprobabilityofobtaininganyparticularsequenceofheadsandtails.Insteadweusuallycareaboutreal-valuedfunctionsofoutcomes,suchasthenumberofheadsthatappearamongour10tosses,orthelengthofthelongestrunoftails.Thesefunctions,undersometechnicalconditions,areknownasrandomvariables.Moreformally,arandomvariableXisafunctionX: !R.2Typically,wewilldenoterandomvariablesusinguppercaselettersX(!)ormoresimplyX(wherethedependenceontherandomoutcome!isimplied).Wewilldenotethevaluethatarandomvariablemaytakeonusinglowercaselettersx.Example:Inourexperimentabove,supposethatX(!)isthenumberofheadswhichoccurinthesequenceoftosses!.Giventhatonly10coinsaretossed,X(!)cantakeonlyanitenumberofvalues,soitisknownasadiscreterandomvariable.Here,theprobabilityofthesetassociatedwitharandomvariableXtakingonsomespecicvaluekisP(X=k):=P(f!:X(!)=kg):Example:SupposethatX(!)isarandomvariableindicatingtheamountoftimeittakesforaradioactiveparticletodecay.Inthiscase,X(!)takesonainnitenumberofpossiblevalues,soitiscalledacontinuousrandomvariable.WedenotetheprobabilitythatXtakesonavaluebetweentworealconstantsaandb(whereab)asP(aXb):=P(f!:aX(!)bg):2.1CumulativedistributionfunctionsInordertospecifytheprobabilitymeasuresusedwhendealingwithrandomvariables,itisoftenconvenienttospecifyalternativefunctions(CDFs,PDFs,andPMFs)fromwhichtheprobabilitymeasuregoverninganexperimentimmediatelyfollows.Inthissectionandthenexttwosections,wedescribeeachofthesetypesoffunctionsinturn.Acumulativedistributionfunction(CDF)isafunctionFX:R![0;1]whichspeciesaproba-bilitymeasureas,FX(x),P(Xx):(1)ByusingthisfunctiononecancalculatetheprobabilityofanyeventinF.3Figure??showsasampleCDFfunction.Properties: 2Technicallyspeaking,noteveryfunctionisnotacceptableasarandomvariable.Fromameasure-theoreticperspective,randomvariablesmustbeBorel-measurablefunctions.Intuitively,thisrestrictionensuresthatgivenarandomvariableanditsunderlyingoutcomespace,onecanimplicitlydenetheeachoftheeventsoftheeventspaceasbeingsetsofoutcomes!2 forwhichX(!)satisessomeproperty(e.g.,theeventf!:X(!)3g).3Thisisaremarkablefactandisactuallyatheoremthatisprovedinmoreadvancedcourses.2 E[X2]=Z11x2fX(x)dx=Z10x2dx=1 3:Var[X]=E[X2]E[X]2=1 31 4=1 12:Example:Supposethatg(x)=1fx2AgforsomesubsetA .WhatisE[g(X)]?Discretecase:E[g(X)]=Xx2Val(X)1fx2AgPX(x)dx=Xx2APX(x)dx=P(x2A):Continuouscase:E[g(X)]=Z111fx2AgfX(x)dx=Zx2AfX(x)dx=P(x2A):2.6SomecommonrandomvariablesDiscreterandomvariablesXBernoulli(p)(where0p1):oneifacoinwithheadsprobabilitypcomesupheads,zerootherwise.p(x)=pifp=11pifp=0XBinomial(n;p)(where0p1):thenumberofheadsinnindependentipsofacoinwithheadsprobabilityp.p(x)=nxpx(1p)nxXGeometric(p)(wherep0):thenumberofipsofacoinwithheadsprobabilitypuntiltherstheads.p(x)=p(1p)x1XPoisson()(where0):aprobabilitydistributionoverthenonnegativeintegersusedformodelingthefrequencyofrareevents.p(x)=ex x!ContinuousrandomvariablesXUniform(a;b)(whereab):equalprobabilitydensitytoeveryvaluebetweenaandbontherealline.f(x)=(1 baifaxb0otherwiseXExponential()(where0):decayingprobabilitydensityoverthenonnegativereals.f(x)=exifx00otherwiseXNormal(;2):alsoknownastheGaussiandistributionf(x)=1 p 2e1 22(x)25 3.4ConditionaldistributionsConditionaldistributionsseektoanswerthequestion,whatistheprobabilitydistributionoverY,whenweknowthatXmusttakeonacertainvaluex?Inthediscretecase,theconditionalprobabilitymassfunctionofXgivenYissimplypYjX(yjx)=pXY(x;y) pX(x);assumingthatpX(x)6=0.Inthecontinuouscase,thesituationistechnicallyalittlemorecomplicatedbecausetheprobabilitythatacontinuousrandomvariableXtakesonaspecicvaluexisequaltozero4.Ignoringthistechnicalpoint,wesimplydene,byanalogytothediscretecase,theconditionalprobabilitydensityofYgivenX=xtobefYjX(yjx)=fXY(x;y) fX(x);providedfX(x)6=0.3.5Bayes'sruleAusefulformulathatoftenariseswhentryingtoderiveexpressionfortheconditionalprobabilityofonevariablegivenanother,isBayes'srule.InthecaseofdiscreterandomvariablesXandY,PYjX(yjx)=PXY(x;y) PX(x)=PXjY(xjy)PY(y) Py02Val(Y)PXjY(xjy0)PY(y0):IftherandomvariablesXandYarecontinuous,fYjX(yjx)=fXY(x;y) fX(x)=fXjY(xjy)fY(y) R11fXjY(xjy0)fY(y0)dy0:3.6IndependenceTworandomvariablesXandYareindependentifFXY(x;y)=FX(x)FY(y)forallvaluesofxandy.Equivalently,Fordiscreterandomvariables,pXY(x;y)=pX(x)pY(y)forallx2Val(X),y2Val(Y).Fordiscreterandomvariables,pYjX(yjx)=pY(y)wheneverpX(x)6=0forally2Val(Y).Forcontinuousrandomvariables,fXY(x;y)=fX(x)fY(y)forallx;y2R.Forcontinuousrandomvariables,fYjX(yjx)=fY(y)wheneverfX(x)6=0forally2R. 4Togetaroundthis,amorereasonablewaytocalculatetheconditionalCDFis,FYjX(y;x)=limx!0P(YyjxXx+x):ItcanbeeasilyseenthatifF(x;y)isdifferentiableinbothx;ythen,FYjX(y;x)=Zy1fX;Y(x;) fX(x)dandthereforewedenetheconditionalPDFofYgivenX=xinthefollowingway,fYjX(yjx)=fXY(x;y) fX(x)8 4.1BasicpropertiesWecandenethejointdistributionfunctionofX1;X2;:::;Xn,thejointprobabilitydensityfunctionofX1;X2;:::;Xn,themarginalprobabilitydensityfunctionofX1,andthecondi-tionalprobabilitydensityfunctionofX1givenX2;:::;Xn,asFX1;X2;:::;Xn(x1;x2;:::xn)=P(X1x1;X2x2;:::;Xnxn)fX1;X2;:::;Xn(x1;x2;:::xn)=@nFX1;X2;:::;Xn(x1;x2;:::xn) @x1:::@xnfX1(X1)=Z11Z11fX1;X2;:::;Xn(x1;x2;:::xn)dx2:::dxnfX1jX2;:::;Xn(x1jx2;:::xn)=fX1;X2;:::;Xn(x1;x2;:::xn) fX2;:::;Xn(x1;x2;:::xn)TocalculatetheprobabilityofaneventARnwehave,P((x1;x2;:::xn)2A)=Z(x1;x2;:::xn)2AfX1;X2;:::;Xn(x1;x2;:::xn)dx1dx2:::dxn(4)Chainrule:Fromthedenitionofconditionalprobabilitiesformultiplerandomvariables,onecanshowthatf(x1;x2;:::;xn)=f(xnjx1;x2:::;xn1)f(x1;x2:::;xn1)=f(xnjx1;x2:::;xn1)f(xn1jx1;x2:::;xn2)f(x1;x2:::;xn2)=:::=f(x1)nYi=2f(xijx1;:::;xi1):Independence:Formultipleevents,A1;:::;Ak,wesaythatA1;:::;Akaremutuallyindepen-dentifforanysubsetSf1;2;:::;kg,wehaveP(\i2SAi)=Yi2SP(Ai):Likewise,wesaythatrandomvariablesX1;:::;Xnareindependentiff(x1;:::;xn)=f(x1)f(x2)f(xn):Here,thedenitionofmutualindependenceissimplythenaturalgeneralizationofindependenceoftworandomvariablestomultiplerandomvariables.Independentrandomvariablesariseofteninmachinelearningalgorithmswhereweassumethatthetrainingexamplesbelongingtothetrainingsetrepresentindependentsamplesfromsomeunknownprobabilitydistribution.Tomakethesignicanceofindependenceclear,considerabadtrainingsetinwhichwerstsampleasingletrainingexample(x(1);y(1))fromthesomeunknowndistribu-tion,andthenaddm1copiesoftheexactsametrainingexampletothetrainingset.Inthiscase,wehave(withsomeabuseofnotation)P((x(1);y(1));::::(x(m);y(m)))6=mYi=1P(x(i);y(i)):Despitethefactthatthetrainingsethassizem,theexamplesarenotindependent!Whileclearlytheproceduredescribedhereisnotasensiblemethodforbuildingatrainingsetforamachinelearningalgorithm,itturnsoutthatinpractice,non-independenceofsamplesdoescomeupoften,andithastheeffectofreducingtheeffectivesizeofthetrainingset.10 WewritethisasXN(;).Noticethatinthecasen=1,thisreducestheregulardenitionofanormaldistributionwithmeanparameter1andvariance11.Generallyspeaking,Gaussianrandomvariablesareextremelyusefulinmachinelearningandstatis-ticsfortwomainreasons.First,theyareextremelycommonwhenmodelingnoiseinstatisticalalgorithms.Quiteoften,noisecanbeconsideredtobetheaccumulationofalargenumberofsmallindependentrandomperturbationsaffectingthemeasurementprocess;bytheCentralLimitTheo-rem,summationsofindependentrandomvariableswilltendtolookGaussian.Second,Gaussianrandomvariablesareconvenientformanyanalyticalmanipulations,becausemanyoftheintegralsinvolvingGaussiandistributionsthatariseinpracticehavesimpleclosedformsolutions.Wewillencounterthislaterinthecourse.5OtherresourcesAgoodtextbookonprobablityatthelevelneededforCS229isthebook,AFirstCourseonProba-bilitybySheldonRoss.12