273K - views

Review of Probability Theory Arian Maleki and Tom Do Stanford University Probability theory is the study of uncertainty

Through this class we will be relying on concepts from probability theory for deriving machine learning algorithms These notes attempt to cover the basics of probability theory at a level appropriate for CS 229 The mathematical theory of probability

Tags : Through this class
Embed :
Pdf Download Link

Download Pdf - The PPT/PDF document "Review of Probability Theory Arian Malek..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Review of Probability Theory Arian Maleki and Tom Do Stanford University Probability theory is the study of uncertainty






Presentation on theme: "Review of Probability Theory Arian Maleki and Tom Do Stanford University Probability theory is the study of uncertainty"— Presentation transcript:

1.1ConditionalprobabilityandindependenceLetBbeaneventwithnon-zeroprobability.TheconditionalprobabilityofanyeventAgivenBisdenedas,P(AjB),P(A\B) P(B)Inotherwords,P(AjB)istheprobabilitymeasureoftheeventAafterobservingtheoccurrenceofeventB.TwoeventsarecalledindependentifandonlyifP(A\B)=P(A)P(B)(orequivalently,P(AjB)=P(A)).Therefore,independenceisequivalenttosayingthatobservingBdoesnothaveanyeffectontheprobabilityofA.2RandomvariablesConsideranexperimentinwhichweip10coins,andwewanttoknowthenumberofcoinsthatcomeupheads.Here,theelementsofthesamplespace are10-lengthsequencesofheadsandtails.Forexample,wemighthavew0=hH;H;T;H;T;H;H;T;T;Ti2 .However,inpractice,weusuallydonotcareabouttheprobabilityofobtaininganyparticularsequenceofheadsandtails.Insteadweusuallycareaboutreal-valuedfunctionsofoutcomes,suchasthenumberofheadsthatappearamongour10tosses,orthelengthofthelongestrunoftails.Thesefunctions,undersometechnicalconditions,areknownasrandomvariables.Moreformally,arandomvariableXisafunctionX: �!R.2Typically,wewilldenoterandomvariablesusinguppercaselettersX(!)ormoresimplyX(wherethedependenceontherandomoutcome!isimplied).Wewilldenotethevaluethatarandomvariablemaytakeonusinglowercaselettersx.Example:Inourexperimentabove,supposethatX(!)isthenumberofheadswhichoccurinthesequenceoftosses!.Giventhatonly10coinsaretossed,X(!)cantakeonlyanitenumberofvalues,soitisknownasadiscreterandomvariable.Here,theprobabilityofthesetassociatedwitharandomvariableXtakingonsomespecicvaluekisP(X=k):=P(f!:X(!)=kg):Example:SupposethatX(!)isarandomvariableindicatingtheamountoftimeittakesforaradioactiveparticletodecay.Inthiscase,X(!)takesonainnitenumberofpossiblevalues,soitiscalledacontinuousrandomvariable.WedenotetheprobabilitythatXtakesonavaluebetweentworealconstantsaandb(whereab)asP(aXb):=P(f!:aX(!)bg):2.1CumulativedistributionfunctionsInordertospecifytheprobabilitymeasuresusedwhendealingwithrandomvariables,itisoftenconvenienttospecifyalternativefunctions(CDFs,PDFs,andPMFs)fromwhichtheprobabilitymeasuregoverninganexperimentimmediatelyfollows.Inthissectionandthenexttwosections,wedescribeeachofthesetypesoffunctionsinturn.Acumulativedistributionfunction(CDF)isafunctionFX:R![0;1]whichspeciesaproba-bilitymeasureas,FX(x),P(Xx):(1)ByusingthisfunctiononecancalculatetheprobabilityofanyeventinF.3Figure??showsasampleCDFfunction.Properties: 2Technicallyspeaking,noteveryfunctionisnotacceptableasarandomvariable.Fromameasure-theoreticperspective,randomvariablesmustbeBorel-measurablefunctions.Intuitively,thisrestrictionensuresthatgivenarandomvariableanditsunderlyingoutcomespace,onecanimplicitlydenetheeachoftheeventsoftheeventspaceasbeingsetsofoutcomes!2 forwhichX(!)satisessomeproperty(e.g.,theeventf!:X(!)3g).3Thisisaremarkablefactandisactuallyatheoremthatisprovedinmoreadvancedcourses.2 E[X2]=Z1�1x2fX(x)dx=Z10x2dx=1 3:Var[X]=E[X2]�E[X]2=1 3�1 4=1 12:Example:Supposethatg(x)=1fx2AgforsomesubsetA .WhatisE[g(X)]?Discretecase:E[g(X)]=Xx2Val(X)1fx2AgPX(x)dx=Xx2APX(x)dx=P(x2A):Continuouscase:E[g(X)]=Z1�11fx2AgfX(x)dx=Zx2AfX(x)dx=P(x2A):2.6SomecommonrandomvariablesDiscreterandomvariablesXBernoulli(p)(where0p1):oneifacoinwithheadsprobabilitypcomesupheads,zerootherwise.p(x)=pifp=11�pifp=0XBinomial(n;p)(where0p1):thenumberofheadsinnindependentipsofacoinwithheadsprobabilityp.p(x)=nxpx(1�p)n�xXGeometric(p)(wherep�0):thenumberofipsofacoinwithheadsprobabilitypuntiltherstheads.p(x)=p(1�p)x�1XPoisson()(where�0):aprobabilitydistributionoverthenonnegativeintegersusedformodelingthefrequencyofrareevents.p(x)=e�x x!ContinuousrandomvariablesXUniform(a;b)(whereab):equalprobabilitydensitytoeveryvaluebetweenaandbontherealline.f(x)=(1 b�aifaxb0otherwiseXExponential()(where�0):decayingprobabilitydensityoverthenonnegativereals.f(x)=e�xifx00otherwiseXNormal(;2):alsoknownastheGaussiandistributionf(x)=1 p 2e�1 22(x�)25 3.4ConditionaldistributionsConditionaldistributionsseektoanswerthequestion,whatistheprobabilitydistributionoverY,whenweknowthatXmusttakeonacertainvaluex?Inthediscretecase,theconditionalprobabilitymassfunctionofXgivenYissimplypYjX(yjx)=pXY(x;y) pX(x);assumingthatpX(x)6=0.Inthecontinuouscase,thesituationistechnicallyalittlemorecomplicatedbecausetheprobabilitythatacontinuousrandomvariableXtakesonaspecicvaluexisequaltozero4.Ignoringthistechnicalpoint,wesimplydene,byanalogytothediscretecase,theconditionalprobabilitydensityofYgivenX=xtobefYjX(yjx)=fXY(x;y) fX(x);providedfX(x)6=0.3.5Bayes'sruleAusefulformulathatoftenariseswhentryingtoderiveexpressionfortheconditionalprobabilityofonevariablegivenanother,isBayes'srule.InthecaseofdiscreterandomvariablesXandY,PYjX(yjx)=PXY(x;y) PX(x)=PXjY(xjy)PY(y) Py02Val(Y)PXjY(xjy0)PY(y0):IftherandomvariablesXandYarecontinuous,fYjX(yjx)=fXY(x;y) fX(x)=fXjY(xjy)fY(y) R1�1fXjY(xjy0)fY(y0)dy0:3.6IndependenceTworandomvariablesXandYareindependentifFXY(x;y)=FX(x)FY(y)forallvaluesofxandy.Equivalently,Fordiscreterandomvariables,pXY(x;y)=pX(x)pY(y)forallx2Val(X),y2Val(Y).Fordiscreterandomvariables,pYjX(yjx)=pY(y)wheneverpX(x)6=0forally2Val(Y).Forcontinuousrandomvariables,fXY(x;y)=fX(x)fY(y)forallx;y2R.Forcontinuousrandomvariables,fYjX(yjx)=fY(y)wheneverfX(x)6=0forally2R. 4Togetaroundthis,amorereasonablewaytocalculatetheconditionalCDFis,FYjX(y;x)=limx!0P(YyjxXx+x):ItcanbeeasilyseenthatifF(x;y)isdifferentiableinbothx;ythen,FYjX(y;x)=Zy�1fX;Y(x; ) fX(x)d andthereforewedenetheconditionalPDFofYgivenX=xinthefollowingway,fYjX(yjx)=fXY(x;y) fX(x)8 4.1BasicpropertiesWecandenethejointdistributionfunctionofX1;X2;:::;Xn,thejointprobabilitydensityfunctionofX1;X2;:::;Xn,themarginalprobabilitydensityfunctionofX1,andthecondi-tionalprobabilitydensityfunctionofX1givenX2;:::;Xn,asFX1;X2;:::;Xn(x1;x2;:::xn)=P(X1x1;X2x2;:::;Xnxn)fX1;X2;:::;Xn(x1;x2;:::xn)=@nFX1;X2;:::;Xn(x1;x2;:::xn) @x1:::@xnfX1(X1)=Z1�1Z1�1fX1;X2;:::;Xn(x1;x2;:::xn)dx2:::dxnfX1jX2;:::;Xn(x1jx2;:::xn)=fX1;X2;:::;Xn(x1;x2;:::xn) fX2;:::;Xn(x1;x2;:::xn)TocalculatetheprobabilityofaneventARnwehave,P((x1;x2;:::xn)2A)=Z(x1;x2;:::xn)2AfX1;X2;:::;Xn(x1;x2;:::xn)dx1dx2:::dxn(4)Chainrule:Fromthedenitionofconditionalprobabilitiesformultiplerandomvariables,onecanshowthatf(x1;x2;:::;xn)=f(xnjx1;x2:::;xn�1)f(x1;x2:::;xn�1)=f(xnjx1;x2:::;xn�1)f(xn�1jx1;x2:::;xn�2)f(x1;x2:::;xn�2)=:::=f(x1)nYi=2f(xijx1;:::;xi�1):Independence:Formultipleevents,A1;:::;Ak,wesaythatA1;:::;Akaremutuallyindepen-dentifforanysubsetSf1;2;:::;kg,wehaveP(\i2SAi)=Yi2SP(Ai):Likewise,wesaythatrandomvariablesX1;:::;Xnareindependentiff(x1;:::;xn)=f(x1)f(x2)f(xn):Here,thedenitionofmutualindependenceissimplythenaturalgeneralizationofindependenceoftworandomvariablestomultiplerandomvariables.Independentrandomvariablesariseofteninmachinelearningalgorithmswhereweassumethatthetrainingexamplesbelongingtothetrainingsetrepresentindependentsamplesfromsomeunknownprobabilitydistribution.Tomakethesignicanceofindependenceclear,considera“bad”trainingsetinwhichwerstsampleasingletrainingexample(x(1);y(1))fromthesomeunknowndistribu-tion,andthenaddm�1copiesoftheexactsametrainingexampletothetrainingset.Inthiscase,wehave(withsomeabuseofnotation)P((x(1);y(1));::::(x(m);y(m)))6=mYi=1P(x(i);y(i)):Despitethefactthatthetrainingsethassizem,theexamplesarenotindependent!Whileclearlytheproceduredescribedhereisnotasensiblemethodforbuildingatrainingsetforamachinelearningalgorithm,itturnsoutthatinpractice,non-independenceofsamplesdoescomeupoften,andithastheeffectofreducingthe“effectivesize”ofthetrainingset.10 WewritethisasXN(;).Noticethatinthecasen=1,thisreducestheregulardenitionofanormaldistributionwithmeanparameter1andvariance11.Generallyspeaking,Gaussianrandomvariablesareextremelyusefulinmachinelearningandstatis-ticsfortwomainreasons.First,theyareextremelycommonwhenmodeling“noise”instatisticalalgorithms.Quiteoften,noisecanbeconsideredtobetheaccumulationofalargenumberofsmallindependentrandomperturbationsaffectingthemeasurementprocess;bytheCentralLimitTheo-rem,summationsofindependentrandomvariableswilltendto“lookGaussian.”Second,Gaussianrandomvariablesareconvenientformanyanalyticalmanipulations,becausemanyoftheintegralsinvolvingGaussiandistributionsthatariseinpracticehavesimpleclosedformsolutions.Wewillencounterthislaterinthecourse.5OtherresourcesAgoodtextbookonprobablityatthelevelneededforCS229isthebook,AFirstCourseonProba-bilitybySheldonRoss.12