Technical Report CS Department of Computer Science and Engineering University of California San Diego A general agnostic active learning algorithm Sanjoy Dasgupta Daniel Hsu and Claire Monteleoni Abs - PDF document

Download presentation
Technical Report CS Department of Computer Science and Engineering University of California San Diego A general agnostic active learning algorithm Sanjoy Dasgupta Daniel Hsu and Claire Monteleoni Abs
Technical Report CS Department of Computer Science and Engineering University of California San Diego A general agnostic active learning algorithm Sanjoy Dasgupta Daniel Hsu and Claire Monteleoni Abs

Technical Report CS Department of Computer Science and Engineering University of California San Diego A general agnostic active learning algorithm Sanjoy Dasgupta Daniel Hsu and Claire Monteleoni Abs - Description


Our algorithm extends a sche me of Cohn Atlas and Ladner 6 to the agnostic setting by 1 reformulating it using a reduction to s upervised learning and 2 showing how to apply generalization bounds even for the noniid samples that result from selectiv ID: 34930 Download Pdf

Tags

Our algorithm extends

Embed / Share - Technical Report CS Department of Computer Science and Engineering University of California San Diego A general agnostic active learning algorithm Sanjoy Dasgupta Daniel Hsu and Claire Monteleoni Abs


Presentation on theme: "Technical Report CS Department of Computer Science and Engineering University of California San Diego A general agnostic active learning algorithm Sanjoy Dasgupta Daniel Hsu and Claire Monteleoni Abs"— Presentation transcript


TechnicalReportCS2007-0898DepartmentofComputerScienceandEngineeringUniversityofCalifornia,SanDiegoAgeneralagnosticactivelearningalgorithmSanjoyDasgupta,DanielHsu,andClaireMonteleoni1AbstractWepresentasimple,agnosticactivelearningalgorithmthatworksforanyhypothesisclassofboundedVCdimension,andanydatadistribution.OuralgorithmextendsaschemeofCohn,Atlas,andLadner[6]totheagnosticsetting,by(1)reformulatingitusingareductiontosupervisedlearningand(2)showinghowtoapplygeneralizationboundsevenforthenon-i.i.d.samplesthatresultfromselectivesampling.Weprovideageneralcharacterizationofthelabelcomplexityofouralgorithm.ThisquantityisnevermorethantheusualPACsamplecomplexityofsupervisedlearning,andisexponentiallysmallerforsomehypothesisclassesanddistributions.Wealsodemonstrateimprovementsexperimentally.1IntroductionActivelearningaddressestheissuethat,inmanyapplications,labeleddatatypicallycomesatahighercost(e.g.intime,e ort)thanunlabeleddata.Anactivelearnerisgivenunlabeleddataandmustpaytoviewanylabel.Thehopeisthatsigni cantlyfewerlabeledexamplesareusedthaninthesupervised(non-active)learningmodel.Activelearningappliestoarangeofdata-richproblemssuchasgenomicsequenceannotationandspeechrecognition.Inthispaperweformalize,extend,andprovidelabelcomplexityguaranteesforoneoftheearliestandsimplestapproachestoactivelearning|oneduetoCohn,Atlas,andLadner[6].Theschemeof[6]examinesdataonebyoneinastreamandrequeststhelabelofanydatapointaboutwhichitiscurrentlyunsure.Forexample,supposethehypothesisclassconsistsoflinearseparatorsintheplane,andassumethatthedataislinearlyseparable.Letthe rstsixdatabelabeledasfollows.     Thelearnerdoesnotneedtorequestthelabeloftheseventhpoint(indicatedbythearrow)becauseitisnotunsureaboutthelabel:anystraightlinewiththesandsonoppositesideshastheseventhpointwiththes.Putanotherway,thepointisnotintheregionofuncertainty[6],theportionofthedataspaceforwhichthereisdisagreementamonghypothesesconsistentwiththepresentlabeleddata.Althoughveryelegantandintuitive,thisapproachtoactivelearningfacestwoproblems:1.Explicitlymaintainingtheregionofuncertaintycanbecomputationallycumbersome.2.Dataisusuallynotperfectlyseparable.Ourmaincontributionistoaddresstheseproblems.Weprovideasimplegeneralizationoftheselectivesamplingschemeof[6]thattoleratesadversarialnoiseandneverrequestsmanymorelabelsthanastandardagnosticsupervisedlearnerwouldtolearnahypothesiswiththesameerror. 1Email:fdasgupta,djhsu,cmontelg@cs.ucsd.edu1 Inthepreviousexample,anagnosticactivelearner(onethatdoesnotassumeaperfectseparatorexists)isactuallystilluncertainaboutthelabeloftheseventhpoint,becauseallsixofthepreviouslabelscouldbeinconsistentwiththebestseparator.Therefore,itshouldstillrequestthelabel.Ontheotherhand,afterenoughpointshavebeenlabeled,ifanunlabeledpointoccursatthepositionshownbelow,chancesareitslabelisnotneeded.                   Toextendthenotionofuncertaintytotheagnosticsetting,wedividetheobserveddatapointsintotwogroups,^SandT:Set^Scontainsthedataforwhichwedidnotrequestlabels.Wekeepthesepointsaroundandassignthemthelabelwethinktheyshouldhave.SetTcontainsthedataforwhichweexplicitlyrequestedlabels.Wewillmanagethingsinsuchawaythatthedatain^Sarealwaysconsistentwiththebestseparatorintheclass.Thus,somewhatcounter-intuitively,thelabelsin^SarecompletelyreliablewhereasthelabelsinTcouldbeinconsistentwiththebestseparator.Todecidewhetherweareuncertainaboutthelabelofanewpointx,wereducetosupervisedlearning:welearnhypothesesh+1andh1suchthath+1isconsistentwithallthelabelsin^S[f(x;+1)gandhasminimalempiricalerroronT,whileh1isconsistentwithallthelabelsin^S[f(x;1)gandhasminimalempiricalerroronT.If,say,thetrueerrorofthehypothesish+1ismuchlargerthanthatofh1,wecansafelyinferthatthebestseparatormustalsolabelxwith1withoutrequestingalabel;iftheerrordi erenceisonlymodest,weexplicitlyrequestalabel.Standardgeneralizationboundsforani.i.d.sampleletusperformthistestbycomparingempiricalerrorson^SST.Thelastclaimmaysoundawfullysuspicious,because^SSTisnoti.i.d.!Indeed,thisisinasensethecoresamplingproblemthathasalwaysplaguedactivelearning:thelabeledsampleTmightnotbei.i.d.(duetothe lteringofexamplesbasedonanadaptivecriterion),while^Sonlycontainsunlabeledexamples(withmade-uplabels).Nevertheless,weprovethatinourcase,itisinfactcorrecttoe ectivelypretend^SSTisani.i.d.sample.Adirectconsequenceisthatthelabelcomplexityofouralgorithm(thenumberoflabelsrequestedbeforeachievingadesirederror)isnevermuchmorethantheusualsamplecomplexityofsupervisedlearning(andinsomecases,issigni cantlyless).Animportantalgorithmicdetailisthespeci cchoiceofgeneralizationboundweuseindecidingwhethertorequestalabelornot.Asmallpolynomialdi erenceingeneralizationrates(betweenn1=2andn1,say)cangetmagni edintoanexponentialdi erenceinlabelcomplexity,soitiscrucialforustouseagoodbound.Weuseanormalizedboundthattakesintoaccounttheempiricalerror(computedo^SST|again,notani.i.d.sample)ofthehypothesisinquestion.Earlierworkonagnosticactivelearning[1,12]hasbeenabletoupperboundlabelcomplexityintermsofaparameterofthehypothesisclass(anddatadistribution)calledthedisagreementcoecient.Wegivelabelcomplexityboundsforourmethodbasedonthissamequantity,andwegetabetterdependenceonit,linearratherthanquadratic.Tosummarize,inthispaperwepresentandanalyzeasimpleagnosticactivelearningalgorithmforgeneralhypothesisclassesofboundedVCdimension.ItextendstheselectivesamplingschemeofCohnetal.[6]totheagnosticsetting,usingnormalizedgeneralizationbounds,whichweapplyinasimplebutsubtlemanner.Forcertainhypothesisclassesanddistributions,ouranalysisyieldsimprovedlabelcomplexityguaranteesoverthestandardsamplecomplexityofsupervisedlearning.Wealsodemonstratesuchimprovementsexperimentally.2 1.1RelatedworkAlargenumberofalgorithmshavebeenproposedforactivelearning,underavarietyoflearningmodels.Inthissection,weconsideronlymethodswhosegeneralizationbehaviorhasbeenrigorouslyanalyzed.Anearlylandmarkresult,from1989,wastheselectivesamplingschemeofCohn,Atlas,andLadner[6]describedabove.Thissimpleactivelearningalgorithm,designedforseparabledata,hasbeentheinspirationforalotofsubsequentwork.Afewyearslater,theseminalworkofFreund,Seung,Shamir,andTishby[9]analyzedanalgorithmcalledquery-by-committeethatoperatesinaBayesiansettingandusesanelegantsamplingtrickfordecidingwhentoquerypoints.Thecoreprimitiverequiredbythisalgorithmistheabilitytosamplerandomlyfromtheposterioroverthehypothesisspace.Insomecasesthiscanbeachievedeciently[10],forinstancewhenthehypothesisclassconsistsoflinearseparatorsinRd(withauniformprior)andthedataisdistributeduniformlyoverthesurfaceoftheunitsphereinRd.Inthisparticularsetting,theauthorsshowedthatthenumberoflabelsrequiredtoachievegeneralizationerror"isjustO(dlog1="),exponentiallylowerthantheusualsupervisedsamplecomplexityofO(d=").Subsequently,Dasgupta,Kalai,andMonteleoni[8]showedthatasimplevariantoftheperceptronalgorithmalsoachievesthislabelcomplexity,evenforaworst-case(non-Bayesian)choiceoftargethypothesis.Alltheworkmentionedsofarassumesseparabledata.ThiscasewasstudiedabstractlybyDasgupta[7],whofoundthataparametercalledthesplittingindexlooselycharacterizesthelabelcomplexityofactivelylearninghypothesisclassesofboundedVCdimension.Asyet,itisnotknownhowtorealizethislabelcomplexityinacomputationallyecientway,exceptinspecialcases.Anaturalwaytoformulateactivelearningintheagnosticsettingistoaskthelearnertoreturnahypothesiswitherroratmost"(whereistheerrorofthebesthypothesisinthespeci edclass)usingasfewlabelsaspossible.AbasicconstraintonthelabelcomplexitywaspointedoutbyKaariainen[14],whoshowedthatforany(0;1=2),therearedatadistributionsthatforceanyactivelearnerthatachieveserroratmost"torequest\n((=")2)labels.The rstrigorously-analyzedagnosticactivelearningalgorithm,calledA2,wasdevelopedrecentlybyBalcan,Beygelzimer,andLangford[1].LikeCohn-Atlas-Ladner[6],thisalgorithmusesaregionofuncertainty,althoughthelackofseparabilitycomplicatesmattersandA2endsupexplicitlymaintainingan"-netofthehypothesisspace.Subsequently,Hanneke[12]characterizedthelabelcomplexityoftheA2algorithmintermsofaparametercalledthedisagreementcoecient.Anotherthreadofworkfocusesonagnosticlearningofthresholdsfordatathatlieonaline;inthiscase,aprecisecharacterizationoflabelcomplexitycanbegiven[4,5].Thesepreviousresultseithermakestrongdistributionalassumptions(suchasseparability,orauniforminputdistribution)[2,6{9,13],orelsetheyarecomputationallyprohibitiveingeneral[1,7,9].Ourworkwasinspiredbyboth[6]and[1],andwehavebuiltheavilyupontheirinsights.2WeboundthelabelcomplexityofourmethodintermsofthesameparameterasusedforA2[12],andgetasomewhatbetterdependence(linearratherthanquadratic).AcommonfeatureofCohn-Atlas-Ladner,A2,andourmethodisthattheyareallfairlynon-aggressiveintheirchoiceofquerypoints.Theyarecontentwithqueryingallpointsonwhichthereisevenasmallamountofuncertainty,ratherthan,forinstance,pursuingthemaximallyuncertainpoint.Recently,Balcan,Broder,andZhang[2]showedthatforthehypothesisclassoflinearseparators,underdistributionalassumptionsonthedata(forinstance,auniformdistributionovertheunitsphere),amoreaggressivestrategycanyieldbetterlabelcomplexity. 2IthasbeennotedthattheCohn-Atlas-Ladnerschemecaneasilybemadetractableusingareductiontosupervisedlearningintheseparablecase[16,p.68].AlthoughouralgorithmismostnaturallyseenasanextensionofCohn-Atlas-Ladner,asimilarreductiontosupervisedlearning(intheagnosticsetting)canbeusedforA2,aswedemonstrateinAppendixB.3 2Preliminaries2.1LearningframeworkanduniformconvergenceLetXbetheinputspace,DadistributionoverXf1gandHaclassofhypothesesh:X!f1gwithVCdimensionvcdim(H)=d.RecallthatthenthshatteringcoecientS(H;n)isde nedasthemaximumnumberofwaysinwhichHcanlabelasetofnpoints;bySauer'slemma,thisisatmostO(nd)[3,p.175].WedenotebyDXthemarginalofDoverX.Inouractivelearningmodel,thelearnerreceivesunlabeleddatasampledfromDX;foranysampledpointx,itcanoptionallyrequestthelabelysampledfromtheconditionaldistributionatx.Thisprocesscanbeviewedassampling(x;y)fromDandrevealingonlyxtothelearner,keepingthelabelyhiddenunlessthelearnerexplicitlyrequestsit.TheerrorofahypothesishunderDiserrD(h)=Pr(x;y)D[h(x)y],andona nitesampleZXf1g,theempiricalerrorofhiserr(h;Z)=1 jZjX(x;y)2Z1l[h(x)y];where1l[]isthe0-1indicatorfunction.Weassumeforsimplicitythattheminimalerror=infferrD(h):h2Hgisachievedbyahypothesish2H.Ouralgorithmandanalysisusethefollowingnormalizeduniformconvergencebound[3,p.200].Lemma1(VapnikandChervonenkis[17]).LetFbeafamilyofmeasurablefunctions:Z!f0;1goveraspaceZ.DenotebyZtheempiricalaverageofoverasubsetZZ.Let n (4=n)ln(8S(F;2n)=).IfZisani.i.d.sampleofsizenfroma xeddistributionoverZ,then,withprobabilityatleast1,forall2F:min n Zf; 2n n Zmin 2n n Zf; n :2.2DisagreementcoecientTheactivelearningalgorithmwewillshortlydescribeisnotveryaggressive:ratherthanseekingoutpointsthataremaximallyinformative,itquerieseverypointthatitissomewhatunsureabout.TheearlyworkofCohn-Atlas-Ladner[6]andtherecentA2algorithm[1]aresimilarlymellowintheirqueryingstrategy.Thelabelcomplexityimprovementsachievablebysuchalgorithmsarenicelycapturedbyaparametercalledthedisagreementcoecient,introducedrecentlybyHanneke[12]inhisanalysisofA2.Tomotivatethedisagreementcoecient,imaginethatweareinthemidstoflearning,andthatourcurrenthypothesishthaserroratmost .Supposeweevenknowthevalueof .Thentheonlycandidatehypotheseswestillneedtoconsiderarethosethatdi erfromhtonatmosta2 fractionoftheinputdistribution,becauseallotherhypothesesmusthaveerrormorethan .Tomakethisabitmoreformal,weimposea(pseudo-)metriconthespaceofhypotheses,asfollows.De nition1.Thedisagreementpseudo-metriconHisde nedby(h;h0)=PrDX[h(x)h0(x)]forh;h02H.LetB(h;r)=fh02H:(h;h0)rgbetheballcenteredaroundhofradiusr.Returningtoourearlierscenario,weneedonlyconsiderhypothesesinB(ht;2 )andthus,whenweseeanewdatapointx,thereisnosenseinaskingforitslabelifallofB(ht;2 )agreesonwhatthislabelshouldbe.Theonlypointswepotentiallyneedtoqueryarefx:h(x)h0(x)forsomeh;h0B(ht;2 )g:Intuitively,thedisagreementcoecientcaptureshowthemeasureofthissetgrowswith .Thefollowingisaslightvariationoftheoriginalde nitionofHanneke[12].4 Algorithm1Input:stream(x1;x2;:::;xm)i.i.d.fromDXInitially,^S0:=;andT0:=;.Forn=1;2;:::;m:1.Foreach^y2f1g,leth^y:=LEARNH(^Sn1[f(xn;^y)g;Tn1).2.Iferr(h^y;^Sn11Tn1)err(h^y;^Sn11Tn1)n1(orifnosuch^yisfound)forsome^y2f1g,then^Sn:=^Sn1[f(xn;^y)gandTn:=Tn1.3.Elserequestyn;^Sn:=^Sn1andTn:=Tn1[f(xn;yn)g.Returnhf=LEARNH(^Sm;Tm).Figure1:Theagnosticselectivesamplingalgorithm.See(1)forapossiblesettingforn.De nition2.Thedisagreementcoecient(D;H;")0is=supPrDX[hB(h;r)s.t.h(x)h(x)] r:r"whereh=arginfh2HerrD(h)and=errD(h).Clearly,1=(");furthermore,itisaconstantboundedindependentlyof1=(")inseveralcasespreviouslyconsideredintheliterature[12].Forexample,ifHishomogeneouslinearseparatorsandDXistheuniformdistributionovertheunitsphereinRd,then=(p d).3AgnosticselectivesamplingHerewestateandanalyzeourgeneralalgorithmforagnosticactivelearning.Themaintechniquesemployedbythealgorithmarereductionstoasupervisedlearningtaskandgeneralizationboundsappliedtodi erencesofempiricalerrors.3.1AgeneralalgorithmforagnosticactivelearningFigure1statesouralgorithminfullgenerality.Theinputisastreamofmunlabeledexamplesdrawni.i.dfromDX;forthetimebeing,mcanbethoughtofas~O((d=")(1+="))where"istheaccuracyparameter.3Thealgorithmoperatesbyreductiontoaspecialkindofsupervisedlearningthatincludeshardconstraints.ForA;BXf1g,letLEARNH(A;B)denoteasupervisedlearnerthatreturnsahypothesish2HconsistentwithA,andwithminimumerroronB.IfthereisnohypothesisconsistentwithA,itreportsthis.Forsomesimplehypothesisclasseslikeintervalsontheline,orrectanglesinR2,itiseasytoconstructsuchalearner.Formorecomplexclasseslikelinearseparators,themainbottleneckisthehardnessofminimizingthe01lossonB(thatis,thehardnessofagnosticsupervisedlearning).Ifaconvexupperboundonthislossfunctionisusedinstead,asinthecaseofsoft-marginsupportvectormachines,itisstraightforwardtoincorporatehardconstraints;butatpresenttherigorousguaranteesaccompanyingouralgorithmapplyonlyif01lossisused. 3The~Onotationsuppresseslog1=andtermspolylogarithmicinthosethatappear.5 Algorithm1maintainstwosetsoflabeledexamples,^SandT,eachofwhichisinitiallyempty.Uponreceivingxn,itlearnstwo4hypotheses,h^y=LEARNH(^S[f(xn;^y)g;T)for^y2f1g,andthencomparestheirempiricalerrorson^SST.Ifthedi erenceislargeenough,itispossibletoinferhowhlabelsxn(asweshowinLemma3).Inthiscase,thealgorithmaddsxn,withthisinferredlabel,to^S.Otherwise,thealgorithmrequeststhelabelynandadds(xn;yn)toT.Thus,^Scontainsexampleswithinferredlabelsconsistentwithh,andTcontainsexampleswiththeirrequestedlabels.BecausehmighterronsomeexamplesinT,wejustinsistthatLEARNH ndahypothesiswithminimalerroronT.Meanwhile,byconstruction,hisconsistentwith^S(asweshallsee),sowerequireLEARNHtoonlyconsiderhypothesesconsistentwith^S.3.2Boundsforerrordi erencesWestillneedtospecifyn,thethresholdvalueforerrordi erencesthatdetermineswhetherthealgorithmrequestsalabelornot.Intuitively,nshouldre\recthowcloselyempiricalerrorsonasampleapproximatetrueerrorsonthedistributionD.Notethatouralgorithmismodularwithrespecttothechoiceofn,so,forexample,itcanbecustomizedforaparticularinputdistributionandhypothesisclass.Belowweprovideasimpleandadaptivesettingthatworksforanydistributionandhypothesisclasswith niteVCdimension.Thesettingofncanonlydependonobservablequantities,sowe rstclarifythedistinctionbetwempiricalerrorson^SnnTnandthosewithrespecttothetrue(hidden)labels.De nition3.Let^SnandTnbeasde nedinAlgorithm1.LetSn(sheddingthehataccent)bethesetoflabeledexamplesidenticaltothosein^Sn,exceptwiththetruehiddenlabelsswappedin.Thus,forexample,SnnTnisani.i.d.samplefromDofsizen.Finally,leterrn(h)=err(h;SnnTn)anderrn(h)=err(h;^SnnTn):ItisstraightforwardtoapplyLemma1toempiricalerrorsonSnnTn,i.e.toerrn(h),butwecannotusesuchboundsalgorithmically:wedonotrequestthetruelabelsforpointsin^Snandthuscannotreliablycomputeerrn(h).Whatwecancomputeareerrordi erenceserrn(h)errn(h0)forpairsofhypotheses(h;h0)thatagreeon(andthusmakethesamemistakeson)^Sn,sinceforsuchpairs,wehaveerrn(h)errn(h0)=errn(h)errn(h0):5Theseempiricalerrordi erencesaremeansoff1;0;+1g-valuedrandomvariables.Weneedtorewritethemintermsoff0;1g-valuedrandomvariablesforsomeoftheconcentrationboundswewillbeusing.De nition4.Forapair(h;h0)2HH,de ne+h;h0(x;y)=1l[h(x)yh0(x)=y]andh;h0(x;y)=1l[h(x)=yh0(x)y].Withthisnotation,wehaveerr(h;Z)err(h0;Z)=Z[+h;h0]Z[h;h0]foranyZXf1g.Now,applyingLemma1toGf+h;h0:(h;h0)2HHgfh;h0:(h;h0)2HHg,andnotingthatS(G;n)S(H;n)2,givesthefollowinglemma.Lemma2.Let n (4=n)ln(8S(H;2n)2=).Withprobabilityatleast1overani.i.d.sampleZofsizenfromD,wehaveforall(h;h0)2HH,err(h;Z)err(h0;Z)errD(h)errD(h0)+ 2n n Z[+h;h0]+ Z[h;h0]:WithZSnnTn,theerrordi erenceontheleft-handsideiserrn(h)errn(h0),whichcanbeempiricallydeterminedbecauseitisequaltoerrn(h)errn(h0).Butthetermsinthesquarerootontheright-handsidestillposeaproblem,whichwe xnext. 4IfLEARNHcannot ndahypothesisconsistentwith^S[f(xn;y)gforsomey,thenassuminghisconsistentwith^S,itmustbethath(x)=y.Inthiscase,wesimplyadd(xn;y)to^S,regardlessoftheerrordi erence.5Thisobservationisenoughtoimmediatelyjustifytheuseofadditivegeneralizationboundsforn.However,weneedtousenormalized(multiplicative)boundstoachieveabetterlabelcomplexity.6 Corollary1.Let n (4=n)ln(8(n2n)S(H;2n)2=).Then,withprobabilityatleast1,foralln1andall(h;h0)2HHconsistentwith^Sn,wehaveerrn(h)errn(h0)errD(h)errD(h0)+ 2n n( errn(h)+ errn(h0)):Proof.Foreachn1,weapplyLemma2usingZSnnTnand=(n2n).Then,weapplyaunionboundoveralln1.Thus,withprobabilityatleast1,theboundsinLemma2holdsimultaneouslyforalln1andall(h;h0)2H2withSnnTninplaceofZ.Thecorollaryfollowsbecauseerrn(h)errn(h0)=errn(h)errn(h0);andbecauseSn[Tn[+h;h0]errn(h)andSn[Tn[h;h0]errn(h0).Toseethe rstoftheseexpectationbounds,witnessthatbecausehandh0agreeonSn,Sn[Tn[+h;h0]=1 nX(x;y)2Tn1l[h(x)yh0(x)=y]1 nX(x;y)2Tn1l[h(x)y]=errn(h):Thesecondboundissimilar. Corollary1impliesthatwecane ectivelyapplythenormalizeduniformconvergenceboundsfromLemma1toempiricalerrordi erenceson^SnnTn,eventhough^SnnTnisnotani.i.d.samplefromD.Inlightofthis,weusethefollowingsettingofn:n:= 2n n errn(h+1)+ errn(h1)(1)where n (4=n)ln(8(n2n)S(H;2n)2=)=~O( dlogn=n)asperCorollary1.3.3Correctnessandfall-backanalysisWenowjustifyoursettingofnwithacorrectnessproofandfall-backguarantee.Thefollowinglemmaelucidateshowtheinferredlabelsin^Sserveasamechanismforimplicitlymaintainingacandidatesetofhypothesesthatalwaysincludesh.Thefall-backguaranteethenfollowsalmostimmediately.Lemma3.Withprobabilityatleast1,thehypothesish=arginfh2HerrD(h)isconsistentwith^Snforalln0inAlgorithm1.Proof.ApplytheboundsinCorollary1(theyholdwithprobabilityatleast1)andproceedbyinductiononn.Thebasecaseistrivialsince^S0;.Nowassumehisconsistentwith^Sn.Supposeuponreceivingxn+1,wediscovererrn(h+1)errn(h1)n.Wewillshowthath(xn+1)=1(assumebothh+1andh1exist,sinceitisclearh(xn+1)=1ifh+1doesnotexist).Supposeforthesakeofcontradictionthath(xn+1)=+1.Weknowthaterrn(h)errn(h+1)(bytheinductivehypothesishisconsistentwith^Sn,andyetthelearnerchoseh+1inpreferencetoit)anderrn(h+1)errn(h1)� 2n n( errn(h+1)+ errn(h1)).Inparticular,errn(h+1)� 2n.Therefore,errn(h)errn(h1)=(errn(h)errn(h+1))+(errn(h+1)errn(h1)) errn(h+1)( errn(h) errn(h+1))+ 2n n( errn(h+1)+ errn(h1))� n( errn(h) errn(h+1))+ 2n n( errn(h+1)+ errn(h1)) 2n n( errn(h)+ errn(h1)):NowCorollary1impliesthaterrD(h)errD(h1),acontradiction. Theorem1.Let=infh2HerrD(h)andd=vcdim(H).Thereexistsaconstantc�0suchthatthefollowingholds.IfAlgorithm1isgivenastreamofmunlabeledexamples,thenwithprobabilityatleast1,thealgorithmreturnsahypothesiswitherroratmost((1=m)(dlogm+log(1=))+ (=m)(dlogm+log(1=))).7 Proof.Lemma3impliesthathisconsistentwith^Smwithprobabilityatleast1.UsingthesameboundsfromCorollary1(alreadyappliedinLemma3)onhandhftogetherwiththefacterrm(hf)errm(h),wehaveerrD(hf) 2m mp  m errD(hf),whichinturnimplieserrD(hf)+3 2m+2 mp . So,Algorithm1returnsahypothesiswitherroratmost"whenm~O((d=")(1+="));thisis(asymp-totically)theusualsamplecomplexityofsupervisedlearning.Sincethealgorithmrequestsatmostmlabels,itslabelcomplexityisalwaysatmost~O((d=")(1+=")).3.4LabelcomplexityanalysisWecanalsoboundthelabelcomplexityofouralgorithmintermsofthedisagreementcoecient.Thisyieldstighterboundswhenisboundedindependentlyof1=(").Thekeytoderivingourlabelcomplexityboundsbasedonisnotingthattheprobabilityofrequestingthe(n+1)stlabelisintimatelyrelatedtoandn.Lemma4.Thereexistconstants1;c20suchthat,withprobabilityatleast12,foralln1,thefollowingholds.Leth(xn+1)=^ywhereh=arginfh2HerrD(h).Then,theprobabilitythatAlgorithm1requeststhelabelyn+1isPrn+1DX[Requestyn+1]Prn+1DX[errD(h^y)12 2n]where nisasde nedinCorollary1and=infh2HerrD(h).Proof.SeeAppendixA. Lemma5.InthesamesettingasLemma4,thereexistsaconstantc�0suchthatPrn+1DX[Requestyn+1]( 2n),where(D;H;3 2m+2 mp )isthedisagreementcoecient,=infh2HerrD(h),and nisasde nedinCorollary1.Proof.Supposeh(xn+1)=1.Bythetriangleinequality,wehavethaterrD(h+1)(h+1;h),whereisthedisagreementmetriconH(De nition1).ByLemma4,thisimpliesthattheprobabilityofrequestingyn+1isatmosttheprobabilitythat(h+1;h)(1+1)2 2nforsomeconstants1;c20.Wecanchoosetheconstantssothat(1+1)2 2n+3 2m+2 mp .Then,thede nitionofthedisagreementcoecientgivestheconclusionthatPrn+1DX[(h+1;h)(1+1)2 2n]((1+1)2 2n). Nowwegiveourmainlabelcomplexityboundforagnosticactivelearning.Theorem2.LetmbethenumberofunlabeleddatagiventoAlgorithm1,d=vcdim(H),=infh2HerrD(h), masde nedinCorollary1,and(D;H;3 2m+2 mp ).Thereexistsaconstant10suchthatforany21,withprobabilityatleast12:1.If(21) 2m,Algorithm1returnsahypothesiswitherrorasboundedinTheorem1andtheexpectednumberoflabelsrequestedisatmost1+12dlog2m+log1 logm:2.Else,thesameholdsexcepttheexpectednumberoflabelsrequestedisatmost1+1mdlog2m+log1 logm:Furthermore,ifListheexpectednumberoflabelsrequestedasperabove,thenwithprobabilityatleast10,thealgorithmrequestsnomorethanL 3Llog(1=0)labels.8 Proof.FollowsfromLemma5andaCherno boundforthePoissontrials1l[Requestyn]. Withthesubstitution"=3 2m+2 mp asperTheorem1,Theorem2entailsthatforanyhypothesisclassanddatadistributionforwhichthedisagreementcoecient(D;H;")isboundedindependentlyof1=(")(see[12]forsomeexamples),Algorithm1onlyneeds~O(dlog2(1="))labelstoachieveerror"and~O(d(log2(1=")+(=")2))labelstoachieveerror".Thelattermatchesthedependenceon="inthe\n((=")2)lowerbound[14].ThelineardependenceonimprovesonthequadraticdependenceshownforA2[12]6.Foranillustrativeconsequenceofthis,supposeDXistheuniformdistributiononthesphereinRdandHishomogeneouslinearseparators;inthiscase,=(p d).ThenthelabelcomplexityofA2dependsatleastquadraticallyonthedimension,whereasthecorrespondingdependenceforouralgorithmisd3=2.Aspecially-designedsettingofn(say,speci ctotheinputdistributionandhypothesisclass)maybeabletofurtherreducethedependencetod(see[2]).4ExperimentsWeimplementedAlgorithm1inafewsimplecasestoexperimentallydemonstratethelabelcomplexityimprovements.Ineachcase,thedatadistributionDXwasuniformover[0;1];thestreamlengthwasm10000,andeachexperimentwasrepeated20timeswithdi erentrandomseeds.Our rstexperimentstudiedlinearthresholdsontheline.Thetargethypothesiswas xedtobeh(x)=sign(x0:5).Forthishypothesisclass,weusedtwodi erentnoisemodels,eachofwhichensuredinfh2HerrD(h)=errD(h)=forapre-speci ed[0;1].The rstmodelwasrandommisclassi cation:foreachpointxDX,weindependentlylabeledith(x)withprobability1andh(x)withprobability.Inthesecondmodel(alsousedin[4]),foreachpointxDX,weindependentlylabeledit+1withprobability(x0:5)=(4)+0:5and1otherwise,thusconcentratingthenoiseneartheboundary.Oursecondexperimentstudiedintervalsontheline.Here,weonlyusedrandommisclassi cation,butwevariedthetargetintervallengthp+=PrDX[h(x)=+1].TheresultsshowthatthenumberoflabelsrequestedbyAlgorithm1wasexponentiallysmallerthanthetotalnumberofdataseen(m)underthe rstnoisemodel,andwaspolynomiallysmallerunderthesecondnoisemodel(seeFigure2;weveri edthepolynomialvs.exponentialdistinctiononseparatelog-logscaleplots).Inthecaseofintervals,weobserveaninitialphase(ofdurationroughly1=p+)inwhicheverylabelisrequested,followedbyamoreecientphase,con rmingtheknownactive-learnabilityofthisclass[7,12].Theseimprovementsshowthatouralgorithmneededsigni cantlyfewerlabelstoachievethesameerrorasastandardsupervisedalgorithmthatuseslabelsforallpointsseen.Asasanitycheck,weexaminedthelocationsofdataforwhichAlgorithm1requestedalabel.Welookedattwoparticularrunsofthealgorithm:the rstwaswithH=intervals,p+=0:2,m=10000,and=0:1;thesecondwaswithH=boxes(d=2),p+=0:49,m=1000,and=0:01.Ineachcase,thedatadistributionwasuniformover[0;1]d,andthenoisemodelwasrandommisclassi cation.Figure3showsthat,earlyon,labelswererequestedeverywhere.Butasthealgorithmprogressed,labelrequestsconcentratedneartheboundaryofthetargethypothesis.5ConclusionandfutureworkWehavepresentedasimpleandnaturalapproachtoagnosticactivelearning.OurextensionoftheselectivesamplingschemeofCohn-Atlas-Ladner[6]1.simpli esthemaintenanceoftheregionofuncertaintywithareductiontosupervisedlearning,and2.guardsagainstnoisewithasuitablealgorithmicapplicationofgeneralizationbounds. 6ItmaybepossibletoreduceA2'squadraticdependencetoalineardependencebyusingnormalizedbounds,aswedohere.9 0 5000 10000 0 500 1000 1500 2000 2500 3000 3500 4000 0 5000 10000 0 500 1000 1500 2000 2500 3000 3500 (a)(b)Figure2:Labelingrateplots.Theplotsshowthenumberoflabelsrequested(verticalaxis)versusthetotalnumberofpointsseen(labeled+unlabeled,horizontalaxis)usingAlgorithm1.(a)H=thresholds:underrandommisclassi cationnoisewith=0(solid),0:1(dashed),0:2(dot-dashed);undertheboundarynoisemodelwith=0:1(lowerdotted),0:2(upperdotted).(b)H=intervals:underrandommisclassi cationwith(p+;)=(0:2;0:0)(solid),(0:1;0:0)(dashed),(0:2;0:1)(dot-dashed),(0:1;0:1)(dotted). (a)(b)Figure3:Locationsoflabelrequests.(a)H=intervals,h=[0:4;0:6].Thetophistogramshowsthelocationsof rst400labelrequests(thex-axisistheunitinterval);thebottomhistogramisforall(2141)labelrequests.(b)H=boxes,h=[0:15;0:85]2.The rst200requestsoccurredatthes,thenext200atthes,andthe nal109atthes.10 Ouralgorithmreliesonathresholdparameternforcomparingempiricalerrors.Weprescribeaverysimpleandnaturalchoiceforn{anormalizedgeneralizationboundfromsupervisedlearning{butonecouldhopeforamorecleveroraggressivechoice,akintothosein[2]forlinearseparators.Findingconsistenthypotheseswhendataisseparableisoftenasimpletask.Insuchcases,reduction-basedactivelearningalgorithmscanberelativelyecient(answeringsomequestionsposedin[15]).Ontheotherhand,agnosticsupervisedlearningiscomputationallyintractableformanyhypothesisclasses(e.g.[11]),andofcourse,agnosticactivelearningisatleastashardintheworstcase.Ourreductiontosupervisedlearningisbenigninthesensethatthelearningproblemsweneedtosolveareoversamplesfromtheoriginaldistribution,sowedonotcreatepathologicallyhardinstances(likethosearisingfromhardnessreductions[11])unlesstheyareinherentinthedata.Nevertheless,animportantresearchdirectionistodevelopconsistentactivelearningalgorithmsthatonlyrequiresolvingtractable(e.g.convex)optimizationproblems.Asimilarreduction-basedschememaybepossible.6AcknowledgementsWearegratefultotheEngineeringInstitute(aresearchandeducationalpartnershipbetweenLosAlamosNationalLaboratoryandU.C.SanDiego)forsupportingthesecondauthorwithagraduatefellowship,andtotheNSFforsupportundergrantsIIS-0347646andIIS-0713540.References[1]M.-F.Balcan,A.Beygelzimer,andJ.Langford.Agnosticactivelearning.InICML,2006.[2]M.-F.Balcan,A.Broder,andT.Zhang.Marginbasedactivelearning.InCOLT,2007.[3]O.Bousquet,S.Boucheron,andG.Lugosi.Introductiontostatisticallearningtheory.LectureNotesinArti cialIntelligence,3176:169{207,2004.[4]R.CastroandR.Nowak.Upperandlowerboundsforactivelearning.InAllertonConferenceonCommunication,ControlandComputing,2006.[5]R.CastroandR.Nowak.Minimaxboundsforactivelearning.InCOLT,2007.[6]D.Cohn,L.Atlas,andR.Ladner.Improvinggeneralizationwithactivelearning.MachineLearning,15(2):201{221,1994.[7]S.Dasgupta.Coarsesamplecomplexityboundsforactivelearning.InNIPS,2005.[8]S.Dasgupta,A.Kalai,andC.Monteleoni.Analysisofperceptron-basedactivelearning.InCOLT,2005.[9]Y.Freund,H.Seung,E.Shamir,andN.Tishby.Selectivesamplingusingthequerybycommitteealgorithm.MachineLearning,28(2):133{168,1997.[10]R.Gilad-Bachrach,A.Navot,andN.Tishby.Querybycommitteeemadereal.InNIPS,2005.[11]V.GuruswamiandP.Raghavendra.Hardnessoflearninghalfspaceswithnoise.InFOCS,2006.[12]S.Hanneke.Aboundonthelabelcomplexityofagnosticactivelearning.InICML,2007.[13]S.Hanneke.Teachingdimensionandthecomplexityofactivelearning.InCOLT,2007.[14]M.Kaariainen.Activelearninginthenon-realizablecase.InALT,2006.[15]C.Monteleoni.Ecientalgorithmsforgeneralactivelearning.InCOLT.Openproblem,2006.[16]C.Monteleoni.Learningwithonlineconstraints:shiftingconceptsandactivelearning.PhDThesis,MITComputerScienceandArti cialIntelligenceLaboratory,2006.[17]V.VapnikandA.Chervonenkis.Ontheuniformconvergenceofrelativefrequenciesofeventstotheirprobabil-ities.TheoryofProbabilityanditsApplications,16:264{280,1971.11 AProofofLemma4Let\rn (4=n)ln(8(n2n)S(H;2n)=)(whichisatmost n).Withprobabilityatleast12:(Lemma1)Foralln1,allh2H,wehave\r2n\rn errD(h)errD(h)errn(h)\rn errD(h):(Lemma3)hisconsistentwith^Snforalln0.Throughout,wemakerepeateduseofthesimplefactABCp AABC2Cp Bfornon-negativeA;B;C.Nowsupposeh(xn+1)=1andAlgorithm1requeststhelabelyn+1.WeneedtoshowthaterrD(h+1)12 2nforsomepositiveconstants1and2.Sincethealgorithmrequestsalabel,wehaveerrn(h+1)errn(h1) 2n n( errn(h+1)+ errn(h1)):WeboundtheLHSfrombelowwitherrn(h+1)errn(h1)errn(h+1)errn(h)=errn(h+1)errn(h)andtheRHSfromaboveusingerrn(h+1)errn(h+1)anderrn(h1)errn(h).Therefore,errn(h+1)errn(h)+ 2n n errn(h+1)+ n errn(h);which,inturn,implieserrn(h+1)errn(h)+2 2n n errn(h)+ n errn(h)+ 2n n errn(h):Uniformconvergenceoferrorsallowstheboundserrn(h+1)errD(h+1)\rn errD(h+1)anderrn(h)\r2n\rnp ,soitfollowsthaterrD(h+1) n errD(h+1)+3 2n np  n  2n np  n +2 2n np  n  2n np +(4+p 3) 2n+3 np :ThisimplieserrD(h+1)3+(12+2p 3) 2nasneeded. BRecasting2withreductionstosupervisedlearningHerewerecasttheA2algorithm[1]withreductionstosupervisedlearning,makingitmorestraightforwardlyimplementable.OurreductionusesthesubroutineLEARNH(S;T),asupervisedlearnerthatreturnsahypothesish2HconsistentwithSandwithminimalerroronT(orfailsifnoneexist).TheoriginalA2algorithmexplicitlymaintainsversionspacesandregionsofuncertainty;thesesetsareimplicitlymaintainedwithourreduction:1.The rstargumentStoLEARNHforcesthealgorithmtoonlyconsiderhypothesesinthecurrentversionspace.2.UponreceivingalabeledsampleTfromthecurrentregionofuncertainty,weuseLEARNHtodeterminewhichunlabeleddatathealgorithmisstilluncertainof.12 Reduction-basedA2Input:X:=fx1;x2;:::;xmgi.i.d.fromDX.Initially,U1:=X,S1:=;,n1:=0.Forphasei=1;2;::::1.LetTi;nibearandomsubsetofUiofsize2ni,andletTOi;ni:=f(x;O(x)):xTi;nig,requestinglabelsasneeded(somemayhavealreadybeenrequestedinpreviousrepeatsofthisphase).If(jUij=jXj)i;ni",thenreturnhf:=LEARNH(Si;Ti;ni).2.InitializetemporaryvariablesC0:=;,S0:=;.3.ForeachxUi:(a)Foreach^y2f1g,leth^y:=LEARNH(Si;TOi;ni).(b)Iferr(h^y;TOi;ni)err(h^y;TOi;ni)i;ni(orifnosuch^yexists)forsome^y2f1g,thenC0:=C0[fxgandS0:=S0[f(x;^y)g.4.Ui+1:=UinC0,Si+1:=SiiS0.IfjUi+1j=jXj",thenreturnhf:=LEARNH(Si+1;;).5.IfjUi+1j(1=2)jUij,thensetni:=ni+1andrepeatphasei.6.Elsesetni+1:=0andcontinuetophasei+1.Figure4:A2recastwithreductionstosupervisedlearning.Thesettingofi;niisdiscussedintheanalysis.B.1AlgorithmThealgorithm,speci edinFigure4,usesaninitialunlabeledsampleXXofsizem~O(d="2)drawni.i.d.fromtheinputdistributionDX.ThissetXservesasaproxyforDXsothatcomputingthemassofthecurrentregionofuncertaintycanbeperformedsimplybycounting.Consequently,thealgorithmwilllearnahypothesiswitherrorclosetotheminimalerrorachievableonX.However,thiscaneasilybetranslatedtoerrorswithrespecttothefulldistributionbystandarduniformconvergencearguments.WecanthinkoftheinitialsampleXfromDXasjustthexpartsofalabeledsampleZf(x1;y1);:::;(xm;ym)gfromD;thelabelsy1;:::;ymremainhiddenunlessthealgorithmrequeststhem.LetO:X!f1gbethemappingfromtheunlabeleddatainXtotheirrespectivehiddenlabels.Also,foranyunlabeledsubsetTofX,letTOf(x;O(x)):xTgbethecorrespondinglabeledsubsetofZ.Thealgorithmproceedsinphases;thegoalofeachphaseiistocutthecurrentregionofuncertaintyUiinhalf.Towardthisend,thealgorithmlabelsarandomsubsetofUiandusesLEARNHtocheckwhichpointsinUiitisstilluncertainabout.ThealgorithmisuncertainaboutapointxUiifitcannotinferh(x),whereh=argminh2Herr(h;XO).B.2AnalysisLeth=argminh2Herr(h;XO)andletHifh2H:h(x)=^y(x;^y)Sig.Weassumethatwithprobabilityatleast1,thefollowingholdsforallphasesi1(includingrepeats):h2Hierr(h;TOi;ni)err(h;TOi;ni)err(h;UOi)err(h;UOi)i;ni:Forexample,standardgeneralizationboundscanbeusedhereforthei;ni,withappropriate-sharing.Lemma6.Withprobabilityatleast1,foralli1,wehaveh2Hianderr(h;UOi)err(h;UOi)forallh2Hi.13 Proof.Byinduction.Thebasecaseistrivial.Soassumeit'strueforiandwe'llshowit'struefori+1.Supposeforsakeofcontradictionthatsome(x;y)isaddedtoS0instep3,buth(x)y.Thenerr(h;UOi)err(h^y;UOi)err(h;TOi;ni)err(h^y;TOi;ni)i;nierr(h^y;TOi;ni)err(h^y;TOi;ni)i;nii;nii;nisoerr(h;UOi)err(h^y;UOi),acontradictionoftheinductivehypothesis.Therefore,h(x)=yforall(x;y)S0(andthusthoseultimatelyaddedtoSi+1),soh2Hi+1.Theerrorofahypothesish2Hi+1onXOdecomposesasfollows:err(h;XO)=jXnUi+1j jXjerr(h;XnUOi+1)+jUi+1j jXjerr(h;UOi+1)jXnUi+1j jXjerr(h;XnUOi+1)+jUi+1j jXjerr(h;UOi+1)jUi+1j jXj(err(h;UOi+1)err(h;UOi+1))=err(h;XO)+jUi+1j jXj(err(h;UOi+1)err(h;UOi+1))(thesecondequalityfollowsbecausehypothesesinHi+1agreeonXnUOi+1fx:(x;y)Si+1g).Sinceerr(h;XO)err(h;XO),wehaveerr(h;UOi+1)err(h;UOi+1)forallh2Hi+1. Lemma7.IfReduction-basedA2returnsahypothesishf,thenhfhaserrorerr(hf;XO)err(h;XO)+".Proof.Weusethesamedecompositionoferr(h;XO)forh2Hiasinthepreviouslemma.Ifhargminh02Hierr(h0;TOi;ni),thenerr(h;UOi)err(h;UOi)+i;niandthuserr(h;XO)err(h;XO)+(jUij=jXj)i;ni.Also,anyhypothesish2Hihaserroratmosterr(h;XO)+jUij=jXj.Theexitconditionsensurethattheseupperboundsarealwaysatmosterr(h;XO)+". Boththefall-backlabelcomplexityguaranteeaswellasthoseforcertainspeci chypothesisclassesanddistributions(e.g.thresholdsontheline,homogeneouslinearseparatorsundertheuniformdistribution)carryoverfromA2propertoReduction-basedA2.14

Shom More....
By: pasty-toler
Views: 203
Type: Public

Download Section

Please download the presentation after appearing the download area.


Download Pdf - The PPT/PDF document "Technical Report CS Department of Comput..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Try DocSlides online tool for compressing your PDF Files Try Now

Related Documents