/
Technical Report CS Department of Computer Science and Engineering University of California Technical Report CS Department of Computer Science and Engineering University of California

Technical Report CS Department of Computer Science and Engineering University of California - PDF document

pasty-toler
pasty-toler . @pasty-toler
Follow
547 views
Uploaded On 2015-01-29

Technical Report CS Department of Computer Science and Engineering University of California - PPT Presentation

Our algorithm extends a sche me of Cohn Atlas and Ladner 6 to the agnostic setting by 1 reformulating it using a reduction to s upervised learning and 2 showing how to apply generalization bounds even for the noniid samples that result from selectiv ID: 34930

Our algorithm extends

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Technical Report CS Department of Comput..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

TechnicalReportCS2007-0898DepartmentofComputerScienceandEngineeringUniversityofCalifornia,SanDiegoAgeneralagnosticactivelearningalgorithmSanjoyDasgupta,DanielHsu,andClaireMonteleoni1AbstractWepresentasimple,agnosticactivelearningalgorithmthatworksforanyhypothesisclassofboundedVCdimension,andanydatadistribution.OuralgorithmextendsaschemeofCohn,Atlas,andLadner[6]totheagnosticsetting,by(1)reformulatingitusingareductiontosupervisedlearningand(2)showinghowtoapplygeneralizationboundsevenforthenon-i.i.d.samplesthatresultfromselectivesampling.Weprovideageneralcharacterizationofthelabelcomplexityofouralgorithm.ThisquantityisnevermorethantheusualPACsamplecomplexityofsupervisedlearning,andisexponentiallysmallerforsomehypothesisclassesanddistributions.Wealsodemonstrateimprovementsexperimentally.1IntroductionActivelearningaddressestheissuethat,inmanyapplications,labeleddatatypicallycomesatahighercost(e.g.intime,e ort)thanunlabeleddata.Anactivelearnerisgivenunlabeleddataandmustpaytoviewanylabel.Thehopeisthatsigni cantlyfewerlabeledexamplesareusedthaninthesupervised(non-active)learningmodel.Activelearningappliestoarangeofdata-richproblemssuchasgenomicsequenceannotationandspeechrecognition.Inthispaperweformalize,extend,andprovidelabelcomplexityguaranteesforoneoftheearliestandsimplestapproachestoactivelearning|oneduetoCohn,Atlas,andLadner[6].Theschemeof[6]examinesdataonebyoneinastreamandrequeststhelabelofanydatapointaboutwhichitiscurrentlyunsure.Forexample,supposethehypothesisclassconsistsoflinearseparatorsintheplane,andassumethatthedataislinearlyseparable.Letthe rstsixdatabelabeledasfollows.     Thelearnerdoesnotneedtorequestthelabeloftheseventhpoint(indicatedbythearrow)becauseitisnotunsureaboutthelabel:anystraightlinewiththesandsonoppositesideshastheseventhpointwiththes.Putanotherway,thepointisnotintheregionofuncertainty[6],theportionofthedataspaceforwhichthereisdisagreementamonghypothesesconsistentwiththepresentlabeleddata.Althoughveryelegantandintuitive,thisapproachtoactivelearningfacestwoproblems:1.Explicitlymaintainingtheregionofuncertaintycanbecomputationallycumbersome.2.Dataisusuallynotperfectlyseparable.Ourmaincontributionistoaddresstheseproblems.Weprovideasimplegeneralizationoftheselectivesamplingschemeof[6]thattoleratesadversarialnoiseandneverrequestsmanymorelabelsthanastandardagnosticsupervisedlearnerwouldtolearnahypothesiswiththesameerror. 1Email:fdasgupta,djhsu,cmontelg@cs.ucsd.edu1 Inthepreviousexample,anagnosticactivelearner(onethatdoesnotassumeaperfectseparatorexists)isactuallystilluncertainaboutthelabeloftheseventhpoint,becauseallsixofthepreviouslabelscouldbeinconsistentwiththebestseparator.Therefore,itshouldstillrequestthelabel.Ontheotherhand,afterenoughpointshavebeenlabeled,ifanunlabeledpointoccursatthepositionshownbelow,chancesareitslabelisnotneeded.                   Toextendthenotionofuncertaintytotheagnosticsetting,wedividetheobserveddatapointsintotwogroups,^SandT:Set^Scontainsthedataforwhichwedidnotrequestlabels.Wekeepthesepointsaroundandassignthemthelabelwethinktheyshouldhave.SetTcontainsthedataforwhichweexplicitlyrequestedlabels.Wewillmanagethingsinsuchawaythatthedatain^Sarealwaysconsistentwiththebestseparatorintheclass.Thus,somewhatcounter-intuitively,thelabelsin^SarecompletelyreliablewhereasthelabelsinTcouldbeinconsistentwiththebestseparator.Todecidewhetherweareuncertainaboutthelabelofanewpointx,wereducetosupervisedlearning:welearnhypothesesh+1andh1suchthath+1isconsistentwithallthelabelsin^S[f(x;+1)gandhasminimalempiricalerroronT,whileh1isconsistentwithallthelabelsin^S[f(x;1)gandhasminimalempiricalerroronT.If,say,thetrueerrorofthehypothesish+1ismuchlargerthanthatofh1,wecansafelyinferthatthebestseparatormustalsolabelxwith1withoutrequestingalabel;iftheerrordi erenceisonlymodest,weexplicitlyrequestalabel.Standardgeneralizationboundsforani.i.d.sampleletusperformthistestbycomparingempiricalerrorson^SST.Thelastclaimmaysoundawfullysuspicious,because^SSTisnoti.i.d.!Indeed,thisisinasensethecoresamplingproblemthathasalwaysplaguedactivelearning:thelabeledsampleTmightnotbei.i.d.(duetothe lteringofexamplesbasedonanadaptivecriterion),while^Sonlycontainsunlabeledexamples(withmade-uplabels).Nevertheless,weprovethatinourcase,itisinfactcorrecttoe ectivelypretend^SSTisani.i.d.sample.Adirectconsequenceisthatthelabelcomplexityofouralgorithm(thenumberoflabelsrequestedbeforeachievingadesirederror)isnevermuchmorethantheusualsamplecomplexityofsupervisedlearning(andinsomecases,issigni cantlyless).Animportantalgorithmicdetailisthespeci cchoiceofgeneralizationboundweuseindecidingwhethertorequestalabelornot.Asmallpolynomialdi erenceingeneralizationrates(betweenn1=2andn1,say)cangetmagni edintoanexponentialdi erenceinlabelcomplexity,soitiscrucialforustouseagoodbound.Weuseanormalizedboundthattakesintoaccounttheempiricalerror(computedo^SST|again,notani.i.d.sample)ofthehypothesisinquestion.Earlierworkonagnosticactivelearning[1,12]hasbeenabletoupperboundlabelcomplexityintermsofaparameterofthehypothesisclass(anddatadistribution)calledthedisagreementcoecient.Wegivelabelcomplexityboundsforourmethodbasedonthissamequantity,andwegetabetterdependenceonit,linearratherthanquadratic.Tosummarize,inthispaperwepresentandanalyzeasimpleagnosticactivelearningalgorithmforgeneralhypothesisclassesofboundedVCdimension.ItextendstheselectivesamplingschemeofCohnetal.[6]totheagnosticsetting,usingnormalizedgeneralizationbounds,whichweapplyinasimplebutsubtlemanner.Forcertainhypothesisclassesanddistributions,ouranalysisyieldsimprovedlabelcomplexityguaranteesoverthestandardsamplecomplexityofsupervisedlearning.Wealsodemonstratesuchimprovementsexperimentally.2 1.1RelatedworkAlargenumberofalgorithmshavebeenproposedforactivelearning,underavarietyoflearningmodels.Inthissection,weconsideronlymethodswhosegeneralizationbehaviorhasbeenrigorouslyanalyzed.Anearlylandmarkresult,from1989,wastheselectivesamplingschemeofCohn,Atlas,andLadner[6]describedabove.Thissimpleactivelearningalgorithm,designedforseparabledata,hasbeentheinspirationforalotofsubsequentwork.Afewyearslater,theseminalworkofFreund,Seung,Shamir,andTishby[9]analyzedanalgorithmcalledquery-by-committeethatoperatesinaBayesiansettingandusesanelegantsamplingtrickfordecidingwhentoquerypoints.Thecoreprimitiverequiredbythisalgorithmistheabilitytosamplerandomlyfromtheposterioroverthehypothesisspace.Insomecasesthiscanbeachievedeciently[10],forinstancewhenthehypothesisclassconsistsoflinearseparatorsinRd(withauniformprior)andthedataisdistributeduniformlyoverthesurfaceoftheunitsphereinRd.Inthisparticularsetting,theauthorsshowedthatthenumberoflabelsrequiredtoachievegeneralizationerror"isjustO(dlog1="),exponentiallylowerthantheusualsupervisedsamplecomplexityofO(d=").Subsequently,Dasgupta,Kalai,andMonteleoni[8]showedthatasimplevariantoftheperceptronalgorithmalsoachievesthislabelcomplexity,evenforaworst-case(non-Bayesian)choiceoftargethypothesis.Alltheworkmentionedsofarassumesseparabledata.ThiscasewasstudiedabstractlybyDasgupta[7],whofoundthataparametercalledthesplittingindexlooselycharacterizesthelabelcomplexityofactivelylearninghypothesisclassesofboundedVCdimension.Asyet,itisnotknownhowtorealizethislabelcomplexityinacomputationallyecientway,exceptinspecialcases.Anaturalwaytoformulateactivelearningintheagnosticsettingistoaskthelearnertoreturnahypothesiswitherroratmost"(whereistheerrorofthebesthypothesisinthespeci edclass)usingasfewlabelsaspossible.AbasicconstraintonthelabelcomplexitywaspointedoutbyKaariainen[14],whoshowedthatforany(0;1=2),therearedatadistributionsthatforceanyactivelearnerthatachieveserroratmost"torequest\n((=")2)labels.The rstrigorously-analyzedagnosticactivelearningalgorithm,calledA2,wasdevelopedrecentlybyBalcan,Beygelzimer,andLangford[1].LikeCohn-Atlas-Ladner[6],thisalgorithmusesaregionofuncertainty,althoughthelackofseparabilitycomplicatesmattersandA2endsupexplicitlymaintainingan"-netofthehypothesisspace.Subsequently,Hanneke[12]characterizedthelabelcomplexityoftheA2algorithmintermsofaparametercalledthedisagreementcoecient.Anotherthreadofworkfocusesonagnosticlearningofthresholdsfordatathatlieonaline;inthiscase,aprecisecharacterizationoflabelcomplexitycanbegiven[4,5].Thesepreviousresultseithermakestrongdistributionalassumptions(suchasseparability,orauniforminputdistribution)[2,6{9,13],orelsetheyarecomputationallyprohibitiveingeneral[1,7,9].Ourworkwasinspiredbyboth[6]and[1],andwehavebuiltheavilyupontheirinsights.2WeboundthelabelcomplexityofourmethodintermsofthesameparameterasusedforA2[12],andgetasomewhatbetterdependence(linearratherthanquadratic).AcommonfeatureofCohn-Atlas-Ladner,A2,andourmethodisthattheyareallfairlynon-aggressiveintheirchoiceofquerypoints.Theyarecontentwithqueryingallpointsonwhichthereisevenasmallamountofuncertainty,ratherthan,forinstance,pursuingthemaximallyuncertainpoint.Recently,Balcan,Broder,andZhang[2]showedthatforthehypothesisclassoflinearseparators,underdistributionalassumptionsonthedata(forinstance,auniformdistributionovertheunitsphere),amoreaggressivestrategycanyieldbetterlabelcomplexity. 2IthasbeennotedthattheCohn-Atlas-Ladnerschemecaneasilybemadetractableusingareductiontosupervisedlearningintheseparablecase[16,p.68].AlthoughouralgorithmismostnaturallyseenasanextensionofCohn-Atlas-Ladner,asimilarreductiontosupervisedlearning(intheagnosticsetting)canbeusedforA2,aswedemonstrateinAppendixB.3 2Preliminaries2.1LearningframeworkanduniformconvergenceLetXbetheinputspace,DadistributionoverXf1gandHaclassofhypothesesh:X!f1gwithVCdimensionvcdim(H)=d.RecallthatthenthshatteringcoecientS(H;n)isde nedasthemaximumnumberofwaysinwhichHcanlabelasetofnpoints;bySauer'slemma,thisisatmostO(nd)[3,p.175].WedenotebyDXthemarginalofDoverX.Inouractivelearningmodel,thelearnerreceivesunlabeleddatasampledfromDX;foranysampledpointx,itcanoptionallyrequestthelabelysampledfromtheconditionaldistributionatx.Thisprocesscanbeviewedassampling(x;y)fromDandrevealingonlyxtothelearner,keepingthelabelyhiddenunlessthelearnerexplicitlyrequestsit.TheerrorofahypothesishunderDiserrD(h)=Pr(x;y)D[h(x)y],andona nitesampleZXf1g,theempiricalerrorofhiserr(h;Z)=1 jZjX(x;y)2Z1l[h(x)y];where1l[]isthe0-1indicatorfunction.Weassumeforsimplicitythattheminimalerror=infferrD(h):h2Hgisachievedbyahypothesish2H.Ouralgorithmandanalysisusethefollowingnormalizeduniformconvergencebound[3,p.200].Lemma1(VapnikandChervonenkis[17]).LetFbeafamilyofmeasurablefunctions:Z!f0;1goveraspaceZ.DenotebyZtheempiricalaverageofoverasubsetZZ.Let n (4=n)ln(8S(F;2n)=).IfZisani.i.d.sampleofsizenfroma xeddistributionoverZ,then,withprobabilityatleast1,forall2F:min n Zf; 2n n Zmin 2n n Zf; n :2.2DisagreementcoecientTheactivelearningalgorithmwewillshortlydescribeisnotveryaggressive:ratherthanseekingoutpointsthataremaximallyinformative,itquerieseverypointthatitissomewhatunsureabout.TheearlyworkofCohn-Atlas-Ladner[6]andtherecentA2algorithm[1]aresimilarlymellowintheirqueryingstrategy.Thelabelcomplexityimprovementsachievablebysuchalgorithmsarenicelycapturedbyaparametercalledthedisagreementcoecient,introducedrecentlybyHanneke[12]inhisanalysisofA2.Tomotivatethedisagreementcoecient,imaginethatweareinthemidstoflearning,andthatourcurrenthypothesishthaserroratmost .Supposeweevenknowthevalueof .Thentheonlycandidatehypotheseswestillneedtoconsiderarethosethatdi erfromhtonatmosta2 fractionoftheinputdistribution,becauseallotherhypothesesmusthaveerrormorethan .Tomakethisabitmoreformal,weimposea(pseudo-)metriconthespaceofhypotheses,asfollows.De nition1.Thedisagreementpseudo-metriconHisde nedby(h;h0)=PrDX[h(x)h0(x)]forh;h02H.LetB(h;r)=fh02H:(h;h0)rgbetheballcenteredaroundhofradiusr.Returningtoourearlierscenario,weneedonlyconsiderhypothesesinB(ht;2 )andthus,whenweseeanewdatapointx,thereisnosenseinaskingforitslabelifallofB(ht;2 )agreesonwhatthislabelshouldbe.Theonlypointswepotentiallyneedtoqueryarefx:h(x)h0(x)forsomeh;h0B(ht;2 )g:Intuitively,thedisagreementcoecientcaptureshowthemeasureofthissetgrowswith .Thefollowingisaslightvariationoftheoriginalde nitionofHanneke[12].4 Algorithm1Input:stream(x1;x2;:::;xm)i.i.d.fromDXInitially,^S0:=;andT0:=;.Forn=1;2;:::;m:1.Foreach^y2f1g,leth^y:=LEARNH(^Sn1[f(xn;^y)g;Tn1).2.Iferr(h^y;^Sn11Tn1)err(h^y;^Sn11Tn1)n1(orifnosuch^yisfound)forsome^y2f1g,then^Sn:=^Sn1[f(xn;^y)gandTn:=Tn1.3.Elserequestyn;^Sn:=^Sn1andTn:=Tn1[f(xn;yn)g.Returnhf=LEARNH(^Sm;Tm).Figure1:Theagnosticselectivesamplingalgorithm.See(1)forapossiblesettingforn.De nition2.Thedisagreementcoecient(D;H;")0is=supPrDX[hB(h;r)s.t.h(x)h(x)] r:r"whereh=arginfh2HerrD(h)and=errD(h).Clearly,1=(");furthermore,itisaconstantboundedindependentlyof1=(")inseveralcasespreviouslyconsideredintheliterature[12].Forexample,ifHishomogeneouslinearseparatorsandDXistheuniformdistributionovertheunitsphereinRd,then=(p d).3AgnosticselectivesamplingHerewestateandanalyzeourgeneralalgorithmforagnosticactivelearning.Themaintechniquesemployedbythealgorithmarereductionstoasupervisedlearningtaskandgeneralizationboundsappliedtodi erencesofempiricalerrors.3.1AgeneralalgorithmforagnosticactivelearningFigure1statesouralgorithminfullgenerality.Theinputisastreamofmunlabeledexamplesdrawni.i.dfromDX;forthetimebeing,mcanbethoughtofas~O((d=")(1+="))where"istheaccuracyparameter.3Thealgorithmoperatesbyreductiontoaspecialkindofsupervisedlearningthatincludeshardconstraints.ForA;BXf1g,letLEARNH(A;B)denoteasupervisedlearnerthatreturnsahypothesish2HconsistentwithA,andwithminimumerroronB.IfthereisnohypothesisconsistentwithA,itreportsthis.Forsomesimplehypothesisclasseslikeintervalsontheline,orrectanglesinR2,itiseasytoconstructsuchalearner.Formorecomplexclasseslikelinearseparators,themainbottleneckisthehardnessofminimizingthe01lossonB(thatis,thehardnessofagnosticsupervisedlearning).Ifaconvexupperboundonthislossfunctionisusedinstead,asinthecaseofsoft-marginsupportvectormachines,itisstraightforwardtoincorporatehardconstraints;butatpresenttherigorousguaranteesaccompanyingouralgorithmapplyonlyif01lossisused. 3The~Onotationsuppresseslog1=andtermspolylogarithmicinthosethatappear.5 Algorithm1maintainstwosetsoflabeledexamples,^SandT,eachofwhichisinitiallyempty.Uponreceivingxn,itlearnstwo4hypotheses,h^y=LEARNH(^S[f(xn;^y)g;T)for^y2f1g,andthencomparestheirempiricalerrorson^SST.Ifthedi erenceislargeenough,itispossibletoinferhowhlabelsxn(asweshowinLemma3).Inthiscase,thealgorithmaddsxn,withthisinferredlabel,to^S.Otherwise,thealgorithmrequeststhelabelynandadds(xn;yn)toT.Thus,^Scontainsexampleswithinferredlabelsconsistentwithh,andTcontainsexampleswiththeirrequestedlabels.BecausehmighterronsomeexamplesinT,wejustinsistthatLEARNH ndahypothesiswithminimalerroronT.Meanwhile,byconstruction,hisconsistentwith^S(asweshallsee),sowerequireLEARNHtoonlyconsiderhypothesesconsistentwith^S.3.2Boundsforerrordi erencesWestillneedtospecifyn,thethresholdvalueforerrordi erencesthatdetermineswhetherthealgorithmrequestsalabelornot.Intuitively,nshouldre\recthowcloselyempiricalerrorsonasampleapproximatetrueerrorsonthedistributionD.Notethatouralgorithmismodularwithrespecttothechoiceofn,so,forexample,itcanbecustomizedforaparticularinputdistributionandhypothesisclass.Belowweprovideasimpleandadaptivesettingthatworksforanydistributionandhypothesisclasswith niteVCdimension.Thesettingofncanonlydependonobservablequantities,sowe rstclarifythedistinctionbetwempiricalerrorson^SnnTnandthosewithrespecttothetrue(hidden)labels.De nition3.Let^SnandTnbeasde nedinAlgorithm1.LetSn(sheddingthehataccent)bethesetoflabeledexamplesidenticaltothosein^Sn,exceptwiththetruehiddenlabelsswappedin.Thus,forexample,SnnTnisani.i.d.samplefromDofsizen.Finally,leterrn(h)=err(h;SnnTn)anderrn(h)=err(h;^SnnTn):ItisstraightforwardtoapplyLemma1toempiricalerrorsonSnnTn,i.e.toerrn(h),butwecannotusesuchboundsalgorithmically:wedonotrequestthetruelabelsforpointsin^Snandthuscannotreliablycomputeerrn(h).Whatwecancomputeareerrordi erenceserrn(h)errn(h0)forpairsofhypotheses(h;h0)thatagreeon(andthusmakethesamemistakeson)^Sn,sinceforsuchpairs,wehaveerrn(h)errn(h0)=errn(h)errn(h0):5Theseempiricalerrordi erencesaremeansoff1;0;+1g-valuedrandomvariables.Weneedtorewritethemintermsoff0;1g-valuedrandomvariablesforsomeoftheconcentrationboundswewillbeusing.De nition4.Forapair(h;h0)2HH,de ne+h;h0(x;y)=1l[h(x)yh0(x)=y]andh;h0(x;y)=1l[h(x)=yh0(x)y].Withthisnotation,wehaveerr(h;Z)err(h0;Z)=Z[+h;h0]Z[h;h0]foranyZXf1g.Now,applyingLemma1toGf+h;h0:(h;h0)2HHgfh;h0:(h;h0)2HHg,andnotingthatS(G;n)S(H;n)2,givesthefollowinglemma.Lemma2.Let n (4=n)ln(8S(H;2n)2=).Withprobabilityatleast1overani.i.d.sampleZofsizenfromD,wehaveforall(h;h0)2HH,err(h;Z)err(h0;Z)errD(h)errD(h0)+ 2n n Z[+h;h0]+ Z[h;h0]:WithZSnnTn,theerrordi erenceontheleft-handsideiserrn(h)errn(h0),whichcanbeempiricallydeterminedbecauseitisequaltoerrn(h)errn(h0).Butthetermsinthesquarerootontheright-handsidestillposeaproblem,whichwe xnext. 4IfLEARNHcannot ndahypothesisconsistentwith^S[f(xn;y)gforsomey,thenassuminghisconsistentwith^S,itmustbethath(x)=y.Inthiscase,wesimplyadd(xn;y)to^S,regardlessoftheerrordi erence.5Thisobservationisenoughtoimmediatelyjustifytheuseofadditivegeneralizationboundsforn.However,weneedtousenormalized(multiplicative)boundstoachieveabetterlabelcomplexity.6 Corollary1.Let n (4=n)ln(8(n2n)S(H;2n)2=).Then,withprobabilityatleast1,foralln1andall(h;h0)2HHconsistentwith^Sn,wehaveerrn(h)errn(h0)errD(h)errD(h0)+ 2n n( errn(h)+ errn(h0)):Proof.Foreachn1,weapplyLemma2usingZSnnTnand=(n2n).Then,weapplyaunionboundoveralln1.Thus,withprobabilityatleast1,theboundsinLemma2holdsimultaneouslyforalln1andall(h;h0)2H2withSnnTninplaceofZ.Thecorollaryfollowsbecauseerrn(h)errn(h0)=errn(h)errn(h0);andbecauseSn[Tn[+h;h0]errn(h)andSn[Tn[h;h0]errn(h0).Toseethe rstoftheseexpectationbounds,witnessthatbecausehandh0agreeonSn,Sn[Tn[+h;h0]=1 nX(x;y)2Tn1l[h(x)yh0(x)=y]1 nX(x;y)2Tn1l[h(x)y]=errn(h):Thesecondboundissimilar. Corollary1impliesthatwecane ectivelyapplythenormalizeduniformconvergenceboundsfromLemma1toempiricalerrordi erenceson^SnnTn,eventhough^SnnTnisnotani.i.d.samplefromD.Inlightofthis,weusethefollowingsettingofn:n:= 2n n errn(h+1)+ errn(h1)(1)where n (4=n)ln(8(n2n)S(H;2n)2=)=~O( dlogn=n)asperCorollary1.3.3Correctnessandfall-backanalysisWenowjustifyoursettingofnwithacorrectnessproofandfall-backguarantee.Thefollowinglemmaelucidateshowtheinferredlabelsin^Sserveasamechanismforimplicitlymaintainingacandidatesetofhypothesesthatalwaysincludesh.Thefall-backguaranteethenfollowsalmostimmediately.Lemma3.Withprobabilityatleast1,thehypothesish=arginfh2HerrD(h)isconsistentwith^Snforalln0inAlgorithm1.Proof.ApplytheboundsinCorollary1(theyholdwithprobabilityatleast1)andproceedbyinductiononn.Thebasecaseistrivialsince^S0;.Nowassumehisconsistentwith^Sn.Supposeuponreceivingxn+1,wediscovererrn(h+1)errn(h1)n.Wewillshowthath(xn+1)=1(assumebothh+1andh1exist,sinceitisclearh(xn+1)=1ifh+1doesnotexist).Supposeforthesakeofcontradictionthath(xn+1)=+1.Weknowthaterrn(h)errn(h+1)(bytheinductivehypothesishisconsistentwith^Sn,andyetthelearnerchoseh+1inpreferencetoit)anderrn(h+1)errn(h1)� 2n n( errn(h+1)+ errn(h1)).Inparticular,errn(h+1)� 2n.Therefore,errn(h)errn(h1)=(errn(h)errn(h+1))+(errn(h+1)errn(h1)) errn(h+1)( errn(h) errn(h+1))+ 2n n( errn(h+1)+ errn(h1))� n( errn(h) errn(h+1))+ 2n n( errn(h+1)+ errn(h1)) 2n n( errn(h)+ errn(h1)):NowCorollary1impliesthaterrD(h)errD(h1),acontradiction. Theorem1.Let=infh2HerrD(h)andd=vcdim(H).Thereexistsaconstantc�0suchthatthefollowingholds.IfAlgorithm1isgivenastreamofmunlabeledexamples,thenwithprobabilityatleast1,thealgorithmreturnsahypothesiswitherroratmost((1=m)(dlogm+log(1=))+ (=m)(dlogm+log(1=))).7 Proof.Lemma3impliesthathisconsistentwith^Smwithprobabilityatleast1.UsingthesameboundsfromCorollary1(alreadyappliedinLemma3)onhandhftogetherwiththefacterrm(hf)errm(h),wehaveerrD(hf) 2m mp  m errD(hf),whichinturnimplieserrD(hf)+3 2m+2 mp . So,Algorithm1returnsahypothesiswitherroratmost"whenm~O((d=")(1+="));thisis(asymp-totically)theusualsamplecomplexityofsupervisedlearning.Sincethealgorithmrequestsatmostmlabels,itslabelcomplexityisalwaysatmost~O((d=")(1+=")).3.4LabelcomplexityanalysisWecanalsoboundthelabelcomplexityofouralgorithmintermsofthedisagreementcoecient.Thisyieldstighterboundswhenisboundedindependentlyof1=(").Thekeytoderivingourlabelcomplexityboundsbasedonisnotingthattheprobabilityofrequestingthe(n+1)stlabelisintimatelyrelatedtoandn.Lemma4.Thereexistconstants1;c20suchthat,withprobabilityatleast12,foralln1,thefollowingholds.Leth(xn+1)=^ywhereh=arginfh2HerrD(h).Then,theprobabilitythatAlgorithm1requeststhelabelyn+1isPrn+1DX[Requestyn+1]Prn+1DX[errD(h^y)12 2n]where nisasde nedinCorollary1and=infh2HerrD(h).Proof.SeeAppendixA. Lemma5.InthesamesettingasLemma4,thereexistsaconstantc�0suchthatPrn+1DX[Requestyn+1]( 2n),where(D;H;3 2m+2 mp )isthedisagreementcoecient,=infh2HerrD(h),and nisasde nedinCorollary1.Proof.Supposeh(xn+1)=1.Bythetriangleinequality,wehavethaterrD(h+1)(h+1;h),whereisthedisagreementmetriconH(De nition1).ByLemma4,thisimpliesthattheprobabilityofrequestingyn+1isatmosttheprobabilitythat(h+1;h)(1+1)2 2nforsomeconstants1;c20.Wecanchoosetheconstantssothat(1+1)2 2n+3 2m+2 mp .Then,thede nitionofthedisagreementcoecientgivestheconclusionthatPrn+1DX[(h+1;h)(1+1)2 2n]((1+1)2 2n). Nowwegiveourmainlabelcomplexityboundforagnosticactivelearning.Theorem2.LetmbethenumberofunlabeleddatagiventoAlgorithm1,d=vcdim(H),=infh2HerrD(h), masde nedinCorollary1,and(D;H;3 2m+2 mp ).Thereexistsaconstant10suchthatforany21,withprobabilityatleast12:1.If(21) 2m,Algorithm1returnsahypothesiswitherrorasboundedinTheorem1andtheexpectednumberoflabelsrequestedisatmost1+12dlog2m+log1 logm:2.Else,thesameholdsexcepttheexpectednumberoflabelsrequestedisatmost1+1mdlog2m+log1 logm:Furthermore,ifListheexpectednumberoflabelsrequestedasperabove,thenwithprobabilityatleast10,thealgorithmrequestsnomorethanL 3Llog(1=0)labels.8 Proof.FollowsfromLemma5andaCherno boundforthePoissontrials1l[Requestyn]. Withthesubstitution"=3 2m+2 mp asperTheorem1,Theorem2entailsthatforanyhypothesisclassanddatadistributionforwhichthedisagreementcoecient(D;H;")isboundedindependentlyof1=(")(see[12]forsomeexamples),Algorithm1onlyneeds~O(dlog2(1="))labelstoachieveerror"and~O(d(log2(1=")+(=")2))labelstoachieveerror".Thelattermatchesthedependenceon="inthe\n((=")2)lowerbound[14].ThelineardependenceonimprovesonthequadraticdependenceshownforA2[12]6.Foranillustrativeconsequenceofthis,supposeDXistheuniformdistributiononthesphereinRdandHishomogeneouslinearseparators;inthiscase,=(p d).ThenthelabelcomplexityofA2dependsatleastquadraticallyonthedimension,whereasthecorrespondingdependenceforouralgorithmisd3=2.Aspecially-designedsettingofn(say,speci ctotheinputdistributionandhypothesisclass)maybeabletofurtherreducethedependencetod(see[2]).4ExperimentsWeimplementedAlgorithm1inafewsimplecasestoexperimentallydemonstratethelabelcomplexityimprovements.Ineachcase,thedatadistributionDXwasuniformover[0;1];thestreamlengthwasm10000,andeachexperimentwasrepeated20timeswithdi erentrandomseeds.Our rstexperimentstudiedlinearthresholdsontheline.Thetargethypothesiswas xedtobeh(x)=sign(x0:5).Forthishypothesisclass,weusedtwodi erentnoisemodels,eachofwhichensuredinfh2HerrD(h)=errD(h)=forapre-speci ed[0;1].The rstmodelwasrandommisclassi cation:foreachpointxDX,weindependentlylabeledith(x)withprobability1andh(x)withprobability.Inthesecondmodel(alsousedin[4]),foreachpointxDX,weindependentlylabeledit+1withprobability(x0:5)=(4)+0:5and1otherwise,thusconcentratingthenoiseneartheboundary.Oursecondexperimentstudiedintervalsontheline.Here,weonlyusedrandommisclassi cation,butwevariedthetargetintervallengthp+=PrDX[h(x)=+1].TheresultsshowthatthenumberoflabelsrequestedbyAlgorithm1wasexponentiallysmallerthanthetotalnumberofdataseen(m)underthe rstnoisemodel,andwaspolynomiallysmallerunderthesecondnoisemodel(seeFigure2;weveri edthepolynomialvs.exponentialdistinctiononseparatelog-logscaleplots).Inthecaseofintervals,weobserveaninitialphase(ofdurationroughly1=p+)inwhicheverylabelisrequested,followedbyamoreecientphase,con rmingtheknownactive-learnabilityofthisclass[7,12].Theseimprovementsshowthatouralgorithmneededsigni cantlyfewerlabelstoachievethesameerrorasastandardsupervisedalgorithmthatuseslabelsforallpointsseen.Asasanitycheck,weexaminedthelocationsofdataforwhichAlgorithm1requestedalabel.Welookedattwoparticularrunsofthealgorithm:the rstwaswithH=intervals,p+=0:2,m=10000,and=0:1;thesecondwaswithH=boxes(d=2),p+=0:49,m=1000,and=0:01.Ineachcase,thedatadistributionwasuniformover[0;1]d,andthenoisemodelwasrandommisclassi cation.Figure3showsthat,earlyon,labelswererequestedeverywhere.Butasthealgorithmprogressed,labelrequestsconcentratedneartheboundaryofthetargethypothesis.5ConclusionandfutureworkWehavepresentedasimpleandnaturalapproachtoagnosticactivelearning.OurextensionoftheselectivesamplingschemeofCohn-Atlas-Ladner[6]1.simpli esthemaintenanceoftheregionofuncertaintywithareductiontosupervisedlearning,and2.guardsagainstnoisewithasuitablealgorithmicapplicationofgeneralizationbounds. 6ItmaybepossibletoreduceA2'squadraticdependencetoalineardependencebyusingnormalizedbounds,aswedohere.9 0 5000 10000 0 500 1000 1500 2000 2500 3000 3500 4000 0 5000 10000 0 500 1000 1500 2000 2500 3000 3500 (a)(b)Figure2:Labelingrateplots.Theplotsshowthenumberoflabelsrequested(verticalaxis)versusthetotalnumberofpointsseen(labeled+unlabeled,horizontalaxis)usingAlgorithm1.(a)H=thresholds:underrandommisclassi cationnoisewith=0(solid),0:1(dashed),0:2(dot-dashed);undertheboundarynoisemodelwith=0:1(lowerdotted),0:2(upperdotted).(b)H=intervals:underrandommisclassi cationwith(p+;)=(0:2;0:0)(solid),(0:1;0:0)(dashed),(0:2;0:1)(dot-dashed),(0:1;0:1)(dotted). (a)(b)Figure3:Locationsoflabelrequests.(a)H=intervals,h=[0:4;0:6].Thetophistogramshowsthelocationsof rst400labelrequests(thex-axisistheunitinterval);thebottomhistogramisforall(2141)labelrequests.(b)H=boxes,h=[0:15;0:85]2.The rst200requestsoccurredatthes,thenext200atthes,andthe nal109atthes.10 Ouralgorithmreliesonathresholdparameternforcomparingempiricalerrors.Weprescribeaverysimpleandnaturalchoiceforn{anormalizedgeneralizationboundfromsupervisedlearning{butonecouldhopeforamorecleveroraggressivechoice,akintothosein[2]forlinearseparators.Findingconsistenthypotheseswhendataisseparableisoftenasimpletask.Insuchcases,reduction-basedactivelearningalgorithmscanberelativelyecient(answeringsomequestionsposedin[15]).Ontheotherhand,agnosticsupervisedlearningiscomputationallyintractableformanyhypothesisclasses(e.g.[11]),andofcourse,agnosticactivelearningisatleastashardintheworstcase.Ourreductiontosupervisedlearningisbenigninthesensethatthelearningproblemsweneedtosolveareoversamplesfromtheoriginaldistribution,sowedonotcreatepathologicallyhardinstances(likethosearisingfromhardnessreductions[11])unlesstheyareinherentinthedata.Nevertheless,animportantresearchdirectionistodevelopconsistentactivelearningalgorithmsthatonlyrequiresolvingtractable(e.g.convex)optimizationproblems.Asimilarreduction-basedschememaybepossible.6AcknowledgementsWearegratefultotheEngineeringInstitute(aresearchandeducationalpartnershipbetweenLosAlamosNationalLaboratoryandU.C.SanDiego)forsupportingthesecondauthorwithagraduatefellowship,andtotheNSFforsupportundergrantsIIS-0347646andIIS-0713540.References[1]M.-F.Balcan,A.Beygelzimer,andJ.Langford.Agnosticactivelearning.InICML,2006.[2]M.-F.Balcan,A.Broder,andT.Zhang.Marginbasedactivelearning.InCOLT,2007.[3]O.Bousquet,S.Boucheron,andG.Lugosi.Introductiontostatisticallearningtheory.LectureNotesinArti cialIntelligence,3176:169{207,2004.[4]R.CastroandR.Nowak.Upperandlowerboundsforactivelearning.InAllertonConferenceonCommunication,ControlandComputing,2006.[5]R.CastroandR.Nowak.Minimaxboundsforactivelearning.InCOLT,2007.[6]D.Cohn,L.Atlas,andR.Ladner.Improvinggeneralizationwithactivelearning.MachineLearning,15(2):201{221,1994.[7]S.Dasgupta.Coarsesamplecomplexityboundsforactivelearning.InNIPS,2005.[8]S.Dasgupta,A.Kalai,andC.Monteleoni.Analysisofperceptron-basedactivelearning.InCOLT,2005.[9]Y.Freund,H.Seung,E.Shamir,andN.Tishby.Selectivesamplingusingthequerybycommitteealgorithm.MachineLearning,28(2):133{168,1997.[10]R.Gilad-Bachrach,A.Navot,andN.Tishby.Querybycommitteeemadereal.InNIPS,2005.[11]V.GuruswamiandP.Raghavendra.Hardnessoflearninghalfspaceswithnoise.InFOCS,2006.[12]S.Hanneke.Aboundonthelabelcomplexityofagnosticactivelearning.InICML,2007.[13]S.Hanneke.Teachingdimensionandthecomplexityofactivelearning.InCOLT,2007.[14]M.Kaariainen.Activelearninginthenon-realizablecase.InALT,2006.[15]C.Monteleoni.Ecientalgorithmsforgeneralactivelearning.InCOLT.Openproblem,2006.[16]C.Monteleoni.Learningwithonlineconstraints:shiftingconceptsandactivelearning.PhDThesis,MITComputerScienceandArti cialIntelligenceLaboratory,2006.[17]V.VapnikandA.Chervonenkis.Ontheuniformconvergenceofrelativefrequenciesofeventstotheirprobabil-ities.TheoryofProbabilityanditsApplications,16:264{280,1971.11 AProofofLemma4Let\rn (4=n)ln(8(n2n)S(H;2n)=)(whichisatmost n).Withprobabilityatleast12:(Lemma1)Foralln1,allh2H,wehave\r2n\rn errD(h)errD(h)errn(h)\rn errD(h):(Lemma3)hisconsistentwith^Snforalln0.Throughout,wemakerepeateduseofthesimplefactABCp AABC2Cp Bfornon-negativeA;B;C.Nowsupposeh(xn+1)=1andAlgorithm1requeststhelabelyn+1.WeneedtoshowthaterrD(h+1)12 2nforsomepositiveconstants1and2.Sincethealgorithmrequestsalabel,wehaveerrn(h+1)errn(h1) 2n n( errn(h+1)+ errn(h1)):WeboundtheLHSfrombelowwitherrn(h+1)errn(h1)errn(h+1)errn(h)=errn(h+1)errn(h)andtheRHSfromaboveusingerrn(h+1)errn(h+1)anderrn(h1)errn(h).Therefore,errn(h+1)errn(h)+ 2n n errn(h+1)+ n errn(h);which,inturn,implieserrn(h+1)errn(h)+2 2n n errn(h)+ n errn(h)+ 2n n errn(h):Uniformconvergenceoferrorsallowstheboundserrn(h+1)errD(h+1)\rn errD(h+1)anderrn(h)\r2n\rnp ,soitfollowsthaterrD(h+1) n errD(h+1)+3 2n np  n  2n np  n +2 2n np  n  2n np +(4+p 3) 2n+3 np :ThisimplieserrD(h+1)3+(12+2p 3) 2nasneeded. BRecasting2withreductionstosupervisedlearningHerewerecasttheA2algorithm[1]withreductionstosupervisedlearning,makingitmorestraightforwardlyimplementable.OurreductionusesthesubroutineLEARNH(S;T),asupervisedlearnerthatreturnsahypothesish2HconsistentwithSandwithminimalerroronT(orfailsifnoneexist).TheoriginalA2algorithmexplicitlymaintainsversionspacesandregionsofuncertainty;thesesetsareimplicitlymaintainedwithourreduction:1.The rstargumentStoLEARNHforcesthealgorithmtoonlyconsiderhypothesesinthecurrentversionspace.2.UponreceivingalabeledsampleTfromthecurrentregionofuncertainty,weuseLEARNHtodeterminewhichunlabeleddatathealgorithmisstilluncertainof.12 Reduction-basedA2Input:X:=fx1;x2;:::;xmgi.i.d.fromDX.Initially,U1:=X,S1:=;,n1:=0.Forphasei=1;2;::::1.LetTi;nibearandomsubsetofUiofsize2ni,andletTOi;ni:=f(x;O(x)):xTi;nig,requestinglabelsasneeded(somemayhavealreadybeenrequestedinpreviousrepeatsofthisphase).If(jUij=jXj)i;ni",thenreturnhf:=LEARNH(Si;Ti;ni).2.InitializetemporaryvariablesC0:=;,S0:=;.3.ForeachxUi:(a)Foreach^y2f1g,leth^y:=LEARNH(Si;TOi;ni).(b)Iferr(h^y;TOi;ni)err(h^y;TOi;ni)i;ni(orifnosuch^yexists)forsome^y2f1g,thenC0:=C0[fxgandS0:=S0[f(x;^y)g.4.Ui+1:=UinC0,Si+1:=SiiS0.IfjUi+1j=jXj",thenreturnhf:=LEARNH(Si+1;;).5.IfjUi+1j(1=2)jUij,thensetni:=ni+1andrepeatphasei.6.Elsesetni+1:=0andcontinuetophasei+1.Figure4:A2recastwithreductionstosupervisedlearning.Thesettingofi;niisdiscussedintheanalysis.B.1AlgorithmThealgorithm,speci edinFigure4,usesaninitialunlabeledsampleXXofsizem~O(d="2)drawni.i.d.fromtheinputdistributionDX.ThissetXservesasaproxyforDXsothatcomputingthemassofthecurrentregionofuncertaintycanbeperformedsimplybycounting.Consequently,thealgorithmwilllearnahypothesiswitherrorclosetotheminimalerrorachievableonX.However,thiscaneasilybetranslatedtoerrorswithrespecttothefulldistributionbystandarduniformconvergencearguments.WecanthinkoftheinitialsampleXfromDXasjustthexpartsofalabeledsampleZf(x1;y1);:::;(xm;ym)gfromD;thelabelsy1;:::;ymremainhiddenunlessthealgorithmrequeststhem.LetO:X!f1gbethemappingfromtheunlabeleddatainXtotheirrespectivehiddenlabels.Also,foranyunlabeledsubsetTofX,letTOf(x;O(x)):xTgbethecorrespondinglabeledsubsetofZ.Thealgorithmproceedsinphases;thegoalofeachphaseiistocutthecurrentregionofuncertaintyUiinhalf.Towardthisend,thealgorithmlabelsarandomsubsetofUiandusesLEARNHtocheckwhichpointsinUiitisstilluncertainabout.ThealgorithmisuncertainaboutapointxUiifitcannotinferh(x),whereh=argminh2Herr(h;XO).B.2AnalysisLeth=argminh2Herr(h;XO)andletHifh2H:h(x)=^y(x;^y)Sig.Weassumethatwithprobabilityatleast1,thefollowingholdsforallphasesi1(includingrepeats):h2Hierr(h;TOi;ni)err(h;TOi;ni)err(h;UOi)err(h;UOi)i;ni:Forexample,standardgeneralizationboundscanbeusedhereforthei;ni,withappropriate-sharing.Lemma6.Withprobabilityatleast1,foralli1,wehaveh2Hianderr(h;UOi)err(h;UOi)forallh2Hi.13 Proof.Byinduction.Thebasecaseistrivial.Soassumeit'strueforiandwe'llshowit'struefori+1.Supposeforsakeofcontradictionthatsome(x;y)isaddedtoS0instep3,buth(x)y.Thenerr(h;UOi)err(h^y;UOi)err(h;TOi;ni)err(h^y;TOi;ni)i;nierr(h^y;TOi;ni)err(h^y;TOi;ni)i;nii;nii;nisoerr(h;UOi)err(h^y;UOi),acontradictionoftheinductivehypothesis.Therefore,h(x)=yforall(x;y)S0(andthusthoseultimatelyaddedtoSi+1),soh2Hi+1.Theerrorofahypothesish2Hi+1onXOdecomposesasfollows:err(h;XO)=jXnUi+1j jXjerr(h;XnUOi+1)+jUi+1j jXjerr(h;UOi+1)jXnUi+1j jXjerr(h;XnUOi+1)+jUi+1j jXjerr(h;UOi+1)jUi+1j jXj(err(h;UOi+1)err(h;UOi+1))=err(h;XO)+jUi+1j jXj(err(h;UOi+1)err(h;UOi+1))(thesecondequalityfollowsbecausehypothesesinHi+1agreeonXnUOi+1fx:(x;y)Si+1g).Sinceerr(h;XO)err(h;XO),wehaveerr(h;UOi+1)err(h;UOi+1)forallh2Hi+1. Lemma7.IfReduction-basedA2returnsahypothesishf,thenhfhaserrorerr(hf;XO)err(h;XO)+".Proof.Weusethesamedecompositionoferr(h;XO)forh2Hiasinthepreviouslemma.Ifhargminh02Hierr(h0;TOi;ni),thenerr(h;UOi)err(h;UOi)+i;niandthuserr(h;XO)err(h;XO)+(jUij=jXj)i;ni.Also,anyhypothesish2Hihaserroratmosterr(h;XO)+jUij=jXj.Theexitconditionsensurethattheseupperboundsarealwaysatmosterr(h;XO)+". Boththefall-backlabelcomplexityguaranteeaswellasthoseforcertainspeci chypothesisclassesanddistributions(e.g.thresholdsontheline,homogeneouslinearseparatorsundertheuniformdistribution)carryoverfromA2propertoReduction-basedA2.14