Our algorithm extends a sche me of Cohn Atlas and Ladner 6 to the agnostic setting by 1 reformulating it using a reduction to s upervised learning and 2 showing how to apply generalization bounds even for the noniid samples that result from selectiv ID: 34930
Download Pdf The PPT/PDF document "Technical Report CS Department of Comput..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
TechnicalReportCS2007-0898DepartmentofComputerScienceandEngineeringUniversityofCalifornia,SanDiegoAgeneralagnosticactivelearningalgorithmSanjoyDasgupta,DanielHsu,andClaireMonteleoni1AbstractWepresentasimple,agnosticactivelearningalgorithmthatworksforanyhypothesisclassofboundedVCdimension,andanydatadistribution.OuralgorithmextendsaschemeofCohn,Atlas,andLadner[6]totheagnosticsetting,by(1)reformulatingitusingareductiontosupervisedlearningand(2)showinghowtoapplygeneralizationboundsevenforthenon-i.i.d.samplesthatresultfromselectivesampling.Weprovideageneralcharacterizationofthelabelcomplexityofouralgorithm.ThisquantityisnevermorethantheusualPACsamplecomplexityofsupervisedlearning,andisexponentiallysmallerforsomehypothesisclassesanddistributions.Wealsodemonstrateimprovementsexperimentally.1IntroductionActivelearningaddressestheissuethat,inmanyapplications,labeleddatatypicallycomesatahighercost(e.g.intime,eort)thanunlabeleddata.Anactivelearnerisgivenunlabeleddataandmustpaytoviewanylabel.Thehopeisthatsignicantlyfewerlabeledexamplesareusedthaninthesupervised(non-active)learningmodel.Activelearningappliestoarangeofdata-richproblemssuchasgenomicsequenceannotationandspeechrecognition.Inthispaperweformalize,extend,andprovidelabelcomplexityguaranteesforoneoftheearliestandsimplestapproachestoactivelearning|oneduetoCohn,Atlas,andLadner[6].Theschemeof[6]examinesdataonebyoneinastreamandrequeststhelabelofanydatapointaboutwhichitiscurrentlyunsure.Forexample,supposethehypothesisclassconsistsoflinearseparatorsintheplane,andassumethatthedataislinearlyseparable.Lettherstsixdatabelabeledasfollows. Thelearnerdoesnotneedtorequestthelabeloftheseventhpoint(indicatedbythearrow)becauseitisnotunsureaboutthelabel:anystraightlinewiththesandsonoppositesideshastheseventhpointwiththes.Putanotherway,thepointisnotintheregionofuncertainty[6],theportionofthedataspaceforwhichthereisdisagreementamonghypothesesconsistentwiththepresentlabeleddata.Althoughveryelegantandintuitive,thisapproachtoactivelearningfacestwoproblems:1.Explicitlymaintainingtheregionofuncertaintycanbecomputationallycumbersome.2.Dataisusuallynotperfectlyseparable.Ourmaincontributionistoaddresstheseproblems.Weprovideasimplegeneralizationoftheselectivesamplingschemeof[6]thattoleratesadversarialnoiseandneverrequestsmanymorelabelsthanastandardagnosticsupervisedlearnerwouldtolearnahypothesiswiththesameerror. 1Email:fdasgupta,djhsu,cmontelg@cs.ucsd.edu1 Inthepreviousexample,anagnosticactivelearner(onethatdoesnotassumeaperfectseparatorexists)isactuallystilluncertainaboutthelabeloftheseventhpoint,becauseallsixofthepreviouslabelscouldbeinconsistentwiththebestseparator.Therefore,itshouldstillrequestthelabel.Ontheotherhand,afterenoughpointshavebeenlabeled,ifanunlabeledpointoccursatthepositionshownbelow,chancesareitslabelisnotneeded. Toextendthenotionofuncertaintytotheagnosticsetting,wedividetheobserveddatapointsintotwogroups,^SandT:Set^Scontainsthedataforwhichwedidnotrequestlabels.Wekeepthesepointsaroundandassignthemthelabelwethinktheyshouldhave.SetTcontainsthedataforwhichweexplicitlyrequestedlabels.Wewillmanagethingsinsuchawaythatthedatain^Sarealwaysconsistentwiththebestseparatorintheclass.Thus,somewhatcounter-intuitively,thelabelsin^SarecompletelyreliablewhereasthelabelsinTcouldbeinconsistentwiththebestseparator.Todecidewhetherweareuncertainaboutthelabelofanewpointx,wereducetosupervisedlearning:welearnhypothesesh+1andh 1suchthath+1isconsistentwithallthelabelsin^S[f(x;+1)gandhasminimalempiricalerroronT,whileh 1isconsistentwithallthelabelsin^S[f(x;1)gandhasminimalempiricalerroronT.If,say,thetrueerrorofthehypothesish+1ismuchlargerthanthatofh 1,wecansafelyinferthatthebestseparatormustalsolabelxwith1withoutrequestingalabel;iftheerrordierenceisonlymodest,weexplicitlyrequestalabel.Standardgeneralizationboundsforani.i.d.sampleletusperformthistestbycomparingempiricalerrorson^SST.Thelastclaimmaysoundawfullysuspicious,because^SSTisnoti.i.d.!Indeed,thisisinasensethecoresamplingproblemthathasalwaysplaguedactivelearning:thelabeledsampleTmightnotbei.i.d.(duetothelteringofexamplesbasedonanadaptivecriterion),while^Sonlycontainsunlabeledexamples(withmade-uplabels).Nevertheless,weprovethatinourcase,itisinfactcorrecttoeectivelypretend^SSTisani.i.d.sample.Adirectconsequenceisthatthelabelcomplexityofouralgorithm(thenumberoflabelsrequestedbeforeachievingadesirederror)isnevermuchmorethantheusualsamplecomplexityofsupervisedlearning(andinsomecases,issignicantlyless).Animportantalgorithmicdetailisthespecicchoiceofgeneralizationboundweuseindecidingwhethertorequestalabelornot.Asmallpolynomialdierenceingeneralizationrates(betweenn 1=2andn 1,say)cangetmagniedintoanexponentialdierenceinlabelcomplexity,soitiscrucialforustouseagoodbound.Weuseanormalizedboundthattakesintoaccounttheempiricalerror(computedo^SST|again,notani.i.d.sample)ofthehypothesisinquestion.Earlierworkonagnosticactivelearning[1,12]hasbeenabletoupperboundlabelcomplexityintermsofaparameterofthehypothesisclass(anddatadistribution)calledthedisagreementcoecient.Wegivelabelcomplexityboundsforourmethodbasedonthissamequantity,andwegetabetterdependenceonit,linearratherthanquadratic.Tosummarize,inthispaperwepresentandanalyzeasimpleagnosticactivelearningalgorithmforgeneralhypothesisclassesofboundedVCdimension.ItextendstheselectivesamplingschemeofCohnetal.[6]totheagnosticsetting,usingnormalizedgeneralizationbounds,whichweapplyinasimplebutsubtlemanner.Forcertainhypothesisclassesanddistributions,ouranalysisyieldsimprovedlabelcomplexityguaranteesoverthestandardsamplecomplexityofsupervisedlearning.Wealsodemonstratesuchimprovementsexperimentally.2 1.1RelatedworkAlargenumberofalgorithmshavebeenproposedforactivelearning,underavarietyoflearningmodels.Inthissection,weconsideronlymethodswhosegeneralizationbehaviorhasbeenrigorouslyanalyzed.Anearlylandmarkresult,from1989,wastheselectivesamplingschemeofCohn,Atlas,andLadner[6]describedabove.Thissimpleactivelearningalgorithm,designedforseparabledata,hasbeentheinspirationforalotofsubsequentwork.Afewyearslater,theseminalworkofFreund,Seung,Shamir,andTishby[9]analyzedanalgorithmcalledquery-by-committeethatoperatesinaBayesiansettingandusesanelegantsamplingtrickfordecidingwhentoquerypoints.Thecoreprimitiverequiredbythisalgorithmistheabilitytosamplerandomlyfromtheposterioroverthehypothesisspace.Insomecasesthiscanbeachievedeciently[10],forinstancewhenthehypothesisclassconsistsoflinearseparatorsinRd(withauniformprior)andthedataisdistributeduniformlyoverthesurfaceoftheunitsphereinRd.Inthisparticularsetting,theauthorsshowedthatthenumberoflabelsrequiredtoachievegeneralizationerror"isjustO(dlog1="),exponentiallylowerthantheusualsupervisedsamplecomplexityofO(d=").Subsequently,Dasgupta,Kalai,andMonteleoni[8]showedthatasimplevariantoftheperceptronalgorithmalsoachievesthislabelcomplexity,evenforaworst-case(non-Bayesian)choiceoftargethypothesis.Alltheworkmentionedsofarassumesseparabledata.ThiscasewasstudiedabstractlybyDasgupta[7],whofoundthataparametercalledthesplittingindexlooselycharacterizesthelabelcomplexityofactivelylearninghypothesisclassesofboundedVCdimension.Asyet,itisnotknownhowtorealizethislabelcomplexityinacomputationallyecientway,exceptinspecialcases.Anaturalwaytoformulateactivelearningintheagnosticsettingistoaskthelearnertoreturnahypothesiswitherroratmost"(whereistheerrorofthebesthypothesisinthespeciedclass)usingasfewlabelsaspossible.AbasicconstraintonthelabelcomplexitywaspointedoutbyKaariainen[14],whoshowedthatforany(0;1=2),therearedatadistributionsthatforceanyactivelearnerthatachieveserroratmost"torequest\n((=")2)labels.Therstrigorously-analyzedagnosticactivelearningalgorithm,calledA2,wasdevelopedrecentlybyBalcan,Beygelzimer,andLangford[1].LikeCohn-Atlas-Ladner[6],thisalgorithmusesaregionofuncertainty,althoughthelackofseparabilitycomplicatesmattersandA2endsupexplicitlymaintainingan"-netofthehypothesisspace.Subsequently,Hanneke[12]characterizedthelabelcomplexityoftheA2algorithmintermsofaparametercalledthedisagreementcoecient.Anotherthreadofworkfocusesonagnosticlearningofthresholdsfordatathatlieonaline;inthiscase,aprecisecharacterizationoflabelcomplexitycanbegiven[4,5].Thesepreviousresultseithermakestrongdistributionalassumptions(suchasseparability,orauniforminputdistribution)[2,6{9,13],orelsetheyarecomputationallyprohibitiveingeneral[1,7,9].Ourworkwasinspiredbyboth[6]and[1],andwehavebuiltheavilyupontheirinsights.2WeboundthelabelcomplexityofourmethodintermsofthesameparameterasusedforA2[12],andgetasomewhatbetterdependence(linearratherthanquadratic).AcommonfeatureofCohn-Atlas-Ladner,A2,andourmethodisthattheyareallfairlynon-aggressiveintheirchoiceofquerypoints.Theyarecontentwithqueryingallpointsonwhichthereisevenasmallamountofuncertainty,ratherthan,forinstance,pursuingthemaximallyuncertainpoint.Recently,Balcan,Broder,andZhang[2]showedthatforthehypothesisclassoflinearseparators,underdistributionalassumptionsonthedata(forinstance,auniformdistributionovertheunitsphere),amoreaggressivestrategycanyieldbetterlabelcomplexity. 2IthasbeennotedthattheCohn-Atlas-Ladnerschemecaneasilybemadetractableusingareductiontosupervisedlearningintheseparablecase[16,p.68].AlthoughouralgorithmismostnaturallyseenasanextensionofCohn-Atlas-Ladner,asimilarreductiontosupervisedlearning(intheagnosticsetting)canbeusedforA2,aswedemonstrateinAppendixB.3 2Preliminaries2.1LearningframeworkanduniformconvergenceLetXbetheinputspace,DadistributionoverXf1gandHaclassofhypothesesh:X!f1gwithVCdimensionvcdim(H)=d.RecallthatthenthshatteringcoecientS(H;n)isdenedasthemaximumnumberofwaysinwhichHcanlabelasetofnpoints;bySauer'slemma,thisisatmostO(nd)[3,p.175].WedenotebyDXthemarginalofDoverX.Inouractivelearningmodel,thelearnerreceivesunlabeleddatasampledfromDX;foranysampledpointx,itcanoptionallyrequestthelabelysampledfromtheconditionaldistributionatx.Thisprocesscanbeviewedassampling(x;y)fromDandrevealingonlyxtothelearner,keepingthelabelyhiddenunlessthelearnerexplicitlyrequestsit.TheerrorofahypothesishunderDiserrD(h)=Pr(x;y)D[h(x)y],andonanitesampleZXf1g,theempiricalerrorofhiserr(h;Z)=1 jZjX(x;y)2Z1l[h(x)y];where1l[]isthe0-1indicatorfunction.Weassumeforsimplicitythattheminimalerror=infferrD(h):h2Hgisachievedbyahypothesish2H.Ouralgorithmandanalysisusethefollowingnormalizeduniformconvergencebound[3,p.200].Lemma1(VapnikandChervonenkis[17]).LetFbeafamilyofmeasurablefunctions:Z!f0;1goveraspaceZ.DenotebyZtheempiricalaverageofoverasubsetZZ.Letn (4=n)ln(8S(F;2n)=).IfZisani.i.d.sampleofsizenfromaxeddistributionoverZ,then,withprobabilityatleast1,forall2F:minn Zf;2nn Zmin2nn Zf;n :2.2DisagreementcoecientTheactivelearningalgorithmwewillshortlydescribeisnotveryaggressive:ratherthanseekingoutpointsthataremaximallyinformative,itquerieseverypointthatitissomewhatunsureabout.TheearlyworkofCohn-Atlas-Ladner[6]andtherecentA2algorithm[1]aresimilarlymellowintheirqueryingstrategy.Thelabelcomplexityimprovementsachievablebysuchalgorithmsarenicelycapturedbyaparametercalledthedisagreementcoecient,introducedrecentlybyHanneke[12]inhisanalysisofA2.Tomotivatethedisagreementcoecient,imaginethatweareinthemidstoflearning,andthatourcurrenthypothesishthaserroratmost.Supposeweevenknowthevalueof.Thentheonlycandidatehypotheseswestillneedtoconsiderarethosethatdierfromhtonatmosta2fractionoftheinputdistribution,becauseallotherhypothesesmusthaveerrormorethan.Tomakethisabitmoreformal,weimposea(pseudo-)metriconthespaceofhypotheses,asfollows.Denition1.Thedisagreementpseudo-metriconHisdenedby(h;h0)=PrDX[h(x)h0(x)]forh;h02H.LetB(h;r)=fh02H:(h;h0)rgbetheballcenteredaroundhofradiusr.Returningtoourearlierscenario,weneedonlyconsiderhypothesesinB(ht;2)andthus,whenweseeanewdatapointx,thereisnosenseinaskingforitslabelifallofB(ht;2)agreesonwhatthislabelshouldbe.Theonlypointswepotentiallyneedtoqueryarefx:h(x)h0(x)forsomeh;h0B(ht;2)g:Intuitively,thedisagreementcoecientcaptureshowthemeasureofthissetgrowswith.ThefollowingisaslightvariationoftheoriginaldenitionofHanneke[12].4 Algorithm1Input:stream(x1;x2;:::;xm)i.i.d.fromDXInitially,^S0:=;andT0:=;.Forn=1;2;:::;m:1.Foreach^y2f1g,leth^y:=LEARNH(^Sn 1[f(xn;^y)g;Tn 1).2.Iferr(h ^y;^Sn 11Tn 1)err(h^y;^Sn 11Tn 1)n 1(orifnosuch ^yisfound)forsome^y2f1g,then^Sn:=^Sn 1[f(xn;^y)gandTn:=Tn 1.3.Elserequestyn;^Sn:=^Sn 1andTn:=Tn 1[f(xn;yn)g.Returnhf=LEARNH(^Sm;Tm).Figure1:Theagnosticselectivesamplingalgorithm.See(1)forapossiblesettingforn.Denition2.Thedisagreementcoecient(D;H;")0is=supPrDX[hB(h;r)s.t.h(x)h(x)] r:r"whereh=arginfh2HerrD(h)and=errD(h).Clearly,1=(");furthermore,itisaconstantboundedindependentlyof1=(")inseveralcasespreviouslyconsideredintheliterature[12].Forexample,ifHishomogeneouslinearseparatorsandDXistheuniformdistributionovertheunitsphereinRd,then=(p d).3AgnosticselectivesamplingHerewestateandanalyzeourgeneralalgorithmforagnosticactivelearning.Themaintechniquesemployedbythealgorithmarereductionstoasupervisedlearningtaskandgeneralizationboundsappliedtodierencesofempiricalerrors.3.1AgeneralalgorithmforagnosticactivelearningFigure1statesouralgorithminfullgenerality.Theinputisastreamofmunlabeledexamplesdrawni.i.dfromDX;forthetimebeing,mcanbethoughtofas~O((d=")(1+="))where"istheaccuracyparameter.3Thealgorithmoperatesbyreductiontoaspecialkindofsupervisedlearningthatincludeshardconstraints.ForA;BXf1g,letLEARNH(A;B)denoteasupervisedlearnerthatreturnsahypothesish2HconsistentwithA,andwithminimumerroronB.IfthereisnohypothesisconsistentwithA,itreportsthis.Forsomesimplehypothesisclasseslikeintervalsontheline,orrectanglesinR2,itiseasytoconstructsuchalearner.Formorecomplexclasseslikelinearseparators,themainbottleneckisthehardnessofminimizingthe01lossonB(thatis,thehardnessofagnosticsupervisedlearning).Ifaconvexupperboundonthislossfunctionisusedinstead,asinthecaseofsoft-marginsupportvectormachines,itisstraightforwardtoincorporatehardconstraints;butatpresenttherigorousguaranteesaccompanyingouralgorithmapplyonlyif01lossisused. 3The~Onotationsuppresseslog1=andtermspolylogarithmicinthosethatappear.5 Algorithm1maintainstwosetsoflabeledexamples,^SandT,eachofwhichisinitiallyempty.Uponreceivingxn,itlearnstwo4hypotheses,h^y=LEARNH(^S[f(xn;^y)g;T)for^y2f1g,andthencomparestheirempiricalerrorson^SST.Ifthedierenceislargeenough,itispossibletoinferhowhlabelsxn(asweshowinLemma3).Inthiscase,thealgorithmaddsxn,withthisinferredlabel,to^S.Otherwise,thealgorithmrequeststhelabelynandadds(xn;yn)toT.Thus,^Scontainsexampleswithinferredlabelsconsistentwithh,andTcontainsexampleswiththeirrequestedlabels.BecausehmighterronsomeexamplesinT,wejustinsistthatLEARNHndahypothesiswithminimalerroronT.Meanwhile,byconstruction,hisconsistentwith^S(asweshallsee),sowerequireLEARNHtoonlyconsiderhypothesesconsistentwith^S.3.2BoundsforerrordierencesWestillneedtospecifyn,thethresholdvalueforerrordierencesthatdetermineswhetherthealgorithmrequestsalabelornot.Intuitively,nshouldre\recthowcloselyempiricalerrorsonasampleapproximatetrueerrorsonthedistributionD.Notethatouralgorithmismodularwithrespecttothechoiceofn,so,forexample,itcanbecustomizedforaparticularinputdistributionandhypothesisclass.BelowweprovideasimpleandadaptivesettingthatworksforanydistributionandhypothesisclasswithniteVCdimension.Thesettingofncanonlydependonobservablequantities,sowerstclarifythedistinctionbetwempiricalerrorson^SnnTnandthosewithrespecttothetrue(hidden)labels.Denition3.Let^SnandTnbeasdenedinAlgorithm1.LetSn(sheddingthehataccent)bethesetoflabeledexamplesidenticaltothosein^Sn,exceptwiththetruehiddenlabelsswappedin.Thus,forexample,SnnTnisani.i.d.samplefromDofsizen.Finally,leterrn(h)=err(h;SnnTn)anderrn(h)=err(h;^SnnTn):ItisstraightforwardtoapplyLemma1toempiricalerrorsonSnnTn,i.e.toerrn(h),butwecannotusesuchboundsalgorithmically:wedonotrequestthetruelabelsforpointsin^Snandthuscannotreliablycomputeerrn(h).Whatwecancomputeareerrordierenceserrn(h)errn(h0)forpairsofhypotheses(h;h0)thatagreeon(andthusmakethesamemistakeson)^Sn,sinceforsuchpairs,wehaveerrn(h)errn(h0)=errn(h)errn(h0):5Theseempiricalerrordierencesaremeansoff 1;0;+1g-valuedrandomvariables.Weneedtorewritethemintermsoff0;1g-valuedrandomvariablesforsomeoftheconcentrationboundswewillbeusing.Denition4.Forapair(h;h0)2HH,dene+h;h0(x;y)=1l[h(x)yh0(x)=y]and h;h0(x;y)=1l[h(x)=yh0(x)y].Withthisnotation,wehaveerr(h;Z)err(h0;Z)=Z[+h;h0]Z[ h;h0]foranyZXf1g.Now,applyingLemma1toGf+h;h0:(h;h0)2HHgf h;h0:(h;h0)2HHg,andnotingthatS(G;n)S(H;n)2,givesthefollowinglemma.Lemma2.Letn (4=n)ln(8S(H;2n)2=).Withprobabilityatleast1overani.i.d.sampleZofsizenfromD,wehaveforall(h;h0)2HH,err(h;Z)err(h0;Z)errD(h)errD(h0)+2nn Z[+h;h0]+ Z[ h;h0]:WithZSnnTn,theerrordierenceontheleft-handsideiserrn(h)errn(h0),whichcanbeempiricallydeterminedbecauseitisequaltoerrn(h)errn(h0).Butthetermsinthesquarerootontheright-handsidestillposeaproblem,whichwexnext. 4IfLEARNHcannotndahypothesisconsistentwith^S[f(xn;y)gforsomey,thenassuminghisconsistentwith^S,itmustbethath(x)= y.Inthiscase,wesimplyadd(xn; y)to^S,regardlessoftheerrordierence.5Thisobservationisenoughtoimmediatelyjustifytheuseofadditivegeneralizationboundsforn.However,weneedtousenormalized(multiplicative)boundstoachieveabetterlabelcomplexity.6 Corollary1.Letn (4=n)ln(8(n2n)S(H;2n)2=).Then,withprobabilityatleast1,foralln1andall(h;h0)2HHconsistentwith^Sn,wehaveerrn(h)errn(h0)errD(h)errD(h0)+2nn( errn(h)+ errn(h0)):Proof.Foreachn1,weapplyLemma2usingZSnnTnand=(n2n).Then,weapplyaunionboundoveralln1.Thus,withprobabilityatleast1,theboundsinLemma2holdsimultaneouslyforalln1andall(h;h0)2H2withSnnTninplaceofZ.Thecorollaryfollowsbecauseerrn(h)errn(h0)=errn(h)errn(h0);andbecauseSn[Tn[+h;h0]errn(h)andSn[Tn[ h;h0]errn(h0).Toseetherstoftheseexpectationbounds,witnessthatbecausehandh0agreeonSn,Sn[Tn[+h;h0]=1 nX(x;y)2Tn1l[h(x)yh0(x)=y]1 nX(x;y)2Tn1l[h(x)y]=errn(h):Thesecondboundissimilar. Corollary1impliesthatwecaneectivelyapplythenormalizeduniformconvergenceboundsfromLemma1toempiricalerrordierenceson^SnnTn,eventhough^SnnTnisnotani.i.d.samplefromD.Inlightofthis,weusethefollowingsettingofn:n:=2nn errn(h+1)+ errn(h 1)(1)wheren (4=n)ln(8(n2n)S(H;2n)2=)=~O( dlogn=n)asperCorollary1.3.3Correctnessandfall-backanalysisWenowjustifyoursettingofnwithacorrectnessproofandfall-backguarantee.Thefollowinglemmaelucidateshowtheinferredlabelsin^Sserveasamechanismforimplicitlymaintainingacandidatesetofhypothesesthatalwaysincludesh.Thefall-backguaranteethenfollowsalmostimmediately.Lemma3.Withprobabilityatleast1,thehypothesish=arginfh2HerrD(h)isconsistentwith^Snforalln0inAlgorithm1.Proof.ApplytheboundsinCorollary1(theyholdwithprobabilityatleast1)andproceedbyinductiononn.Thebasecaseistrivialsince^S0;.Nowassumehisconsistentwith^Sn.Supposeuponreceivingxn+1,wediscovererrn(h+1)errn(h 1)n.Wewillshowthath(xn+1)=1(assumebothh+1andh 1exist,sinceitisclearh(xn+1)=1ifh+1doesnotexist).Supposeforthesakeofcontradictionthath(xn+1)=+1.Weknowthaterrn(h)errn(h+1)(bytheinductivehypothesishisconsistentwith^Sn,andyetthelearnerchoseh+1inpreferencetoit)anderrn(h+1)errn(h 1)2nn( errn(h+1)+ errn(h 1)).Inparticular,errn(h+1)2n.Therefore,errn(h)errn(h 1)=(errn(h)errn(h+1))+(errn(h+1)errn(h 1)) errn(h+1)( errn(h) errn(h+1))+2nn( errn(h+1)+ errn(h 1))n( errn(h) errn(h+1))+2nn( errn(h+1)+ errn(h 1))2nn( errn(h)+ errn(h 1)):NowCorollary1impliesthaterrD(h)errD(h 1),acontradiction. Theorem1.Let=infh2HerrD(h)andd=vcdim(H).Thereexistsaconstantc0suchthatthefollowingholds.IfAlgorithm1isgivenastreamofmunlabeledexamples,thenwithprobabilityatleast1,thealgorithmreturnsahypothesiswitherroratmost((1=m)(dlogm+log(1=))+ (=m)(dlogm+log(1=))).7 Proof.Lemma3impliesthathisconsistentwith^Smwithprobabilityatleast1.UsingthesameboundsfromCorollary1(alreadyappliedinLemma3)onhandhftogetherwiththefacterrm(hf)errm(h),wehaveerrD(hf)2mmp m errD(hf),whichinturnimplieserrD(hf)+32m+2mp . So,Algorithm1returnsahypothesiswitherroratmost"whenm~O((d=")(1+="));thisis(asymp-totically)theusualsamplecomplexityofsupervisedlearning.Sincethealgorithmrequestsatmostmlabels,itslabelcomplexityisalwaysatmost~O((d=")(1+=")).3.4LabelcomplexityanalysisWecanalsoboundthelabelcomplexityofouralgorithmintermsofthedisagreementcoecient.Thisyieldstighterboundswhenisboundedindependentlyof1=(").Thekeytoderivingourlabelcomplexityboundsbasedonisnotingthattheprobabilityofrequestingthe(n+1)stlabelisintimatelyrelatedtoandn.Lemma4.Thereexistconstants1;c20suchthat,withprobabilityatleast12,foralln1,thefollowingholds.Leth(xn+1)=^ywhereh=arginfh2HerrD(h).Then,theprobabilitythatAlgorithm1requeststhelabelyn+1isPrn+1DX[Requestyn+1]Prn+1DX[errD(h ^y)122n]wherenisasdenedinCorollary1and=infh2HerrD(h).Proof.SeeAppendixA. Lemma5.InthesamesettingasLemma4,thereexistsaconstantc0suchthatPrn+1DX[Requestyn+1](2n),where(D;H;32m+2mp )isthedisagreementcoecient,=infh2HerrD(h),andnisasdenedinCorollary1.Proof.Supposeh(xn+1)=1.Bythetriangleinequality,wehavethaterrD(h+1)(h+1;h),whereisthedisagreementmetriconH(Denition1).ByLemma4,thisimpliesthattheprobabilityofrequestingyn+1isatmosttheprobabilitythat(h+1;h)(1+1)22nforsomeconstants1;c20.Wecanchoosetheconstantssothat(1+1)22n+32m+2mp .Then,thedenitionofthedisagreementcoecientgivestheconclusionthatPrn+1DX[(h+1;h)(1+1)22n]((1+1)22n). Nowwegiveourmainlabelcomplexityboundforagnosticactivelearning.Theorem2.LetmbethenumberofunlabeleddatagiventoAlgorithm1,d=vcdim(H),=infh2HerrD(h),masdenedinCorollary1,and(D;H;32m+2mp ).Thereexistsaconstant10suchthatforany21,withprobabilityatleast12:1.If(21)2m,Algorithm1returnsahypothesiswitherrorasboundedinTheorem1andtheexpectednumberoflabelsrequestedisatmost1+12dlog2m+log1 logm:2.Else,thesameholdsexcepttheexpectednumberoflabelsrequestedisatmost1+1mdlog2m+log1 logm:Furthermore,ifListheexpectednumberoflabelsrequestedasperabove,thenwithprobabilityatleast10,thealgorithmrequestsnomorethanL 3Llog(1=0)labels.8 Proof.FollowsfromLemma5andaChernoboundforthePoissontrials1l[Requestyn]. Withthesubstitution"=32m+2mp asperTheorem1,Theorem2entailsthatforanyhypothesisclassanddatadistributionforwhichthedisagreementcoecient(D;H;")isboundedindependentlyof1=(")(see[12]forsomeexamples),Algorithm1onlyneeds~O(dlog2(1="))labelstoachieveerror"and~O(d(log2(1=")+(=")2))labelstoachieveerror".Thelattermatchesthedependenceon="inthe\n((=")2)lowerbound[14].ThelineardependenceonimprovesonthequadraticdependenceshownforA2[12]6.Foranillustrativeconsequenceofthis,supposeDXistheuniformdistributiononthesphereinRdandHishomogeneouslinearseparators;inthiscase,=(p d).ThenthelabelcomplexityofA2dependsatleastquadraticallyonthedimension,whereasthecorrespondingdependenceforouralgorithmisd3=2.Aspecially-designedsettingofn(say,specictotheinputdistributionandhypothesisclass)maybeabletofurtherreducethedependencetod(see[2]).4ExperimentsWeimplementedAlgorithm1inafewsimplecasestoexperimentallydemonstratethelabelcomplexityimprovements.Ineachcase,thedatadistributionDXwasuniformover[0;1];thestreamlengthwasm10000,andeachexperimentwasrepeated20timeswithdierentrandomseeds.Ourrstexperimentstudiedlinearthresholdsontheline.Thetargethypothesiswasxedtobeh(x)=sign(x0:5).Forthishypothesisclass,weusedtwodierentnoisemodels,eachofwhichensuredinfh2HerrD(h)=errD(h)=forapre-specied[0;1].Therstmodelwasrandommisclassication:foreachpointxDX,weindependentlylabeledith(x)withprobability1andh(x)withprobability.Inthesecondmodel(alsousedin[4]),foreachpointxDX,weindependentlylabeledit+1withprobability(x0:5)=(4)+0:5and1otherwise,thusconcentratingthenoiseneartheboundary.Oursecondexperimentstudiedintervalsontheline.Here,weonlyusedrandommisclassication,butwevariedthetargetintervallengthp+=PrDX[h(x)=+1].TheresultsshowthatthenumberoflabelsrequestedbyAlgorithm1wasexponentiallysmallerthanthetotalnumberofdataseen(m)undertherstnoisemodel,andwaspolynomiallysmallerunderthesecondnoisemodel(seeFigure2;weveriedthepolynomialvs.exponentialdistinctiononseparatelog-logscaleplots).Inthecaseofintervals,weobserveaninitialphase(ofdurationroughly1=p+)inwhicheverylabelisrequested,followedbyamoreecientphase,conrmingtheknownactive-learnabilityofthisclass[7,12].Theseimprovementsshowthatouralgorithmneededsignicantlyfewerlabelstoachievethesameerrorasastandardsupervisedalgorithmthatuseslabelsforallpointsseen.Asasanitycheck,weexaminedthelocationsofdataforwhichAlgorithm1requestedalabel.Welookedattwoparticularrunsofthealgorithm:therstwaswithH=intervals,p+=0:2,m=10000,and=0:1;thesecondwaswithH=boxes(d=2),p+=0:49,m=1000,and=0:01.Ineachcase,thedatadistributionwasuniformover[0;1]d,andthenoisemodelwasrandommisclassication.Figure3showsthat,earlyon,labelswererequestedeverywhere.Butasthealgorithmprogressed,labelrequestsconcentratedneartheboundaryofthetargethypothesis.5ConclusionandfutureworkWehavepresentedasimpleandnaturalapproachtoagnosticactivelearning.OurextensionoftheselectivesamplingschemeofCohn-Atlas-Ladner[6]1.simpliesthemaintenanceoftheregionofuncertaintywithareductiontosupervisedlearning,and2.guardsagainstnoisewithasuitablealgorithmicapplicationofgeneralizationbounds. 6ItmaybepossibletoreduceA2'squadraticdependencetoalineardependencebyusingnormalizedbounds,aswedohere.9 0 5000 10000 0 500 1000 1500 2000 2500 3000 3500 4000 0 5000 10000 0 500 1000 1500 2000 2500 3000 3500 (a)(b)Figure2:Labelingrateplots.Theplotsshowthenumberoflabelsrequested(verticalaxis)versusthetotalnumberofpointsseen(labeled+unlabeled,horizontalaxis)usingAlgorithm1.(a)H=thresholds:underrandommisclassicationnoisewith=0(solid),0:1(dashed),0:2(dot-dashed);undertheboundarynoisemodelwith=0:1(lowerdotted),0:2(upperdotted).(b)H=intervals:underrandommisclassicationwith(p+;)=(0:2;0:0)(solid),(0:1;0:0)(dashed),(0:2;0:1)(dot-dashed),(0:1;0:1)(dotted). (a)(b)Figure3:Locationsoflabelrequests.(a)H=intervals,h=[0:4;0:6].Thetophistogramshowsthelocationsofrst400labelrequests(thex-axisistheunitinterval);thebottomhistogramisforall(2141)labelrequests.(b)H=boxes,h=[0:15;0:85]2.Therst200requestsoccurredatthes,thenext200atthes,andthenal109atthes.10 Ouralgorithmreliesonathresholdparameternforcomparingempiricalerrors.Weprescribeaverysimpleandnaturalchoiceforn{anormalizedgeneralizationboundfromsupervisedlearning{butonecouldhopeforamorecleveroraggressivechoice,akintothosein[2]forlinearseparators.Findingconsistenthypotheseswhendataisseparableisoftenasimpletask.Insuchcases,reduction-basedactivelearningalgorithmscanberelativelyecient(answeringsomequestionsposedin[15]).Ontheotherhand,agnosticsupervisedlearningiscomputationallyintractableformanyhypothesisclasses(e.g.[11]),andofcourse,agnosticactivelearningisatleastashardintheworstcase.Ourreductiontosupervisedlearningisbenigninthesensethatthelearningproblemsweneedtosolveareoversamplesfromtheoriginaldistribution,sowedonotcreatepathologicallyhardinstances(likethosearisingfromhardnessreductions[11])unlesstheyareinherentinthedata.Nevertheless,animportantresearchdirectionistodevelopconsistentactivelearningalgorithmsthatonlyrequiresolvingtractable(e.g.convex)optimizationproblems.Asimilarreduction-basedschememaybepossible.6AcknowledgementsWearegratefultotheEngineeringInstitute(aresearchandeducationalpartnershipbetweenLosAlamosNationalLaboratoryandU.C.SanDiego)forsupportingthesecondauthorwithagraduatefellowship,andtotheNSFforsupportundergrantsIIS-0347646andIIS-0713540.References[1]M.-F.Balcan,A.Beygelzimer,andJ.Langford.Agnosticactivelearning.InICML,2006.[2]M.-F.Balcan,A.Broder,andT.Zhang.Marginbasedactivelearning.InCOLT,2007.[3]O.Bousquet,S.Boucheron,andG.Lugosi.Introductiontostatisticallearningtheory.LectureNotesinArticialIntelligence,3176:169{207,2004.[4]R.CastroandR.Nowak.Upperandlowerboundsforactivelearning.InAllertonConferenceonCommunication,ControlandComputing,2006.[5]R.CastroandR.Nowak.Minimaxboundsforactivelearning.InCOLT,2007.[6]D.Cohn,L.Atlas,andR.Ladner.Improvinggeneralizationwithactivelearning.MachineLearning,15(2):201{221,1994.[7]S.Dasgupta.Coarsesamplecomplexityboundsforactivelearning.InNIPS,2005.[8]S.Dasgupta,A.Kalai,andC.Monteleoni.Analysisofperceptron-basedactivelearning.InCOLT,2005.[9]Y.Freund,H.Seung,E.Shamir,andN.Tishby.Selectivesamplingusingthequerybycommitteealgorithm.MachineLearning,28(2):133{168,1997.[10]R.Gilad-Bachrach,A.Navot,andN.Tishby.Querybycommitteeemadereal.InNIPS,2005.[11]V.GuruswamiandP.Raghavendra.Hardnessoflearninghalfspaceswithnoise.InFOCS,2006.[12]S.Hanneke.Aboundonthelabelcomplexityofagnosticactivelearning.InICML,2007.[13]S.Hanneke.Teachingdimensionandthecomplexityofactivelearning.InCOLT,2007.[14]M.Kaariainen.Activelearninginthenon-realizablecase.InALT,2006.[15]C.Monteleoni.Ecientalgorithmsforgeneralactivelearning.InCOLT.Openproblem,2006.[16]C.Monteleoni.Learningwithonlineconstraints:shiftingconceptsandactivelearning.PhDThesis,MITComputerScienceandArticialIntelligenceLaboratory,2006.[17]V.VapnikandA.Chervonenkis.Ontheuniformconvergenceofrelativefrequenciesofeventstotheirprobabil-ities.TheoryofProbabilityanditsApplications,16:264{280,1971.11 AProofofLemma4Let\rn (4=n)ln(8(n2n)S(H;2n)=)(whichisatmostn).Withprobabilityatleast12:(Lemma1)Foralln1,allh2H,wehave\r2n\rn errD(h)errD(h)errn(h)\rn errD(h):(Lemma3)hisconsistentwith^Snforalln0.Throughout,wemakerepeateduseofthesimplefactABCp AABC2Cp Bfornon-negativeA;B;C.Nowsupposeh(xn+1)=1andAlgorithm1requeststhelabelyn+1.WeneedtoshowthaterrD(h+1)122nforsomepositiveconstants1and2.Sincethealgorithmrequestsalabel,wehaveerrn(h+1)errn(h 1)2nn( errn(h+1)+ errn(h 1)):WeboundtheLHSfrombelowwitherrn(h+1)errn(h 1)errn(h+1)errn(h)=errn(h+1)errn(h)andtheRHSfromaboveusingerrn(h+1)errn(h+1)anderrn(h 1)errn(h).Therefore,errn(h+1)errn(h)+2nn errn(h+1)+n errn(h);which,inturn,implieserrn(h+1)errn(h)+22nn errn(h)+n errn(h)+2nn errn(h):Uniformconvergenceoferrorsallowstheboundserrn(h+1)errD(h+1)\rn errD(h+1)anderrn(h)\r2n\rnp ,soitfollowsthaterrD(h+1)n errD(h+1)+32nnp n 2nnp n +22nnp n 2nnp +(4+p 3)2n+3np :ThisimplieserrD(h+1)3+(12+2p 3)2nasneeded. BRecasting2withreductionstosupervisedlearningHerewerecasttheA2algorithm[1]withreductionstosupervisedlearning,makingitmorestraightforwardlyimplementable.OurreductionusesthesubroutineLEARNH(S;T),asupervisedlearnerthatreturnsahypothesish2HconsistentwithSandwithminimalerroronT(orfailsifnoneexist).TheoriginalA2algorithmexplicitlymaintainsversionspacesandregionsofuncertainty;thesesetsareimplicitlymaintainedwithourreduction:1.TherstargumentStoLEARNHforcesthealgorithmtoonlyconsiderhypothesesinthecurrentversionspace.2.UponreceivingalabeledsampleTfromthecurrentregionofuncertainty,weuseLEARNHtodeterminewhichunlabeleddatathealgorithmisstilluncertainof.12 Reduction-basedA2Input:X:=fx1;x2;:::;xmgi.i.d.fromDX.Initially,U1:=X,S1:=;,n1:=0.Forphasei=1;2;::::1.LetTi;nibearandomsubsetofUiofsize2ni,andletTOi;ni:=f(x;O(x)):xTi;nig,requestinglabelsasneeded(somemayhavealreadybeenrequestedinpreviousrepeatsofthisphase).If(jUij=jXj)i;ni",thenreturnhf:=LEARNH(Si;Ti;ni).2.InitializetemporaryvariablesC0:=;,S0:=;.3.ForeachxUi:(a)Foreach^y2f1g,leth^y:=LEARNH(Si;TOi;ni).(b)Iferr(h ^y;TOi;ni)err(h^y;TOi;ni)i;ni(orifnosuch ^yexists)forsome^y2f1g,thenC0:=C0[fxgandS0:=S0[f(x;^y)g.4.Ui+1:=UinC0,Si+1:=SiiS0.IfjUi+1j=jXj",thenreturnhf:=LEARNH(Si+1;;).5.IfjUi+1j(1=2)jUij,thensetni:=ni+1andrepeatphasei.6.Elsesetni+1:=0andcontinuetophasei+1.Figure4:A2recastwithreductionstosupervisedlearning.Thesettingofi;niisdiscussedintheanalysis.B.1AlgorithmThealgorithm,speciedinFigure4,usesaninitialunlabeledsampleXXofsizem~O(d="2)drawni.i.d.fromtheinputdistributionDX.ThissetXservesasaproxyforDXsothatcomputingthemassofthecurrentregionofuncertaintycanbeperformedsimplybycounting.Consequently,thealgorithmwilllearnahypothesiswitherrorclosetotheminimalerrorachievableonX.However,thiscaneasilybetranslatedtoerrorswithrespecttothefulldistributionbystandarduniformconvergencearguments.WecanthinkoftheinitialsampleXfromDXasjustthexpartsofalabeledsampleZf(x1;y1);:::;(xm;ym)gfromD;thelabelsy1;:::;ymremainhiddenunlessthealgorithmrequeststhem.LetO:X!f1gbethemappingfromtheunlabeleddatainXtotheirrespectivehiddenlabels.Also,foranyunlabeledsubsetTofX,letTOf(x;O(x)):xTgbethecorrespondinglabeledsubsetofZ.Thealgorithmproceedsinphases;thegoalofeachphaseiistocutthecurrentregionofuncertaintyUiinhalf.Towardthisend,thealgorithmlabelsarandomsubsetofUiandusesLEARNHtocheckwhichpointsinUiitisstilluncertainabout.ThealgorithmisuncertainaboutapointxUiifitcannotinferh(x),whereh=argminh2Herr(h;XO).B.2AnalysisLeth=argminh2Herr(h;XO)andletHifh2H:h(x)=^y(x;^y)Sig.Weassumethatwithprobabilityatleast1,thefollowingholdsforallphasesi1(includingrepeats):h2Hierr(h;TOi;ni)err(h;TOi;ni)err(h;UOi)err(h;UOi)i;ni:Forexample,standardgeneralizationboundscanbeusedhereforthei;ni,withappropriate-sharing.Lemma6.Withprobabilityatleast1,foralli1,wehaveh2Hianderr(h;UOi)err(h;UOi)forallh2Hi.13 Proof.Byinduction.Thebasecaseistrivial.Soassumeit'strueforiandwe'llshowit'struefori+1.Supposeforsakeofcontradictionthatsome(x;y)isaddedtoS0instep3,buth(x)y.Thenerr(h;UOi)err(h^y;UOi)err(h;TOi;ni)err(h^y;TOi;ni)i;nierr(h ^y;TOi;ni)err(h^y;TOi;ni)i;nii;nii;nisoerr(h;UOi)err(h^y;UOi),acontradictionoftheinductivehypothesis.Therefore,h(x)=yforall(x;y)S0(andthusthoseultimatelyaddedtoSi+1),soh2Hi+1.Theerrorofahypothesish2Hi+1onXOdecomposesasfollows:err(h;XO)=jXnUi+1j jXjerr(h;XnUOi+1)+jUi+1j jXjerr(h;UOi+1)jXnUi+1j jXjerr(h;XnUOi+1)+jUi+1j jXjerr(h;UOi+1)jUi+1j jXj(err(h;UOi+1)err(h;UOi+1))=err(h;XO)+jUi+1j jXj(err(h;UOi+1)err(h;UOi+1))(thesecondequalityfollowsbecausehypothesesinHi+1agreeonXnUOi+1fx:(x;y)Si+1g).Sinceerr(h;XO)err(h;XO),wehaveerr(h;UOi+1)err(h;UOi+1)forallh2Hi+1. Lemma7.IfReduction-basedA2returnsahypothesishf,thenhfhaserrorerr(hf;XO)err(h;XO)+".Proof.Weusethesamedecompositionoferr(h;XO)forh2Hiasinthepreviouslemma.Ifhargminh02Hierr(h0;TOi;ni),thenerr(h;UOi)err(h;UOi)+i;niandthuserr(h;XO)err(h;XO)+(jUij=jXj)i;ni.Also,anyhypothesish2Hihaserroratmosterr(h;XO)+jUij=jXj.Theexitconditionsensurethattheseupperboundsarealwaysatmosterr(h;XO)+". Boththefall-backlabelcomplexityguaranteeaswellasthoseforcertainspecichypothesisclassesanddistributions(e.g.thresholdsontheline,homogeneouslinearseparatorsundertheuniformdistribution)carryoverfromA2propertoReduction-basedA2.14