VOVK010002000300040005000600070008000900010000050100150200250300350400450500examplescumulative errors at different significance levels1 2 3 4 5 Figure1TCMscumulativeerrorsatthesignicancelevels ID: 480132
Download Pdf The PPT/PDF document "JournalofMachineLearningResearch5(2004)5..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
JournalofMachineLearningResearch5(2004)575-604Submitted1/04;Published6/04AUniversalWell-CalibratedAlgorithmforOn-lineClassicationVladimirVovkVOVK@CS.RHUL.AC.UKComputerLearningResearchCentreRoyalHolloway,UniversityofLondonEgham,SurreyTW200EX,UKEditors:KristinBennettandNicoloCesa-BianchiAbstractWestudytheproblemofon-lineclassicationinwhichthepredictionalgorithm,foreachsig-nicanceleveld,isrequiredtooutputasitspredictionarangeoflabels(intuitively,thoselabelsdeemedcompatiblewiththeavailabledataattheleveld)ratherthanjustonelabel;asusual,theexamplesareassumedtobegeneratedindependentlyfromthesameprobabilitydistributionP.Thepredictionalgorithmissaidtobewell-calibratedforPanddifthelong-runrelativefrequencyoferrorsdoesnotexceeddalmostsurelyw.r.toP.Forwell-calibratedalgorithmswetakethenumberofuncertainpredictions(i.e.,thosecontainingmorethanonelabel)astheprincipalmeasureofpredictiveperformance.Themainresultofthispaperistheconstructionofapredictionalgorithmwhich,forany(unknown)Pandanyd:(a)makeserrorsindependentlyandwithprobabilitydatev-erytrial(inparticular,iswell-calibratedforPandd);(b)makesinthelongrunnomoreuncertainpredictionsthananyotherpredictionalgorithmthatiswell-calibratedforPandd;(c)processesexamplenintimeO(logn).Keywords:TransductiveCondenceMachine,on-lineprediction1.IntroductionTypicalmachinelearningalgorithmsoutputapointpredictionforthelabelofanunknownobject.ThispapercontinuesstudyofanalgorithmcalledtheTransductiveCondenceMachine(TCM),introducedbySaundersetal.(1999)andVovketal.(1999),thatcomplementsitspredictionswithsomemeasuresofcondence.TherearedifferentwaysofpresentingTCM'soutput;inthispaper(asintherelatedVovk,2002a,b)weuseTCMasaregionpredictor,inthesensethatitoutputsanestedfamilyofpredictionregions(indexedbythesignicanceleveld)ratherthanapointprediction.AnyTCMiswell-calibratedwhenusedintheon-linemode:foranysignicanceleveldthelong-runrelativefrequencyoferroneouspredictionsdoesnotexceedd.WhatmakesthisfeatureofTCMespeciallyappealingisthatitisfarfrombeingjustanasymptoticphenomenon:aslightmodicationofTCMcalledrandomized1TCM(rTCM)makeserrorsindependentlyatdifferenttrialsandwithprobabilitydateachtrial.Thepropertyofbeingwell-calibratedthenimmediatelyfollowsbytheBorelstronglawoflargenumbers.Figure1showsthecumulativenumbersoferrorsatthesignicancelevels1%5%madeonthewell-knownUSPSdatasetofhand-writtendigits(randomlypermuted);asexpected,thesearestraightlineswiththeslopeapproximatelyequaltothesignicancelevel.Forproofsandfurtherinformation,seeVovk(2002a).1.Randomizationisneededtobreaktiesanddealefcientlywithborderlinecases.c\r2004VladimirVovk. VOVK010002000300040005000600070008000900010000050100150200250300350400450500examplescumulative errors at different significance levels1% 2% 3% 4% 5% Figure1:TCM'scumulativeerrorsatthesignicancelevels1%5%ontheUSPSdatasetThejusticationofthestudyofTCMgivenbyVovk(2002a)wasitsgoodperformanceonreal-worldandstandardbenchmarkdatasets.Forexample,Figure2showsthatforthesignicancelevelsbetween1%and5%mostexamplesintheUSPSdatasetcanbepredictedcategorically(byasimple1-NearestNeighbourTCM,usedinallexperimentsreportedinthispaper):thepredictionregioncontainsonlyonelabel.ThispaperpresentstheoreticalresultsaboutTCM'sperformanceintheproblemofclassica-tion,wherethenumberofpossiblelabelsisnite;weshowthatthereexistsauniversalrTCM,which,foranysignicanceleveldandwithoutknowingthetruedistributionPgeneratingtheex-amples:produces,asymptotically,nomoreuncertainpredictionsthananyotherpredictionalgorithmthatiswell-calibratedforPandd;produces,asymptotically,atleastasmanyemptypredictionsasanyotherpredictionalgorithmthatiswell-calibratedforPanddandwhosepercentageofuncertainpredictionsisoptimal(inthesenseofthepreviousitem).Theimportanceoftherstitemisobvious:wewanttominimizethenumberofuncertainpredic-tions.Thisasymptoticcriterionceasestowork,however,whenthenumberofuncertainpredictionsstabilizes,asinFigure2forsignicancelevels3%5%.Insuchcasesthenumberofemptypredic-tionsbecomesimportant:emptypredictions(automaticallyleadingtoanerror)provideawarningthattheobjectisatypical(looksverydifferentfromthepreviousobjects),andonewouldliketobewarnedasoftenaspossible,takingintoaccountthattherelativefrequencyoferrors(includingemptypredictions)isguaranteednottoexceeddinthelongrun.RememberthatTCMoutputsawholefamilyofpredictionregions,sothefactthatatsomesignicancelevelthepredictionregion576 UNIVERSALWELL-CALIBRATEDCLASSIFICATION01000200030004000500060007000800090001000001002003004005006007008009001000examplescumulative uncertain predictions at different significance levels1% 2% 3% 4% 5% Figure2:Cumulativenumberofuncertainpredictions(i.e.,predictionregionscontainingmorethanonelabel)madebythe1-NearestNeighbourTCMatthesignicancelevels1%5%ontheUSPSdatasetbecomesemptydoesnotmeanthatallpotentiallabelsforanewobjectbecomeequallylikely:weshouldjustshiftourattentiontoothersignicancelevels.Figure3showsthecumulativenumbersofemptypredictionsfortheUSPSdataset.ThefullpredictionoutputbyaTCMisacomplicatedmathematicalobject:foreachsignicanceleveldwehaveapredictionregion.Inpractice,agoodstartingpointmightbersttolookatthepredictionregionscorrespondingtotwoorthreeconventionalsignicancelevels,suchas1%and5%(afterwards,ofcourse,thepredictionregionsatothersignicancelevelsshouldbelookedat).Forexample,denotingGdthepredictionregionatsignicanceleveld,wecouldsaythatthepredictionishighlycertainifjG1%j1andcertainifjG5%j1;similarly,wecouldsaythatthenewobject(whoselabelisbeingpredicted)ishighlyatypicalifjG1%j=0andatypicalifjG5%j=0.Inthecaseofclassication,thefamilyofpredictionregionsGdcanbesummarizedbyreportingthecondencesupf1 d:jGdj1g;thecredibilityinffd:jGdj=0g;andthepredictionGd,where1 disthecondence(inthecaseofTCM,jGdj1andusuallyjGdj=1when1 disthecondence).Reportingtheprediction,condence,andcredibility,asinSaundersetal.(1999)andVovketal.(1999),isanalogoustoreportingtheobservedlevelofsignicance(CoxandHinkley,1974,p.66)instatistics.577 VOVK010002000300040005000600070008000900010000050100150200250300examplescumulative empty predictions at different significance levels5% 4% 3% Figure3:Cumulativenumberofemptypredictionsmadebythe1-NearestNeighbourTCMatthesignicancelevels1%5%ontheUSPSdataset(therearenoemptypredictionsfor1%and2%)Thispaper'sresultelaboratesonVovk(2002b),whereitwasshownthatanoptimalrandomizedTCMexistswhenthedistributionPgeneratingtheexamplesisknown.IntherestofthispaperweconsideronlyrandomizedTCM,sowedroptheadjectiverandomized.ThetwoareasofmainstreammachinelearningthataremostcloselyconnectedwiththispaperarePAClearningtheoryandBayesianlearningtheory.Whereasweoftenusethericharsenalofmathematicaltoolsdevelopedintheseelds,theydonotprovidethesamekindofguarantees(therightprobabilityoferrorateachsignicancelevel,witherrorsatdifferenttrialsindependent)underunknownP;formoredetails,seeVovk(2002a)andreferencestherein.Severalpapers(suchasRivestandSloan,1988;Freundetal.,2004)extendthestandardPACframeworkbyallowingthepredictionalgorithmtoabstainfrommakingapredictionatsometrials.Ourresultsshowthatforanysignicanceleveldthereexistsapredictionalgorithmthat:(a)makesawrongpredictionwithrelativefrequencyatmostd;(b)hasanoptimalfrequencyofabstentionsamongthepredictionalgorithmsthatsatisfyproperty(a)(fordetails,seeRemark2onp.580).ThepaperbyFreundetal.(2004)isespeciallyclosetotheapproachofthispaper,deningaverynaturalTCMinthesituationwhereahypothesisclassisgiven(theempiricallogratioofFreundetal.(2004),takenwithappropriatesign,canbeusedasanindividualstrangenessmeasure,asdenedin§3).2.MainResultInourlearningprotocol,Realityoutputspairs(x1;y1);(x2;y2);:::calledexamples.Eachexample(xi;yi)consistsofanobjectxianditslabelyi.TheobjectsarechosenfromameasurablespaceX578 UNIVERSALWELL-CALIBRATEDCLASSIFICATIONcalledtheobjectspaceandthelabelsareelementsofameasurablespaceYcalledthelabelspace.InthispaperweassumethatYisnite(andendowedwiththes-algebraofallsubsets).TheprotocolincludesvariablesErrd(thetotalnumberoferrorsmadeuptoandincludingtrialnatsignicanceleveld)anderrd(thebinaryvariableshowingwhetheranerrorismadeattrialn).ItalsoincludesanalogousvariablesUncd,uncd,Empd,empdforuncertainandemptypredictions:Errd:=0,Uncd0:=0,Empd0:=0foralld2(0;1);FORn=1;2;::::Realityoutputsxn2X;PredictoroutputsGdYforalld2(0;1);Realityoutputsyn2Y;errd:=(1ifyn=2Gdn0otherwise,Errd:=Errd 1+errdforalld2(0;1);uncd:=(1ifjGdnj10otherwise,Uncd:=Uncd 1+uncdforalld2(0;1);empd:=(1ifjGdj=00otherwise,Empd:=Empd 1+empdforalld2(0;1)ENDFOR.WewillusethenotationZ:=XYfortheexamplespace;Gdwillbecalledthepredictionregion(orjustprediction).Wewillassumethateachexamplezn=(xn;yn),n=1;2;:::,isoutputaccordingtoaprobabilitydistributionPinZandtheexamplesareindependentofeachother(sothesequencez1z2:::isoutputbythepowerdistributionP¥).ThisisReality'srandomizedstrategy.AregionpredictorisameasurablefunctionGd(x1;t1;y1;:::;xn 1;tn 1;yn 1;xn;tn);(1)whered2(0;1),n=1;2;:::,the(xi;yi)2Z,i=1;:::;n 1,areexamples,xn2Xisanobject,andti2[0;1](i=1;:::;n),whichsatisesGd1(x1;t1;y1;:::;xn 1;tn 1;yn 1;xn;tn)Gd2(x1;t1;y1;:::;xn 1;tn 1;yn 1;xn;tn)wheneverd1d2.Themeasurabilityof(1)meansthatforeachnthesetn(d;x1;t1;y1;:::;xn;tn;yn):yn2Gd(x1;t1;y1;:::;xn 1;tn 1;yn 1;xn;tn)o(0;1)(X[0;1]Y)nismeasurable.Sinceweareinterestedinpredictionwithcondence,theregionpredictor(1)isgivenanextrainputd2(0;1),whichwecallthesignicancelevel(typicallyitiscloseto0,standardvaluesbeing1%and5%);thecomplementaryvalue1 discalledthecondencelevel.Wewillalwaysassumethattnareindependentrandomvariablesuniformlydistributedin[0;1].Thismakesaregionpredictorafamily(indexedbyd2(0;1))ofPredictor'srandomizedstrategies.Wewilloftenusethenotationerrdn,uncdn,etc.,inthecasewhereRealityandPredictorareusinggivenrandomizedstrategies.Forexample,errd(P¥;G)istherandomvariableequalto0ifPredictor579 VOVKisrightattrialnandatsignicanceleveldandequalto1otherwise.ItisalwaysassumedthattherandomnumberstnusedbyGandtherandomexamplesznchosenbyRealityareindependent.WesaythataregionpredictorGis(conservatively)well-calibratedforaprobabilitydistributionPinZandasignicanceleveld2(0;1)iflimsupn!¥Errd(P¥;G)nda.s.Wesay(asinVovk,2002b)thatGisoptimalforPanddif,foranyregionpredictorGwhichiswell-calibratedforPandd,limsupn!¥Uncd(P¥;G)nliminfn!¥Uncd(P¥;G)na.s.(2)(ItisnaturaltoassumeinthisandothersimilardenitionsthattherandomnumbersusedbyGandGareindependent,butthisassumptionisnotneededforourmathematicalresultsandwedonotmakeit.)Ofcourse,thedenitionofoptimalityisnaturalonlyforwell-calibratedG.AregionpredictorGisuniversalwell-calibratedif:itiswell-calibratedforanyPandd;itisoptimalforanyPandd;foranyP,anyd,andanyregionpredictorGwhichiswell-calibratedandoptimalforPandd,liminfn!¥Empd(P¥;G)nlimsupn!¥Empd(P¥;G)na.s.RecallthatameasurablespaceXisBorelifitisisomorphictoameasurablesubsetoftheinterval[0;1].TheclassofBorelspacesisveryrich;forexample,allPolishspaces(suchasnite-dimensionalEuclideanspacesRn,R¥,functionalspacesCandD)areBorel.Theorem1SupposetheobjectspaceXisBorel.Thereexistsauniversalwell-calibratedregionpredictor.Thisisthemainresultofthepaper.In§3weconstructauniversalwell-calibratedregionpredictor(processingexamplenintimeO(logn))andin§4outlinetheideaoftheproofthatitindeedsatisestherequiredproperties.Technicaldetailswillbegivenin§5.RemarkTheprotocolofRivestandSloan(1988)andFreundetal.(2004)isinfactarestrictionofourprotocol,inwhichPredictorisonlyallowedtooutputaone-elementsetorthewholeofY;thelatterisinterpretedasabstention.(Andinthesituationwherethenumbersoferrorsanduncertainpredictionsareofprimaryinterest,asinthispaper,thedifferencebetweenthesetwoprotocolsisnotsignicant.)Theuniversalwell-calibratedregionpredictorcanbeadaptedtotherestrictedprotocolbyreplacinganuncertainpredictionwithYandreplacinganemptypredictionwitharandomlychosenlabel.Inthiswayweobtainapredictionalgorithmintherestrictedprotocolwhichiswell-calibratedandhasanoptimalfrequencyofabstentions,inthesenseof(2),amongthewell-calibratedalgorithms.580 UNIVERSALWELL-CALIBRATEDCLASSIFICATION3.ConstructionofaUniversalWell-CalibratedRegionPredictorInthissectionwerstdenethegeneralnotionofTransductiveCondenceMachine,andthenwespecializeitusinganearestneighboursproceduretoobtainauniversalwell-calibratedregionpredictor.3.1PreliminariesIftisanumberin[0;1],wesplititintotwonumberst0;t002[0;1]asfollows:ifthebinaryexpansionoftis0:a1a2:::(redenethebinaryexpansionof1tobe0:11:::),sett0:=0:a1a3a5:::andt00:=0:a2a4a6:::.Iftisdistributeduniformlyin[0;1],thenbotht0andt00are,andtheyareindependentofeachother.Wewilloftenapplyourprocedures(e.g.,theindividualstrangenessmeasurein§3.2,theNearestNeighboursrulein§3.3)nottotheoriginalobjectsx2Xbuttoextendedobjects(x;s)2X:=X[0;1],wherexiscomplementedbyarandomnumbers(tobeextractedfromoneofthetn).Inotherwords,alongwithexamples(x;y)wewillalsoconsiderextendedexamples(x;s;y)2Z:=X[0;1]Y.LetussetX:=[0;1];wecandothiswithoutlossofgeneralitysinceXisBorel.ThismakestheextendedobjectspaceX=[0;1]2alinearlyorderedsetwiththelexicographicorder:(x1;s1)(x2;s2)meansthateitherx1=x2ands1s2orx1x2.Wesaythat(x1;s1)isnearerto(x3;s3)than(x2;s2)isifjx1 x3;s1 s3jjx2 x3;s2 s3j;(3)wherejx;sj:=((x;s)if(x;s)(0;0)( x; s)otherwise:(4)Thevaluejx1 x2;s1 s2jplaystheroleofthedistancebetweenextendedobjects(x1;s1)and(x2;s2).Despitesuchdistancesbeingtwo-dimensional,theyarestillalwayscomparableusingthelexicographicorder.OurconstructionwillbebasedontheNearestNeighboursalgorithm,whichisknowntobestronglyuniversallyconsistentinthetraditionaltheoryofpatternrecognition(see,e.g.,Devroyeetal.,1996,Chapter11);therandomcomponentssareneededfortie-breaking.3.2TransductiveCondenceMachinesTransductiveCondenceMachine,orTCM,isawayoftransitionfromwhatwecallanindividualstrangenessmeasuretoaregionpredictor.AfamilyofmeasurablefunctionsfAn:n=1;2;:::g,whereAn:Zn!Rnforalln,iscalledanindividualstrangenessmeasureif,foranyn=1;2;:::,eachaiinAn:(w1;:::;wn)7!(a1;:::;an)(5)isdeterminedbywiandthemultiset*w1;:::;wn+.(Thedifferencebetweenamultiset*w1;:::;wn+andasetfw1;:::;wngisthattheformercancontainseveralcopiesofthesameelement.)TheTCMassociatedwithanindividualstrangenessmeasureAnisthefollowingregionpredic-torGd(x1;t1;y1;:::;xn 1;tn 1;yn 1;xn;tn):atanytrialnandforanylabely2Y,dene(a1;:::;an):=An((x1;t0;y1);:::;(xn 1;t0n 1;yn 1);(xn;t0n;y));581 VOVKandincludeyinGdifandonlyift00#fi=1;:::;n:aiang nd#fi=1;:::;n:ai=ang(6)(inparticular,includeyinGdif#fi=1;:::;n:aiang=ndanddonotincludeyinGdif#fi=1;:::;n:aiang=nd).ATCMistheTCMassociatedwithsomeindividualstrangenessmeasure.ItwasshowninVovk(2002a)thatProposition2EveryTCMiswell-calibratedforeveryPandd.ThedenitionofTCMcanbeillustratedbythefollowingsimpleexampleofanindividualstrangenessmeasure,theoneusedinproducingFigures13:mapping(5)canbedened,inthespiritofthe1-NearestNeighbourAlgorithm,as(assumingtheobjectsarevectorsinaEuclideanspace)ai:=minj6=i:yj=yid(xi;xj)minj6=i:yj6=yid(xi;xj);wheredistheEuclideandistance(i.e.,anobjectisconsideredstrangeifitisinthemiddleofobjectslabelledinadifferentwayandisfarfromtheobjectslabelledinthesameway).3.3UniversalTCMFixamonotonicallynon-decreasingsequenceofintegernumbersKn,n=1;2;:::,suchthatKn!¥;Kn=opn=lnn(7)asn!¥.TheNearestNeighboursTCMisdenedasfollows.Letw1;:::;wnbeasequenceofextendedexampleswi=(xi;si;yi).Todenethecorrespondingas,asseenin(5),werstdeneNearestNeighboursapproximationsP6=n(yjxi;si)tothetrue(butunknown)conditionalprobabilitiesP(yjxi):foreveryextendedexample(xi;si;yi)inthesequence,P6=n(yjxi;si):=N6=(xi;si;y)=Kn;(8)whereN6=(xi;si;y)isthenumberofj=1;:::;nsuchthatyj=yand(xj;sj)isoneoftheKnnearestneighbours,inthesenseof(3),of(xi;si)inthesequence((x1;s1);:::;(xi 1;si 1);(xi+1;si+1);:::;(xn;sn)):(Theupperindex6=remindsusofthefactthat(xi;si)isnotcountedasoneofitsownnearestneighboursinthisdenition.)IfKnnorKn0,thisdenitiondoesnotwork,soset,e.g.,P6=n(yjxi;si):=1=jYjforallyandi(thisparticularconventionisnotessentialsince,by(7),0Knnfromsomenon).IftheexpressionKnnearestneighboursisnotdenedbecauseofdistanceties,weagainsetP6=n(yjxi;si):=1=jYjforallyandi(thisconventionisnotessentialsincedistancetieshappenwithprobabilityzero).Denetheempiricalpredictabilityfunctionf6=nbyf6=n(xi;si):=maxy2YP6=n(yjxi;si):(9)582 UNIVERSALWELL-CALIBRATEDCLASSIFICATIONForeach(xi;si)xsomeyn(xi;si)2argmaxyP6=n(yjxi;si)(10)(e.g.,taketherstelementofargmaxyP6=n(yjxi;si)inaxedorderingofY)anddenethemap-ping(5)(wherewi=(xi;si;yi),i=1;:::;n)settingai:=( f6=n(xi;si)ifyi=yn(xi;si)f6=n(xi;si)otherwise:(11)ThiscompletesthedenitionoftheNearestNeighboursTCM,whichwilllaterbeshowntobeuniversal.Proposition3LetD(0;1)benite.IfX=[0;1]andKn!¥sufcientlyslowly,theNearestNeighboursTCMcanbeimplementedforsignicancelevelsd2DsothatthecomputationsattrialnareperformedintimeO(logn).Proposition3assumesacomputationalmodelthatallowsoperations(suchascomparison)withrealnumbers.IfXisanarbitraryBorelspace,forthispropositiontobeapplicableXshouldbeembeddedin[0;1]rst;e.g.,ifX[0;1]n,anx=(x1;:::;xn)2Xcanberepresentedas(x1;1;x2;1;:::;xn;1;x1;2;x2;2;:::;xn;2;:::)2[0;1];where0:xi;1xi;2:::isthebinaryexpansionofxi.Weusetheexpressioncanbeimplementedinawidesense,onlyrequiringthattheimplementationshouldgivethecorrectresultsalmostsurely.4.FineDetailsofRegionPredictionInthissectionwemakerststepstowardstheproofofTheorem1.LetPbethetruedistributioninZgeneratingtheexamples.WedenotebyPXthemarginaldistributionofPinX(i.e.,PX(E):=P(EY))andbyPYjX(yjx)theconditionalprobabilitythat,forarandomexample(X;Y)chosenfromP,Y=yprovidedX=x(wexarbitrarilyaregularversionofthisconditionalprobability).WewilloftenomitlowerindicesXandYjXandPitselffromournotation.Thepredictabilityofanobjectx2Xisf(x):=maxy2YP(yjx)andthepredictabilitydistributionfunctionisthefunctionF:[0;1]![0;1]denedbyF(b):=Pfx:f(x)bg:AnexampleofsuchafunctionFisgiveninFigure4(left),wherethegraphofFisthethickline.ThesuccesscurveSPofPisdenedbytheequalitySP(d)=infB2[0;1]:Z10(F(b) B)+dbd;(12)wheret+standsformax(t;0);thefunctionSPisalsoofthetype[0;1]![0;1].Geometrically,SP(d)isdenedfromthegraphofFasfollows(seeFigure4,left;weoftendropthelowerindexP):583 VOVK-6bF(b)rApBrCpZdS(d)F-6bF(b)rApDpZpOd d0d0C(d)FFigure4:ThepredictabilitydistributionfunctionFandthesuccesscurveS(d)(left);thecomple-mentarysuccesscurveC(d)(right)movethepointBfromAtoZuntiltheareaofthecurvilineartriangleABCbecomesdorBreachesZ;theordinateofBisthenS(d).ThecomplementarysuccesscurveCPofPisdenedbyCP(d)=supB2[0;1]:B+Z10(F(b) B)+dbd;(13)wheresup/0isinterpretedas0.SimilarlytothecaseofS(d),C(d)isdenedasthevaluesuchthattheareaofthepartoftheboxAZODbelowthethicklineinFigure4(right)isd(C(d)=0ifsuchavaluedoesnotexist).Denethecriticalsignicanceleveld0asd0:=Z10F(b)db:(14)Itisclearthatdd0=)Z10(F(b) S(d))+db=d&C(d)=0dd0=)S(d)=0&C(d)+Z10(F(b) C(d))+db=d:ThefollowingresultisprovedinVovk(2002b).Proposition4LetPbeaprobabilitydistributioninZandd2(0;1)beasignicancelevel.IfaregionpredictorGiswell-calibratedforPandd,thenliminfn!¥Uncd(P¥;G)nSP(d)a.s.(15)InthispaperwecomplementProposition4with584 UNIVERSALWELL-CALIBRATEDCLASSIFICATIONProposition5LetPbeaprobabilitydistributioninZandd2(0;1)beasignicancelevel.IfaregionpredictorGiswell-calibratedforPanddandsatiseslimsupn!¥Uncd(P¥;G)nSP(d)a.s.;(16)thenlimsupn!¥Empd(P¥;G)nCP(d)a.s.Theorem1immediatelyfollowsfromPropositions2,4,5andthefollowingproposition.Proposition6SupposeXisBorel.TheNearestNeighboursTCMconstructedin§3.3satises,foranyPandanysignicanceleveld,limsupn!¥Uncd(P¥;G)nSP(d)a.s.(17)andliminfn!¥Empd(P¥;G)nCP(d)a.s.(18)5.ProofsInthissectionwewillassumethatallextendedobjects(xi;t0)2[0;1]2,wherexiareoutputbyRealityandtiaretherandomnumbersused,aredifferentandthatallpairwisedistancesbetweenthemarealsodifferent(thisistruewithprobabilityone,sincet0iareindependentrandomnumbersuniformlydistributedin[0;1]).5.1ProofSketchofProposition3WithoutlossofgeneralityweassumethatDcontainsonlyonesignicanceleveld,whichwillbeomittedfromournotation.Ourcomputationalmodelhasanoperationofsplittingt2[0;1]intot0andt00(orisallowedtogeneratebotht0andt00nateverytrialn).WewillusetwomaindatastructuresinourimplementationoftheNearestNeighboursTCM:ared-blackbinarysearchtree;2agrowingarrayofnonnegativeintegersindexedbyk2f Kn; Kn+1;:::;Kng(wherenistheordinalnumberoftheexamplebeingprocessed).Immediatelyafterprocessingthenthextendedexample(xn;tn;yn)thecontentsofthesedatastruc-turesareasfollows:Thesearchtreecontainsnvertices,correspondingtotheextendedexamples(xi;ti;yi)seensofar.Thekeyofvertexiistheextendedobject(xi;t0)2[0;1]2;thelinearorderonthekeysisthelexicographicorder.Theotherinformationcontainedinvertexiistherandomnumbert00,2.See,e.g.,Cormenetal.(2001),Chapters1214.Theonlytwooperationsonred-blacktreesweneedinthispaperarethequerySEARCHandthemodifyingoperationINSERT.585 VOVKthelabelyi,thesetfP6=n(yjxi;t0):y2Ygofconditionalprobabilityestimates(8),thepointertothefollowingvertex(i.e.,thevertexthathasthesmallestkeygreaterthan(xi;t0i);ifthereisnogreaterkey,thepointerisNIL),andthepointertothepreviousvertex(i.e.,thevertexthathasthegreatestkeysmallerthan(xi;t0);if(xi;t0i)isthesmallestkey,thepointerisNIL).ThearraycontainsthenumbersN(k):=#fi=1;:::;n:ai=k=Kng(aiaredenedby(11)withsi:=t0i).Noticethattheinformationcontainedinvertexiofthesearchtreeissufcienttondyn(xi;t0)andaiintimeO(1).Wewillsaythatanextendedobject(xj;t0j)isinthevicinityofanextendedobject(xi;t0i),i6=j,iftherearelessthanKnextendedobjects(xk;t0)(strictly)between(xi;t0)and(xj;t0j).Whenanewobjectxnbecomesknown,thealgorithmdoesthefollowing:Generatest0andt00n.Locatesthesuccessorandpredecessorof(xn;t0)inthesearchtree(usingthequerySEARCHandthepointerstothefollowingandpreviousvertices);thisrequirestimeO(logn).ComputestheestimatedconditionalprobabilitiesfP6=n(yjxn;t0n):y2Yg;thisalsogivesyn(xn;t0).Thisinvolvesscanningthevicinityof(xn;t0n)fortheKnnearestneighboursof(xn;t0),whichcanbedoneintimeO(Kn):theKnnearestneighbourscanbeextractedfromthevicinityof(xn;t0n)sortedintheorderofincreasingdistancesfrom(xn;t0n);sinceinitiallythevicinityconsistsoftwosortedlists(totheleftandtotherightof(xn;t0n)),theprocedureMERGEusedinthemergesortalgorithm(see,e.g.,Cormenetal.2001,§2.3.1)willsortthewholevicinityintimeO(Kn).Therefore,therequiredtimeisO(Kn)=O(logn).Foreachy2Ylooksatwhathappensifthenthexampleis(xn;tn;yn)=(xn;tn;y):computesanandupdates(ifnecessary)aifor(xi;t0)inthevicinityof(xn;t0);usingthearrayandt00n,ndswhethery2Gn.ThisrequirestimeO(K2n)=O(logn),sincethereareO(Kn)ai'sinthevicinityof(xn;t0n)andeachofthemcanbecomputedintimeO(Kn).OutputsthepredictionregionGn(timeO(1)).Whenthelabelynarrives,thealgorithm:Insertsthenewvertex(xn;t0n;t00n;yn;fP6=n(yjxn;t0n):y2Yg)inthesearchtree,repairsthepointerstothefollowingandpreviouselementsfor(xn;t0)'sleftandrightneighbours,ini-tializesthepointerstothefollowingandpreviouselementsfor(xn;t0)itself,andrebalancesthetree(timeO(logn)).Updates(ifnecessary)theconditionalprobabilitiesfP6=n 1(yjxi;t0):y2Yg7!fP6=n(yjxi;t0i):y2Ygforthe2Knexistingvertices(xi;t0)inthevicinityof(xn;t0);thisrequirestimeO(K2n)=O(logn).Theconditionalprobabilitiesforother(xi;t0),i=1;:::;n 1,donotchange.586 UNIVERSALWELL-CALIBRATEDCLASSIFICATIONUpdatesthearray,changingN(Knai)forthe(xi;t0)6=(xn;t0)inthevicinityof(xn;t0n)andforbotholdandnewvaluesofaiandchangingN(Knan)(timeO(Kn)=O(logn)).InconclusionwediscusshowtodotheupdatesrequiredwhenKnchanges.AtthecriticaltrialsnwhenKnchangesthearrayandtheestimatedconditionalprobabilitiesP6=n(yjxi;t0)havetoberecomputed,which,ifdonenaively,wouldrequiretimeQ(nKn).TheassumptionwehavemadeaboutKnsofaristhatKn=O(plogn).WenowalsoassumethatKnismonotonicnon-decreasingand#fn:Kncg=O(#fn:Kn=cg)(19)asc!¥.ThisisthefullexplicationoftheKn!¥sufcientlyslowlyinthestatementofthelemma,asusedinthisproof.AnepochisdenedtobeamaximalsequenceofnswiththesameKn.Sincethechangesthatneedtobedonewhenanewepochstartsaresubstantial,theywillbespreadoverthewholepre-cedingepoch;wewillonlydiscussupdatingtheestimatedconditionalprobabilitiesP6=n(yjxi;t0):thearrayistreatedsimilarly.AnepochisoddifthecorrespondingKnisoddandevenifKniseven.Ateverystepinanepochwepreparethegroundforthenextepoch.Bytheendofepochn=A+1;A+2;:::;BweneedtochangeBsetsfP6=n(yjxi;t0):y2YginB Asteps(thedurationoftheepoch).Therefore,eachvertexofthesearchtreeshouldcontainnotonlyfP6=n(yjxi;t0i)gforthecurrentepochbutalsofP6=n(yjxi;t0i)gforthenextepoch(twostructuresforholdingfP6=n(yjxi;t0i)gwillsufce,oneforevenepochsandoneforoddepochs).OurassumptionsoftheslowgrowthofKn,asseenin19),implythatB=O(B A).ThismeansthatateachstepO(1)setsfP6=n(yjxi;t0)gforthenextepochshouldbeadded.ThiswilltaketimeO(Kn)=O(logn).AssoonasasetfP6=n(yjxi;t0):y2Ygforthenextepochisaddedatsometrial,bothsets(forthecurrentandnextepoch)willhavetobeupdatedforeachnewexample.5.2ProofSketchofProposition5TheproofofProposition5issimilarto(butmorecomplicatedthan)theproofofTheorems1and1rinVovk(2002b);thisproofsketchcanbemaderigoroususingtheNeymanPearsonlemma,asinVovk(2002b).Wewillusethenotationsg0andg0rightfortheleftandrightderivatives,respectively,ofafunctiong.ThefollowinglemmaparallelsLemma2inVovk(2002b),whichdealswithS(d).Lemma7ThecomplementarysuccesscurveC:[0;1]![0;1]alwayssatisestheseproperties:1.Thereisapointd02[0;1](namely,thecriticalsignicancelevel)suchthatC(d)=0fordd0andC(d)isconcavefordd0.2.C0right(d0)¥andC0left(1)1;therefore,ford2(d0;1),1C0right(d)C0left(d)¥andthefunctionC(d)isincreasing.3.C(d)iscontinuousatd=d0;therefore,itiscontinuouseverywherein[0;1].IfafunctionC:[0;1]![0;1]satisestheseproperties,thereexistameasurablespaceX,anitesetY,andaprobabilitydistributionPinXYforwhichCisthecomplementarysuccesscurve.587 VOVKProofsketchThestatementofthelemmafollowsfromthefactthatthecomplementarysuccesscurveCcanbeobtainedfromthepredictabilitydistributionfunctionFusingthesesteps(labellingthehorizontalandverticalaxesasxandyrespectively):1.InvertF:F1:=F 1.2.IntegrateF1:F2(x):=Rx0F1(t)dt.3.IncreaseF2:F3(x):=F2(x)+d0,whered0:=R10F(x)dx.4.InvertF3:F4:=F 13.ItcanbeshownthatC=F4,ifwedeneg 1(y):=supfx:g(x)ygfornon-decreasingg(sothatg 1iscontinuousontheright).Complementtheprotocolof§2inwhichRealityplaysP¥andPredictorplaysGwiththefol-lowingvariables:errn:=(PU)(x;y;t):y=2Gd(x1;t1;y1;:::;xn 1;tn 1;yn 1;x;t) ;uncn:=(PXU)(x;t):jGd(x1;t1;y1;:::;xn 1;tn 1;yn 1;x;t)j1 ;empn:=(PXU)(x;t):jGd(x1;t1;y1;:::;xn 1;tn 1;yn 1;x;t)j=0 ;dbeingxedandUstandingfortheuniformdistributionin[0;1],andErrn:=nåi=1erri;Uncn:=nåi=1unci;Empn:=nåi=1empi:Bythemartingalestronglawoflargenumbers,toprovethepropositionitsufcestoconsideronlythesepredictableversionsofErrn,Uncn,andEmpn:indeed,sinceErrn Errn,Uncn Uncn,andEmpn Empnaremartingales(withincrementsboundedby1inabsolutevalue)withrespecttotheltrationFn,n=0;1;:::,whereeachFnisgeneratedby(x1;t1;y1);:::;(xn;tn;yn),wehavelimn!¥Errn Errnn=0a.s.;limn!¥Uncn Uncnn=0a.s.;andlimn!¥Empn Empnn=0a.s.(See,e.g.,Shiryaev,1996,TheoremVII.5.4.)WithoutlossofgeneralitywecanassumethatPredictor'smoveGnattrialnisfy(xn)g(wherex7!y(x)2argmaxyP(yjx)isaxedchoicefunction)ortheemptyset/0orthewholelabelspaceY.Furthermore,wecanassumethat,ateverytrial,thepredictionsarecertainforthenewobjects588 UNIVERSALWELL-CALIBRATEDCLASSIFICATION-6bF(b)rArBrCrDrGrEpZpOenS(errn empn)Figure5:Anadmissibleregionpredictor.ThethicklineisthepredictabilitydistributionfunctionF;theareaofthecurvilineartriangleABCiserrn empn;theareaoftherectangleDZOGisempn;the(non-negative)areaofthecurvilinearquadrangleBDECisdenotedenabovethestraightlineBCinFigure5,3andthatthepredictionsareemptyfortheobjectsbelowthestraightlineDGinFigure5.4Itisclearthatfortheregionpredictortosatisfy(16)itmustholdthatlimn!¥1nnåi=1(ei^empi)=0(otherwiseUncncanbedecreasedsubstantially,whichcontradicts(15);eiaredenedinthecaptionofFigure5),andsowecanassume,withoutlossofgenerality,thateitheren=0orempn=0ateverytrialn,i.e.,thatuncn=S(errn);empn=C(errn)ateverytrial.Letuscheckthattoachieve(16)theregionpredictormustsatisfydd0=)limsupn!¥1nnåi=1(erri d0)+=0(20)dd0=)limsupn!¥1nnåi=1(d0 erri)+=0;(21)wheretheconvergenceis,asusual,almostcertain.ItwasshowninVovk(2002b)(Lemma2)thatthesuccesscurveSisconvex,non-increasing,continuous,andhasslopeatmost 1beforeithits3.Moreformally,predictionsarecertainfornewextendedobjects(x;t)satisfyingF(x;t):=F(f(x) )+t(F(f(x)+) F(f(x) ))S(errn empn):Intuitively,consideringextendedobjectsmakestheverticalaxisinnitelydivisible.4.Indeed,predictionsofthiskindareadmissibleinthesensethatwecannotimproveuncnandempnsimultaneously,andalladmissiblepredictionsareequivalenttopredictionsofthiskind.AformalargumentforthecasewhereempnareomittedisgiveninVovk(2002b).589 VOVKthexaxisatd=d0.Thesecondimplication,(21),nowimmediatelyfollowsfromthefactthat,underdd0and(16),0=limsupn!¥Uncnn=limsupn!¥1nnåi=1S(erri)limsupn!¥1nnåi=1(d0 erri)+:Therstimplication,(20),canbeextractedfromthechainUncnn=1nnåi=1unci=1nnåi=1S(erri)S 1nnåi=1erri!=S Errnn!S(d) e(22)(withthelastinequalityholdingalmostsurelyforanarbitrarye0fromsomenon)usedbyVovk(2002b,intheproofofTheorems1and1r).Indeed,itcanbeseenfrom(22)that,assumingthepredictoriswell-calibratedandoptimalanddd0,Errn=n!da.s.and,therefore,S(d)limsupn!¥Uncnn=limsupn!¥1nnåi=1S(erri)=limsupn!¥1nnåi=1S(erri^d0)limsupn!¥S 1nnåi=1(erri^d0)!=limsupn!¥S Errnn 1nnåi=1(erri d0)+!=limsupn!¥S d 1nnåi=1(erri d0)+!=S d limsupn!¥1nnåi=1(erri d0)+!almostsurely.Thisproves(20).Using(20),(21),andthefactthatthecomplementarysuccesscurveCisconcave,increasing,and(uniformly)continuousfordd0(seeLemma7),weobtain:ifdd0,Empnn=1nnåi=1empi=1nnåi=1C(erri)1nC0(d0)nåi=1(erri d0)+!0(n!¥);ifdd0,Empnn=1nnåi=1C(erri)=1nnåi=1C(erri_d0)C 1nnåi=1(erri_d0)!=C 1nnåi=1erri+1nnåi=1(d0 erri)+!C 1nnåi=1erri!+o(1)C(d)+e;thelastinequalityholdingalmostsurelyforanarbitrarye0fromsomenonanddbeingthesignicancelevelused.590 UNIVERSALWELL-CALIBRATEDCLASSIFICATION5.3ProofSketchofProposition6LetusrstmodifyandextendthenotationP6=n(yjxi;si)introducedin(8).Considerthesequenceofextendedexampleswi=(xi;t0;yi),i=1;:::;n((xi;yi)aretherstnexampleschosenbyRealityandtiaretherandomnumbersusedbyPredictor).WedenetheNearestNeighboursapproximationsPn(yjx;s)totheconditionalprobabilitiesP(yjx)asfollows:forevery(x;s;y)2Z,Pn(yjx;s):=N(x;s;y)=Kn;(23)whereN(x;s;y)isthenumberofi=1;:::;nsuchthat(xi;t0i)isamongtheKnnearestneighboursof(x;s)andyi=y(thistime(xi;t0)isnotpreventedfrombeingcountedasoneoftheKnnearestneighboursof(x;s)if(xi;t0)=(x;s)).Wedenetheempiricalpredictabilityfunctionfnbyfn(x;s):=maxy2YPn(yjx;s):(24)Theproofwillbebasedonthefollowingversionofawell-knownfundamentalresult.Lemma8SupposeKn!¥,Kn=o(n),andY=f0;1g.Foranye0andlargeenoughn,PZjP(1jx) Pn(1jx;s)jPX(dx)U(ds)ee ne2=40;wheretheoutermostprobabilitydistributionP(essentially(PU)¥)generatestheextendedexam-ples(xi;ti;yi),whichdeterminetheempiricaldistributionsPn.ProofThisisalmostaspecialcaseofDevroyeetal.'s(1994)Theorem1.Thereis,however,animportantdifferencebetweenthewaywebreakdistancetiesandthewayDevroyeetal.(1994)dothis.Inthatwork,insteadofour(3),(jx1 x3j;js1 s3j)(jx2 x3j;js2 s3j)isused.(Ourwayofbreakingtiesbetteragreeswiththelexicographicorderon[0;1]2,whichisusefulintheproofofProposition3and,lessimportantly,intheproofofLemma10.)ItiseasytocheckthattheproofgivenbyDevroyeetal.(1994)alsoworks(andbecomessimpler)forourwayofbreakingdistanceties.Lemma9SupposeKn!¥andKn=o(n).Foranye0thereexistsane0suchthat,forlargeenoughn,P(PXU)(x;s):maxy2YjPn(yjx;s) P(yjx)jeee en;inparticular,P(PXU)f(x;s):jfn(x;s) f(x)jege e en:ProofWeapplyLemma8tothebinaryclassicationproblemobtainedfromourclassicationproblembyreplacinglabely2Ywith1andreplacingallotherlabelswith0:PZjP(yjx) Pn(yjx;s)jPX(dx)U(ds)ee ne2=40:591 VOVKByMarkov'sinequalitythisimpliesP(PXU)fjP(yjx) Pn(yjx;s)jpegpe e ne2=40;which,inturn,impliesP(PXU)maxy2YjP(yjx) Pn(yjx;s)jpejYjpee ne2=40:Thiscompletestheproof,sincewecantaketheeinthelastequationarbitrarilysmallascomparedtotheeinthestatementofthelemma.Wewillusetheshorthand8¥nforfromsomenon.Lemma10SupposeKn!¥andKn=o(n).Foranye0thereexistsane0suchthat,forlargeenoughn,P8#ni:maxyP(yjxi) P6=n(yjxi;t0)eone9e en:Inparticular,8¥n:P8#ni:f(xi) f6=n(xi;t0)eone9e en:ProofSinceP6=n(yjxi;t0) Pn(yjxi;t0i)1Kn=o(1);wecan,andwill,ignoretheupperindices6=inthestatementofthelemma.DeneIn(x;s):=80ifmaxyjP(yjx) Pn(yjx;s)je1ifmaxyjP(yjx) Pn(yjx;s)j2e(maxyjP(yjx) Pn(yjx;s)j e)=eotherwise(intuitively,In(x;s)isasoftversionofIfmaxyjP(yjx) Pn(yjx;s)jeg).Themaintoolinthisproof(andseveralotherproofsinthissection)willbeMcDiarmid'stheo-rem(see,e.g.,Devroyeetal.,1996,Theorem9.2).Firstwecheckthepossibilityofitsapplication.Ifwereplaceanextendedobject(xj;t0j)byanotherextendedobject(xj;tj),theexpressionnåi=1In(xi;t0)willchangeasfollows:theaddendIn(xi;t0)fori=jchangesby1atmost;theaddendsIn(xi;t0i)fori6=jsuchthatneither(xj;t0j)nor(xj;tj)areamongtheKnnearestneighboursof(xi;t0)donotchangeatall;592 UNIVERSALWELL-CALIBRATEDCLASSIFICATIONthesumovertheatmost4Kn(seebelow)addendsIn(xi;t0)fori6=jsuchthateither(xj;t0j)or(xj;tj)(orboth)areamongtheKnnearestneighboursof(xi;t0i)canchangebyatmost4Kn1e1Kn=4e:(25)Theleft-handsideof(25)reectsthefollowingfacts:thechangeinPn(yjxi;t0)fori6=jisatmost1=Kn;thenumberofi6=jsuchthat(xj;t0j)isamongtheKnnearestneighboursof(xi;t0)doesnotexceed2Kn(sincetheextendedobjectsarelinearlyorderedand(3)isusedforbreakingdistanceties);analogously,thenumberofi6=jsuchthat(xj;tj)isamongtheKnnearestneighboursof(xi;t0)doesnotexceed2Kn.Therefore,byMcDiarmid'stheorem,P(1nnåi=1In(xi;t0) E 1nnåi=1In(xi;t0)!e)exp 2e2n=(1+4=e)2=exp 2e4(4+e)2n:(26)Nextwend:E 1nnåi=1In(xi;t0)!=E In(xn;t0n)E In 1(xn;t0n)+o(1)E(PXU)f(x;s):maxyjP(yjx) Pn 1(yjx;s)jeg+o(1)e en+e+o(1)2e(thepenultimateinequalityfollowsfromLemma9)fromsomenon.Incombinationwith(26)thisimplies8¥n:P(1nnåi=1In(xi;t0)3e)exp 2e4(4+e)2n;inparticularP#fi:maxyjP(yjxi) Pn(yjxi;t0)j2egn3eexp 2e4(4+e)2n:Replacing3ebye,weobtainthat,fromsomenon,P#fi:maxyjP(yjxi) Pn(yjxi;t0)jegneexp 2(e=3)4(4+e=3)2n;whichcompletestheproof.Wesaythatanextendedexample(xi;ti;yi),i=1;:::;n,isn-strangeifyi6=yn(xi;t0);otherwise,(xi;ti;yi)willbecalledn-ordinary.Wewillassumethat(f6=n(xi;t0);t00i),i=1;:::;n,arealldifferentforalln;evenmorethanthat,wewillassumethatt00arealldifferent(wecandososincetheprobabilityofthiseventisone).593 VOVK-6bF(b)rrrS(d)qceeee-6bF(b)rrS(d)ppqceeeerFigure6:CasesF(c)=S(d)(left)andF(c)S(d)(right).TheverticalbandsofwidthedeterminethedivisionoftherstnextendedexamplesintoveclassesLemma11Suppose(7)issatisedanddd0.Withprobabilityone,theb(1 S(d))ncex-tendedexampleswiththelargest(inthesenseofthelexicographicorder)(f6=n(xi;t0);t00i)among(x1;t1;y1);:::;(xn;tn;yn)containatmostnd+o(n)n-strangeextendedexamplesasn!¥.ProofDenec:=supfb:F(b)S(d)g:Itisclearthat0c1.OurproofwillworkbothinthecasewhereF(c)=S(d)andinthecasewhereF(c)S(d),asillustratedinFigure6.Lete0beasmallconstant(wewilllete!0eventually).Deneathreshold(c0;c00n)2[0;1]2requiringthatPf(xn)=c;(fn 1(xn;t0);t00n)(c0n;c00n) =F(c) S(d) e(27)ifF(c)S(d);weassumethateissmallenoughfor2eF(c) S(d)(28)tohold.Amongotherthingsthiswillensurethevalidityofthedenition(27).IfF(c)=S(d),weset(c0;c00n):=(c+e;0);inanycase,wewillhavePf(xn)=c;(fn 1(xn;t0);t00n)(c0n;c00n) F(c) S(d) e:(29)Letussaythatanextendedexample(xi;ti;yi)isabovethethresholdif(f6=n(xi;t0i);t00i)(c0n;c00n);otherwise,wesayitisbelowthethreshold.Dividetherstnextendedexamples(xi;ti;yi),i=1;:::;n,intoveclasses:ClassI:Thosesatisfyingf(xi)c 2e.594 UNIVERSALWELL-CALIBRATEDCLASSIFICATIONClassII:Thosethatsatisfyf(xi)=candarebelowthethreshold.ClassIII:Thosesatisfyingc 2ef(xi)c+2ebutnotf(xi)=c.ClassIV:Thosethatsatisfyf(xi)=candareabovethethreshold.ClassV:Thosesatisfyingf(xi)c+2e.Firstweexplainthegeneralideaoftheproof.Thethreshold(c0;c00)waschosensothatapproxi-matelyb(1 S(d))ncoftheavailableextendedexampleswillbeabovethethreshold.Becauseofthis,theextendedexamplesabovethethresholdwillessentiallybetheb(1 S(d))ncextendedex-ampleswiththelargest(f6=n(xi;t0);t00i)referredtointhestatementofthelemma.Foreachoftheveclasseswewillbeinterestedinthefollowingquestions:Howmanyextendedexamplesarethereintheclass?Howmanyofthoseareabovethethreshold?Howmanyofthoseabovethethresholdaren-strange?Ifthesumoftheanswerstothelastquestiondoesnotexceedndbytoomuch,wearedone.Withthisplaninmind,westarttheformalproof.(Ofcourse,wewillnotbefollowingtheplanliterally:forexample,ifaclassisverysmall,wedonotneedtoanswerthesecondandthirdquestions.)Therststepistoshowthatc ec0c+e(30)fromsomenon;thiswillensurethattheclassesareconvenientlyseparatedfromeachother.WeonlyneedtoconsiderthecaseF(c)S(d).Theinequalityc0c+efollowsfrom8¥n:Pf(xn)=c;fn 1(xn;t0)c+e eF(c) S(d) eSimplycombineLemma9with(28).Theinequalityc ec0followsinasimilarwayfrom8¥n:Pf(xn)=c;fn 1(xn;t0)c e =Pff(xn)=cg Pf(xn)=c;fn 1(xn;t0n)c e F(c) F(c ) eF(c) S(d) e:Nowwearereadytoanalyzethecompositionofourveclasses.AmongtheClassIextendedexamplesatmosten(31)willbeabovethethresholdfromsomenonalmostsurely(byLemma10andtheBorelCantellilemma).NoneoftheClassIIextendedexampleswillbeabovethethreshold,bydenition.ThefractionofClassIIIextendedexamplesamongtherstnextendedexampleswilltendtoF(c+2e) F(c)+F(c ) F(c 2e)(32)asn!¥almostsurely.595 VOVKToestimatethenumberNIVnofClassIVextendedexamplesamongtherstnextendedex-amples,weuseMcDiarmid'stheorem.Ifoneextendedexampleisreplacedbyanother,NIVnwillchangebyatmost2Kn+1(sincethisextendedexamplecanaffectf6=n(xi;t0)foratmost2Knotherextendedexamples(xi;ti;yi)).Therefore,P1nNIVn 1nENIVne2e 2e2n=(2Kn+1)2;theassumptionKn=opn=lnnandtheBorelCantellilemmaimplythat1nNIVn 1nENIVnefromsomenonalmostsurely.Since1nENIVn=Pf(xn)=c;(fn 1(xn;t0);t00n)(c0n;c00n) F(c) S(d) e;asin(29),wehaveNIVn(F(c) S(d) 2e)n(33)fromsomenonalmostsurely.Ofcourse,alltheseexamplesareabovethethreshold.NowweestimatethenumberNIV;strnofn-strangeextendedexamplesofClassIV.AgainMcDi-armid'stheoremimpliesthat1nNIV;strn 1nENIV;strnefromsomenonalmostsurely.Now,fromsomenon,1nENIV;strn=Pf(xn)=c;(fn 1(xn;t0);t00n)(c0n;c00n);yn(xn;t0n)6=yn =E 1 PYjX yn(xn;t0n)jxnIff(xn)=c;(fn 1(xn;t0);t00n)(c0n;c00n)ge en+e+(1 c+2e)Pff(xn)=c;(fn 1(xn;t0);t00n)(c0n;c00n)g=e en+e+(1 c+2e)(F(c) S(d) e)(34)(F(c) S(d))(1 c)+4e(35)inthecaseF(c)S(d);therstinequalityinthischainfollowsfromLemma9:indeed,thislemmaimpliesthat,unlessaneventofthesmallprobabilitye en+ehappens,P yn(xn;t0n)jxnPn 1 yn(xn;t0n)jxn;t0n e=fn 1 xn;t0n ef(xn) 2e:(36)IfF(c)=S(d),thelines(34)and(35)ofthatchainhavetobechangedtoe en+e+(1 c+2e)Pff(xn)=c;fn 1(xn;t0n)c+ege en+e+(1 c+2e)e en+e4e596 UNIVERSALWELL-CALIBRATEDCLASSIFICATION(wheretheobviousmodicationofLemma9withallechangedtoeisused),buttheinequalitybetweentheextremetermsofthechainstillholds.Therefore,thenumberofn-strangeClassIVextendedexamplesdoesnotexceed((F(c) S(d))(1 c)+5e)n(37)fromsomenonalmostsurely.BytheBorelstronglawoflargenumbers,thefractionofClassVextendedexamplesamongtherstnextendedexampleswilltendto1 F(c+2e)(38)asn!¥almostsurely.ByLemma10,theBorelCantellilemma,and(30),almostsurelyfromsomenonatleast(1 F(c+2e) 2e)n(39)extendedexamplesinClassVwillbeabovethethreshold.Finally,weestimatethenumberNV;strnofn-strangeextendedexamplesofClassVamongtherstnextendedexamples.ByMcDiarmid'stheorem,1nNV;strn 1nENV;strnefromsomenonalmostsurely.Now1nENV;strn=Pf(xn)c+2e;yn(xn;t0)6=yn =E 1 PYjX yn(xn;t0n)jxnIff(xn)c+2ege en+e+E (1 f(xn)+2e)Iff(xn)c+2ege en+3e+E (1 f(xn))Iff(xn)c+2eg=e en+3e+Z10(F(b) F(c+2e))+dbZ10(F(b) F(c))+db+4efromsomenon.TherstinequalityfollowsfromLemma9,asin(36).Therefore,1nNV;strnZ10(F(b) F(c))+db+5e(40)fromsomenonalmostsurely.Summarizing,wecanseethatthetotalnumberofextendedexamplesabovethethresholdamongtherstnextendedexampleswillbeatleast(F(c) S(d) 2e+1 F(c+2e) 2e)n=(1 S(d)+F(c) F(c+2e) 4e)n(41)597 VOVK(see(33)and(39))fromsomenonalmostsurely.Thenumberofn-strangeextendedexamplesamongthemwillnotexceede+F(c+2e) F(c)+F(c ) F(c 2e)+e+(F(c) S(d))(1 c)+5e+Z10(F(b) F(c))+db+5en=F(c+2e) F(c)+F(c ) F(c 2e)+(F(c) S(d))(1 c)+Z10(F(b) F(c))+db+12en(42)(see(31),(32),(37),and(40))fromsomenonalmostsurely.Combining(41)and(42),wecanseethatthenumberofn-strangeextendedexamplesamongtheb(1 S(d))ncextendedexampleswiththelargest(f6=n(xi;t0);t00i)doesnotexceedF(c+2e) F(c)+F(c ) F(c 2e)+(F(c) S(d))(1 c)+Z10(F(b) F(c))+db+12en+(F(c+2e) F(c)+4e)n=2(F(c+2e) F(c))+(F(c ) F(c 2e))+(F(c) S(d))(1 c)+Z10(F(b) F(c))+db+16enfromsomenonalmostsurely.Sinceecanbearbitrarilysmall,thecoefcientinfrontofninthelastexpressioncanbemadearbitrarilycloseto(F(c) S(d))(1 c)+Z10(F(b) F(c))+db=Z10(F(b) S(d))+db=d;whichcompletestheproof.Lemma12Suppose(7)issatised.Thefractionofn-strangeextendedexamplesamongtherstnextendedexamples(xi;ti;yi)approachesd0asymptoticallywithprobabilityone.ProofsketchThelemmaisnotdifculttoproveusingMcDiarmid'stheoremandthefactthat,byLemma10,P(yn(xi;t0)jxi)willtypicallydifferlittlefromf(xi).Notice,however,thatthepartthatwereallyneedinthispaper(thatthefractionofn-strangeextendedexamplesdoesnotexceedd0+o(1)asn!¥withprobabilityone)isjustaspecialcaseofLemma11,correspondingtod=d0.Lemma13Suppose(7)issatisedanddd0.Thefractionofn-ordinaryextendedexamplesamongthebC(d)ncextendedexamples(xi;ti;yi),i=1;:::;n,withthelowest(f6=n(xi;t0);t00i)doesnotexceedd d0+o(1)asn!¥withprobabilityone.598 UNIVERSALWELL-CALIBRATEDCLASSIFICATIONLemma13canbeprovedanalogouslytoLemma11.Lemma14LetF1F2beadecreasingsequenceofs-algebrasandx1;x2:::beaboundedadapted(inthesensethatxnisFn-measurableforalln)sequenceofrandomvariablessuchthatlimsupn!¥E(xnjFn+1)0a.s.Thenlimsupn!¥1nnåi=1xi0a.s.ProofReplacing,ifnecessary,xnbyxn E(xnjFn+1),wereduceourtasktothefollowingspecialcase(areverseBorelstronglawoflargenumbers):ifx1;x2;:::isaboundedreversemartingaledifference,inthesenseofbeingadaptedandsatisfying8n:E(xnjFn+1)=0,thenlimn!¥1nnåi=1xi=0a.s.(43)Fixaboundedreversemartingaledifferencex1;x2;:::;ourgoalistoprove(43).BythemartingaleversionofHoeffding'sinequality(Devroyeetal.,1996,Theorem9.1)appliedtothemartingaledifference(xi;Fi),i=n;:::;1,P(1nnåi=1xie)2e 2e2n=(2C)2;(44)whereCisanupperboundonsupnjxnj.CombinedwiththeBorelCantelliL´evylemma,(44)implies(43).NowwecansketchtheproofofProposition6.DeneFn,n=1;2;:::,tobethes-algebraonZ¥generatedbythemultisetoftherstn 1extendedexamples(xi;ti;yi),i=1;:::;n 1,andthesequenceofextendedexamples(xi;ti;yi),i=n;n+1;:::(startingfromthenthextendedexample).Supposerstthatdd0.Considertheb(1 S(d e))ncextendedexampleswiththelargest(f6=n(xi;t0);t00i)among(x1;t1;y1);:::;(xn;tn;yn),wheree2(0;d)isasmallconstant.Letusshowthateachoftheseexampleswillbepredictedwithcertaintyfromtheotherextendedexamplesinthesequence(x1;t1;y1);:::;(xn;tn;yn),fromsomenon.Wewillbeassumingnlargeenough.Let(xk;tk;yk)betheextendedexamplewiththe(b(d e=2)nc+1)thlargest(inthesenseofthelexicographicorder)(f6=n(xi;t0i);t00i)amongalln-strangeextendedexamples(xi;ti;yi),i=1;:::;n.(Rememberthatallt00areassumedtobedifferent.)Let(xj;tj;yj)beoneoftheb(1 S(d e))ncextendedexampleswiththelargest(f6=n(xi;t0i);t00i)andlety2Ybealabeldifferentfromyn(xj;t0j).Itsufcestoprovethatt00j#fi=1;:::;n:ayayjg nd#fi=1;:::;n:ay=ayjg(45)(cf.(6)onp.582),whereallayarecomputedasain(11)fromthesequence(x1;t1;y1);:::;(xn;tn;yn)599 VOVKwithyjreplacedbyy.Itwillbemoreconvenienttowrite(45)intheform#fi:ayayjg+(1 t00j)#fi:ay=ayjgnd:Sinceayj=f6=n(xj;t0j)anday6=aiforatmost2Kn+1valuesofi(indeed,changingyjwillaffectatmost2Kn+1as),itsufcestoprove#fi:aif6=n(xj;t0j)g+(1 t00j)#fi:ai=f6=n(xj;t0j)gn(d e);(46)whereeeisapositiveconstant.Since(f6=n(xj;t0j);t00j)(ak;t00)(indeed,byLemma11,therearelessthan(d e=2)nn-strangeextendedexamplesamongtheb(1 S(d e))ncextendedexampleswiththelargest(f6=n(xi;t0);t00i)),(46)willfollowfrom#fi:aiakg+(1 t00)#fi:ai=akgn(d e):(47)If#fi:ai=akge3n,theleft-handsideof(47)doesnotexceedd e2n+e3nn(d e);sowecan,andwill,assumewithoutlossofgeneralitythat#fi:ai=akge3n:(48)Sincet00fortheextendedexamplessatisfyingai=akareoutputaccordingtotheuniformdistribu-tionU,theexpectedvalueof1 t00isabout(d e=2)n #fi:aiakg#fi:ai=akg;andsobyHoeffding'sinequalityandtheBorelCantellilemmawewillhave(fromsomenon)1 t00(d e=2)n #fi:aiakg#fi:ai=akg+e;(49)remembering(48).Equation(47)willholdbecauseitsleft-handsidecanbetransformedusing(49)as#fi:aiakg+(1 t00)#fi:ai=akg(d e=2)n+e#fi:ai=akg(d e=2+e)n(d e)n:Theassertionwehavejustprovedmeansthat,almostsurelyfromsomenon,P(funcn=0gjFn+1)b(1 S(d e))ncn1 S(d e) 1n:SinceecanbearbitrarilysmallandSiscontinuous(Vovk,2002b,Lemma2),thisimplieslimsupn!¥E(uncnjFn+1)S(d)a.s.600 UNIVERSALWELL-CALIBRATEDCLASSIFICATIONByLemma14thisimplies,inturn,limsupn!¥1nnåi=1unciS(d)a.s.;whichcoincideswith(17).Ifdd0,Lemma12impliesthatlimn!¥E(uncnjFn+1)=0a.s.(andactuallyE(uncnjFn+1)=0fromsomenonifdd0);incombinationwithLemma14thisagainimplies(17).Inequality(18)istreatedinasimilarwayto(17).Lemmas12and13implythatliminfn!¥E(empnjFn+1)C(d)a.s.(50)(thisinequalityisvacuouslytruewhendd0).AnotherapplicationofLemma14givesliminfn!¥1nnåi=1empiC(d)a.s.;i.e.,(18).RemarkThederivationofProposition6fromLemmas1114wouldbeverysimpleifwedenedtheindividualstrangenessmeasureby,say,ai:=(( f6=n(xi;si);si)ifyi=yn(xi;si)(f6=n(xi;si);si)otherwise(withthelexicographicorderonthea's)insteadof(11)(inwhichcasethedenominatorof(6)wouldbe1almostsurely).Ourdenition(11),however,issimplerand,mostimportantly,facilitatestheproofofProposition3.AnothersimplicationwouldbetouseLemma11(appliedtod:=d C(d))insteadofLemma13inthederivationof(50);wepreferredamoresymmetricpicture.6.ConclusionWehaveshownthatthereexistuniversalwell-calibratedregionpredictors,thussatisfying,tosomedegree,thedesideratamentionedin§1:well-calibratednessandoptimalperformance.Notice,however,thatthewaysinwhichthesetwodesiderataaresatisedareverydifferent:thewell-calibratednessholdsinaveryspecicnitarysense,sincetheerrorshaveprobabilitydandareindependent,whereastheoptimalperformanceisachievedonlyasymptotically.Animportantdirectionoffurtherresearchistoobtainnon-asymptoticresultsaboutTCM'soptimality.AnaturalsettingiswherewehaveaBayesianmodelforReality'sstrategy,fPq:q2Qgwithapriorµ(dq)onQ,andourgoalistominimizeUncdunderthismodel.Theintuitionbehindthissettingisthatwedonotreallybelievethatthedataisgeneratedfromourmodelandsopreferapredictorthatiswell-calibratedregardlessthecorrectnessofthemodel;butifthemodeliscorrect,wewouldliketohaveanoptimalperformance.Aspecialcaseofthissetting,withµ(dq)601 VOVKconcentratedatonepoint,wasconsideredinVovk(2002b);however,allresultsinthatpaperareasymptotic.wledgmentsThisisafullversionoftheconferencepaper(Vovk,2003);Iamgratefultobothsetsofanonymousreferees(fortheconferenceandjournalversions),whosecommentsandsuggestionshelpedtoim-provethequalityofpresentationandcorrectseveralmistakes.ThisworkwaspartiallysupportedbyEPSRC(grantGR/R46670/01),BBSRC(grant111/BIO14428),andEU(grantIST-1999-10226).AppendixA.NotationThefollowingtablecontains,strictlyspeaking,notonlythenotationusedinthispaperbutalsothepreferreduseofsymbols.XobjectspaceYlabelspaceZexamplespace(Z=XY)PtheprobabilitydistributioninZgeneratingindividualexamplesz1=(x1;y1);z2=(x2;y2);:::dsignicancelevelGdpredictionregionerrdindicatoroferrorattrialnuncdnindicatorofuncertainpredictionattrialnempdindicatorofemptypredictionattrialnErrdcumulativenumberoferrorsuptotrialnUncdncumulativenumberofuncertainpredictionsuptotrialnEmpdcumulativenumberofemptypredictionsuptotrialntnthenthrandomnumberusedbyaregionpredictort0,t00ntwocomponentsoftn,asdenedin§3.1XtheextendedobjectspaceX[0;1]ZtheextendedexamplespaceX[0;1]Y,mayrefertothelexicographicorderon[0;1]2,asdenedonp.581jx;sjtheabsolutevalueof(x;s)2[0;1]2,asdenedin(4)Anindividualstrangenessmeasureaivaluestakenbyanindividualstrangenessmeasure#EthesizeofsetEKnthenumberofnearestneighbourstakenintoaccountattrialnP6=n(yjxi;si)empiricalestimateofP(yjxi)withouttakingyiintoaccount,asdenedin(8)f6=n(xi;si)correspondingempiricalpredictabilityfunction,(9)yn(xi;si)choicefunction,asdenedin(10)DnitesetofsignicancelevelsPXthemarginaldistributionofPinX602 UNIVERSALWELL-CALIBRATEDCLASSIFICATIONPYjXtheregularconditionaldistributionofy2Ygivenx2X,where(x;y)isdistributedasPf(x)predictabilityofobjectxF(b)predictabilitydistributionfunctionS(d)successcurve,denedin(12)C(d)complementarysuccesscurve,denedin(13)d0criticalsignicancelevel,denedin(14)err,unc,emppredictableversionsoferr,unc,emp,asdenedonp.588Err,Unc,EmppredictableversionsofErr,Unc,EmpF(t )thelimitofF(u)asuapproachestfrombelowF(t+)thelimitofF(u)asuapproachestfromaboveu_vthemaximumofuandv,alsodenotedmax(u;v)u^vtheminimumofuandv,alsodenotedmin(u;v)t+t_0t ( t)_0Utheuniformprobabilitydistributionin[0;1]Pn(yjx;s)empiricalestimateofP(yjx),denedby(23)fn(x;s)correspondingempiricalpredictabilityfunction,denedby(24)PprobabilityEexpectation8¥nfromsomenonIEtheindicatorfunctionofsetEReferencesThomasH.Cormen,CharlesE.Leiserson,RonaldL.Rivest,andCliffordStein.IntroductiontoAlgorithms.MITPress,Cambridge,MA,secondedition,2001.DavidR.CoxandDavidV.Hinkley.TheoreticalStatistics.ChapmanandHall,London,1974.LucDevroye,L´aszl´oGy¨or,AdamKrzyzak,andG´aborLugosi.Onthestronguniversalconsistencyofnearestneighborregressionfunctionestimates.AnnalsofStatistics,22:13711385,1994.LucDevroye,L´aszl´oGy¨or,andG´aborLugosi.AProbabilisticTheoryofPatternRecognition.Springer,NewYork,1996.YoavFreund,YishayMansour,andRobertE.Schapire.Generalizationboundsforaveragedclassi-ers.AnnalsofStatistics,32(4),2004.RonaldL.RivestandRobertH.Sloan.Learningcomplicatedconceptsreliablyandusefully.InProceedingsoftheFirstAnnualConferenceonComputationalLearningTheory,pages6979,SanMateo,CA,1988.MorganKaufmann.CraigSaunders,AlexGammerman,andVladimirVovk.Transductionwithcondenceandcredi-bility.InThomasDean,editor,ProceedingsoftheSixteenthInternationalJointConferenceonArticialIntelligence,volume2,pages722726.MorganKaufmann,1999.603 VOVKAlbertN.Shiryaev.Probability.Springer,NewYork,secondedition,1996.VladimirVovk.On-lineCondenceMachinesarewell-calibrated.InProceedingsoftheFortyThirdAnnualSymposiumonFoundationsofComputerScience,pages187196,LosAlamitos,CA,2002a.IEEEComputerSociety.VladimirVovk.AsymptoticoptimalityofTransductiveCondenceMachine.InProceedingsoftheThirteenthInternationalConferenceonAlgorithmicLearningTheory,volume2533ofLectureNotesinArticialIntelligence,pages336350,Berlin,2002b.Springer.VladimirVovk.Universalwell-calibratedalgorithmforon-lineclassication.InBernhardSch¨olkopfandManfredK.Warmuth,editors,LearningTheoryandKernelMachines:SixteenthAnnualConferenceonLearningTheoryandSeventhKernelWorkshop,volume2777ofLectureNotesinArticialIntelligence,pages358372,Berlin,2003.Springer.VladimirVovk,AlexGammerman,andCraigSaunders.Machine-learningapplicationsofalgorith-micrandomness.InProceedingsoftheSixteenthInternationalConferenceonMachineLearning,pages444453,SanFrancisco,CA,1999.MorganKaufmann.604