/
JournalofMachineLearningResearch5(2004)575-604Submitted1/04;Published6 JournalofMachineLearningResearch5(2004)575-604Submitted1/04;Published6

JournalofMachineLearningResearch5(2004)575-604Submitted1/04;Published6 - PDF document

sherrill-nordquist
sherrill-nordquist . @sherrill-nordquist
Follow
385 views
Uploaded On 2016-10-24

JournalofMachineLearningResearch5(2004)575-604Submitted1/04;Published6 - PPT Presentation

VOVK010002000300040005000600070008000900010000050100150200250300350400450500examplescumulative errors at different significance levels1 2 3 4 5 Figure1TCMscumulativeerrorsatthesignicancelevels ID: 480132

VOVK010002000300040005000600070008000900010000050100150200250300350400450500examplescumulative errors different significance

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "JournalofMachineLearningResearch5(2004)5..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JournalofMachineLearningResearch5(2004)575-604Submitted1/04;Published6/04AUniversalWell-CalibratedAlgorithmforOn-lineClassicationVladimirVovkVOVK@CS.RHUL.AC.UKComputerLearningResearchCentreRoyalHolloway,UniversityofLondonEgham,SurreyTW200EX,UKEditors:KristinBennettandNicoloCesa-BianchiAbstractWestudytheproblemofon-lineclassicationinwhichthepredictionalgorithm,foreach“sig-nicancelevel”d,isrequiredtooutputasitspredictionarangeoflabels(intuitively,thoselabelsdeemedcompatiblewiththeavailabledataattheleveld)ratherthanjustonelabel;asusual,theexamplesareassumedtobegeneratedindependentlyfromthesameprobabilitydistributionP.Thepredictionalgorithmissaidtobe“well-calibrated”forPanddifthelong-runrelativefrequencyoferrorsdoesnotexceeddalmostsurelyw.r.toP.Forwell-calibratedalgorithmswetakethenumberof“uncertain”predictions(i.e.,thosecontainingmorethanonelabel)astheprincipalmeasureofpredictiveperformance.Themainresultofthispaperistheconstructionofapredictionalgorithmwhich,forany(unknown)Pandanyd:(a)makeserrorsindependentlyandwithprobabilitydatev-erytrial(inparticular,iswell-calibratedforPandd);(b)makesinthelongrunnomoreuncertainpredictionsthananyotherpredictionalgorithmthatiswell-calibratedforPandd;(c)processesexamplenintimeO(logn).Keywords:TransductiveCondenceMachine,on-lineprediction1.IntroductionTypicalmachinelearningalgorithmsoutputapointpredictionforthelabelofanunknownobject.ThispapercontinuesstudyofanalgorithmcalledtheTransductiveCondenceMachine(TCM),introducedbySaundersetal.(1999)andVovketal.(1999),thatcomplementsitspredictionswithsomemeasuresofcondence.TherearedifferentwaysofpresentingTCM'soutput;inthispaper(asintherelatedVovk,2002a,b)weuseTCMasa“regionpredictor”,inthesensethatitoutputsanestedfamilyofpredictionregions(indexedbythesignicanceleveld)ratherthanapointprediction.AnyTCMiswell-calibratedwhenusedintheon-linemode:foranysignicanceleveldthelong-runrelativefrequencyoferroneouspredictionsdoesnotexceedd.WhatmakesthisfeatureofTCMespeciallyappealingisthatitisfarfrombeingjustanasymptoticphenomenon:aslightmodicationofTCMcalledrandomized1TCM(rTCM)makeserrorsindependentlyatdifferenttrialsandwithprobabilitydateachtrial.Thepropertyofbeingwell-calibratedthenimmediatelyfollowsbytheBorelstronglawoflargenumbers.Figure1showsthecumulativenumbersoferrorsatthesignicancelevels1%–5%madeonthewell-knownUSPSdatasetofhand-writtendigits(randomlypermuted);asexpected,thesearestraightlineswiththeslopeapproximatelyequaltothesignicancelevel.Forproofsandfurtherinformation,seeVovk(2002a).1.Randomizationisneededtobreaktiesanddealefcientlywithborderlinecases.c\r2004VladimirVovk. VOVK010002000300040005000600070008000900010000050100150200250300350400450500examplescumulative errors at different significance levels1% 2% 3% 4% 5% Figure1:TCM'scumulativeerrorsatthesignicancelevels1%–5%ontheUSPSdatasetThejusticationofthestudyofTCMgivenbyVovk(2002a)wasitsgoodperformanceonreal-worldandstandardbenchmarkdatasets.Forexample,Figure2showsthatforthesignicancelevelsbetween1%and5%mostexamplesintheUSPSdatasetcanbepredictedcategorically(byasimple1-NearestNeighbourTCM,usedinallexperimentsreportedinthispaper):thepredictionregioncontainsonlyonelabel.ThispaperpresentstheoreticalresultsaboutTCM'sperformanceintheproblemofclassica-tion,wherethenumberofpossiblelabelsisnite;weshowthatthereexistsauniversalrTCM,which,foranysignicanceleveldandwithoutknowingthetruedistributionPgeneratingtheex-amples:produces,asymptotically,nomoreuncertainpredictionsthananyotherpredictionalgorithmthatiswell-calibratedforPandd;produces,asymptotically,atleastasmanyemptypredictionsasanyotherpredictionalgorithmthatiswell-calibratedforPanddandwhosepercentageofuncertainpredictionsisoptimal(inthesenseofthepreviousitem).Theimportanceoftherstitemisobvious:wewanttominimizethenumberofuncertainpredic-tions.Thisasymptoticcriterionceasestowork,however,whenthenumberofuncertainpredictionsstabilizes,asinFigure2forsignicancelevels3%–5%.Insuchcasesthenumberofemptypredic-tionsbecomesimportant:emptypredictions(automaticallyleadingtoanerror)provideawarningthattheobjectisatypical(looksverydifferentfromthepreviousobjects),andonewouldliketobewarnedasoftenaspossible,takingintoaccountthattherelativefrequencyoferrors(includingemptypredictions)isguaranteednottoexceeddinthelongrun.RememberthatTCMoutputsawholefamilyofpredictionregions,sothefactthatatsomesignicancelevelthepredictionregion576 UNIVERSALWELL-CALIBRATEDCLASSIFICATION01000200030004000500060007000800090001000001002003004005006007008009001000examplescumulative uncertain predictions at different significance levels1% 2% 3% 4% 5% Figure2:Cumulativenumberof“uncertain”predictions(i.e.,predictionregionscontainingmorethanonelabel)madebythe1-NearestNeighbourTCMatthesignicancelevels1%–5%ontheUSPSdatasetbecomesemptydoesnotmeanthatallpotentiallabelsforanewobjectbecomeequallylikely:weshouldjustshiftourattentiontoothersignicancelevels.Figure3showsthecumulativenumbersofemptypredictionsfortheUSPSdataset.ThefullpredictionoutputbyaTCMisacomplicatedmathematicalobject:foreachsignicanceleveldwehaveapredictionregion.Inpractice,agoodstartingpointmightbersttolookatthepredictionregionscorrespondingtotwoorthreeconventionalsignicancelevels,suchas1%and5%(afterwards,ofcourse,thepredictionregionsatothersignicancelevelsshouldbelookedat).Forexample,denotingGdthepredictionregionatsignicanceleveld,wecouldsaythatthepredictionis“highlycertain”ifjG1%j1and“certain”ifjG5%j1;similarly,wecouldsaythatthenewobject(whoselabelisbeingpredicted)is“highlyatypical”ifjG1%j=0and“atypical”ifjG5%j=0.Inthecaseofclassication,thefamilyofpredictionregionsGdcanbesummarizedbyreportingthecondencesupf1d:jGdj1g;thecredibilityinffd:jGdj=0g;andthepredictionGd,where1disthecondence(inthecaseofTCM,jGdj1andusuallyjGdj=1when1disthecondence).Reportingtheprediction,condence,andcredibility,asinSaundersetal.(1999)andVovketal.(1999),isanalogoustoreportingtheobservedlevelofsignicance(CoxandHinkley,1974,p.66)instatistics.577 VOVK010002000300040005000600070008000900010000050100150200250300examplescumulative empty predictions at different significance levels5% 4% 3% Figure3:Cumulativenumberofemptypredictionsmadebythe1-NearestNeighbourTCMatthesignicancelevels1%–5%ontheUSPSdataset(therearenoemptypredictionsfor1%and2%)Thispaper'sresultelaboratesonVovk(2002b),whereitwasshownthatanoptimalrandomizedTCMexistswhenthedistributionPgeneratingtheexamplesisknown.IntherestofthispaperweconsideronlyrandomizedTCM,sowedroptheadjective“randomized”.ThetwoareasofmainstreammachinelearningthataremostcloselyconnectedwiththispaperarePAClearningtheoryandBayesianlearningtheory.Whereasweoftenusethericharsenalofmathematicaltoolsdevelopedintheseelds,theydonotprovidethesamekindofguarantees(therightprobabilityoferrorateachsignicancelevel,witherrorsatdifferenttrialsindependent)underunknownP;formoredetails,seeVovk(2002a)andreferencestherein.Severalpapers(suchasRivestandSloan,1988;Freundetal.,2004)extendthestandardPACframeworkbyallowingthepredictionalgorithmtoabstainfrommakingapredictionatsometrials.Ourresultsshowthatforanysignicanceleveldthereexistsapredictionalgorithmthat:(a)makesawrongpredictionwithrelativefrequencyatmostd;(b)hasanoptimalfrequencyofabstentionsamongthepredictionalgorithmsthatsatisfyproperty(a)(fordetails,seeRemark2onp.580).ThepaperbyFreundetal.(2004)isespeciallyclosetotheapproachofthispaper,deningaverynaturalTCMinthesituationwhereahypothesisclassisgiven(the“empiricallogratio”ofFreundetal.(2004),takenwithappropriatesign,canbeusedasan“individualstrangenessmeasure”,asdenedin§3).2.MainResultInourlearningprotocol,Realityoutputspairs(x1;y1);(x2;y2);:::calledexamples.Eachexample(xi;yi)consistsofanobjectxianditslabelyi.TheobjectsarechosenfromameasurablespaceX578 UNIVERSALWELL-CALIBRATEDCLASSIFICATIONcalledtheobjectspaceandthelabelsareelementsofameasurablespaceYcalledthelabelspace.InthispaperweassumethatYisnite(andendowedwiththes-algebraofallsubsets).TheprotocolincludesvariablesErrd(thetotalnumberoferrorsmadeuptoandincludingtrialnatsignicanceleveld)anderrd(thebinaryvariableshowingwhetheranerrorismadeattrialn).ItalsoincludesanalogousvariablesUncd,uncd,Empd,empdforuncertainandemptypredictions:Errd:=0,Uncd0:=0,Empd0:=0foralld2(0;1);FORn=1;2;::::Realityoutputsxn2X;PredictoroutputsGdYforalld2(0;1);Realityoutputsyn2Y;errd:=(1ifyn=2Gdn0otherwise,Errd:=Errd1+errdforalld2(0;1);uncd:=(1ifjGdnj�10otherwise,Uncd:=Uncd1+uncdforalld2(0;1);empd:=(1ifjGdj=00otherwise,Empd:=Empd1+empdforalld2(0;1)ENDFOR.WewillusethenotationZ:=XYfortheexamplespace;Gdwillbecalledthepredictionregion(orjustprediction).Wewillassumethateachexamplezn=(xn;yn),n=1;2;:::,isoutputaccordingtoaprobabilitydistributionPinZandtheexamplesareindependentofeachother(sothesequencez1z2:::isoutputbythepowerdistributionP¥).ThisisReality'srandomizedstrategy.AregionpredictorisameasurablefunctionGd(x1;t1;y1;:::;xn1;tn1;yn1;xn;tn);(1)whered2(0;1),n=1;2;:::,the(xi;yi)2Z,i=1;:::;n1,areexamples,xn2Xisanobject,andti2[0;1](i=1;:::;n),whichsatisesGd1(x1;t1;y1;:::;xn1;tn1;yn1;xn;tn)Gd2(x1;t1;y1;:::;xn1;tn1;yn1;xn;tn)wheneverd1d2.Themeasurabilityof(1)meansthatforeachnthesetn(d;x1;t1;y1;:::;xn;tn;yn):yn2Gd(x1;t1;y1;:::;xn1;tn1;yn1;xn;tn)o(0;1)(X[0;1]Y)nismeasurable.Sinceweareinterestedinpredictionwithcondence,theregionpredictor(1)isgivenanextrainputd2(0;1),whichwecallthesignicancelevel(typicallyitiscloseto0,standardvaluesbeing1%and5%);thecomplementaryvalue1discalledthecondencelevel.Wewillalwaysassumethattnareindependentrandomvariablesuniformlydistributedin[0;1].Thismakesaregionpredictorafamily(indexedbyd2(0;1))ofPredictor'srandomizedstrategies.Wewilloftenusethenotationerrdn,uncdn,etc.,inthecasewhereRealityandPredictorareusinggivenrandomizedstrategies.Forexample,errd(P¥;G)istherandomvariableequalto0ifPredictor579 VOVKisrightattrialnandatsignicanceleveldandequalto1otherwise.ItisalwaysassumedthattherandomnumberstnusedbyGandtherandomexamplesznchosenbyRealityareindependent.WesaythataregionpredictorGis(conservatively)well-calibratedforaprobabilitydistributionPinZandasignicanceleveld2(0;1)iflimsupn!¥Errd(P¥;G)nda.s.Wesay(asinVovk,2002b)thatGisoptimalforPanddif,foranyregionpredictorG†whichiswell-calibratedforPandd,limsupn!¥Uncd(P¥;G)nliminfn!¥Uncd(P¥;G†)na.s.(2)(ItisnaturaltoassumeinthisandothersimilardenitionsthattherandomnumbersusedbyGandG†areindependent,butthisassumptionisnotneededforourmathematicalresultsandwedonotmakeit.)Ofcourse,thedenitionofoptimalityisnaturalonlyforwell-calibratedG.AregionpredictorGisuniversalwell-calibratedif:itiswell-calibratedforanyPandd;itisoptimalforanyPandd;foranyP,anyd,andanyregionpredictorG†whichiswell-calibratedandoptimalforPandd,liminfn!¥Empd(P¥;G)nlimsupn!¥Empd(P¥;G†)na.s.RecallthatameasurablespaceXisBorelifitisisomorphictoameasurablesubsetoftheinterval[0;1].TheclassofBorelspacesisveryrich;forexample,allPolishspaces(suchasnite-dimensionalEuclideanspacesRn,R¥,functionalspacesCandD)areBorel.Theorem1SupposetheobjectspaceXisBorel.Thereexistsauniversalwell-calibratedregionpredictor.Thisisthemainresultofthepaper.In§3weconstructauniversalwell-calibratedregionpredictor(processingexamplenintimeO(logn))andin§4outlinetheideaoftheproofthatitindeedsatisestherequiredproperties.Technicaldetailswillbegivenin§5.RemarkTheprotocolofRivestandSloan(1988)andFreundetal.(2004)isinfactarestrictionofourprotocol,inwhichPredictorisonlyallowedtooutputaone-elementsetorthewholeofY;thelatterisinterpretedasabstention.(Andinthesituationwherethenumbersoferrorsanduncertainpredictionsareofprimaryinterest,asinthispaper,thedifferencebetweenthesetwoprotocolsisnotsignicant.)Theuniversalwell-calibratedregionpredictorcanbeadaptedtotherestrictedprotocolbyreplacinganuncertainpredictionwithYandreplacinganemptypredictionwitharandomlychosenlabel.Inthiswayweobtainapredictionalgorithmintherestrictedprotocolwhichiswell-calibratedandhasanoptimalfrequencyofabstentions,inthesenseof(2),amongthewell-calibratedalgorithms.580 UNIVERSALWELL-CALIBRATEDCLASSIFICATION3.ConstructionofaUniversalWell-CalibratedRegionPredictorInthissectionwerstdenethegeneralnotionofTransductiveCondenceMachine,andthenwespecializeitusinganearestneighboursproceduretoobtainauniversalwell-calibratedregionpredictor.3.1PreliminariesIftisanumberin[0;1],wesplititintotwonumberst0;t002[0;1]asfollows:ifthebinaryexpansionoftis0:a1a2:::(redenethebinaryexpansionof1tobe0:11:::),sett0:=0:a1a3a5:::andt00:=0:a2a4a6:::.Iftisdistributeduniformlyin[0;1],thenbotht0andt00are,andtheyareindependentofeachother.Wewilloftenapplyourprocedures(e.g.,the“individualstrangenessmeasure”in§3.2,theNearestNeighboursrulein§3.3)nottotheoriginalobjectsx2Xbuttoextendedobjects(x;s)2˜X:=X[0;1],wherexiscomplementedbyarandomnumbers(tobeextractedfromoneofthetn).Inotherwords,alongwithexamples(x;y)wewillalsoconsiderextendedexamples(x;s;y)2˜Z:=X[0;1]Y.LetussetX:=[0;1];wecandothiswithoutlossofgeneralitysinceXisBorel.Thismakestheextendedobjectspace˜X=[0;1]2alinearlyorderedsetwiththelexicographicorder:(x1;s1)(x2;s2)meansthateitherx1=x2ands1s2orx1x2.Wesaythat(x1;s1)isnearerto(x3;s3)than(x2;s2)isifjx1x3;s1s3jjx2x3;s2s3j;(3)wherejx;sj:=((x;s)if(x;s)(0;0)(x;s)otherwise:(4)Thevaluejx1x2;s1s2jplaystheroleofthedistancebetweenextendedobjects(x1;s1)and(x2;s2).Despitesuchdistancesbeingtwo-dimensional,theyarestillalwayscomparableusingthelexicographicorder.OurconstructionwillbebasedontheNearestNeighboursalgorithm,whichisknowntobestronglyuniversallyconsistentinthetraditionaltheoryofpatternrecognition(see,e.g.,Devroyeetal.,1996,Chapter11);therandomcomponentssareneededfortie-breaking.3.2TransductiveCondenceMachinesTransductiveCondenceMachine,orTCM,isawayoftransitionfromwhatwecallan“individualstrangenessmeasure”toaregionpredictor.AfamilyofmeasurablefunctionsfAn:n=1;2;:::g,whereAn:˜Zn!Rnforalln,iscalledanindividualstrangenessmeasureif,foranyn=1;2;:::,eachaiinAn:(w1;:::;wn)7!(a1;:::;an)(5)isdeterminedbywiandthemultiset*w1;:::;wn+.(Thedifferencebetweenamultiset*w1;:::;wn+andasetfw1;:::;wngisthattheformercancontainseveralcopiesofthesameelement.)TheTCMassociatedwithanindividualstrangenessmeasureAnisthefollowingregionpredic-torGd(x1;t1;y1;:::;xn1;tn1;yn1;xn;tn):atanytrialnandforanylabely2Y,dene(a1;:::;an):=An((x1;t0;y1);:::;(xn1;t0n1;yn1);(xn;t0n;y));581 VOVKandincludeyinGdifandonlyift00#fi=1;:::;n:aiangnd#fi=1;:::;n:ai=ang(6)(inparticular,includeyinGdif#fi=1;:::;n:ai�ang=n�danddonotincludeyinGdif#fi=1;:::;n:aiang=nd).ATCMistheTCMassociatedwithsomeindividualstrangenessmeasure.ItwasshowninVovk(2002a)thatProposition2EveryTCMiswell-calibratedforeveryPandd.ThedenitionofTCMcanbeillustratedbythefollowingsimpleexampleofanindividualstrangenessmeasure,theoneusedinproducingFigures1–3:mapping(5)canbedened,inthespiritofthe1-NearestNeighbourAlgorithm,as(assumingtheobjectsarevectorsinaEuclideanspace)ai:=minj6=i:yj=yid(xi;xj)minj6=i:yj6=yid(xi;xj);wheredistheEuclideandistance(i.e.,anobjectisconsideredstrangeifitisinthemiddleofobjectslabelledinadifferentwayandisfarfromtheobjectslabelledinthesameway).3.3UniversalTCMFixamonotonicallynon-decreasingsequenceofintegernumbersKn,n=1;2;:::,suchthatKn!¥;Kn=opn=lnn(7)asn!¥.TheNearestNeighboursTCMisdenedasfollows.Letw1;:::;wnbeasequenceofextendedexampleswi=(xi;si;yi).Todenethecorrespondingas,asseenin(5),werstdeneNearestNeighboursapproximationsP6=n(yjxi;si)tothetrue(butunknown)conditionalprobabilitiesP(yjxi):foreveryextendedexample(xi;si;yi)inthesequence,P6=n(yjxi;si):=N6=(xi;si;y)=Kn;(8)whereN6=(xi;si;y)isthenumberofj=1;:::;nsuchthatyj=yand(xj;sj)isoneoftheKnnearestneighbours,inthesenseof(3),of(xi;si)inthesequence((x1;s1);:::;(xi1;si1);(xi+1;si+1);:::;(xn;sn)):(Theupperindex6=remindsusofthefactthat(xi;si)isnotcountedasoneofitsownnearestneighboursinthisdenition.)IfKnnorKn0,thisdenitiondoesnotwork,soset,e.g.,P6=n(yjxi;si):=1=jYjforallyandi(thisparticularconventionisnotessentialsince,by(7),0Knnfromsomenon).Iftheexpression“Knnearestneighbours”isnotdenedbecauseofdistanceties,weagainsetP6=n(yjxi;si):=1=jYjforallyandi(thisconventionisnotessentialsincedistancetieshappenwithprobabilityzero).Denethe“empiricalpredictabilityfunction”f6=nbyf6=n(xi;si):=maxy2YP6=n(yjxi;si):(9)582 UNIVERSALWELL-CALIBRATEDCLASSIFICATIONForeach(xi;si)xsomeˆyn(xi;si)2argmaxyP6=n(yjxi;si)(10)(e.g.,taketherstelementofargmaxyP6=n(yjxi;si)inaxedorderingofY)anddenethemap-ping(5)(wherewi=(xi;si;yi),i=1;:::;n)settingai:=(f6=n(xi;si)ifyi=ˆyn(xi;si)f6=n(xi;si)otherwise:(11)ThiscompletesthedenitionoftheNearestNeighboursTCM,whichwilllaterbeshowntobeuniversal.Proposition3LetD(0;1)benite.IfX=[0;1]andKn!¥sufcientlyslowly,theNearestNeighboursTCMcanbeimplementedforsignicancelevelsd2DsothatthecomputationsattrialnareperformedintimeO(logn).Proposition3assumesacomputationalmodelthatallowsoperations(suchascomparison)withrealnumbers.IfXisanarbitraryBorelspace,forthispropositiontobeapplicableXshouldbeembeddedin[0;1]rst;e.g.,ifX[0;1]n,anx=(x1;:::;xn)2Xcanberepresentedas(x1;1;x2;1;:::;xn;1;x1;2;x2;2;:::;xn;2;:::)2[0;1];where0:xi;1xi;2:::isthebinaryexpansionofxi.Weusetheexpression“canbeimplemented”inawidesense,onlyrequiringthattheimplementationshouldgivethecorrectresultsalmostsurely.4.FineDetailsofRegionPredictionInthissectionwemakerststepstowardstheproofofTheorem1.LetPbethetruedistributioninZgeneratingtheexamples.WedenotebyPXthemarginaldistributionofPinX(i.e.,PX(E):=P(EY))andbyPYjX(yjx)theconditionalprobabilitythat,forarandomexample(X;Y)chosenfromP,Y=yprovidedX=x(wexarbitrarilyaregularversionofthisconditionalprobability).WewilloftenomitlowerindicesXandYjXandPitselffromournotation.Thepredictabilityofanobjectx2Xisf(x):=maxy2YP(yjx)andthepredictabilitydistributionfunctionisthefunctionF:[0;1]![0;1]denedbyF(b):=Pfx:f(x)bg:AnexampleofsuchafunctionFisgiveninFigure4(left),wherethegraphofFisthethickline.ThesuccesscurveSPofPisdenedbytheequalitySP(d)=infB2[0;1]:Z10(F(b)B)+dbd;(12)wheret+standsformax(t;0);thefunctionSPisalsoofthetype[0;1]![0;1].Geometrically,SP(d)isdenedfromthegraphofFasfollows(seeFigure4,left;weoftendropthelowerindexP):583 VOVK-6bF(b)rApBrCpZdS(d)F-6bF(b)rApDpZpOdd0d0C(d)FFigure4:ThepredictabilitydistributionfunctionFandthesuccesscurveS(d)(left);thecomple-mentarysuccesscurveC(d)(right)movethepointBfromAtoZuntiltheareaofthecurvilineartriangleABCbecomesdorBreachesZ;theordinateofBisthenS(d).ThecomplementarysuccesscurveCPofPisdenedbyCP(d)=supB2[0;1]:B+Z10(F(b)B)+dbd;(13)wheresup/0isinterpretedas0.SimilarlytothecaseofS(d),C(d)isdenedasthevaluesuchthattheareaofthepartoftheboxAZODbelowthethicklineinFigure4(right)isd(C(d)=0ifsuchavaluedoesnotexist).Denethecriticalsignicanceleveld0asd0:=Z10F(b)db:(14)Itisclearthatdd0=)Z10(F(b)S(d))+db=d&C(d)=0dd0=)S(d)=0&C(d)+Z10(F(b)C(d))+db=d:ThefollowingresultisprovedinVovk(2002b).Proposition4LetPbeaprobabilitydistributioninZandd2(0;1)beasignicancelevel.IfaregionpredictorGiswell-calibratedforPandd,thenliminfn!¥Uncd(P¥;G)nSP(d)a.s.(15)InthispaperwecomplementProposition4with584 UNIVERSALWELL-CALIBRATEDCLASSIFICATIONProposition5LetPbeaprobabilitydistributioninZandd2(0;1)beasignicancelevel.IfaregionpredictorGiswell-calibratedforPanddandsatiseslimsupn!¥Uncd(P¥;G)nSP(d)a.s.;(16)thenlimsupn!¥Empd(P¥;G)nCP(d)a.s.Theorem1immediatelyfollowsfromPropositions2,4,5andthefollowingproposition.Proposition6SupposeXisBorel.TheNearestNeighboursTCMconstructedin§3.3satises,foranyPandanysignicanceleveld,limsupn!¥Uncd(P¥;G)nSP(d)a.s.(17)andliminfn!¥Empd(P¥;G)nCP(d)a.s.(18)5.ProofsInthissectionwewillassumethatallextendedobjects(xi;t0)2[0;1]2,wherexiareoutputbyRealityandtiaretherandomnumbersused,aredifferentandthatallpairwisedistancesbetweenthemarealsodifferent(thisistruewithprobabilityone,sincet0iareindependentrandomnumbersuniformlydistributedin[0;1]).5.1ProofSketchofProposition3WithoutlossofgeneralityweassumethatDcontainsonlyonesignicanceleveld,whichwillbeomittedfromournotation.Ourcomputationalmodelhasanoperationofsplittingt2[0;1]intot0andt00(orisallowedtogeneratebotht0andt00nateverytrialn).WewillusetwomaindatastructuresinourimplementationoftheNearestNeighboursTCM:ared-blackbinarysearchtree;2agrowingarrayofnonnegativeintegersindexedbyk2fKn;Kn+1;:::;Kng(wherenistheordinalnumberoftheexamplebeingprocessed).Immediatelyafterprocessingthenthextendedexample(xn;tn;yn)thecontentsofthesedatastruc-turesareasfollows:Thesearchtreecontainsnvertices,correspondingtotheextendedexamples(xi;ti;yi)seensofar.Thekeyofvertexiistheextendedobject(xi;t0)2[0;1]2;thelinearorderonthekeysisthelexicographicorder.Theotherinformationcontainedinvertexiistherandomnumbert00,2.See,e.g.,Cormenetal.(2001),Chapters12–14.Theonlytwooperationsonred-blacktreesweneedinthispaperarethequerySEARCHandthemodifyingoperationINSERT.585 VOVKthelabelyi,thesetfP6=n(yjxi;t0):y2Ygofconditionalprobabilityestimates(8),thepointertothefollowingvertex(i.e.,thevertexthathasthesmallestkeygreaterthan(xi;t0i);ifthereisnogreaterkey,thepointerisNIL),andthepointertothepreviousvertex(i.e.,thevertexthathasthegreatestkeysmallerthan(xi;t0);if(xi;t0i)isthesmallestkey,thepointerisNIL).ThearraycontainsthenumbersN(k):=#fi=1;:::;n:ai=k=Kng(aiaredenedby(11)withsi:=t0i).Noticethattheinformationcontainedinvertexiofthesearchtreeissufcienttondˆyn(xi;t0)andaiintimeO(1).Wewillsaythatanextendedobject(xj;t0j)isinthevicinityofanextendedobject(xi;t0i),i6=j,iftherearelessthanKnextendedobjects(xk;t0)(strictly)between(xi;t0)and(xj;t0j).Whenanewobjectxnbecomesknown,thealgorithmdoesthefollowing:Generatest0andt00n.Locatesthesuccessorandpredecessorof(xn;t0)inthesearchtree(usingthequerySEARCHandthepointerstothefollowingandpreviousvertices);thisrequirestimeO(logn).ComputestheestimatedconditionalprobabilitiesfP6=n(yjxn;t0n):y2Yg;thisalsogivesˆyn(xn;t0).Thisinvolvesscanningthevicinityof(xn;t0n)fortheKnnearestneighboursof(xn;t0),whichcanbedoneintimeO(Kn):theKnnearestneighbourscanbeextractedfromthevicinityof(xn;t0n)sortedintheorderofincreasingdistancesfrom(xn;t0n);sinceinitiallythevicinityconsistsoftwosortedlists(totheleftandtotherightof(xn;t0n)),theprocedureMERGEusedinthemergesortalgorithm(see,e.g.,Cormenetal.2001,§2.3.1)willsortthewholevicinityintimeO(Kn).Therefore,therequiredtimeisO(Kn)=O(logn).Foreachy2Ylooksatwhathappensifthenthexampleis(xn;tn;yn)=(xn;tn;y):computesanandupdates(ifnecessary)aifor(xi;t0)inthevicinityof(xn;t0);usingthearrayandt00n,ndswhethery2Gn.ThisrequirestimeO(K2n)=O(logn),sincethereareO(Kn)ai'sinthevicinityof(xn;t0n)andeachofthemcanbecomputedintimeO(Kn).OutputsthepredictionregionGn(timeO(1)).Whenthelabelynarrives,thealgorithm:Insertsthenewvertex(xn;t0n;t00n;yn;fP6=n(yjxn;t0n):y2Yg)inthesearchtree,repairsthepointerstothefollowingandpreviouselementsfor(xn;t0)'sleftandrightneighbours,ini-tializesthepointerstothefollowingandpreviouselementsfor(xn;t0)itself,andrebalancesthetree(timeO(logn)).Updates(ifnecessary)theconditionalprobabilitiesfP6=n1(yjxi;t0):y2Yg7!fP6=n(yjxi;t0i):y2Ygforthe2Knexistingvertices(xi;t0)inthevicinityof(xn;t0);thisrequirestimeO(K2n)=O(logn).Theconditionalprobabilitiesforother(xi;t0),i=1;:::;n1,donotchange.586 UNIVERSALWELL-CALIBRATEDCLASSIFICATIONUpdatesthearray,changingN(Knai)forthe(xi;t0)6=(xn;t0)inthevicinityof(xn;t0n)andforbotholdandnewvaluesofaiandchangingN(Knan)(timeO(Kn)=O(logn)).InconclusionwediscusshowtodotheupdatesrequiredwhenKnchanges.AtthecriticaltrialsnwhenKnchangesthearrayandtheestimatedconditionalprobabilitiesP6=n(yjxi;t0)havetoberecomputed,which,ifdonenaively,wouldrequiretimeQ(nKn).TheassumptionwehavemadeaboutKnsofaristhatKn=O(plogn).WenowalsoassumethatKnismonotonicnon-decreasingand#fn:Kncg=O(#fn:Kn=cg)(19)asc!¥.Thisisthefullexplicationofthe“Kn!¥sufcientlyslowly”inthestatementofthelemma,asusedinthisproof.AnepochisdenedtobeamaximalsequenceofnswiththesameKn.Sincethechangesthatneedtobedonewhenanewepochstartsaresubstantial,theywillbespreadoverthewholepre-cedingepoch;wewillonlydiscussupdatingtheestimatedconditionalprobabilitiesP6=n(yjxi;t0):thearrayistreatedsimilarly.AnepochisoddifthecorrespondingKnisoddandevenifKniseven.Ateverystepinanepochwepreparethegroundforthenextepoch.Bytheendofepochn=A+1;A+2;:::;BweneedtochangeBsetsfP6=n(yjxi;t0):y2YginBAsteps(thedurationoftheepoch).Therefore,eachvertexofthesearchtreeshouldcontainnotonlyfP6=n(yjxi;t0i)gforthecurrentepochbutalsofP6=n(yjxi;t0i)gforthenextepoch(twostructuresforholdingfP6=n(yjxi;t0i)gwillsufce,oneforevenepochsandoneforoddepochs).OurassumptionsoftheslowgrowthofKn,asseenin19),implythatB=O(BA).ThismeansthatateachstepO(1)setsfP6=n(yjxi;t0)gforthenextepochshouldbeadded.ThiswilltaketimeO(Kn)=O(logn).AssoonasasetfP6=n(yjxi;t0):y2Ygforthenextepochisaddedatsometrial,bothsets(forthecurrentandnextepoch)willhavetobeupdatedforeachnewexample.5.2ProofSketchofProposition5TheproofofProposition5issimilarto(butmorecomplicatedthan)theproofofTheorems1and1rinVovk(2002b);thisproofsketchcanbemaderigoroususingtheNeyman–Pearsonlemma,asinVovk(2002b).Wewillusethenotationsg0andg0rightfortheleftandrightderivatives,respectively,ofafunctiong.ThefollowinglemmaparallelsLemma2inVovk(2002b),whichdealswithS(d).Lemma7ThecomplementarysuccesscurveC:[0;1]![0;1]alwayssatisestheseproperties:1.Thereisapointd02[0;1](namely,thecriticalsignicancelevel)suchthatC(d)=0fordd0andC(d)isconcavefordd0.2.C0right(d0)¥andC0left(1)1;therefore,ford2(d0;1),1C0right(d)C0left(d)¥andthefunctionC(d)isincreasing.3.C(d)iscontinuousatd=d0;therefore,itiscontinuouseverywherein[0;1].IfafunctionC:[0;1]![0;1]satisestheseproperties,thereexistameasurablespaceX,anitesetY,andaprobabilitydistributionPinXYforwhichCisthecomplementarysuccesscurve.587 VOVKProofsketchThestatementofthelemmafollowsfromthefactthatthecomplementarysuccesscurveCcanbeobtainedfromthepredictabilitydistributionfunctionFusingthesesteps(labellingthehorizontalandverticalaxesasxandyrespectively):1.InvertF:F1:=F1.2.IntegrateF1:F2(x):=Rx0F1(t)dt.3.IncreaseF2:F3(x):=F2(x)+d0,whered0:=R10F(x)dx.4.InvertF3:F4:=F13.ItcanbeshownthatC=F4,ifwedeneg1(y):=supfx:g(x)ygfornon-decreasingg(sothatg1iscontinuousontheright).Complementtheprotocolof§2inwhichRealityplaysP¥andPredictorplaysGwiththefol-lowingvariables:errn:=(PU)(x;y;t):y=2Gd(x1;t1;y1;:::;xn1;tn1;yn1;x;t) ;uncn:=(PXU)(x;t):jGd(x1;t1;y1;:::;xn1;tn1;yn1;x;t)j�1 ;empn:=(PXU)(x;t):jGd(x1;t1;y1;:::;xn1;tn1;yn1;x;t)j=0 ;dbeingxedandUstandingfortheuniformdistributionin[0;1],andErrn:=nåi=1erri;Uncn:=nåi=1unci;Empn:=nåi=1empi:Bythemartingalestronglawoflargenumbers,toprovethepropositionitsufcestoconsideronlythese“predictable”versionsofErrn,Uncn,andEmpn:indeed,sinceErrnErrn,UncnUncn,andEmpnEmpnaremartingales(withincrementsboundedby1inabsolutevalue)withrespecttotheltrationFn,n=0;1;:::,whereeachFnisgeneratedby(x1;t1;y1);:::;(xn;tn;yn),wehavelimn!¥ErrnErrnn=0a.s.;limn!¥UncnUncnn=0a.s.;andlimn!¥EmpnEmpnn=0a.s.(See,e.g.,Shiryaev,1996,TheoremVII.5.4.)WithoutlossofgeneralitywecanassumethatPredictor'smoveGnattrialnisfˆy(xn)g(wherex7!ˆy(x)2argmaxyP(yjx)isaxed“choicefunction”)ortheemptyset/0orthewholelabelspaceY.Furthermore,wecanassumethat,ateverytrial,thepredictionsarecertainforthenewobjects588 UNIVERSALWELL-CALIBRATEDCLASSIFICATION-6bF(b)rArBrCrDrGrEpZpOenS(errnempn)Figure5:Anadmissibleregionpredictor.ThethicklineisthepredictabilitydistributionfunctionF;theareaofthecurvilineartriangleABCiserrnempn;theareaoftherectangleDZOGisempn;the(non-negative)areaofthecurvilinearquadrangleBDECisdenotedenabovethestraightlineBCinFigure5,3andthatthepredictionsareemptyfortheobjectsbelowthestraightlineDGinFigure5.4Itisclearthatfortheregionpredictortosatisfy(16)itmustholdthatlimn!¥1nnåi=1(ei^empi)=0(otherwiseUncncanbedecreasedsubstantially,whichcontradicts(15);eiaredenedinthecaptionofFigure5),andsowecanassume,withoutlossofgenerality,thateitheren=0orempn=0ateverytrialn,i.e.,thatuncn=S(errn);empn=C(errn)ateverytrial.Letuscheckthattoachieve(16)theregionpredictormustsatisfydd0=)limsupn!¥1nnåi=1(errid0)+=0(20)dd0=)limsupn!¥1nnåi=1(d0erri)+=0;(21)wheretheconvergenceis,asusual,almostcertain.ItwasshowninVovk(2002b)(Lemma2)thatthesuccesscurveSisconvex,non-increasing,continuous,andhasslopeatmost1beforeithits3.Moreformally,predictionsarecertainfornewextendedobjects(x;t)satisfyingF(x;t):=F(f(x))+t(F(f(x)+)F(f(x)))S(errnempn):Intuitively,consideringextendedobjectsmakestheverticalaxis“innitelydivisible”.4.Indeed,predictionsofthiskindareadmissibleinthesensethatwecannotimproveuncnandempnsimultaneously,andalladmissiblepredictionsareequivalenttopredictionsofthiskind.AformalargumentforthecasewhereempnareomittedisgiveninVovk(2002b).589 VOVKthexaxisatd=d0.Thesecondimplication,(21),nowimmediatelyfollowsfromthefactthat,underdd0and(16),0=limsupn!¥Uncnn=limsupn!¥1nnåi=1S(erri)limsupn!¥1nnåi=1(d0erri)+:Therstimplication,(20),canbeextractedfromthechainUncnn=1nnåi=1unci=1nnåi=1S(erri)S 1nnåi=1erri!=S Errnn!S(d)e(22)(withthelastinequalityholdingalmostsurelyforanarbitrarye�0fromsomenon)usedbyVovk(2002b,intheproofofTheorems1and1r).Indeed,itcanbeseenfrom(22)that,assumingthepredictoriswell-calibratedandoptimalanddd0,Errn=n!da.s.and,therefore,S(d)limsupn!¥Uncnn=limsupn!¥1nnåi=1S(erri)=limsupn!¥1nnåi=1S(erri^d0)limsupn!¥S 1nnåi=1(erri^d0)!=limsupn!¥S Errnn1nnåi=1(errid0)+!=limsupn!¥S d1nnåi=1(errid0)+!=S dlimsupn!¥1nnåi=1(errid0)+!almostsurely.Thisproves(20).Using(20),(21),andthefactthatthecomplementarysuccesscurveCisconcave,increasing,and(uniformly)continuousfordd0(seeLemma7),weobtain:ifdd0,Empnn=1nnåi=1empi=1nnåi=1C(erri)1nC0(d0)nåi=1(errid0)+!0(n!¥);ifdd0,Empnn=1nnåi=1C(erri)=1nnåi=1C(erri_d0)C 1nnåi=1(erri_d0)!=C 1nnåi=1erri+1nnåi=1(d0erri)+!C 1nnåi=1erri!+o(1)C(d)+e;thelastinequalityholdingalmostsurelyforanarbitrarye�0fromsomenonanddbeingthesignicancelevelused.590 UNIVERSALWELL-CALIBRATEDCLASSIFICATION5.3ProofSketchofProposition6LetusrstmodifyandextendthenotationP6=n(yjxi;si)introducedin(8).Considerthesequenceofextendedexampleswi=(xi;t0;yi),i=1;:::;n((xi;yi)aretherstnexampleschosenbyRealityandtiaretherandomnumbersusedbyPredictor).WedenetheNearestNeighboursapproximationsPn(yjx;s)totheconditionalprobabilitiesP(yjx)asfollows:forevery(x;s;y)2˜Z,Pn(yjx;s):=N(x;s;y)=Kn;(23)whereN(x;s;y)isthenumberofi=1;:::;nsuchthat(xi;t0i)isamongtheKnnearestneighboursof(x;s)andyi=y(thistime(xi;t0)isnotpreventedfrombeingcountedasoneoftheKnnearestneighboursof(x;s)if(xi;t0)=(x;s)).Wedenetheempiricalpredictabilityfunctionfnbyfn(x;s):=maxy2YPn(yjx;s):(24)Theproofwillbebasedonthefollowingversionofawell-knownfundamentalresult.Lemma8SupposeKn!¥,Kn=o(n),andY=f0;1g.Foranye�0andlargeenoughn,PZjP(1jx)Pn(1jx;s)jPX(dx)U(ds)�eene2=40;wheretheoutermostprobabilitydistributionP(essentially(PU)¥)generatestheextendedexam-ples(xi;ti;yi),whichdeterminetheempiricaldistributionsPn.ProofThisisalmostaspecialcaseofDevroyeetal.'s(1994)Theorem1.Thereis,however,animportantdifferencebetweenthewaywebreakdistancetiesandthewayDevroyeetal.(1994)dothis.Inthatwork,insteadofour(3),(jx1x3j;js1s3j)(jx2x3j;js2s3j)isused.(Ourwayofbreakingtiesbetteragreeswiththelexicographicorderon[0;1]2,whichisusefulintheproofofProposition3and,lessimportantly,intheproofofLemma10.)ItiseasytocheckthattheproofgivenbyDevroyeetal.(1994)alsoworks(andbecomessimpler)forourwayofbreakingdistanceties.Lemma9SupposeKn!¥andKn=o(n).Foranye�0thereexistsane�0suchthat,forlargeenoughn,P(PXU)(x;s):maxy2YjPn(yjx;s)P(yjx)j�e�eeen;inparticular,P(PXU)f(x;s):jfn(x;s)f(x)j�eg�e een:ProofWeapplyLemma8tothebinaryclassicationproblemobtainedfromourclassicationproblembyreplacinglabely2Ywith1andreplacingallotherlabelswith0:PZjP(yjx)Pn(yjx;s)jPX(dx)U(ds)�eene2=40:591 VOVKByMarkov'sinequalitythisimpliesP(PXU)fjP(yjx)Pn(yjx;s)j�peg�pe ene2=40;which,inturn,impliesP(PXU)maxy2YjP(yjx)Pn(yjx;s)j�pe�jYjpeene2=40:Thiscompletestheproof,sincewecantaketheeinthelastequationarbitrarilysmallascomparedtotheeinthestatementofthelemma.Wewillusetheshorthand“8¥n”for“fromsomenon”.Lemma10SupposeKn!¥andKn=o(n).Foranye�0thereexistsane�0suchthat,forlargeenoughn,P8#ni:maxy P(yjxi)P6=n(yjxi;t0) �eon�e9een:Inparticular,8¥n:P8#ni: f(xi)f6=n(xi;t0) �eon�e9een:ProofSince P6=n(yjxi;t0)Pn(yjxi;t0i) 1Kn=o(1);wecan,andwill,ignoretheupperindices6=inthestatementofthelemma.DeneIn(x;s):=80ifmaxyjP(yjx)Pn(yjx;s)je1ifmaxyjP(yjx)Pn(yjx;s)j2e(maxyjP(yjx)Pn(yjx;s)je)=eotherwise(intuitively,In(x;s)isa“softversion”ofIfmaxyjP(yjx)Pn(yjx;s)j�eg).Themaintoolinthisproof(andseveralotherproofsinthissection)willbeMcDiarmid'stheo-rem(see,e.g.,Devroyeetal.,1996,Theorem9.2).Firstwecheckthepossibilityofitsapplication.Ifwereplaceanextendedobject(xj;t0j)byanotherextendedobject(xj;tj),theexpressionnåi=1In(xi;t0)willchangeasfollows:theaddendIn(xi;t0)fori=jchangesby1atmost;theaddendsIn(xi;t0i)fori6=jsuchthatneither(xj;t0j)nor(xj;tj)areamongtheKnnearestneighboursof(xi;t0)donotchangeatall;592 UNIVERSALWELL-CALIBRATEDCLASSIFICATIONthesumovertheatmost4Kn(seebelow)addendsIn(xi;t0)fori6=jsuchthateither(xj;t0j)or(xj;tj)(orboth)areamongtheKnnearestneighboursof(xi;t0i)canchangebyatmost4Kn1e1Kn=4e:(25)Theleft-handsideof(25)reectsthefollowingfacts:thechangeinPn(yjxi;t0)fori6=jisatmost1=Kn;thenumberofi6=jsuchthat(xj;t0j)isamongtheKnnearestneighboursof(xi;t0)doesnotexceed2Kn(sincetheextendedobjectsarelinearlyorderedand(3)isusedforbreakingdistanceties);analogously,thenumberofi6=jsuchthat(xj;tj)isamongtheKnnearestneighboursof(xi;t0)doesnotexceed2Kn.Therefore,byMcDiarmid'stheorem,P(1nnåi=1In(xi;t0)E 1nnåi=1In(xi;t0)!�e)exp2e2n=(1+4=e)2=exp2e4(4+e)2n:(26)Nextwend:E 1nnåi=1In(xi;t0)!=EIn(xn;t0n)EIn1(xn;t0n)+o(1)E(PXU)f(x;s):maxyjP(yjx)Pn1(yjx;s)j�eg+o(1)een+e+o(1)2e(thepenultimateinequalityfollowsfromLemma9)fromsomenon.Incombinationwith(26)thisimplies8¥n:P(1nnåi=1In(xi;t0)�3e)exp2e4(4+e)2n;inparticularP#fi:maxyjP(yjxi)Pn(yjxi;t0)j2egn�3eexp2e4(4+e)2n:Replacing3ebye,weobtainthat,fromsomenon,P#fi:maxyjP(yjxi)Pn(yjxi;t0)j�egn�eexp2(e=3)4(4+e=3)2n;whichcompletestheproof.Wesaythatanextendedexample(xi;ti;yi),i=1;:::;n,isn-strangeifyi6=ˆyn(xi;t0);otherwise,(xi;ti;yi)willbecalledn-ordinary.Wewillassumethat(f6=n(xi;t0);t00i),i=1;:::;n,arealldifferentforalln;evenmorethanthat,wewillassumethatt00arealldifferent(wecandososincetheprobabilityofthiseventisone).593 VOVK-6bF(b)rrrS(d)qceeee-6bF(b)rrS(d)ppqceeeerFigure6:CasesF(c)=S(d)(left)andF(c)�S(d)(right).TheverticalbandsofwidthedeterminethedivisionoftherstnextendedexamplesintoveclassesLemma11Suppose(7)issatisedanddd0.Withprobabilityone,theb(1S(d))ncex-tendedexampleswiththelargest(inthesenseofthelexicographicorder)(f6=n(xi;t0);t00i)among(x1;t1;y1);:::;(xn;tn;yn)containatmostnd+o(n)n-strangeextendedexamplesasn!¥.ProofDenec:=supfb:F(b)S(d)g:Itisclearthat0c1.OurproofwillworkbothinthecasewhereF(c)=S(d)andinthecasewhereF(c)�S(d),asillustratedinFigure6.Lete�0beasmallconstant(wewilllete!0eventually).Denea“threshold”(c0;c00n)2[0;1]2requiringthatPf(xn)=c;(fn1(xn;t0);t00n)�(c0n;c00n) =F(c)S(d)e(27)ifF(c)�S(d);weassumethateissmallenoughfor2eF(c)S(d)(28)tohold.Amongotherthingsthiswillensurethevalidityofthedenition(27).IfF(c)=S(d),weset(c0;c00n):=(c+e;0);inanycase,wewillhavePf(xn)=c;(fn1(xn;t0);t00n)�(c0n;c00n) F(c)S(d)e:(29)Letussaythatanextendedexample(xi;ti;yi)isabovethethresholdif(f6=n(xi;t0i);t00i)�(c0n;c00n);otherwise,wesayitisbelowthethreshold.Dividetherstnextendedexamples(xi;ti;yi),i=1;:::;n,intoveclasses:ClassI:Thosesatisfyingf(xi)c2e.594 UNIVERSALWELL-CALIBRATEDCLASSIFICATIONClassII:Thosethatsatisfyf(xi)=candarebelowthethreshold.ClassIII:Thosesatisfyingc2ef(xi)c+2ebutnotf(xi)=c.ClassIV:Thosethatsatisfyf(xi)=candareabovethethreshold.ClassV:Thosesatisfyingf(xi)�c+2e.Firstweexplainthegeneralideaoftheproof.Thethreshold(c0;c00)waschosensothatapproxi-matelyb(1S(d))ncoftheavailableextendedexampleswillbeabovethethreshold.Becauseofthis,theextendedexamplesabovethethresholdwillessentiallybetheb(1S(d))ncextendedex-ampleswiththelargest(f6=n(xi;t0);t00i)referredtointhestatementofthelemma.Foreachoftheveclasseswewillbeinterestedinthefollowingquestions:Howmanyextendedexamplesarethereintheclass?Howmanyofthoseareabovethethreshold?Howmanyofthoseabovethethresholdaren-strange?Ifthesumoftheanswerstothelastquestiondoesnotexceedndbytoomuch,wearedone.Withthisplaninmind,westarttheformalproof.(Ofcourse,wewillnotbefollowingtheplanliterally:forexample,ifaclassisverysmall,wedonotneedtoanswerthesecondandthirdquestions.)Therststepistoshowthatcec0c+e(30)fromsomenon;thiswillensurethattheclassesareconvenientlyseparatedfromeachother.WeonlyneedtoconsiderthecaseF(c)�S(d).Theinequalityc0c+efollowsfrom8¥n:Pf(xn)=c;fn1(xn;t0)�c+e eF(c)S(d)eSimplycombineLemma9with(28).Theinequalitycec0followsinasimilarwayfrom8¥n:Pf(xn)=c;fn1(xn;t0)ce =Pff(xn)=cgPf(xn)=c;fn1(xn;t0n)ce �F(c)F(c)eF(c)S(d)e:Nowwearereadytoanalyzethecompositionofourveclasses.AmongtheClassIextendedexamplesatmosten(31)willbeabovethethresholdfromsomenonalmostsurely(byLemma10andtheBorel–Cantellilemma).NoneoftheClassIIextendedexampleswillbeabovethethreshold,bydenition.ThefractionofClassIIIextendedexamplesamongtherstnextendedexampleswilltendtoF(c+2e)F(c)+F(c)F(c2e)(32)asn!¥almostsurely.595 VOVKToestimatethenumberNIVnofClassIVextendedexamplesamongtherstnextendedex-amples,weuseMcDiarmid'stheorem.Ifoneextendedexampleisreplacedbyanother,NIVnwillchangebyatmost2Kn+1(sincethisextendedexamplecanaffectf6=n(xi;t0)foratmost2Knotherextendedexamples(xi;ti;yi)).Therefore,P 1nNIVn1nENIVn e2e2e2n=(2Kn+1)2;theassumptionKn=opn=lnnandtheBorel–Cantellilemmaimplythat 1nNIVn1nENIVn efromsomenonalmostsurely.Since1nENIVn=Pf(xn)=c;(fn1(xn;t0);t00n)�(c0n;c00n) F(c)S(d)e;asin(29),wehaveNIVn�(F(c)S(d)2e)n(33)fromsomenonalmostsurely.Ofcourse,alltheseexamplesareabovethethreshold.NowweestimatethenumberNIV;strnofn-strangeextendedexamplesofClassIV.AgainMcDi-armid'stheoremimpliesthat 1nNIV;strn1nENIV;strn efromsomenonalmostsurely.Now,fromsomenon,1nENIV;strn=Pf(xn)=c;(fn1(xn;t0);t00n)�(c0n;c00n);ˆyn(xn;t0n)6=yn =E1PYjXˆyn(xn;t0n)jxnIff(xn)=c;(fn1(xn;t0);t00n)�(c0n;c00n)geen+e+(1c+2e)Pff(xn)=c;(fn1(xn;t0);t00n)�(c0n;c00n)g=een+e+(1c+2e)(F(c)S(d)e)(34)(F(c)S(d))(1c)+4e(35)inthecaseF(c)�S(d);therstinequalityinthischainfollowsfromLemma9:indeed,thislemmaimpliesthat,unlessaneventofthesmallprobabilityeen+ehappens,Pˆyn(xn;t0n)jxnPn1ˆyn(xn;t0n)jxn;t0ne=fn1xn;t0nef(xn)2e:(36)IfF(c)=S(d),thelines(34)and(35)ofthatchainhavetobechangedtoeen+e+(1c+2e)Pff(xn)=c;fn1(xn;t0n)c+egeen+e+(1c+2e)een+e4e596 UNIVERSALWELL-CALIBRATEDCLASSIFICATION(wheretheobviousmodicationofLemma9withall“�e”changedto“e”isused),buttheinequalitybetweentheextremetermsofthechainstillholds.Therefore,thenumberofn-strangeClassIVextendedexamplesdoesnotexceed((F(c)S(d))(1c)+5e)n(37)fromsomenonalmostsurely.BytheBorelstronglawoflargenumbers,thefractionofClassVextendedexamplesamongtherstnextendedexampleswilltendto1F(c+2e)(38)asn!¥almostsurely.ByLemma10,theBorel–Cantellilemma,and(30),almostsurelyfromsomenonatleast(1F(c+2e)2e)n(39)extendedexamplesinClassVwillbeabovethethreshold.Finally,weestimatethenumberNV;strnofn-strangeextendedexamplesofClassVamongtherstnextendedexamples.ByMcDiarmid'stheorem, 1nNV;strn1nENV;strn efromsomenonalmostsurely.Now1nENV;strn=Pf(xn)�c+2e;ˆyn(xn;t0)6=yn =E1PYjXˆyn(xn;t0n)jxnIff(xn)�c+2egeen+e+E(1f(xn)+2e)Iff(xn)�c+2egeen+3e+E(1f(xn))Iff(xn)�c+2eg=een+3e+Z10(F(b)F(c+2e))+dbZ10(F(b)F(c))+db+4efromsomenon.TherstinequalityfollowsfromLemma9,asin(36).Therefore,1nNV;strnZ10(F(b)F(c))+db+5e(40)fromsomenonalmostsurely.Summarizing,wecanseethatthetotalnumberofextendedexamplesabovethethresholdamongtherstnextendedexampleswillbeatleast(F(c)S(d)2e+1F(c+2e)2e)n=(1S(d)+F(c)F(c+2e)4e)n(41)597 VOVK(see(33)and(39))fromsomenonalmostsurely.Thenumberofn-strangeextendedexamplesamongthemwillnotexceede+F(c+2e)F(c)+F(c)F(c2e)+e+(F(c)S(d))(1c)+5e+Z10(F(b)F(c))+db+5en=F(c+2e)F(c)+F(c)F(c2e)+(F(c)S(d))(1c)+Z10(F(b)F(c))+db+12en(42)(see(31),(32),(37),and(40))fromsomenonalmostsurely.Combining(41)and(42),wecanseethatthenumberofn-strangeextendedexamplesamongtheb(1S(d))ncextendedexampleswiththelargest(f6=n(xi;t0);t00i)doesnotexceedF(c+2e)F(c)+F(c)F(c2e)+(F(c)S(d))(1c)+Z10(F(b)F(c))+db+12en+(F(c+2e)F(c)+4e)n=2(F(c+2e)F(c))+(F(c)F(c2e))+(F(c)S(d))(1c)+Z10(F(b)F(c))+db+16enfromsomenonalmostsurely.Sinceecanbearbitrarilysmall,thecoefcientinfrontofninthelastexpressioncanbemadearbitrarilycloseto(F(c)S(d))(1c)+Z10(F(b)F(c))+db=Z10(F(b)S(d))+db=d;whichcompletestheproof.Lemma12Suppose(7)issatised.Thefractionofn-strangeextendedexamplesamongtherstnextendedexamples(xi;ti;yi)approachesd0asymptoticallywithprobabilityone.ProofsketchThelemmaisnotdifculttoproveusingMcDiarmid'stheoremandthefactthat,byLemma10,P(ˆyn(xi;t0)jxi)willtypicallydifferlittlefromf(xi).Notice,however,thatthepartthatwereallyneedinthispaper(thatthefractionofn-strangeextendedexamplesdoesnotexceedd0+o(1)asn!¥withprobabilityone)isjustaspecialcaseofLemma11,correspondingtod=d0.Lemma13Suppose(7)issatisedandd�d0.Thefractionofn-ordinaryextendedexamplesamongthebC(d)ncextendedexamples(xi;ti;yi),i=1;:::;n,withthelowest(f6=n(xi;t0);t00i)doesnotexceeddd0+o(1)asn!¥withprobabilityone.598 UNIVERSALWELL-CALIBRATEDCLASSIFICATIONLemma13canbeprovedanalogouslytoLemma11.Lemma14LetF1F2beadecreasingsequenceofs-algebrasandx1;x2:::beaboundedadapted(inthesensethatxnisFn-measurableforalln)sequenceofrandomvariablessuchthatlimsupn!¥E(xnjFn+1)0a.s.Thenlimsupn!¥1nnåi=1xi0a.s.ProofReplacing,ifnecessary,xnbyxnE(xnjFn+1),wereduceourtasktothefollowingspecialcase(areverseBorelstronglawoflargenumbers):ifx1;x2;:::isaboundedreversemartingaledifference,inthesenseofbeingadaptedandsatisfying8n:E(xnjFn+1)=0,thenlimn!¥1nnåi=1xi=0a.s.(43)Fixaboundedreversemartingaledifferencex1;x2;:::;ourgoalistoprove(43).BythemartingaleversionofHoeffding'sinequality(Devroyeetal.,1996,Theorem9.1)appliedtothemartingaledifference(xi;Fi),i=n;:::;1,P( 1nnåi=1xi e)2e2e2n=(2C)2;(44)whereCisanupperboundonsupnjxnj.CombinedwiththeBorel–Cantelli–L´evylemma,(44)implies(43).NowwecansketchtheproofofProposition6.DeneFn,n=1;2;:::,tobethes-algebraon˜Z¥generatedbythemultisetoftherstn1extendedexamples(xi;ti;yi),i=1;:::;n1,andthesequenceofextendedexamples(xi;ti;yi),i=n;n+1;:::(startingfromthenthextendedexample).Supposerstthatdd0.Considertheb(1S(de))ncextendedexampleswiththelargest(f6=n(xi;t0);t00i)among(x1;t1;y1);:::;(xn;tn;yn),wheree2(0;d)isasmallconstant.Letusshowthateachoftheseexampleswillbepredictedwithcertaintyfromtheotherextendedexamplesinthesequence(x1;t1;y1);:::;(xn;tn;yn),fromsomenon.Wewillbeassumingnlargeenough.Let(xk;tk;yk)betheextendedexamplewiththe(b(de=2)nc+1)thlargest(inthesenseofthelexicographicorder)(f6=n(xi;t0i);t00i)amongalln-strangeextendedexamples(xi;ti;yi),i=1;:::;n.(Rememberthatallt00areassumedtobedifferent.)Let(xj;tj;yj)beoneoftheb(1S(de))ncextendedexampleswiththelargest(f6=n(xi;t0i);t00i)andlety2Ybealabeldifferentfromˆyn(xj;t0j).Itsufcestoprovethatt00j#fi=1;:::;n:ayayjgnd#fi=1;:::;n:ay=ayjg(45)(cf.(6)onp.582),whereallayarecomputedasain(11)fromthesequence(x1;t1;y1);:::;(xn;tn;yn)599 VOVKwithyjreplacedbyy.Itwillbemoreconvenienttowrite(45)intheform#fi:ay�ayjg+(1t00j)#fi:ay=ayjgnd:Sinceayj=f6=n(xj;t0j)anday6=aiforatmost2Kn+1valuesofi(indeed,changingyjwillaffectatmost2Kn+1as),itsufcestoprove#fi:ai�f6=n(xj;t0j)g+(1t00j)#fi:ai=f6=n(xj;t0j)gn(de);(46)whereeeisapositiveconstant.Since(f6=n(xj;t0j);t00j)(ak;t00)(indeed,byLemma11,therearelessthan(de=2)nn-strangeextendedexamplesamongtheb(1S(de))ncextendedexampleswiththelargest(f6=n(xi;t0);t00i)),(46)willfollowfrom#fi:ai�akg+(1t00)#fi:ai=akgn(de):(47)If#fi:ai=akge3n,theleft-handsideof(47)doesnotexceedde2n+e3nn(de);sowecan,andwill,assumewithoutlossofgeneralitythat#fi:ai=akg�e3n:(48)Sincet00fortheextendedexamplessatisfyingai=akareoutputaccordingtotheuniformdistribu-tionU,theexpectedvalueof1t00isabout(de=2)n#fi:ai�akg#fi:ai=akg;andsobyHoeffding'sinequalityandtheBorel–Cantellilemmawewillhave(fromsomenon)1t00(de=2)n#fi:ai�akg#fi:ai=akg+e;(49)remembering(48).Equation(47)willholdbecauseitsleft-handsidecanbetransformedusing(49)as#fi:ai�akg+(1t00)#fi:ai=akg(de=2)n+e#fi:ai=akg(de=2+e)n(de)n:Theassertionwehavejustprovedmeansthat,almostsurelyfromsomenon,P(funcn=0gjFn+1)b(1S(de))ncn1S(de)1n:SinceecanbearbitrarilysmallandSiscontinuous(Vovk,2002b,Lemma2),thisimplieslimsupn!¥E(uncnjFn+1)S(d)a.s.600 UNIVERSALWELL-CALIBRATEDCLASSIFICATIONByLemma14thisimplies,inturn,limsupn!¥1nnåi=1unciS(d)a.s.;whichcoincideswith(17).Ifdd0,Lemma12impliesthatlimn!¥E(uncnjFn+1)=0a.s.(andactuallyE(uncnjFn+1)=0fromsomenonifd�d0);incombinationwithLemma14thisagainimplies(17).Inequality(18)istreatedinasimilarwayto(17).Lemmas12and13implythatliminfn!¥E(empnjFn+1)C(d)a.s.(50)(thisinequalityisvacuouslytruewhendd0).AnotherapplicationofLemma14givesliminfn!¥1nnåi=1empiC(d)a.s.;i.e.,(18).RemarkThederivationofProposition6fromLemmas11–14wouldbeverysimpleifwedenedtheindividualstrangenessmeasureby,say,ai:=((f6=n(xi;si);si)ifyi=ˆyn(xi;si)(f6=n(xi;si);si)otherwise(withthelexicographicorderonthea's)insteadof(11)(inwhichcasethedenominatorof(6)wouldbe1almostsurely).Ourdenition(11),however,issimplerand,mostimportantly,facilitatestheproofofProposition3.AnothersimplicationwouldbetouseLemma11(appliedtod:=dC(d))insteadofLemma13inthederivationof(50);wepreferredamoresymmetricpicture.6.ConclusionWehaveshownthatthereexistuniversalwell-calibratedregionpredictors,thussatisfying,tosomedegree,thedesideratamentionedin§1:well-calibratednessandoptimalperformance.Notice,however,thatthewaysinwhichthesetwodesiderataaresatisedareverydifferent:thewell-calibratednessholdsinaveryspecicnitarysense,sincetheerrorshaveprobabilitydandareindependent,whereastheoptimalperformanceisachievedonlyasymptotically.Animportantdirectionoffurtherresearchistoobtainnon-asymptoticresultsaboutTCM'soptimality.AnaturalsettingiswherewehaveaBayesianmodelforReality'sstrategy,fPq:q2Qgwithapriorµ(dq)onQ,andourgoalistominimizeUncdunderthismodel.Theintuitionbehindthissettingisthatwedonotreallybelievethatthedataisgeneratedfromourmodelandsopreferapredictorthatiswell-calibratedregardlessthecorrectnessofthemodel;butifthemodeliscorrect,wewouldliketohaveanoptimalperformance.Aspecialcaseofthissetting,withµ(dq)601 VOVKconcentratedatonepoint,wasconsideredinVovk(2002b);however,allresultsinthatpaperareasymptotic.wledgmentsThisisafullversionoftheconferencepaper(Vovk,2003);Iamgratefultobothsetsofanonymousreferees(fortheconferenceandjournalversions),whosecommentsandsuggestionshelpedtoim-provethequalityofpresentationandcorrectseveralmistakes.ThisworkwaspartiallysupportedbyEPSRC(grantGR/R46670/01),BBSRC(grant111/BIO14428),andEU(grantIST-1999-10226).AppendixA.NotationThefollowingtablecontains,strictlyspeaking,notonlythenotationusedinthispaperbutalsothepreferreduseofsymbols.XobjectspaceYlabelspaceZexamplespace(Z=XY)PtheprobabilitydistributioninZgeneratingindividualexamplesz1=(x1;y1);z2=(x2;y2);:::dsignicancelevelGdpredictionregionerrdindicatoroferrorattrialnuncdnindicatorofuncertainpredictionattrialnempdindicatorofemptypredictionattrialnErrdcumulativenumberoferrorsuptotrialnUncdncumulativenumberofuncertainpredictionsuptotrialnEmpdcumulativenumberofemptypredictionsuptotrialntnthenthrandomnumberusedbyaregionpredictort0,t00ntwocomponentsoftn,asdenedin§3.1˜XtheextendedobjectspaceX[0;1]˜ZtheextendedexamplespaceX[0;1]Y,mayrefertothelexicographicorderon[0;1]2,asdenedonp.581jx;sjtheabsolutevalueof(x;s)2[0;1]2,asdenedin(4)Anindividualstrangenessmeasureaivaluestakenbyanindividualstrangenessmeasure#EthesizeofsetEKnthenumberofnearestneighbourstakenintoaccountattrialnP6=n(yjxi;si)empiricalestimateofP(yjxi)withouttakingyiintoaccount,asdenedin(8)f6=n(xi;si)correspondingempiricalpredictabilityfunction,(9)ˆyn(xi;si)“choicefunction”,asdenedin(10)DnitesetofsignicancelevelsPXthemarginaldistributionofPinX602 UNIVERSALWELL-CALIBRATEDCLASSIFICATIONPYjXtheregularconditionaldistributionofy2Ygivenx2X,where(x;y)isdistributedasPf(x)predictabilityofobjectxF(b)predictabilitydistributionfunctionS(d)successcurve,denedin(12)C(d)complementarysuccesscurve,denedin(13)d0criticalsignicancelevel,denedin(14)err,unc,emp“predictable”versionsoferr,unc,emp,asdenedonp.588Err,Unc,Emp“predictable”versionsofErr,Unc,EmpF(t)thelimitofF(u)asuapproachestfrombelowF(t+)thelimitofF(u)asuapproachestfromaboveu_vthemaximumofuandv,alsodenotedmax(u;v)u^vtheminimumofuandv,alsodenotedmin(u;v)t+t_0t(t)_0Utheuniformprobabilitydistributionin[0;1]Pn(yjx;s)empiricalestimateofP(yjx),denedby(23)fn(x;s)correspondingempiricalpredictabilityfunction,denedby(24)PprobabilityEexpectation8¥nfromsomenonIEtheindicatorfunctionofsetEReferencesThomasH.Cormen,CharlesE.Leiserson,RonaldL.Rivest,andCliffordStein.IntroductiontoAlgorithms.MITPress,Cambridge,MA,secondedition,2001.DavidR.CoxandDavidV.Hinkley.TheoreticalStatistics.ChapmanandHall,London,1974.LucDevroye,L´aszl´oGy¨or,AdamKrzyzak,andG´aborLugosi.Onthestronguniversalconsistencyofnearestneighborregressionfunctionestimates.AnnalsofStatistics,22:1371–1385,1994.LucDevroye,L´aszl´oGy¨or,andG´aborLugosi.AProbabilisticTheoryofPatternRecognition.Springer,NewYork,1996.YoavFreund,YishayMansour,andRobertE.Schapire.Generalizationboundsforaveragedclassi-ers.AnnalsofStatistics,32(4),2004.RonaldL.RivestandRobertH.Sloan.Learningcomplicatedconceptsreliablyandusefully.InProceedingsoftheFirstAnnualConferenceonComputationalLearningTheory,pages69–79,SanMateo,CA,1988.MorganKaufmann.CraigSaunders,AlexGammerman,andVladimirVovk.Transductionwithcondenceandcredi-bility.InThomasDean,editor,ProceedingsoftheSixteenthInternationalJointConferenceonArticialIntelligence,volume2,pages722–726.MorganKaufmann,1999.603 VOVKAlbertN.Shiryaev.Probability.Springer,NewYork,secondedition,1996.VladimirVovk.On-lineCondenceMachinesarewell-calibrated.InProceedingsoftheFortyThirdAnnualSymposiumonFoundationsofComputerScience,pages187–196,LosAlamitos,CA,2002a.IEEEComputerSociety.VladimirVovk.AsymptoticoptimalityofTransductiveCondenceMachine.InProceedingsoftheThirteenthInternationalConferenceonAlgorithmicLearningTheory,volume2533ofLectureNotesinArticialIntelligence,pages336–350,Berlin,2002b.Springer.VladimirVovk.Universalwell-calibratedalgorithmforon-lineclassication.InBernhardSch¨olkopfandManfredK.Warmuth,editors,LearningTheoryandKernelMachines:SixteenthAnnualConferenceonLearningTheoryandSeventhKernelWorkshop,volume2777ofLectureNotesinArticialIntelligence,pages358–372,Berlin,2003.Springer.VladimirVovk,AlexGammerman,andCraigSaunders.Machine-learningapplicationsofalgorith-micrandomness.InProceedingsoftheSixteenthInternationalConferenceonMachineLearning,pages444–453,SanFrancisco,CA,1999.MorganKaufmann.604