/
1Learning,Regularity,andCompressionOverviewThetaskofinductiveinference 1Learning,Regularity,andCompressionOverviewThetaskofinductiveinference

1Learning,Regularity,andCompressionOverviewThetaskofinductiveinference - PDF document

giovanna-bartolotta
giovanna-bartolotta . @giovanna-bartolotta
Follow
420 views
Uploaded On 2015-10-08

1Learning,Regularity,andCompressionOverviewThetaskofinductiveinference - PPT Presentation

41LearningRegularityandCompression11RegularityandLearningWeareinterestedindevelopingamethodforlearningthelawsandregularitiesindataThefollowingexamplewillillustratewhatwemeanbythisandgivearstidea ID: 153565

41Learning Regularity andCompression1.1RegularityandLearningWeareinterestedindevelopingamethodforlearningthelawsandregulari-tiesindata.Thefollowingexamplewillillustratewhatwemeanbythisandgivearstidea

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "1Learning,Regularity,andCompressionOverv..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1Learning,Regularity,andCompressionOverviewThetaskofinductiveinferenceistondlawsorregularitiesun-derlyingsomegivensetofdata.Theselawsarethenusedtogaininsightintothedataortoclassifyorpredictfuturedata.Theminimumdescriptionlength(MDL)principleisageneralmethodforinductiveinference,basedontheideathatthemoreweareabletocompress(describeinacompactmanner)asetofdata,themoreregularitieswehavefoundinitandtherefore,themorewehavelearnedfromthedata.Inthischapterwegivearst,preliminaryandinformalintroductiontothisprinciple.ContentsInSections1.1and1.2wediscusssomeofthefundamentalideasrelatingdescriptionlengthandregularity.InSection1.3wedescribewhatwashistoricallytherstattempttoformalizetheseideas.InSection1.4weexplaintheproblemswithusingtheoriginalformalizationinpractice,andindicatewhatmustbedonetomaketheideaspracticable.Section1.5in-troducesthepracticalformsofMDLwedealwithinthisbook,aswellasthecrucialconceptof“universalcoding.”Section1.6dealswithsomeissuesconcerningmodelselection,whichisoneofthemainMDLapplications.ThephilosophyunderlyingMDLisdiscussedinSection1.7.Section1.8showshowtheideasbehindMDLarerelatedto“Occam'srazor.”WeendinSec-tion1.9withabriefhistoricaloverviewoftheeldanditsliterature.FastTrackThischapterdiscusses,inaninformalmanner,severalofthecomplicatedissueswewilldealwithinthisbook.ItisthereforeessentialforreaderswithoutpriorexposuretoMDL.ReaderswhoarefamiliarwiththebasicideasbehindMDLmayjustwanttolookattheboxes. 41Learning,Regularity,andCompression1.1RegularityandLearningWeareinterestedindevelopingamethodforlearningthelawsandregulari-tiesindata.Thefollowingexamplewillillustratewhatwemeanbythisandgivearstideaofhowitcanberelatedtodescriptionsofdata.Example1.1Westartbyconsideringbinarydata.Considerthefollowingthreesequences.Weassumethateachsequenceis10000bitslong,andwejustlistthebeginningandtheendofeachsequence.00010001000100010001:::0001000100010001000100010001(1.1)01110100110100100110:::1010111010111011000101100010(1.2)00011000001010100000:::0010001000010000001000110000(1.3)Therstofthesethreesequencesisa2500-foldrepetitionof0001.Intu-itively,thesequencelooksregular;thereseemstobeasimple“law”under-lyingit;itmightmakesensetoconjecturethatfuturedatawillalsobesubjecttothislaw,andtopredictthatfuturedatawillbehaveaccordingtothislaw.Thesecondsequencehasbeengeneratedbytossesofafaircoin.Itis,in-tuitivelyspeaking,as“randomaspossible,”andinthissensethereisnoregularityunderlyingit.1Indeed,wecannotseemtondsucharegularityeitherwhenwelookatthedata.Thethirdsequencecontainsexactlyfourtimesasmany0sas1s.Itlookslessregular,morerandomthantherst;butitlookslessrandomthanthesecond.Thereisstillsomediscernibleregu-larityinthesedata,butofastatisticalratherthanofadeterministickind.Again,noticingthatsucharegularityisthereandpredictingthatfuturedatawillbehaveaccordingtothesameregularityseemssensible.1.2RegularityandCompressionWhatdowemeanbya“regularity”?ThefundamentalideabehindtheMDLprincipleisthefollowinginsight:everyregularityinthedatacanbeusedtocompressthedata,i.e.todescribeitusingfewersymbolsthanthenumberofsymbolsneededtodescribethedataliterally.Suchadescriptionshouldalwaysuniquelyspecifythedataitdescribes-hencegivenadescriptionor 1.Unlesswecall“generatedbyafaircointoss”a“regularity”too.Thereisnothingwrongwiththatview-thepointisthat,themorewecancompressasequence,themoreregularitywehavefound.Onecanavoidallterminologicalconfusionabouttheconceptof“regularity”bymakingitrelativetosomethingcalleda“basemeasure,”butthatisbeyondthescopeofthisbook(LiandVitányi1997). 1.2RegularityandCompression5encodingD0ofaparticularsequenceofdataD,weshouldalwaysbeabletofullyreconstructDusingD0.Forexample,sequence(1.1)abovecanbedescribedusingonlyafewwords;wehaveactuallydonesoalready:wehavenotgiventhecompletesequence—whichwouldhavetakenaboutthewholepage—butratherjustaone-sentencedescriptionofitthatneverthelessallowsyoutoreproducethecom-pletesequenceifnecessary.Ofcourse,thedescriptionwasdoneusingnatu-rallanguageandwemaywanttodoitinsomemoreformalmanner.Ifwewanttoidentifyregularitywithcompressibility,thenitshouldalsobethecasethatnonregularsequencescannotbecompressed.Sincese-quence(1.2)hasbeengeneratedbyfaircointosses,itshouldnotbecompress-ible.Aswewillshowbelow,wecanindeedprovethatwhateverdescriptionmethodConeuses,thelengthofthedescriptionofasequencelike(1.2)will,withoverwhelmingprobability,benotmuchshorterthansequence(1.2)it-self.Notethatthedescriptionofsequence(1.3)thatwegaveabovedoesnotuniquelydenesequence(1.3).Therefore,itdoesnotcountasa“real”de-scription:onecannotregeneratethewholesequenceifonehasthedescrip-tion.Auniquedescriptionthatstilltakesonlyafewwordsmaylooklikethis:“Sequence(1.3)isoneofthosesequencesof10000bitsinwhichtherearefourtimesasmany0sasthereare1s.Inthelexicographicalorderingofthosesequences,itisnumberi.”Hereiissomelargenumberthatisexplic-itlyspelledoutinthedescription.Ingeneral,thereare2nbinarysequencesoflengthn,whilethereareonlynnsequencesoflengthnwithafractionof1s.Foreveryrationalnumberexcept=1=2,theratioofnnto2ngoesto0exponentiallyfastasnincreases(thisisshownformallyinChap-ter4;seeEquation(4.36)onpage129andthetextthereunder;bythemethodusedthereonecanalsoshowthatfor=1=2,itgoesto0asO(1=p n)).Itfollowsthatcomparedtothetotalnumberofbinarysequencesoflength10000,thenumberofsequencesoflength10000withfourtimesasmany0sas1sisvanishinglysmall.Directcomputationshowsitissmallerthan27213,sothattheratiobetweenthenumberofsequenceswithfourtimesasmany0sthan1sandthetotalnumberofsequencesissmallerthan22787.Thus,i27213210000andtowritedowniinbinaryweneedapproximately(log2i)721310000bits.Example1.2[CompressingVariousRegularSequences]Theregularitiesun-derlyingsequences(1)and(3)wereofaveryparticularkind.Toillustratethatanytypeofregularityinasequencemaybeexploitedtocompressthatsequence,wegiveafewmoreexamples: 61Learning,Regularity,andCompressionTheNumberEvidently,thereexistsacomputerprogramforgeneratingtherstndigitsof–suchaprogramcouldbebased,forexample,onaninniteseriesexpansionof.Thiscomputerprogramhasconstantsize,ex-ceptforthespecicationofnwhichtakesnomorethanO(logn)bits.Thus,whennisverylarge,thesizeoftheprogramgeneratingtherstndigitsofwillbeverysmallcomparedton:the-digitsequenceisdeterministic,andthereforeextremelyregular.PhysicsDataConsideratwo-columntablewheretherstcolumncontainsnumbersrepresentingvariousheightsfromwhichanobjectwasdropped.Thesecondcolumncontainsthecorrespondingtimesittookfortheobjecttoreachtheground.Assumebothheightsandtimesarerecordedtosomeniteprecision.InSection1.5weillustratethatsuchatablecanbesubstan-tiallycompressedbyrstdescribingthecoefcientsofthesecond-degreepolynomialthatexpressesNewton'slaw;thendescribingtheheights;andthendescribingthedeviationofthetimepointsfromthenumberspre-dictedby.NaturalLanguageMostsequencesofwordsarenotvalidsentencesaccord-ingtotheEnglishlanguage.ThisfactcanbeexploitedtosubstantiallycompressEnglishtext,aslongasitissyntacticallymostlycorrect:byrstdescribingagrammarforEnglish,andthendescribinganEnglishtextDwiththehelpofthatgrammar(Grünwald1996),Dcanbedescribedusingmuchlessbitsthanareneededwithouttheassumptionthatwordorderisconstrained.DescriptionMethodsInordertoformalizeouridea,wehavetoreplacethepartofthedescriptionsabovethatmadeuseofnaturallanguagebysomeformallanguage.Forthis,weneedtoxadescriptionmethodthatmapsse-quencesofdatatotheirdescriptions.Eachsuchsequencewillbeencodedasanothersequenceofsymbolscomingfromsomeniteorcountablyinnitecodingalphabet.Analphabetissimplyacountablesetofdistinctsymbols.Anexampleofanalphabetisthebinaryalphabetf0;1g;thethreedatasequencesabovearesequencesoverthebinaryalphabet.Asequenceoverabinaryalphabetwillalsobecalledabinarystring.Sometimesourdatawillconsistofrealnumbersratherthanbinarystrings.Inpractice,however,suchnumbersarealwaystruncatedtosomeniteprecision.Wecanthenagainmodelthemassymbolscomingfromanitedataalphabet.Moreprecisely,wearegivenasampleorequivalentlydatasequenceD(x1;:::;xn)whereeachxiisamemberofsomesetX,calledthespaceofobservationsorthesamplespaceforoneobservation.ThesetofallpotentialsamplesoflengthnisdenotedXnandiscalledthesamplespace.Wecall 1.2RegularityandCompression7xiasingleobservationor,equivalently,adataitem.Forageneralnoteabouthowourterminologyrelatestotheusualterminologyinstatistics,machinelearningandpatternrecognition,werefertotheboxonpage72.Withoutanylossofgeneralitywemaydescribeourdatasequencesasbi-narystrings(thisisexplainedinChapter3,Section3.2.2).Henceallthede-scriptionmethodsweconsidermapdatasequencestosequencesofbits.AlldescriptionmethodsconsideredinMDLsatisfytheuniquedecodabilityprop-erty:givenadescriptionD0,thereisatmostone(“unique”)DthatisencodedasD0.Therefore,givenanydescriptionD0,oneshouldbeabletofullyrecon-structtheoriginalsequenceD.Semiformally: DescriptionMethodsDenition1.1Adescriptionmethodisaone-manyrelationfromthesamplespacetothesetofbinarystringsofarbitrarylength. AtrulyformaldenitionwillbegiveninChapter3,Section3.1.Therewealsoexplainhowournotionof“descriptionmethod”relatestothemorecommonandcloselyrelatednotionofa“code.”Untilthen,thedistinctionbetweencodesandescriptionmethodsisnotthatimportant,andweusethesymbolCtodenotebothconcepts.CompressionandSmallSubsetsWearenowinapositiontoshowthatstringswhichare“intuitively”randomcannotbesubstantiallycompressed.Weequateintuitivelyrandomwith“havingbeengeneratedbyindependenttossesofafaircoin.”Wethereforehavetoprovethatitisvirtuallyimpossi-bletosubstantiallycompresssequencesthathavebeengeneratedbyfaircointosses.By“itisvirtuallyimpossible”wemean“ithappenswithvanishingprobability.”LetustakesomearbitrarybutxeddescriptionmethodCoverthedataalphabetconsistingofthesetofallbinarysequencesoflength1.Suchacodemapsbinarystringstobinarystrings.Supposewearegivenadatasequenceoflengthn(inExample1.1,n=10000).Clearly,thereare2npossibledatasequencesoflengthn.Weseethatonlytwoofthesecanbemappedtoadescriptionoflength1(sincethereareonlytwobinarystringsoflength1:0and1).Similarly,onlyasubsetofatmost2msequencescanhaveadescriptionoflengthm.ThismeansthatatmostPmi=12i2m+1datasequencescanhaveadescriptionlengthm.Thefractionofdatase-quencesoflengthnthatcanbecompressedbymorethankbitsisthereforeat 81Learning,Regularity,andCompressionmost2kandassuchdecreasesexponentiallyink.Ifdataaregeneratedbyntossesofafaircoin,thenall2npossibilitiesforthedataareequallyprob-able,sotheprobabilitythatwecancompressthedatabymorethankbitsissmallerthan2k.Forexample,theprobabilitythatwecancompressthedatabymorethan20bitsissmallerthanoneinamillion. MostDataSetsAreIncompressibleSupposeourgoalistoencodeabinarysequenceoflengthn.Then•Nomatterwhatdescriptionmethodweuse,onlyafractionofatmost2ksequencescanbecompressedbymorethankbits.•Thus,ifdataaregeneratedbyfaircointosses,thennomatterwhatcodeweuse,theprobabilitythatwecancompressasequencebymorethankbitsisatmost2k.•ThisobservationwillbegeneralizedtodatageneratedbyanarbitrarydistributioninChapter3.Wethencallittheno-hypercompressionin-equality.Itcanbefoundintheboxonpage103.Seeninthislight,havingashortdescriptionlengthforthedataisequiv-alenttoidentifyingthedataasbelongingtoatiny,veryspecialsubsetoutofallaprioripossibledatasequences;seealsotheboxonpage31. 1.3Solomonoff'sBreakthrough–KolmogorovComplexityItseemsthatwhatdataarecompressibleandwhatarenotisextremelyde-pendentonthespecicdescriptionmethodused.In1964–inapioneeringpaperthatmayberegardedasthestartingpointofallMDL-relatedresearch(Solomonoff1964)–RaySolomonoffsuggestedtheuseofauniversalcom-puterlanguageasadescriptionmethod.ByauniversallanguagewemeanacomputerlanguageinwhichauniversalTuringmachinecanbeimple-mented.Allcommonlyusedcomputerlanguages,likePascal,LISP,C,are“universal.”EverydatasequenceDcanbeencodedbyacomputerprogramPthatprintsDandthenhalts.WecandeneadescriptionmethodthatmapseachdatasequenceDtotheshortestprogramthatprintsDandthen 1.3Solomonoff'sBreakthrough–KolmogorovComplexity9halts.2Clearly,thisisadescriptionmethodinoursenseofthewordinthatitdenesa1-many(even1-1)mappingfromsequencesoverthedataalphabettoasubsetofthebinarysequences.TheshortestprogramforasequenceDistheninterpretedastheoptimalhypothesisforD.Letusseehowthisworksforsequence(1.1)above.UsingalanguagesimilartoC,wecanwriteaprogramfori1to2500;dofprint000010g;haltwhichprintssequence(1.1)butisclearlyalotshorterthanit.Ifwewanttomakeafaircomparison,weshouldrewritethisprograminabinaryal-phabet;theresultingnumberofbitsisstillmuchsmallerthan10000.Theshortestprogramprintingsequence(1.1)isatleastasshortastheprogramabove,whichmeansthatsequence(1.1)isindeedhighlycompressibleusingSolomonoff'scode.Bytheargumentsoftheprevioussectionweseethat,givenanarbitrarydescriptionmethodC,sequenceslike(1.2)thathavebeengeneratedbytossesofafaircoinareverylikelynotsubstantiallycompress-ibleusingC.Inotherwords,theshortestprogramforsequence(1.1)is,withextremelyhighprobability,notmuchshorterthanthefollowing:print001110100110100001010::::::::101110110001011000100;haltThisprogramhassizeaboutequaltothelengthofthesequence.Clearly,itisnothingmorethanarepetitionofthesequence.KolmogorovComplexityWedenetheKolmogorovcomplexityofasequenceasthelengthoftheshortestprogramthatprintsthesequenceandthenhalts.Kolmogorovcomplexityhasbecomealargesubjectinitsownright;see(LiandVitányi1997)foracomprehensiveintroduction.ThelowertheKolmogorovcomplexityofasequence,themoreregularorequivalently,thelessrandom,or,yetequivalently,thesimpleritis.Measur-ingregularityinthiswayconfrontsuswithaproblem,sinceitdependsontheparticularprogramminglanguageused.However,inhis1964paper,RaySolomonoff(Solomonoff1964)showedthatasymptoticallyitdoesnotmatterwhatprogramminglanguageoneuses,aslongasitisuniversal:foreverysequenceofdataD=(x1;:::;xn),letusdenotebyLUL(D)thelengthoftheshortestprogramforDusinguniversallanguageUL.Wecanshowthatfor 2.Ifthereexistsmorethanoneshortestprogram,wepicktheonethatcomesrstinenumera-tionorder. 101Learning,Regularity,andCompressioneverytwouniversallanguagesUL1andUL2,thedifferencebetweenthetwolengthsLUL1(D)LUL2(D)isboundedbyaconstantthatdependsonUL1andUL2butnotonthelengthnofthedatasequenceD.Thisimpliesthatifwehavealotofdata(nislarge),thenthedifferenceinthetwodescrip-tionlengthsisnegligiblecomparedtothesizeofthedatasequence.Thisresultisknownastheinvariancetheoremandwasprovedindependentlyin(Solomonoff1964),(Kolmogorov1965)(hencethenameKolmogorovcom-plexity),and(Chaitin1969).TheproofisbasedonthefactthatonecanwriteacompilerforeveryuniversallanguageUL1ineveryotheruniver-sallanguageUL2.SuchacompilerisacomputerprogramwithlengthL1!2.Forexample,wecanwriteaprograminPascalthattranslateseveryCpro-gramintoanequivalentPascalprogram.Thelength(inbits)ofthisprogramwouldthenbeLC!Pascal.WecansimulateeachprogramP1writteninlan-guageUL1byprogramP2writteninUL2asfollows:P2consistsofthecom-pilerfromUL1toUL2,followedbyP1.ThelengthofprogramP2isboundedbythelengthofP1plusL1!2.HenceforalldataD,themaximaldifferencebetweenLUL1(D)andLUL2(D)isboundedbymaxfL1!2;L2!1g,aconstantwhichonlydependsonUL1andUL2butnotonD.1.4MakingtheIdeaApplicableProblemsTherearetwomajorproblemswithapplyingKolmogorovcom-plexitytopracticallearningproblems:1.Uncomputability.TheKolmogorovcomplexitycannotbecomputedingeneral;2.Largeconstants.Thedescriptionlengthofanysequenceofdatainvolvesaconstantdependingonthedescriptionmethodused.By“Kolmogorovcomplexitycannotbecomputed”wemeanthefollowing:thereisnocomputerprogramthat,foreverysequenceofdataD,whengivenDasinput,returnstheshortestprogramthatprintsDandhalts.Neithercantherebeaprogram,thatforeverydataDreturnsonlythelengthoftheshortestprogramthatprintsDandthenhalts.Assumingsuchaprogramexistsleadstoacontradiction(LiandVitányi1997).Thesecondproblemrelatestothefactthatinmanyrealisticsettings,weareconfrontedwithverysmalldatasequencesforwhichtheinvariancetheoremisnotveryrelevantsincethelengthofDissmallcomparedtotheconstantL1!2. 1.4MakingtheIdeaApplicable11“Idealized”or“Algorithmic”MDLIfweignoretheseproblems,wemayuseKolmogorovcomplexityasourfundamentalconceptandbuildathe-oryofidealizedinductiveinferenceontopofit.ThisroadhasbeentakenbySolomonoff(1964,1978),startingwiththe1964paperinwhichheintroducedKolmogorovcomplexity,andbyKolmogorov,whenheintroducedtheKol-mogorovminimumsufcientstatistic(LiandVitányi1997;CoverandThomas1991).BothSolomonoff'sandKolmogorov'sideashavebeensubstantiallyrenedbyseveralauthors.WementionhereP.Vitányi(LiandVitányi1997;Gács,Tromp,andVitányi2001;VereshchaginandVitányi2002;VereshchaginandVitányi2004;Vitányi2005),whoconcentratedonKolmogorov'sideas,andM.Hutter(2004),whoconcentratedonSolomonoff'sideas.Differentauthorshaveuseddifferentnamesforthisareaofresearch:“idealMDL,”“idealizedMDL,”or“algorithmicstatistics.”Itiscloselyrelatedtothecele-bratedtheoryofrandomsequencesduetoP.Martin-LöfandKolmogorov(LiandVitányi1997).WebrieyreturntoidealizedMDLinChapter17,Sec-tion17.8.PracticalMDLLikemostauthorsintheeld,weconcentratehereonnon-idealized,practicalversionsofMDLthatexplicitlydealwiththetwoprob-lemsmentionedabove.ThebasicideaistoscaledownSolomonoff'sap-proachsothatitdoesbecomeapplicable.Thisisachievedbyusingdescrip-tionmethodsthatarelessexpressivethangeneral-purposecomputerlan-guages.SuchdescriptionmethodsCshouldberestrictiveenoughsothatforanydatasequenceD,wecanalwayscomputethelengthoftheshortestde-scriptionofDthatisattainableusingmethodC;buttheyshouldbegeneralenoughtoallowustocompressmanyoftheintuitively“regular”sequences.Thepricewepayisthat,usingthe“practical”MDLprinciple,therewillal-waysbesomeregularsequenceswhichwewillnotbeabletocompress.Butwealreadyknowthattherecanbenomethodforinductiveinferenceatallwhichwillalwaysgiveusalltheregularitythereis—simplybecausetherecanbenoautomatedmethodwhichforanysequenceDndstheshortestcomputerprogramthatprintsDandthenhalts.Moreover,itwilloftenbepossibletoguideasuitablechoiceofCbyaprioriknowledgewehaveaboutourproblemdomain.Forexample,belowweconsideradescriptionmethodCthatisbasedontheclassofallpolynomials,suchthatwiththehelpofCwecancompressalldatasetswhichcanmeaningfullybeseenaspointsonsomepolynomial. 121Learning,Regularity,andCompression1.5CrudeMDL,RenedMDLandUniversalCodingLetusrecapitulateourmaininsightssofar: MDL:TheBasicIdeaThegoalofstatisticalinferencemaybecastastryingtondregularityinthedata.“Regularity”maybeidentiedwith“abilitytocompress.”MDLcombinesthesetwoinsightsbyviewinglearningasdatacompression:ittellsusthat,foragivensetofhypothesesHanddatasetD,weshouldtrytondthehypothesisorcombinationofhypothesesinHthatcom-pressesDmost. Thisideacanbeappliedtoallsortsofinductiveinferenceproblems,butitturnsouttobemostfruitfulin(anditsdevelopmenthasmostlyconcentratedon)problemsofmodelselectionand,moregenerally,dealingwithovertting.Hereisastandardexample(weexplainthedifferencebetween“model”and“hypothesis”aftertheexample).Example1.3[ModelSelectionandOvertting]ConsiderthepointsinFig-ure1.1.Wewouldliketolearnhowthey-valuesdependonthex-values.Tothisend,wemaywanttotapolynomialtothepoints.Straightforwardlinearregressionwillgiveustheleftmostpolynomial-astraightlinethatseemsoverlysimple:itdoesnotcapturetheregularitiesinthedatawell.Sinceforanysetofnpointsthereexistsapolynomialofthe(n1)stdegreethatgoesexactlythroughallthesepoints,simplylookingforthepolyno-mialwiththeleasterrorwillgiveusapolynomialliketheoneinthesecondpicture.Thispolynomialseemsoverlycomplex:itreectstherandomuc-tuationsinthedataratherthanthegeneralpatternunderlyingit.Insteadofpickingtheoverlysimpleortheoverlycomplexpolynomial,itseemsmorereasonabletopreferarelativelysimplepolynomialwithsmallbutnonzeroerror,asintherightmostpicture.Thisintuitionisconrmedbynumerousexperimentsonreal-worlddatafromabroadvarietyofsources(Rissanen1989;Vapnik1998;Ripley1996):ifonenaivelytsahigh-degreepolyno-mialtoasmallsample(setofdatapoints),thenoneobtainsaverygoodttothedata.Yetifoneteststheinferredpolynomialonasecondsetofdatacomingfromthesamesource,ittypicallytsthistestdataverybadlyinthesensethatthereisalargedistancebetweenthepolynomialandthenewdatapoints.Wesaythatthepolynomialovertsthedata.Indeed,allmodelselec-tionmethodsthatareusedinpracticeeitherimplicitlyorexplicitlychoose 1.5CrudeMDL,RenedMDLandUniversalCoding13 Figure1.1Asimple,acomplexandatradeoff(third-degree)polynomial.atradeoffbetweengoodness-of-tandcomplexityofthemodelsinvolved.Inpractice,suchtradeoffsleadtomuchbetterpredictionsoftestdatathanonewouldgetbyadoptingthe“simplest”(onedegree)ormost“complex”3(n1-degree)polynomial.MDLprovidesoneparticularmeansofachievingsuchatradeoff.Itwillbeusefultodistinguishbetween“model”,“modelclass”and“(point)hypothesis.”Thisterminologyisexplainedintheboxonpage15,andwillbediscussedinmoredetailinSection2.4,page69.Inourterminology,theproblemdescribedinExample1.3isa“pointhypothesisselectionproblem”ifweareinterestedinselectingboththedegreeofapolynomialandthecor-respondingparameters;itisa“modelselectionproblem”ifwearemainlyinterestedinselectingthedegree.ToapplyMDLtopolynomialorothertypesofhypothesisandmodelselec-tion,wehavetomakeprecisethesomewhatvagueinsight“learningmaybeviewedasdatacompression.”Thiscanbedoneinvariousways.Werstexplaintheearliestandsimplestimplementationoftheidea.Thisistheso-calledtwo-partcodeversionofMDL: 3.Strictlyspeaking,inourcontextitisnotveryaccuratetospeakof“simple”or“complex”polynomials;insteadweshouldcallthesetofrstdegreepolynomials“simple,”andthesetof100th-degreepolynomials“complex.” 141Learning,Regularity,andCompression CrudeTwo-PartVersionofMDLPrinciple(InformallyStated)LetH1;H2;:::bealistofcandidatemodels(e.g.,Histhesetof\rthdegreepolynomials),eachcontainingasetofpointhypotheses(e.g.,in-dividualpolynomials).ThebestpointhypothesisH2HH1[H22:::toexplainthedataDistheonewhichminimizesthesumL(H)+L(DjH),where•L(H)isthelength,inbits,ofthedescriptionofthehypothesis;and•L(DjH)isthelength,inbits,ofthedescriptionofthedatawhenen-codedwiththehelpofthehypothesis.ThebestmodeltoexplainDisthesmallestmodelcontainingtheselectedH. Theterminology“crudeMDL”isexplainedinthenextsubsection.Itisnotstandard,anditisintroducedhereforpedagogicalreasons.Example1.4[Polynomials,cont.]Inourpreviousexample,thecandidatehypotheseswerepolynomials.Wecandescribeapolynomialbydescribingitscoefcientsatacertainprecision(numberofbitsperparameter).Thus,thehigherthedegreeofapolynomialortheprecision,themorebitsweneedtodescribeitandthemore“complex”itbecomes.Adescriptionofthedata“withthehelpof”ahypothesismeansthatthebetterthehypothesiststhedata,theshorterthedescriptionwillbe.Ahypothesisthattsthedatawellgivesusalotofinformationaboutthedata.Suchinformationcanalwaysbeusedtocompressthedata.Intuitively,thisisbecauseweonlyhavetocodetheerrorsthehypothesismakesonthedataratherthanthefulldata.Inourpolynomialexample,thebetterapolynomialHtsD,thefewerbitsweneedtoencodethediscrepanciesbetweentheactualy-valuesyiandthepredictedy-valuesH(xi).Wecantypicallyndaverycomplexpointhypothesis(largeL(H))withaverygoodt(smallL(DjH)).Wecanalsotypicallyndaverysimplepointhypothesis(smallL(H))witharatherbadt(largeL(DjH)).Thesumofthetwodescriptionlengthswillbeminimizedatahypothesisthatisquite(butnottoo)“simple,”withagood(butnotperfect)t.1.5.1FromCrudetoRenedMDLCrudeMDLpickstheHminimizingthesumL(H)+L(DjH).Tomakethisprocedurewelldened,weneedtoagreeonprecisedenitionsforthe 1.5CrudeMDL,RenedMDLandUniversalCoding15 ModelsandModelClasses;(Point)HypothesesWeusethewordmodeltorefertoasetofprobabilitydistributionsorfunc-tionsofthesamefunctionalform.E.g.,the“rst-orderMarkovmodel”isthesetofallprobabilitydistributionsthatarerst-orderMarkovchains.The“modelofkthdegreepolynomials”isthesetofallkthdegreepoly-nomialsforsomexedk.Weusethewordmodelclasstorefertoafamily(set)ofmodels,e.g.“themodelclassofallpolynomials”or“themodelclassofallMarkovchainsofeachorder.”Thedenitionsof“model”and“modelclass”arechosensothattheyagreewithhowthesewordsareusedinstatisticalpractice.Thereforetheyareintentionallyleftsomewhatimprecise.Weusethewordhypothesistorefertoanarbitrarysetofprobabilitydis-tributionsorfunctions.Weusethewordpointhypothesistorefertoasingleprobabilitydistribution(e.g.aMarkovchainwithallparametervaluesspecied)orfunction(e.g.aparticularpolynomial).Inparametricinference(Chapter2),apointhypothesiscorrespondstoaparticularpa-rametervalue.Apointhypothesismayalsobeviewedasaninstantiationofamodel.Whatwecall“pointhypothesis”iscalled“simplehypothesis”inthestatisticsliterature;ouruseoftheword“model(selection)”coincideswithitsuseinmuchofthestatisticsliterature;seeSection2.3,page62wherewegiveseveralexamplestoclarifyourterminology. Figure1.2ModelsandModelClasses;(Point)Hypotheses.codes(descriptionmethods)givingrisetolengthsL(DjH)andL(H).Wenowdiscussthesecodesinmoredetail.WewillseethatthedenitionofL(H)isproblematic,indicatingthatwesomehowneedto“rene”ourcrudeMDLprinciple.DenitionofLDjHConsideratwo-partcodeasdescribedabove,andassumeforthetimebeingthatallHunderconsiderationdeneprobabilitydistributions.IfHisapolynomial,wecanturnitintoadistributionbymak- 161Learning,Regularity,andCompressioningtheadditionalassumptionthattheY-valuesaregivenbyYH(X)+Z,whereZisanormallydistributednoisetermwithmean0.ForeachHweneedtodeneacodewithlengthL(jH)suchthatL(DjH)canbeinterpretedas“thecodelengthofDwhenencodedwiththehelpofH.”Itturnsoutthatforprobabilistichypotheses,thereisonlyonereasonablechoiceforthiscode;thisisexplainedatlengthinChapter5.Ititistheso-calledShannon-Fanocode,satisfying,foralldatasequencesD,L(DjH)=logP(DjH),whereP(DjH)istheprobabilitymassordensityofDaccordingtoH.Suchacodealwaysexists,asweexplaininChapter3,intheboxonpage96.DenitionofLH:AProblemforCrudeMDLItismoreproblematictondagoodcodeforhypothesesH.Someauthorshavesimplyused“in-tuitivelyreasonable”codesinthepast,butthisisnotsatisfactory:sincethedescriptionlengthL(H)ofanyxedpointhypothesisHcanbeverylargeunderonecode,butquiteshortunderanother,ourprocedureisindangerofbecomingarbitrary.Instead,weneedsomeadditionalprinciplefordesigningacodeforH.IntherstpublicationsonMDL(Rissanen1978;Rissanen1983),itwasim-plicitlyadvocatedtochoosesomesortofminimaxcodeforeachH,minimiz-ingtheshortestworst-casetotaldescriptionlengthL(H)+L(DjH),wheretheworst-caseisoverallpossibledatasequences.Thus,theMDLprincipleisemployedata“meta-level”tochooseacodeforH.Thisidea,alreadyimplicitinRissanen'searlyworkabutperhapsforthersttimestatedandformalizedinacompletelyprecisewayBarronandCover(1991),istherststeptowards“rened”MDL.MoreProblemsforCrudeMDLWecanusecrudeMDLtocodeanyse-quenceofdataDwithatotaldescriptionlengthL(D):=minfL(DjH)+L(H)g.Butitturnsoutthatthiscodeisincomplete:onecanshowthatthereexistothercodesL0whichforsomeDachievestrictlysmallercodelength(L0(D)L(D)),andfornoDachievelargercodelength(Chapter6,Exam-ple6.4).Itseemsstrangethatour“minimumdescriptionlength”principleshouldbebasedoncodeswhichareincomplete(inefcient)inthissense.An-other,lessfundamentalproblemwithtwo-partcodesisthat,ifdesignedinaminimaxwayasindicatedabove,theyrequireacumbersomediscretizationofthemodelspaceH,whichisnotalwaysfeasibleinpractice.Thenalprob-lemwementionisthat,whileitisclearhowtousecrudetwo-partcodesfor 1.5CrudeMDL,RenedMDLandUniversalCoding17pointhypothesisandmodelselection,itisnotimmediatelyclearhowtheycanbeusedforprediction.Later,Rissanen(1984)realizedthattheseproblemscouldbeside-steppedbyusingone-partratherthantwo-partcodes.Asweexplainbelow,itdependsonthesituationathandwhetheraone-partoratwo-partcodeshouldbeused.Combiningtheideaofdesigningcodessoastoachieveessentiallyminimaxoptimalcodelengthswiththecombineduseofone-partandtwo-partcodes(whicheverisappropriateforthesituationathand)hasculmi-natedinatheoryofinductiveinferencethatwecallrenedMDL.Wediscussitinmoredetailinthenextsubsection. CrudeTwo-PartMDL(PartI,Chapter5ofthisbook)Inthisbook,weusetheterm“crudeMDL”torefertoapplicationsofMDLformodelandhypothesisselectionofthetypedescribedintheboxonpage14,aslongasthehypothesesH2Hareencodedin“intuitivelyreasonable”butad-hocways.RenedMDLissometimesbasedonone-partcodes,sometimesontwo-partcodes,andsometimesonacombinationofthese,but,incontrasttocrudeMDL,thecodesareinvariablydesignedaccordingtosomemin-imaxprinciples.Ifthereisachoice,oneshouldalwayspreferrenedMDL,butinsomeexoticmodelingsituations,theuseofcrudeMDLisinevitable.PartIofthisbookrstdiscussesallprobabilistic,statisticalandinformation-theoreticpreliminaries(Chapters2–4)andculminatesinadescriptionofcrudetwo-partMDL(Chapter5).RenedMDLisde-scribedonlyinPartIII. 1.5.2UniversalCodingandRenedMDLInrenedMDL,weassociateacodeforencodingDnotwithasingleH2H,butwiththefullmodelH.Thus,givenmodelH,weencodedatanotintwopartsbutwedesignasingleone-partcodewithlengthsL(DjH).Thiscodeisdesignedsuchthatwheneverthereisamemberof(parameterin)Hthattsthedatawell,inthesensethatL(DjH)issmall,thenthecodelengthL(DjH)willalsobesmall.Codeswiththispropertyarecalleduniversalcodesintheinformation-theoreticliterature(Barron,Rissanen,andYu1998): 181Learning,Regularity,andCompression UniversalCoding(PartIIofThisBook)Thereexistatleastfourtypesofuniversalcodes:1.Thenormalizedmaximumlikelihood(NML)codeanditsvariations.2.TheBayesianmixturecodeanditsvariations.3.Theprequentialplug-incode4.Thetwo-partcodeThesecodesareallbasedonentirelydifferentcodingschemes,butinpractice,leadtoverysimilarcodelengthsL(DjH).PartIIofthisbookisentirelydevotedtouniversalcoding.Thefourtypesofcodesareintro-ducedinChapter6.Thisisfollowsbyaseparatechapterforeachcode. ForeachmodelH,therearemanydifferentuniversalcodeswecanasso-ciatewithH.WhenapplyingMDL,wehaveapreferencefortheonethatisminimaxoptimalinasensemadepreciseinChapter6.Forexample,thesetH3ofthird-degreepolynomialsisassociatedwithacodewithlengthsL(jH3)suchthat,thebetterthedataDaretbythebest-ttingthird-degreepoly-nomial,theshorterthecodelengthL(DjH).L(DjH)iscalledthestochasticcomplexityofthedatagiventhemodel.RenedMDLisageneraltheoryofinductiveinferencebasedonuniversalcodesthataredesignedtobeminimax,orclosetominimaxoptimal.Ithasmostlybeendevelopedformodelselection,estimationandprediction.Togivearstavor,weinitiallydiscussmodelselection,where,arguably,ithasthemostnewinsightstooffer:1.5.3RenedMDLforModelSelectionParametricComplexityAfundamentalconceptofrenedMDLformodelselectionistheparametriccomplexityofaparametricmodelHwhichwede-notebyCOMP(H).Thisisameasureofthe“richness”ofmodelH,indicat-ingitsabilitytotrandomdata.Thiscomplexityisrelatedtothenumberofdegrees-of-freedom(parameters)inH,butalsotothegeometricalstruc-tureofH;seeExample1.5.Toseehowitrelatestostochasticcomplexity,let,forgivendataD,^HdenotethedistributioninHwhichmaximizestheprobability,andhenceminimizesthecodelengthL(Dj^H)ofD.Itturnsout 1.5CrudeMDL,RenedMDLandUniversalCoding19thatL(DjH)=stochasticcomplexityofDgivenHL(Dj^H)+COMP(H):RenedMDLmodelselectionbetweentwoparametricmodelsH1andH2(suchasthemodelsofrstandseconddegreepolynomials)nowproceedsasfollows.WeencodedataDintwostages.Intherststage,weencodeanumberj2f1;2g.Inthesecondstage,weencodethedatausingtheuniversalcodewithlengthsL(DjHj).Asinthetwo-partcodeprinciple,wethenselecttheMjachievingtheminimumtotaltwo-partcodelength,minj2f1;2gfL(j)+L(DjHj)g=minj2f1;2gfL(j)+L(Dj^H)+COMP(H)g:(1.4)Sincetheworst-caseoptimalcodetoencodejneedsonly1bittoencodeei-therj=1orj=2,weuseacodefortherst-partsuchthatL(1)=L(2)=1.ButthismeansthatL(j)playsnoroleintheminimization,andweareef-fectivelyselectingthemodelsuchthatthestochasticcomplexityofthegivendataDissmallest.4Thus,intheendweselectthemodelminimizingtheone-partcodelengthofthedata.Nevertheless,renedMDLmodelselectionin-volvesatradeoffbetweentwoterms:agoodness-of-ttermL(Dj^H)andacomplexitytermCOMP(H).However,becausewedonotexplicitlyencodehypothesesHanymore,thereisnopotentialforarbitrarycodelengthsany-more.Theresultingprocedurecanbeinterpretedinseveraldifferentways,someofwhichprovideuswithrationalesforMDLmodelselectionbeyondthepurecodinginterpretation(Chapter14):1.Counting/differentialgeometricinterpretationTheparametriccomplex-ityofamodelisthelogarithmofthenumberofessentiallydifferent,distin-guishablepointhypotheseswithinthemodel.2.Two-partcodeinterpretationForlargesamples,thestochasticcomplex-itycanbeinterpretedasatwo-partcodelengthofthedataafterall,wherehypothesesHareencodedwithaspecialcodethatworksbyrstdis-cretizingthemodelspaceHintoasetof“maximallydistinguishablehy-potheses,”andthenassigningequalcodelengthtoeachofthese.3.BayesianinterpretationInmanycases,renedMDLmodelselectionco-incideswithBayesfactormodelselectionbasedonanoninformativepriorsuchasJeffreys'prior(BernardoandSmith1994). 4.ThereasonweincludeL(j)atallin(1.4)istomaintainconsistencywiththecasewhereweneedtoselectbetweenaninnitenumberofmodels.Inthatcase,itisnecessarytoincludeL(j). 201Learning,Regularity,andCompression4.PrequentialinterpretationMDLmodelselectioncanbeinterpretedasse-lectingthemodelwiththebestpredictiveperformancewhensequentiallypredictingunseentestdata,inthesensedescribedinChapter6,Section6.4andChapter9.ThismakesitaninstanceofDawid's(1984)prequentialmodelvalidationandalsorelatesittocross-validationmethods;seeChap-ter17,Sections17.5and17.6.InSection1.6.1weshowthatrenedMDLallowsustocomparemodelsofdifferentfunctionalform.Itevenaccountsforthephenomenonthatdifferentmodelswiththesamenumberofparametersmaynotbeequally“complex.”1.5.4GeneralRenedMDL:PredictionandHypothesisSelectionModelselectionisjustoneapplicationofrenedMDL.Thetwoothermainapplicationsarepointhypothesisselectionandprediction.Theseapplicationscanalsobeinterpretedasmethodsforparametricandnonparametricestima-tion.Infact,itturnsoutthatlargepartsofMDLtheorycanbereinterpretedasatheoryaboutsequentialpredictionoffuturedatagivenpreviouslyseendata.This“prequential”interpretationofMDL(Chapter15)isatleastasimportantasthecodinginterpretation.Itisbasedonthefundamentalcorrespondencebe-tweenprobabilitydistributionsandcodesviatheShannon-Fanocodethatwealludedtobefore,whenexplainingthecodewithlengthsL(DjH);seetheboxonpage96.ThiscorrespondenceallowsustoviewanyuniversalcodeL(jH)asastrategyforsequentiallypredictingdata,suchthatthebetterHissuitedasamodelforthedata,thebetterthepredictionswillbe.MDLpredictionandhypothesisselectionaremathematicallycleanerthanMDLmodelselection:inChapter15,weprovidetheorems(Theorem15.1andTheorem15.3)which,intherespectivecontextsofpredictionandhy-pothesisselection,expressthat,infullgenerality,gooddatacompressionimpliesfastlearning,where“learning”isdenedas“ndingahypothesisthatisinsomesenseclosetoanimagined“truestateoftheworld.”Therearesimi-lartheoremsformodelselection,buttheselacksomeofthesimplicityandeleganceofTheorem15.1andTheorem15.3.Probabilisticvs.NonprobabilisticMDLLikemostotherauthorsonMDL,inthisbookweconneourselvestoprobabilistichypotheses,alsoknownasprobabilisticsources.Thesearehypothesesthattaketheformofprobabilitydis-tributionsoverthespaceofpossibledatasequences.Theexampleswegiveinthischapter(Examples1.3and1.5)involvehypothesesHthatarefunctions 1.5CrudeMDL,RenedMDLandUniversalCoding21fromsomespaceXtoanotherspaceY;atrstsight,thesearenot“proba-bilistic.”Wewillusuallyassumethatforanygivenx,wehaveyH(x)+ZwhereZisanoisetermwithaknowndistribution.Typically,thenoiseZwillbeassumedtobeGaussian(normally)distributed.Withsuchanad-ditionalassumption,wemayview“functional”hypothesesH:X!Yas“probabilistic”afterall.Suchatechniqueofturningfunctionsintoproba-bilitydistributionsiscustomaryinstatistics,andwewilluseitthroughoutlargepartsofthisbook.WheneverwerefertoMDL,weimplicitlyassumethatwedealwithprobabilisticmodels.WeshouldnotethoughthatthereexistsvariationsofMDLthatdirectlyworkwithuniversalcodesrelativetofunctionalhypothesessuchaspolynomials(seeSection1.9.1,andChapter17,Section17.10). FixingNotationWeusethesymbolHforgeneralpointhypotheses,thatmayeitherrepre-sentaprobabilisticsourceoradeterministicfunction.WeuseHforsetsofsuchgeneralpointhypotheses.WereservethesymbolMforprobabilis-ticmodelsandmodelclasses.WedenoteprobabilisticpointhypothesesbyP,andpointhypothesesthataredeterministicfunctionsbyh. Individual-Sequencevs.Expectation-basedMDLRenedMDLisbasedonminimaxoptimaluniversalcodes.Broadlyspeaking,therearetwodiffer-entwaystodenewhatwemeanbyminimaxoptimality.Oneistolookattheworst-casecodelengthoverallpossiblesequences.Wecallthisindividual-sequenceMDL.Analternativeistolookatexpectedcodelength,wheretheex-pectationistakenoversomeprobabilitydistribution,usuallybutnotalwaysassumedtobeamemberofthemodelclassMunderconsideration.Wecallthisexpectation-basedMDL.WediscussthedistinctionindetailinPartIIIofthebook;seealsotheboxonpage407.Theindividual-sequenceapproachistheonetakenbyRissanen,themainoriginatorofMDL,andwewillmostlyfollowitthroughoutthisbook.TheLuckinessPrincipleIntheindividual-sequenceapproach,themini-maxoptimaluniversalcodeisgivenbythenormalizedmaximumlikelihood(NML)codethatwementionedabove.Aproblemisthatformany(infact,most)practicallyinterestingmodels,theNMLcodeisnotwelldened.In 221Learning,Regularity,andCompressionsuchcases,aminimaxoptimalcodedoesnotexist.AsweexplaininChap-ter11,insomecasesonecangetaroundthisproblemusingso-called“condi-tionalNML”codes,butingeneral,oneneedstousecodesbasedonamodi-edminimaxprinciple,whichwecalltheluckinessprinciple.AlthoughithasbeenimplicitlyusedinMDLsinceitsinception,Iamthersttousetheterm“luckinessprinciple”inanMDLcontext;seetheboxonpage92,Chapter3;thedevelopmentsinChapter11,Section11.3,whereweintroducethecon-ceptofaluckinessfunction;andthediscussioninChapter17,Section17.2.1.TheluckinessprinciplereintroducessomesubjectivityinMDLcodede-sign.Thisseemstobringusbacktothead-hoccodesusedincrudetwo-partMDL.Thedifferencehoweveristhatwithluckinessfunctions,wecanpre-ciselyquantifytheeffectsofthissubjectivity:foreachpossibledatasampleDthatwemayobserve,wecanindicatehow“lucky”weareonthesample,i.e.howmanyextrabitsweneedcomparedtoencodeDcomparedtothebesthypothesisthatwehaveavailableforD.ThisideasignicantlyextendstheapplicabilityofrenedMDLmethods.MDLisaPrincipleContrarytowhatisoftenthought,MDL,andeven,“modern,renedMDL”isnotaunique,singlemethodofinductiveinfer-ence.Rather,itrepresentsageneralprinciplefordoinginductiveinference.Theprinciplemay(andwill)beformulatedpreciselyenoughtoallowustoestablish,formanygivenmethods(procedures,learningalgorithms)“thismethodisaninstanceofMDL”or“thisisnotaninstanceofMDL.Butnev-ertheless: MDLIsaPrinciple,NotaUniqueMethodBeingaprinciple,MDLgivesrisetoseveralmethodsofinduc-tiveinference.Thereisnosingle“uniquelyoptimalMDLmethod/procedure/algorithm.”Nevertheless,insomespecialsituations(e.g.simpleparametricstatisticalmodels),onecanclearlydistinguishbetweengoodandnotsogoodversionsofMDL,andsomethingcloseto“anoptimalMDLmethod”exists. 1.6SomeRemarksonModelSelection23 Summary:RenedMDL(PartIIIofThisBook)RenedMDLisamethodofinductiveinferencebasedonuniversalcodeswhicharedesignedtohavesomeminimaxoptimalityproperties.EachmodelHunderconsiderationisassociatedwithacorrespondinguniver-salcode.InthisbookwerestrictourselvestoprobabilisticH.RenedMDLhasmainlybeendevelopedformodelselection,pointhypothesisselectionandprediction.RenedMDLcomesintwoversions:individual-sequenceandexpectation-basedrenedMDL,dependingonwhethertheuniversalcodesaredesignedtobeoptimalinanindividual-sequenceorinanex-pectedsense.IftheminimaxoptimalcoderelativetoamodelMisnotdened,someelementofsubjectivityisintroducedintothecodingus-ingaluckinessfunction.Amorepreciseoverviewisgivenintheboxonpage406. IntheremainderofthischapterwewillmostlyconcentrateonMDLformodelselection.1.6SomeRemarksonModelSelectionModelselectionisacontroversialtopicinstatistics.Althoughmostpeopleagreethatitisimportant,manysayitcanonlybedoneonexternalgrounds,andneverbymerelylookingatthedata.Still,aplethoraofautomaticmodelselectionmethodshasbeensuggestedintheliterature.Thesecangivewildlydifferentresultsonthesamedata,oneofthemainreasonsbeingthattheyhaveoftenbeendesignedwithdifferentgoalsinmind.Thissectionstartswithafurtherexamplethatmotivatestheneedformodelselection,anditthendiscussesseveralgoalsthatonemayhaveinmindwhendoingmodelselection.TheseissuesarediscussedinalotmoredetailinChapter14.SeealsoChapter17,especiallySection17.3,wherewecompareMDLmodelse-lectiontothestandardmodelselectionmethodsAICandBIC.1.6.1ModelSelectionamongNon-NestedModelsModelselectionisoftenusedinthefollowingcontext:tworesearchersorresearchgroupsAandBproposeentirelydifferentmodelsMAandMBasanexplanationforthesamedataD.Thissituationoccursallthetimeinap-pliedscienceslikeeconometrics,biology,experimentalpsychology,etc.For 241Learning,Regularity,andCompressionexample,groupAmayhavesomegeneraltheoryaboutthephenomenonathandwhichprescribesthatthetrendindataDisgivenbysomepolynomial.GroupBmaythinkthatthetrendisbetterdescribedbysomeneuralnet-work;aconcretecasewillbegiveninExample1.3below.AandBwouldliketohavesomewayofdecidingwhichoftheirtwomodelsisbettersuitedforthedataathand.Iftheysimplydecideonthemodelcontainingthehy-pothesis(parameterinstantiation)thatbesttsthedata,theyonceagainruntheriskofovertting:ifmodelMAhasmoredegreesoffreedom(parame-ters)thanmodelMB,itwilltypicallybeabletobettertrandomnoiseinthedata.ItmaythenbeselectedevenifMBactuallybettercapturestheun-derlyingtrend(regularity)inthedata.Therefore,justasinthehypothesisselectionexample,decidingwhetherMAorMBisabetterexplanationforthedatashouldsomehowdependonhowwellMAandMBtthedataandontherespective“complexities”ofMAandMB.Inthepolynomialcasediscussedbefore,therewasacountablyinnitenumberof“nested”M(i.e.MM+1).Incontrast,wenowdealwithanitenumberofentirelyunrelatedmodelsM.ButthereisnothingthatstopsusfromusingMDLmodelselectionas“dened”above.Example1.5[SelectingBetweenModelsofDifferentFunctionalForm]Considertwomodelsfrompsychophysicsdescribingtherelationshipbe-tweenphysicaldimensions(e.g.,lightintensity)andtheirpsychologicalcoun-terparts(e.g.brightness)(Myung,Balasubramanian,andPitt2000):yaxbZ(Stevens'smodel)andyaln(xb)+Z(Fechner'smodel)whereZisanormallydistributednoiseterm.Bothmodelshavetwofreeparame-ters;nevertheless,accordingtotherenedversionofMDLmodelselectiontobeintroducedinPartIII,Chapter14ofthisbook,Stevens'smodelisinasense“morecomplex”thanFechner's(seepage417).Roughlyspeaking,thismeanstherearealotmoredatapatternsthatcanbeexplainedbyStevens'smodelthancanbeexplainedbyFechner'smodel.Somewhatmoreprecisely,thenumberofdatapatterns(sequencesofdata)ofagivenlengththatcanbetwellbyStevens'smodelismuchlargerthanthenumberofdatapatternsofthesamelengththatcanbetwellbyFechner'smodel.Therefore,usingStevens'smodelwerunalargerriskof“overtting.”Intheexampleabove,thegoalwastoselectbetweenapowerlawandalog-arithmicrelationship.Ingeneral,wemayofcoursecomeacrossmodelselec-tionproblemsinvolvingneuralnetworks,polynomials,Fourierorwaveletexpansions,exponentialfunctions-anythingmaybeproposedandtested. 1.6SomeRemarksonModelSelection251.6.2GoalsofModelvs.PointHypothesisSelectionThegoalofpointhypothesisselectionisusuallyjusttoinferahypothesisfromthedataandusethattomakepredictionsof,ordecisionsabout,futuredatacomingfromthesamesource.Modelselectionmaybedoneforseveralreasons:1.Decidingbetween“general”theories.Thisistheapplicationthatwasil-lustratedintheexampleabove.Often,theresearchgroupsAandBareonlyinterestedinthemodelsMAandMB,andnotinparticularhypothe-ses(correspondingtoparametersettings)withinthosemodels.Therea-sonisthatthemodelsMAandMBareproposedasgeneraltheoriesforthephenomenonathand.TheclaimisthattheyworknotonlyundertheexactcircumstancesunderwhichtheexperimentgivingrisetodataDtookplacebutinmanyothersituationsaswell.Inourcase,theresearchgroupproposingmodelMAmayclaimthatthefunctionalrelationshipunderlyingmodelMAprovidesagooddescriptionoftherelationshipbetweenlightintensityandbrightnessunderavarietyofcircumstances;however,thespecicparametersettingsmayvaryfromsituationtositu-ation.Forexample,itshouldbeanappropriatemodelbothindaylight(forparametersetting(a0;b0))andinarticiallight(forparametersetting(a1;b1)).2.Gaininginsight.Sometimes,thegoalisnottomakespecicpredictionsbutjusttogetarstideaoftheprocessunderlyingthedata.Sucharough,rstimpressionmaythenbeusedtoguidefurtherexperimentationaboutthephenomenonunderinvestigation.3.Determiningrelevantvariables.InExample1.3theinstancesxiwereallrealnumbers.Inpractice,theyimayoftendependonseveralquantities,whichmaybemodeledbytakingthexitoberealvectorsxixi1;:::;xik.Wesaythattherearekregressorvariables.Insuchasetting,animportantmodelselectionproblemistodeterminewhichvariablesarerelevantandwhicharenot.Thisissometimescalledtheselection-of-variablesproblem.Often,foreachj,thereisacostassociatedwithmeasuringxij.Wewouldthereforeliketolearn,fromsomegivensetofempiricaldata,whichoftheregressorvariablesaretrulyrelevantforpredictingthevaluesofy.Iftherearekregressorvariables,thisinvolvesmodelselectionbetween2kdiffer-entmodels.Eachmodelcorrespondstothesetofalllinearrelationshipsbetweenaparticularsubsetoftheregressorvariablesandy. 261Learning,Regularity,andCompression4.Predictionbyweightedaveraging.Evenifoursolegoalispredictionoffuturedata,modelselectionmaybeuseful.Inthiscontext,werstinferamodel(setofhypotheses)forthedataathand.Wethenpredictfuturedatabycombiningallthepointhypotheseswithinthemodeltoarriveatapredic-tion.Usuallythisisdonebytakingaweightedaverageofthepredictionsthatwouldbeoptimalaccordingtothedifferenthypotheseswithinthemodel.Heretheweightsofthesepredictionsaredeterminedbytheper-formanceofthecorrespondinghypothesesonpastdata.Thereareabun-dantexamplesintheliteratureonBayesianstatistics(Lee1997;Berger1985;BernardoandSmith1994)whichshowthat,bothintheoryandin“therealworld,”predictionbyweightingaveragingusuallyworkssub-stantiallybetterthanpredictionbyasinglehypothesis.InChapter15,wediscussmodel-basedMDLprediction,whichisquitesimilartoBayesianprediction.1.7TheMDLPhilosophyTherstcentralMDLideaisthateveryregularityindatamaybeusedtocompressthatdata;thesecondcentralideaisthatlearningcanbeequatedwithndingregularitiesindata.Whereastherstpartisrelativelystraight-forward,thesecondpartoftheideaimpliesthatmethodsforlearningfromdatamusthaveaclearinterpretationindependentofwhetheranyofthemodelsunderconsiderationis“true”ornot.QuotingJ.Rissanen(1989),themainoriginatorofMDL:“Weneverwanttomakethefalseassumptionthattheobserveddataactuallyweregeneratedbyadistributionofsomekind,sayGaussian,andthengoontoanalyzetheconsequencesandmakefurtherdeductions.Ourdeductionsmaybeentertainingbutquiteirrelevanttothetaskathand,namely,tolearnusefulpropertiesfromthedata.”JormaRissanen[1989]Basedonsuchideas,Rissanenhasdevelopedaradicalphilosophyoflearn-ingandstatisticalinferencethatisconsiderablydifferentfromtheideasun-derlyingmainstreamstatistics,bothfrequentistandBayesian.Wenowde-scribethisphilosophyinmoredetail;seealsoChapter17,wherewecomparetheMDLphilosophytotheideasunderlyingotherstatisticalparadigms. 1.7TheMDLPhilosophy271.RegularityasCompressionAccordingtoRissanen,thegoalofinduc-tiveinferenceshouldbeto“squeezeoutasmuchregularityaspossible”fromthegivendata.Themaintaskistodistillthemeaningfulinformationpresentinthedata,i.e.toseparatestructure(interpretedastheregularity,the“meaningfulinformation”)fromnoise(interpretedasthe“accidentalin-formation”).ForthethreesequencesofExample1.1,thiswouldamounttothefollowing:therstsequencewouldbeconsideredasentirelyregularand“noiseless.”Thesecondsequencewouldbeconsideredasentirelyrandom-allinformationinthesequenceisaccidental,thereisnostructurepresent.Inthethirdsequence,thestructuralpartwould(roughly)bethepatternthat4timesasmany0sas1soccur;giventhisregularity,thedescriptionofexactlywhichoneamongallsequenceswithfourtimesasmany0sas1sactuallyoccurs,istheaccidentalinformation.2.ModelsasLanguagesRissaneninterpretsmodels(setsofhypotheses)asnothingmorethanlanguagesfordescribingusefulpropertiesofthedata–amodelHisidentiedwithitscorrespondinguniversalcodeL(jH).Differentindividualhypotheseswithinthemodelsexpressdifferentregularitiesinthedata,andmaysimplyberegardedasstatistics,thatis,summariesofcertainregularitiesinthedata.Theseregularitiesarepresentandmeaningfulindepen-dentlyofwhethersomeH2Histhe“truestateofnature”ornot.SupposethatthemodelHMunderconsiderationisprobabilistic.Intraditionaltheo-ries,onetypicallyassumesthatsomeP2Mgeneratesthedata,andthen“noise”isdenedasarandomquantityrelativetothisP.IntheMDLview“noise”isdenedrelativetothemodelMastheresidualnumberofbitsneededtoencodethedataoncethemodelMisgiven.Thus,noiseisnotarandomvariable:itisafunctiononlyofthechosenmodelandtheactuallyob-serveddata.Indeed,thereisnoplacefora“truedistribution”ora“truestateofnature”inthisview–thereareonlymodelsanddata.Tobringoutthedifferencetotheordinarystatisticalviewpoint,considerthephrase“theseexperimentaldataarequitenoisy.”Accordingtoatraditionalinterpretation,suchastatementmeansthatthedataweregeneratedbyadistributionwithhighvariance.AccordingtotheMDLphilosophy,suchaphrasemeansonlythatthedataarenotcompressiblewiththecurrentlyhypothesizedmodel–asamatterofprinciple,itcanneverberuledoutthatthereexistsadifferentmodelunderwhichthedataareverycompressible(notnoisy)afterall! 281Learning,Regularity,andCompression3.WeHaveOnlytheDataMany(butnotall5)othermethodsofinduc-tiveinferencearebasedontheideathatthereexistssome“truestateofna-ture,”typicallyadistributionassumedtolieinsomemodelM.Themethodsarethendesignedasameanstoidentifyorapproximatethisstateofnaturebasedonaslittledataaspossible.AccordingtoRissanen,6suchmethodsarefundamentallyawed.Themainreasonisthatthemethodsaredesignedun-dertheassumptionthatthetruestateofnatureisintheassumedmodelM,whichisoftennotthecase.Therefore,suchmethodsonlyadmitaclearinterpre-tationunderassumptionsthataretypicallyviolatedinpractice.Manycherishedstatisticalmethodshavebeendesignedinthisway-wementionhypothe-sistesting,minimum-varianceunbiasedestimation,severalnonparametricmethods,andevensomeformsofBayesianinference–seeChapter17,Sec-tion17.2.1.Incontrast,MDLhasaclearinterpretationwhichdependsonlyonthedata,andnotontheassumptionofanyunderlying“stateofnature.”Example1.6[ModelsThatareWrong,YetUseful]Eventhoughthemodelsunderconsiderationareoftenwrong,theycanneverthelessbeveryuseful.Examplesarethesuccessful“NaiveBayes”modelforspamltering,hiddenMarkovmodelsforspeechrecognition(isspeechastationaryergodicprocess?probablynot),andtheuseoflinearmodelsineconometricsandpsychology.Sincethesemodelsareevidentlywrong,itseemsstrangetobaseinferencesonthemusingmethodsthataredesignedundertheassumptionthattheycontainthetruedistribution.Tobefair,weshouldaddthatdomainssuchasspamlteringandspeechrecognitionarenotwhatthefathersofmodernstatis-ticshadinmindwhentheydesignedtheirprocedures–theywereusuallythinkingaboutmuchsimplerdomains,wheretheassumptionthatsomedis-tributionP2Mis“true”maynotbesounreasonable;seealsoChapter17,Section17.1.1.4.MDLandConsistencyLetMbeaprobabilisticmodel,suchthateachP2Misaprobabilitydistribution.Roughly,astatisticalprocedureiscalledconsistentrelativetoMif,forallP2M,thefollowingholds:supposedataaredistributedaccordingtoP.Thengivenenoughdata,thelearningmethodwilllearnagoodapproximationofPwithhighprobability.Manytraditionalstatisticalmethodshavebeendesignedwithconsistencyinmind(Chapter2,Section2.5). 5.Forexample,cross-validationcannoteasilybeinterpretedinsuchtermsof“amethodhunt-ingforthetruedistribution.”Thesameholdsforsome–notall–Bayesianmethods;seeChap-ter17.6.Thepresentauthor'sownviewsaresomewhatmilderinthisrespect,butthisisnottheplacetodiscussthem. 1.8MDL,Occam'sRazor,andthe“TrueModel”29ThefactthatinMDL,wedonotassumeatruedistributionmaysuggestthatwedonotcareaboutstatisticalconsistency.Butthisisnotthecase:wewouldstilllikeourstatisticalmethodtobesuchthatintheidealizedcasewhereoneofthedistributionsinoneofthemodelsunderconsiderationac-tuallygeneratesthedata,ourmethodisabletoidentifythisdistribution,givenenoughdata.Ifevenintheidealizedspecialcasewherea“truth”ex-istswithinourmodels,themethodfailstolearnit,thenwecertainlycannottrustittodosomethingreasonableinthemoregeneralcase,wheretheremaynotbea“truedistribution”underlyingthedataatall.So:consistencyisimportantintheMDLphilosophy,butitisusedasasanitycheck(foramethodthathasbeendevelopedwithoutmakingdistributionalassumptions)ratherthanasadesignprinciple;seealsoChapter17,Section17.1.1.Infact,mereconsistencyisnotsufcient.WewouldlikeourmethodtoconvergetotheimaginedtruePfast,basedonassmallasampleaspossible.Theorems15.1and15.3ofChapter15showthatthisindeedhappensforMDLpredictionandhypothesisselection–asexplainedinChapter16,MDLconvergenceratesforestimationandpredictionaretypicallyeitherminimaxoptimalorwithinafactorlognofminimaxoptimal.Summarizingthissection,theMDLphilosophyisagnosticaboutwhetheranyofthemodelsunderconsiderationis“true,”orwhethersomethinglikea“truedistribution”evenexists.Nevertheless,ithasbeensuggested(Webb1996;Domingos1999)thatMDLembodiesanaivebeliefthat“simplemod-els”are“apriorimorelikelytobetrue”thancomplexmodels.Belowweexplainwhysuchclaimsaremistaken.1.8DoesItMakeAnySense?MDL,Occam'sRazor,andthe“TrueModel”Whentwomodelstthedataequallywell,MDLwillchoosetheonethatisthe“simplest”inthesensethatitallowsforashorterdescriptionofthedata.Assuch,itimplementsapreciseformofOccam'srazor–eventhoughasmoreandmoredatabecomeavailable,themodelselectedbyMDLmaybecomemoreandmorecomplex!Throughouttheages,Occam'srazorhasreceivedalotofpraiseaswellascriticism.Someofthesecriticisms(Webb1996;Domingos1999)seemapplicabletoMDLaswell.Thefollowingtwoareprobablyheardmostoften: 301Learning,Regularity,andCompression“1.Occam'srazor(andMDL)isarbitrary.”Because“descriptionlength”isasyntacticnotionitmayseemthatMDLselectsanarbitrarymodel:differentcodeswouldhaveledtodifferentdescriptionlengths,andtherefore,todif-ferentmodels.Bychangingtheencodingmethod,wecanmake“complex'things“simple”andviceversa.“2.Occam'srazorisfalse.”ItissometimesclaimedthatOccam'srazorisfalse:weoftentrytomodelreal-worldsituationsthatarearbitrarilycomplex,sowhyshouldwefavorsimplemodels?InthewordsofG.Webb:7“Whatgoodaresimplemodelsofacomplexworld?”Theshortanswerto1isthatthisargumentoverlooksthefactthatwearenotallowedtousejustanycodewelike!“RenedMDL”severelyrestrictsthesetofcodesoneisallowedtouse.Asweexplainbelow,andinmoredetailinChapter7,thisleadstoanotionofcomplexitythatcanalsobeinter-pretedasakindof“modelvolume,”withoutanyreferenceto“descriptionlengths.”Theshortanswerto2isthatevenifthetruedata-generatingma-chineryisverycomplex,itmayoftenbeagoodstrategytoprefersimplemodelsforsmallsamplesizes..Belowwegivemoreelaborateanswerstobothcriticisms.1.8.1AnswertoCriticismNo.1:RenedMDL'sNotionof“Complexity”IsNotArbitraryIn“algorithmic”or“idealized”MDL(Section1.4),itispossibletodenetheKolmogorovcomplexityofapointhypothesisHasthelengthoftheshort-estprogramthatcomputesthefunctionvalueorprobabilityH(x)uptorbitsprecisionwheninput(x;r).InourpracticalversionofMDL,thereisnosingle“universal”descriptionmethodusedtoencodepointhypotheses.Ahypothesiswithaveryshortdescriptionunderonedescriptionmethodmayhaveaverylongdescriptionunderanothermethod.Thereforeitisusuallymeaninglesstosaythataparticularpointhypothesisis“simple”or“complex.”However,formanytypesofmodels,itispossibletodenethecomplexityofamodel(interrelatedsetofpointhypotheses)inanunambigu-ousmanner,thatdoesnotdependonthewayweparameterizethemodel.Thisisthe“parametriccomplexity”thatwementionedinSection1.5.3.8It 7.QuotedwithpermissionfromKDDNuggets96:28,1996.8.TheparametriccomplexityofaprobabilisticmodelM=fPgthatconsistsonlyofonehy-pothesis,isalways0,nomatterhowlargetheKolmogorovcomplexityofP;seeChapter17,Example17.5. 1.8MDL,Occam'sRazor,andthe“TrueModel”31willbedenedfornitemodelsinChapter6.InChapter7weextendthedenitiontogeneralmodelsthatcontainuncountablymanyhypotheses.Thereexistsacloseconnectionbetweenthealgorithmiccomplexityofahy-pothesisandtheparametriccomplexityofanylargemodelthatcontainsthehypothesis.Broadlyspeaking,formosthypothesesthatarecontainedinanygivenmodel,theKolmogorovcomplexityofthehypothesiswillbeapproxi-matelyequaltotheparametriccomplexityofthemodel.InExample1.3wedidspeakofa“complex”pointhypothesis.Thisisreallysloppyterminology:sinceonlythecomplexityofmodelsratherthanhypothe-sescanbegivenanunambiguousmeaning,weshouldinsteadhavespokenof“apointhypothesisthat,relativetothesetofmodelsunderconsideration,isamemberonlyofacomplexmodelandnotofasimplemodel.”SuchsloppyterminologyiscommonlyusedinpapersonMDL.Unfortunately,ithascausedalotofconfusioninthepast.Specically,ithasledpeopletothinkthatMDLmodelselectionisamostlyarbitraryprocedureleadingtocompletelydifferentresultsaccordingtohowthedetailsintheprocedurearelledin(Shaffer1993).AtleastfortherenedversionsofMDLwedis-cussinPartIIIofthisbook,thisisjustplainfalse. ComplexityofModelsvs.ComplexityofHypothesesInalgorithmicMDL,wemaydenethecomplexityofanindividual(i.e.,point)hypothesis(functionorprobabilitydistribution).InpracticalMDL,asstudiedhere,thisisnotpossible:complexitybecomesapropertyofmodels(setsofpointhypotheses)ratherthanindividualpointhypotheses(instantiationsofmodels).MDL-based,orparametriccomplexity,isapropertyofamodelthatdoesnotdependonanyparticulardescriptionmethodused,oranyparam-eterizationofthehypotheseswithinthemodel.Itisrelatedto(butnotquitethesameas)thenumberofsubstantiallydifferenthypothesesinamodel(PartIIofthisbook,Chapter6,Chapter7).A“simplemodel”thenroughlycorrespondsto“asmallsetofhypotheses.” Inpractice,weoftenusemodelsforwhichtheparametriccomplexityisun-dened.Wethenuseanextendednotionofcomplexity,basedona“luck-inessfunction.”Whilesuchacomplexitymeasuredoeshaveasubjectivecomponent,itisstillfarfromarbitrary,anditcannotbeusedtomakeacom-plexmodel“simple”;seeChapter17,Section17.2.1. 321Learning,Regularity,andCompression1.8.2AnswertoCriticismNo.2Inlightofthepreviousdiscussionsinthischapter,preferring“simpler”overmore“complex”modelsseemstomakealotofsense:oneshouldtrytoavoidovertting(i.e.oneshouldtrytoavoidmodelingthenoiseratherthanthe“pattern”or“regularity”inthedata).Itseemsplausiblethatthismaybeachievedbysomehowtakingintoaccountthe“complexity”,“richness”,or“(non-)smoothness”ofthemodelsunderconsideration.Butfromanotherviewpoint,thewholeenterprisemayseemmisguided:astheexamplebelowshows,itseemstoimplythat,whenweapplyMDL,weareimplicitlyassum-ingthat“simpler”modelsaresomehowapriorimorelikelytobe“true.”Yetinmanycasesofinterest,thephenomenawetrytomodelareverycomplex,sothenpreferringsimplermodelsforthedataathandwouldnotseemtomakealotofsense.Howcanthesetwoconictingintuitionsbereconciled?AuthorscriticizingOccam'srazor(Domingos1999;Webb1996)usuallydothinkthatinsomecases“simpler”modelsshouldbepreferredovermore“complex”onessincetheformeraremoreunderstandable,andinthatsensemoreuseful.Buttheyarguethatthe“simpler”modelwillusuallynotleadtobetterpredictionsoffuturedatacomingfromthesamesource.Weclaimthatonthecontrary,fortheMDL-baseddenitionsof“simple”and“complex”wewillintroduceinthisbook,selectingthesimplermodelinmanycasesdoesleadtobetterpredictions,eveninacomplexenvironment.Example1.7[MDLHypothesis/ModelSelectionforPolynomials,cont.]Letusfocusonpointhypothesisselectionofpolynomials.LetHbethesetof\rthdegreepolynomials.Supposea“truth”existsinthefollowingsense:thereexistssomedistributionPXsuchthatthexiareallindependentlydis-tributedaccordingtoPX.Weassumethatallximustfallwithinsomein-terval[a;b],i.e.PX([a;b])=1.TherealsoexistsafunctionhsuchthatforallxigeneratedbyPX,wehaveyih(xi)+Zi.HeretheZiarenoiseor“error”terms.WeassumetheZitobeidentical,independent,normally(Gaussian)distributedrandomvariables,withmean0andsomevariance2.Forconcretenesswewillassumeh(x)=x38x2+19x+9and2=1.Thisisactuallythepolynomial/error-combinationthatwasusedtogeneratethepointsinFigure1.1onpage13.However,thexinthatgraphwerenotdrawnaccordingtosomedistributionsuchasinthepresentscenario.Instead,theywerepresettobe:;;:;:::;.Thisissimilartothepracticalcasewheretheexperimentaldesigniscontrolledbytheexperimenter:theexperi-menterdeterminesthexvaluesforwhichcorrespondingy-valueswillbe 1.8MDL,Occam'sRazor,andthe“TrueModel”33measured.Inthissetupsimilaranalysesasthosebelowmaystillbemade(see,e.g.(Wei1992)).Insuchascenariowitha“true”handPX,asmoreandmoredatapairs(xi;yi)aremadeavailable,withhighprobabilitysomethinglikethefollow-ingwillhappen(Chapter16):forverysmalln,MDLmodelselectionwillse-lecta0th-degreepolynomial,i.e.yforsomeconstant.Thenasngrowslarger,MDLwillstartselectingmodelsofhigherdegree.Atsomesamplesizenitwillselectathird-orderpolynomialforthersttime.Itwillthenforawhile(asnincreasesfurther)uctuatebetweensecond-,third-andfourth-orderpolynomials.Butforallnlargerthansomecriticaln0,MDLwillselectthecorrectthirdorderH3.Itturnsoutthatforalln,thepointhypothesishnselectedbytwo-partMDLhypothesisselectionis(approximately)thepoly-nomialwithintheselectedHthatbesttsdataD=((x1;y1);:::;(xn;yn)).Asngoestoinnity,hnwillwithhighprobabilityconvergetohinthesensethatallcoefcientsconvergetothecorrespondingcoefcientsofh.Two-partcodeMDLbehaveslikethisnotjustwhenappliedtothemodelclassofpolynomials,butformostotherpotentiallyinterestingmodelclassesatwell.Theupshotisthat,forsmallsamplesizes,MDLhasabuilt-inprefer-encefor“simple”models.Thispreferencemayseemunjustied,since“real-ity”maybemorecomplex.Weclaimthatonthecontrary,suchapreference(ifimplementedcarefully)hasamplejustication.Tobackourclaim,werstnotethatMDL(andthecorrespondingformofOccam'srazor)isjustastrategyforinferringmodelsfromdata(“choosesimplemodelsatsmallsamplesizes”),notastatementabouthowtheworldworks(“simplemodelsaremorelikelytobetrue”)–indeed,astrategycan-notbetrueorfalse,itis“clever”or“stupid.”Andthestrategyofpreferringsimplermodelsiscleverevenifthedata-generatingprocessishighlycom-plex,asillustratedbythefollowingexample:Example1.8[“Innitely”ComplexSources]SupposethatdataaresubjecttothelawY(X)+ZwhereissomecontinuousfunctionandZissomenoisetermwithmean0.Ifisnotapolynomial,butXonlytakesvaluesinaniteinterval,say[1;1],wemaystillapproximatearbitrar-ilywellbytakinghigherandhigherdegreepolynomials.Forexample,let(x)=exp(x).Then,ifweuseMDLtolearnapolynomialfordataD((x1;y1);:::;(xn;yn)),thedegreeofthepolynomialhnselectedbyMDLatsamplesizenwillincreasewithn,andwithhighprobability,hnconvergesto(x)=exp(x)inthesensethatmax2[1;1]jhn(x)(x)j!0(Chap-ter16).Ofcourse,ifwehadbetterpriorknowledgeabouttheproblemwe 341Learning,Regularity,andCompressioncouldhavetriedtolearnusingamodelclassHcontainingthefunctiony=exp(x).Butingeneral,bothourimaginationandourcomputationalresourcesarelimited,andwemaybeforcedtouseimperfectmodels.If,basedonasmallsample,wechoosethebest-ttingpolynomial^hwithinthesetofallpolynomials,then,eventhough^hwilltthedataverywell,itislikelytobequiteunrelatedtothe“true”,and^hmayleadtodisas-trouspredictionsoffuturedata.Thereasonisthat,forsmallsamples,thesetofallpolynomialsisverylargecomparedtothesetofpossibledatapat-ternsthatwemighthaveobserved.Therefore,anyparticulardatapatterncanonlygiveusverylimitedinformationaboutwhichhigh-degreepolyno-mialbestapproximates.Ontheotherhand,ifwechoosethebest-tting^hinsomemuchsmallersetsuchasthesetofsecond-degreepolynomials,thenitishighlyprobablethatthepredictionquality(meansquarederror)of^honfuturedataisaboutthesameasitsmeansquarederroronthedataweobserved:thesize(complexity)ofthecontemplatedmodelisrelativelysmallcomparedtothesetofpossibledatapatternsthatwemighthaveob-served.Therefore,theparticularpatternthatwedoobservegivesusalotofinformationonwhatsecond-degreepolynomialbestapproximates.Thus,(a)^htypicallyleadstobetterpredictionsoffuturedatathan^h;and(b)unlike^h,^hisreliableinthatitgivesacorrectimpressionofhowgooditwillpredictfuturedataevenifthe“true”is“innitely”complex.ThisideadoesnotjustappearinMDL,butisalsothebasisofthestructuralriskmini-mizationapproach(Vapnik1998)andmanystandardstatisticalmethodsfornonparametricinference;seeChapter17,Section17.10.Insuchapproachesoneacknowledgesthatthedata-generatingmachinerycanbeinnitelycom-plex(e.g.,notdescribablebyanitedegreepolynomial).Nevertheless,itisstillagoodstrategytoapproximateitbysimplehypotheses(low-degreepolynomials)aslongasthesamplesizeissmall.Summarizing: TheInherentDifferencebetweenUnder-andOverttingIfwechooseanoverlysimplemodelforourdata,thenthebest-ttingpointhypothesiswithinthemodelislikelytobealmostthebestpredictor,withinthesimplemodel,offuturedatacomingfromthesamesource.Ifweovert(chooseaverycomplexmodel)andthereisnoiseinourdata,then,evenifthecomplexmodelcontainsthe“true”pointhypothesis,thebest-ttingpointhypothesiswithinthemodelmayleadtoverybadpredictionsoffuturedatacomingfromthesamesource. 1.8MDL,Occam'sRazor,andthe“TrueModel”35Thisstatementisveryimpreciseandismeantmoretoconveythegeneralideathantobecompletelytrue.ThefundamentalconsistencytheoremsforMDLpredictionandhypothesisselection(Chapter15,Theorem15.1andTheorem15.3),aswellastheirextensiontomodelselection(Chapter16),areessentiallyjustvariationsofthisstatementthatareprovablytrue.TheFutureandThePastOuranalysisdependsonthedataitems(x;y)tobeprobabilisticallyindependent.Whilethisassumptionmaybesubstantiallyweakened,wecanjustifytheuseofMDLandotherformsofOccam'srazoronlyifwearewillingtoadoptsome(possiblyveryweak)assumptionofthesort“trainingdataandfuturedataarefromthesamesource”:futuredatashould(atleastwithhighprobability)besubjecttosomeofthesameregulari-tiesastrainingdata.Otherwise,DandD0maybecompletelyunrelatedandnomethodofinductiveinferencecanbeexpectedtoworkwell.Thisisindirectlyrelatedtothegrue-paradox(Goodman1955). MDLandOccam'sRazorWhileMDLdoeshaveabuilt-inpreferenceforselecting“simple”models(withsmall“parametriccomplexity”),thisdoesnotatallmeanthatapplyingMDLonlymakessenseinsituationswheresimplermodelsaremorelikelytobetrue.MDLisamethodologyforinferringmodelsfromdata,notastatementabouthowtheworldworks!Forsmallsamplesizes,itpreferssimplemodels.Itdoessonotbecausetheseare“morelikelytobetrue”(theyoftenarenot).Instead,itdoessobecausethistendstoselectthemodelthatleadstothebestpredictionsoffuturedatafromthesamesource.Forsmallsamplesizesthismaybeamodelmuchsimplerthanthemodelcontainingthe“truth”(assumingforthetimebeingthatsuchamodelcontainingthe“truth”existsintherstplace).Infact,someofMDL'smostusefulandsuccessfulapplicationsareinnonparametricstatisticswherethe“truth”underlyingdataistypicallyassumedtobe“innitely”complex(seeChapter13andChapter15). 361Learning,Regularity,andCompression1.9HistoryandFormsofMDLThepracticalMDLprinciplethatwediscussinthisbookhasmainlybeendevelopedbyJ.Rissaneninaseriesofpapersstartingwith(Rissanen1978).IthasitsrootsinthetheoryofKolmogorovcomplexity(LiandVitányi1997),developedinthe1960sbySolomonoff(1964),Kolmogorov(1965)andChaitin(1966,1969).Amongtheseauthors,Solomonoff(aformerstudentofthefa-mousphilosopherofscience,RudolfCarnap)wasexplicitlyinterestedinin-ductiveinference.The1964papercontainsexplicitsuggestionsonhowtheunderlyingideascouldbemadepractical,therebyforeshadowingsomeofthelaterworkontwo-partMDL.WhileRissanenwasnotawareofSolomo-noff'sworkatthetime,Kolmogorov's[1965]paperdidserveasaninspi-rationforRissanen's(1978)developmentofMDL.Still,Rissanen'spracticalMDLisquitedifferentfromtheidealizedformsofMDLthathavebeendi-rectlybasedonKolmogorovcomplexity,whichwediscussedinSection1.4.AnotherimportantinspirationforRissanenwasAkaike'sAICmethodformodelselection(Chapter17,Section17.3),essentiallytherstmodelse-lectionmethodbasedoninformation-theoreticideas(Akaike1973).EventhoughRissanenwasinspiredbyAIC,boththeactualmethodandtheun-derlyingphilosophyaresubstantiallydifferentfromMDL.MinimumMessageLengthMDLismuchcloserrelatedtotheMinimumMessageLength(MML)Principle(Wallace2005),developedbyWallaceandhiscoworkersinaseriesofpapersstartingwiththegroundbreaking(Wal-laceandBoulton1968);othermilestonesare(WallaceandBoulton1975)and(WallaceandFreeman1987).Remarkably,Wallacedevelopedhisideaswith-outbeingawareofthenotionofKolmogorovcomplexity.AlthoughRis-sanenbecameawareofWallace'sworkbeforethepublicationof(Rissanen1978),hedevelopedhisideasmostlyindependently,beinginuencedratherbyAkaikeandKolmogorov.Indeed,despitethecloseresemblanceofbothmethodsinpractice,theunderlyingphilosophyisverydifferent-seeChap-ter17,Section17.4.RenedMDLTherstpublicationsonMDLonlymentiontwo-partcodes.ImportantprogresswasmadebyRissanen(1984),inwhichprequentialcodesareemployedforthersttimeandRissanen(1987),whointroducedtheBayesianmixturecodesintoMDL.Thisledtothedevelopmentofthenotionofstochasticcomplexityastheshortestcodelengthofthedatagivenamodel 1.9HistoryandFormsofMDL37(Rissanen1986c;Rissanen1987).However,theconnectiontoShtarkov'snormalizedmaximumlikelihoodcodewasnotmadeuntil1996,andthispre-ventedthefulldevelopmentofthenotionof“parametriccomplexity.”Inthemeantime,inhisimpressivePh.D.thesis,Barron(1985)showedhowaspe-cicversionofthetwo-partcodecriterionhasexcellentfrequentiststatisticalconsistencyproperties.ThiswasextendedbyBarronandCover(1991)whoachievedabreakthroughfortwo-partcodes:theygaveclearprescriptionsonhowtodesigncodesforhypotheses,relatingcodeswithgoodminimaxcode-lengthpropertiestoratesofconvergenceinstatisticalconsistencytheorems.SomeoftheideasofRissanen(1987)andBarronandCover(1991)were,asitwere,uniedwhenRissanen(1996)introducedthenormalizedmaximumlikelihoodcode.TheresultingtheorywassummarizedforthersttimebyBarron,Rissanen,andYu(1998),andisthesubjectofthisbook.WheneverweneedtodistinguishitfromotherformsofMDL,wecallit“renedMDL.”1.9.1WhatIsMDL?“MDL”isusedbydifferentauthorsinsomewhatdifferentmeanings,anditmaybeusefultoreviewthese.SomeauthorsuseMDLasabroadumbrellatermforalltypesofinductiveinferencebasedonndingashortcodelengthforthedata.Thiswould,forexample,includethe“idealized”versionsofMDLbasedonKolmogorovcomplexity(page11)andWallaces'sMMLprin-ciple(seeabove).Someauthorstakeanevenbroaderviewandincludeallinductiveinferencethatisbasedondatacompression,evenifitcannotbedirectlyinterpretedintermsofcodelengthminimization.Thisincludes,forexampletheworkonsimilarityanalysisandclusteringbasedonthenormal-izedcompressiondistance(CilibrasiandVitányi2005).Ontheotherextreme,forhistoricalreasons,someauthorsusetheMDLCriteriontodescribeaveryspecic(andoftennotverysuccessful)modelselectioncriterionequivalenttoBIC(seeChapter17,Section17.3).Asalreadyindicated,weadoptthemeaningofthetermthatisembracedinthesurvey(Barron,Rissanen,andYu1998),writtenbyarguablythethreemostimportantcontributorstotheeld:weuseMDLforgeneralinferencebasedonuniversalmodels.Althoughweconcentrateonhypothesisselection,modelselectionandprediction,thisideacanbefurtherextendedtomanyothertypesofinductiveinference.Theseincludedenoising(Rissanen2000;HansenandYu2000;Roos,Myllymäki,andTirri2005),similarityanalysisandclustering(Kontkanen,Myllymäki,Buntine,Rissanen,andTirri2005),outlierdetectionandtransduction(asdenedin(Vapnik1998)),andmanyothers.In 381Learning,Regularity,andCompressionsuchareastherehasbeenlessresearchanda“denitive”universal-modelbasedMDLapproachhasnotyetbeenformulated.Wedoexpect,however,thatsuchresearchwilltakeplaceinthefuture:oneofthemainstrengthsof“MDL”inthisbroadsenseisthatitcanbeappliedtoevermoreexoticmodel-ingsituations,inwhichthemodelsdonotresembleanythingthatisusuallyencounteredinstatisticalpractice.Anexampleisthemodelofcontext-freegrammars,alreadyconsideredbySolomonoff(1964).Anotherapplicationofuniversal-modelbasedMDListhetypeofprob-lemusuallystudiedinstatisticallearningtheory(Vapnik1998);seealsoChap-ter17,Section17.10.Herethegoalistodirectlylearnfunctions(suchaspolynomials)topredictYgivenX,withoutmakinganyspecicprobabilis-ticassumptionsaboutthenoise.MDLhasbeendevelopedinsomedetailforsuchproblems,mostnotablyclassicationproblems,whereYtakesitsvaluesinaniteset–spamlteringisaprototypicalexample;hereXstandsforanemailmessage,andYencodeswhetherornotitisspam.AnexampleistheapplicationofMDLtodecisiontreelearning(QuinlanandRivest1989;WallaceandPatrick1993;Mehta,Rissanen,andAgrawal1995).SomeMDLtheoryforsuchcaseshasbeendeveloped(MeirandMerhav1995;Yaman-ishi1998;Grünwald1998),buttheexistingMDLmethodsinthisareacanbehavesuboptimally.ThisisexplainedinChapter17,Section17.10.2.Al-thoughwecertainlyconsideritapartof“rened”MDL,wedonotconsiderthis“nonprobabilistic”MDLfurtherinthisbook,exceptinSection17.10.2.1.9.2MDLLiteratureTheoreticalContributionsTherehavebeennumerouscontributorstore-nedMDLtheory,buttherearethreeresearchersthatIshouldmentionex-plicitly:J.Rissanen,B.YuandA.Barron,whojointlywrote(Barron,Ris-sanen,andYu1998).Forexample,mostoftheresultsthatconnectMDLtotraditionalstatistics(includingTheorem15.1andTheorem15.3inChap-ter15)areduetoA.Barron.Thisbookcontainsnumerousreferencestotheirwork.ThereisacloseconnectionbetweenMDLtheoryandworkinuniversalcoding((MerhavandFeder1998);seealsoChapter6)anduniversalprediction((Cesa-BianchiandLugosi2006);seealsoChapter17,Section17.9).PracticalContributionsTherehavebeennumerouspracticalapplicationsofMDL.TheonlythreeapplicationswedescribeindetailareacrudeMDLmethodforlearningMarkovchains(Chapter5);arenedMDLmethodfor 1.9HistoryandFormsofMDL39learningdensitiesbasedonhistograms(Chapter13andChapter15);andMDLregression(Chapter12andChapter14).Belowwegiveafewrepre-sentativeexamplesofotherapplicationsandexperimentalresultsthathaveappearedintheliterature.Wewarnthereaderthatthislistisbynomeanscomplete!HansenandYu(2001)applyMDLtoavarietyofpracticalprob-lemsinvolvingregression,clusteringanalysis,andtimeseriesanalysis.In(Tabus,Rissanen,andAstola2002;Tabus,Rissanen,andAstola2003),MDLisusedforclassicationproblemsarisingingenomics.Lee(2002a,b)describesadditiveclusteringwithMDL.useMDLforimagedenoisingandapplyMDLtodecisiontreelearning.useMDLforsequentialprediction.In(Myung,Pitt,Zhang,andBalasubramanian2000;Myung,Balasubramanian,andPitt2000),MDLisappliedtoavarietyofmodelselectionproblemsarisingincognitivepsychology.Alltheseauthorsapplymodern,“rened”versionsofMDL.Somereferencestoolderwork,inwhich“crude”(butoftenquitesen-sible)ad-hoccodesareused,are(Friedman,Geiger,andGoldszmidt1997;AllenandGreiner2000;Allen,Madani,andGreiner2003;RissanenandRis-tad1994;QuinlanandRivest1989;NowakandFigueiredo2000;LiuandMoulin1998;Ndili,Nowak,andFigueiredo2001;Figueiredo,J.Leitão,andA.K.Jain2000;GaoandLi1989).Inthesepapers,MDLisappliedtolearn-ingBayesiannetworks,grammarinferenceandlanguageacquisition,learn-ingdecisiontrees,analysisofPoissonpointprocesses(forbiomedicalimag-ingapplications),imagedenoising,imagesegmentation,contourestimation,andChinesehandwrittencharacterrecognitionrespectively.MDLhasalsobeenextensivelystudiedintime-seriesanalysis,bothintheory(HannanandRissanen1982;Gerenscér1987;Wax1988;Hannan,McDougall,andPoskitt1989;HemerlyandDavis1989b;HemerlyandDavis1989a;Gerencsér1994)andpractice(Wei1992;Wagenmakers,Grünwald,andSteyvers2006).Finally,weshouldnotethattherehavebeenanumberofapplications,es-peciallyinnaturallanguagelearning,which,althoughpracticallyviable,havebeenprimarilyinspiredby“idealizedMDL”andKolmogorovcomplexity,ratherthanbytheRissanen-Barron-YustyleofMDLthatweconsiderhere.Theseinclude(AdriaansandJacobs2006;Osborne1999;Starkie2001)andmyown(Grünwald1996).OtherTutorials,IntroductionsandOverviewsThereaderwhoprefersashorterintroductiontoMDLthanthepresentonemaywanttohavealookat(Barron,Rissanen,andYu1998)(verytheoreticalandverycomprehen-sive;presumesknowledgeofinformationtheory),(HansenandYu2001) 401Learning,Regularity,andCompression(presumesknowledgeofstatistics;describesseveralpracticalapplications),(Lanterman2001)(aboutcomparingMDL,MMLandasymptoticBayesianapproachestomodelselection),orperhapsmyown(Grünwald2005),whichispartof(Grünwald,Myung,andPitt2005),a“sourcebook”forMDLthe-oryandapplicationsthatcontainschaptersbymostofthemaincontributorstotheeld.Rissanen(1989,2007)haswrittentwobooksonMDL.WhileoutdatedasanintroductiontoMDL,the“littlegreenbook”(Rissanen1989)isstillverymuchworthreadingforitsclearexpositionofthephilosophyunderlyingMDL.(Rissanen2007)containsabriefgeneralintroductionandthenfocusesonsomerecentresearchofRissanen's,applyingtherenormalizedmaximumlikelihood(RNML)distribution(Chapter11)inregressionanddenoising,andformalizingtheconnectionbetweenMDLandKolmogorov'sstructurefunction.Incontrasttomyself,Rissanenwritesinaccordwithhisownprin-ciple:whilecontainingalotofinformation,bothtextsarequiteshort.1.10SummaryandOutlookWehavediscussedtherelationshipbetweencompression,regularity,andlearning.WehavegivenarstideaofwhattheMDLprincipleisallabout,andofthekindofproblemswecanapplyitto.Inthenextchapters,wepresentthemathematicalbackgroundneededtodescribesuchapplicationsindetail.