/
CORRECTED VERSION OF IEEE TRANSA CTIONS ON INFORMA TION THEOR OL CORRECTED VERSION OF IEEE TRANSA CTIONS ON INFORMA TION THEOR OL

CORRECTED VERSION OF IEEE TRANSA CTIONS ON INFORMA TION THEOR OL - PDF document

faustina-dinatale
faustina-dinatale . @faustina-dinatale
Follow
483 views
Uploaded On 2014-12-04

CORRECTED VERSION OF IEEE TRANSA CTIONS ON INFORMA TION THEOR OL - PPT Presentation

51 NO 4 APRIL 2005 15231545 Clustering by Compression Rudi Cilibrasi and aul MB it an yi Abstract pr esent new method or clustering based on compr ession The method doesnt use subjectspeci64257c featur es or backgr ound kno wledge and orks as ollo w ID: 20700

APRIL

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "CORRECTED VERSION OF IEEE TRANSA CTIONS ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

CORRECTEDVERSIONOF:IEEETRANSACTIONSONINFORMATIONTHEORY,VOL.51,NO4,APRIL2005,1523–15451ClusteringbyCompressionRudiCilibrasiandPaulM.B.Vit´anyiAbstract—Wepresentanewmethodforclusteringbasedoncompression.Themethoddoesn'tusesubject-specicfeaturesorbackgroundknowledge,andworksasfollows:First,wedetermineaparameter-free,universal,similaritydistance,thenormalizedcompressiondistanceorNCD,computedfromthelengthsofcompresseddatales(singlyandinpairwiseconcatenation).Second,weapplyahierarchicalclusteringmethod.TheNCDisnotrestrictedtoaspecicapplicationarea,andworksacrossapplicationareaboundaries.Atheoreticalprecursor,thenormal-izedinformationdistance,co-developedbyoneoftheauthors,isprovablyoptimal.However,theoptimalitycomesatthepriceofusingthenon-computablenotionofKolmogorovcomplexity.Weproposeaxiomstocapturethereal-worldsetting,andshowthattheNCDapproximatesoptimality.Toextractahierarchyofclustersfromthedistancematrix,wedetermineadendrogram(binarytree)byanewquartetmethodandafastheuristictoimplementit.Themethodisimplementedandavailableaspublicsoftware,andisrobustunderchoiceofdifferentcompressors.Tosubstantiateourclaimsofuniversalityandrobustness,wereportevidenceofsuccessfulapplicationinareasasdiverseasgenomics,virology,languages,literature,music,handwrittendigits,astronomy,andcombinationsofobjectsfromcompletelydifferentdomains,usingstatistical,dictionary,andblocksortingcompressors.IngenomicswepresentednewevidenceformajorquestionsinMammalianevolution,basedonwhole-mitochondrialgenomicanalysis:theEutherianordersandtheMarsupiontahypothesisagainsttheTheriahypothesis.IndexTerms—universaldissimilaritydistance,normalizedcompressiondistance,hierarchicalunsupervisedclustering,quar-tettreemethod,parameter-freedata-mining,heterogenousdataanalysis,KolmogorovcomplexityWithrespecttotheversionpublishedintheIEEETrans.Inform.Th.,51:4(2005),1523–1545,wehavechangedDeni-tion2.1of“admissibledistance”makingitmoregeneralandDenitions2.4and2.5of“normalizedadmissibledistance,”consequentlyadaptedLemma2.6(II.2)andinitsproof(II.3)andthedisplayedinequalities.ThisleftTheorem6.3unchangedexceptforchanging“suchthatd(x;y)e”to“suchthatd(x;y)eandC(v)C(x).”I.INTRODUCTIONAlldataarecreatedequalbutsomedataaremorealikethanothers.Weproposeamethodexpressingthisalikeness,Manuscriptreceivedxxx,2003;revisedxxx2004.TheexperimentalresultsinthispaperwerepresentedinpartattheIEEEInternationalSymposiumonInformationTheory,Yokohama,Japan,June29-July4,2003.RudiCilibrasiiswiththeCentreforMathematicsandComputerScience(CWI).Address:CWI,Kruislaan413,1098SJAmsterdam,TheNetherlands.Email:Rudi.Cilibrasi@cwi.nl.PartofhisworkwassupportedbytheNether-landsBSIK/BRICKSproject,andbyNWOproject612.55.002.PaulVit´anyiiswiththeCentreforMathematicsandComputerScience(CWI),theUniversityofAmsterdam,andNationalICTofAustralia.Address:CWI,Kruislaan413,1098SJAmsterdam,TheNetherlands.Email:Paul.Vitanyi@cwi.nl.PartofhisworkwasdonewhiletheauthorwasonSabbaticalleaveattheNationalICTofAustralia,SydneyLaboratoryatUNSW.HewassupportedinpartbytheEUprojectRESQ,IST–2001–37559,theNoEQUIPROCONEIST–1999–29064,theESFQiTProgrammme,andtheEUNoEPASCAL,theNetherlandsBSIK/BRICKSproject,andtheKRRandSML&KAProgramsofNationalICTofAustralia.usinganewsimilaritymetricbasedoncompression.Itisparameter-freeinthatitdoesn'tuseanyfeaturesorbackgroundknowledgeaboutthedata,andcanwithoutchangesbeappliedtodifferentareasandacrossareaboundaries.Itisuniversalinthatitapproximatestheparameterexpressingsimilarityofthedominantfeatureinallpairwisecomparisons.Itisrobustinthesensethatitssuccessappearsindependentfromthetypeofcompressorused.Theclusteringweuseishierarchicalclusteringindendrogramsbasedonanewfastheuristicforthequartetmethod.Themethodisavailableasanopen-sourcesoftwaretool.Belowweexplainthemethod,thetheoryunderpinningit,andpresentevidenceforitsuniversalityandrobustnessbyexperimentsandresultsinaplethoraofdifferentareasusingdifferenttypesofcompressors.Feature-BasedSimilarities:Wearepresentedwithun-knowndataandthequestionistodeterminethesimilaritiesamongthemandgrouplikewithliketogether.Commonly,thedataareofacertaintype:musicles,transactionrecordsofATMmachines,creditcardapplications,genomicdata.Inthesedatatherearehiddenrelationsthatwewouldliketogetoutintheopen.Forexample,fromgenomicdataonecanextractletter-orblockfrequencies(theblocksareoverthefour-letteralphabet);frommusiclesonecanextractvariousspecicnumericalfeatures,relatedtopitch,rhythm,harmonyetc.OnecanextractsuchfeaturesusingforinstanceFouriertransforms[43]orwavelettransforms[17],toquantifyparam-etersexpressingsimilarity.Theresultingvectorscorrespondingtothevariouslesarethenclassiedorclusteredusingexist-ingclassicationsoftware,basedonvariousstandardstatisticalpatternrecognitionclassiers[43],Bayesianclassiers[15],hiddenMarkovmodels[13],ensemblesofnearest-neighborclassiers[17]orneuralnetworks[15],[39].Forexample,inmusiconefeaturewouldbetolookforrhythminthesenseofbeatsperminute.Onecanmakeahistogramwhereeachhistogrambincorrespondstoaparticulartempoinbeats-per-minuteandtheassociatedpeakshowshowfrequentandstrongthatparticularperiodicitywasovertheentirepiece.In[43]weseeagradualchangefromafewhighpeakstomanylowandspread-outonesgoingfromhip-hip,rock,jazz,toclassical.Onecanusethissimilaritytypetotrytoclusterpiecesinthesecategories.However,suchamethodrequiresspecicanddetailedknowledgeoftheproblemarea,sinceoneneedstoknowwhatfeaturestolookfor.Non-FeatureSimilarities:Ouraimistocapture,inasinglesimilaritymetric,everyeffectivedistance:effectiveversionsofHammingdistance,Euclideandistance,editdistances,align-mentdistance,Lempel-Zivdistance[11],andsoon.Thismetricshouldbesogeneralthatitworksineverydomain:music,text,literature,programs,genomes,executables,naturallanguagedetermination,equallyandsimultaneously.Itwouldbeabletosimultaneouslydetectallsimilaritiesbetweenpieces 2CORRECTEDVERSIONOF:IEEETRANSACTIONSONINFORMATIONTHEORY,VOL.51,NO4,APRIL2005,1523–1545thatothereffectivedistancescandetectseperately.Compression-basedSimilarity:Sucha“universal”metricwasco-developedbyusin[29],[30],[31],asanormalizedversionofthe“informationmetric”of[32],[4].Roughlyspeaking,twoobjectsaredeemedcloseifwecansigni-cantly“compress”onegiventheinformationintheother,theideabeingthatiftwopiecesaremoresimilar,thenwecanmoresuccinctlydescribeonegiventheother.ThemathematicsusedisbasedonKolmogorovcomplexitytheory[32].In[31]wedenedanewclassof(possiblynon-metric)distances,takingvaluesin[0;1]andappropriateformeasuringeffectivesimilarityrelationsbetweensequences,sayonetypeofsimilarityperdistance,andviceversa.Itwasshownthatanappropriately“normalized”informationdistanceminorizeseverydistanceintheclass.Itdiscoversalleffectivesimilaritiesinthesensethatiftwoobjectsarecloseaccordingtosomeeffectivesimilarity,thentheyarealsocloseaccordingtothenormalizedinformationdistance.Putdifferently,thenormalizedinformationdistancerepresentssimilarityaccordingtothedominatingsharedfeaturebetweenthetwoobjectsbeingcompared.Incomparisonsofmorethantwoobjects,differentpairsmayhavedifferentdominatingfeatures.Thenormalizedinformationdistanceisametricandtakesvaluesin[0;1];henceitmaybecalled“the”similaritymetric.Toapplythisidealprecisemathematicaltheoryinreallife,wehavetoreplacetheuseofthenoncomputableKolmogorovcomplexitybyanapproximationusingastan-dardreal-worldcompressor.Earlierapproachesresultedintherstcompletelyautomaticconstructionofthephylogenytreebasedonwholemitochondrialgenomes,[29],[30],[31],acompletelyautomaticconstructionofalanguagetreeforover50Euro-Asianlanguages[31],detectsplagiarisminstudentprogrammingassignments[8],givesphylogenyofchainletters[5],andclustersmusic[10].Moreover,themethodturnsouttoberobustunderchangeoftheunderlyingcompressor-types:statistical(PPMZ),Lempel-Zivbaseddictionary(gzip),blockbased(bzip2),orspecialpurpose(Gencompress).RelatedWork:Inviewofthesimplicityandnaturalnessofourproposal,itisperhapssurprisingthatcompressionbasedclusteringandclassicationapproachesdidnotarisebefore.Butrecentlytherehavebeenseveralpartiallyindependentproposalsinthatdirection:[1],[2]forauthorattributionandbuildinglanguagetrees—whilecitingtheearlierwork[32],[4]—doesn'tdevelopatheorybasedoninformationdistancebutproceedsbymoreadhocargumentsrelatedtothecompressibilityofatargetleafterrstcompressingareferencele.Thebetterthetargetlecompresses,themorewefeelitissimilartothereferenceleinquestion.SeealsotheexplanationinAppendixIof[31].ThisapproachisusedalsotoclustermusicMIDIlesbyKohonenmapsin[33].Anotherrecentoffshootbasedonourworkishierarchicalclusteringbasedonmutualinformation,[23].Inarelated,butconsiderablysimplerfeature-basedapproach,onecancomparethewordfrequenciesintextlestoassesssimilarity.In[42]thewordfrequenciesofwordscommontoapairoftextlesareusedasentriesintwovectors,andthesimilarityofthetwolesisbasedonthedistancebetweenthosevectors.TheauthorsattributeauthorshiptoShakespeareplays,theFederalistPapers,andtheChineseclassic“TheDreamoftheRedChamber.”Thisapproachbasedonblockoccur-rencestatisticsisstandardingenomics,butinanexperimentreportedin[31]givesinferiorphylogenytreescomparedtoourcompressionmethod(andwrongonesaccordingtocurrentbiologicalwisdom).Arelated,opposite,approachwastakenin[22],whereliterarytextsareclusteredbyauthorgenderorfactversusction,essentiallybyrstidentifyingdistinguishingfeatures,likegenderdependentwordusage,andthenclassifyingaccordingtothosefeatures.Apartfromtheexperimentsreportedhere,theclusteringbycompressionmethodreportedinthispaperhasrecentlybeenusedtoanalyzenetworktrafcandclustercomputerwormsandvirusses[44].Finally,recentwork[20]reportsexperimentswithourmethodonalltimesequencedatausedinallthemajordata-miningconferencesinthelastdecade.Comparingthecompressionmethodwithallmajormethodsusedinthoseconferencestheyestablishedclearsuperiorityofthecompressionmethodforclusteringheterogenousdata,andforanomalydetection.SeealsotheexplanationinAppendixIIof[31].Outline:Hereweproposearstcomprehensivetheoryofreal-worldcompressor-basednormalizedcompressiondis-tance,anovelhierarchicalclusteringheuristic,togetherwithseveralapplications.First,weproposemathematicalnotionsof“admissibledistance”(usingthetermforawiderclassthanwedidin[31]),“normalizedadmissibledistance”or“similaritydistance,”“normalcompressor,”and“normalizedcompres-siondistance.”Wethenprovethenormalizedcompressiondistancebasedonanormalcompressortobeasimilaritydistancesatisfyingthemetric(in)equalities.Thenormalizedcompressiondistanceisshowntobequasi-universalinthesensethatitminorizeseverycomputablesimilaritydistanceuptoanerrorthatdependsonthequalityofthecompressor'sapproximationofthetrueKolmogorovcomplexitiesofthelesconcerned.ThismeansthattheNCDcapturesthedominantsimilarityoverallpossiblefeaturesforeverypairofobjectscompared,uptothestatedprecision.Notethatdifferentpairsofobjectsmayhavedifferentdominantsharedfeatures.Next,wepresentamethodofhierarchicalclusteringbasedonanovelfastrandomizedhill-climbingheuristicofanewquartettreeoptimizationcriterion.Givenamatrixofthepairwisesimilaritydistancesbetweentheobjects,wescorehowwelltheresultingtreerepresentstheinformationinthedistancematrixonascaleof0to1.Then,asproofofprinciple,weruntheprogramonthreedatasets,whereweknowwhatthenalanswershouldbe:(i)reconstructatreefromadistancematrixobtainedfromarandomlygeneratedtree;(ii)reconstructatreefromlescontainingarticialsimilarities;and(iii)reconstructatreefromnaturallesofheterogenousdataofvastlydifferenttypes.Tosubstantiateourclaimofparameter-freenessanduniversality,weapplythemethodtodifferentareas,notusinganyfeatureanalysisatall.Werstgiveanexampleinwhole-genomephylogenyusingthewholemitochondrialDNAofthespeciesconcerned.Wecomparethehierarchicalclusteringofourmethodwithamorestandardmethodoftwo-dimensionalclustering(toshowthatourdendrogrammethodofdepictingtheclustersismoreinformative).Wegiveawhole-genomephylogenyoffungiandcomparethistoresults RUDICILIBRASIANDPAULVIT´ANYI:CLUSTERINGBYCOMPRESSION3usingalignmentofselectedproteins(alignmentbeingoftentoocostlytoperformonthewhole-mitochondialgenome,butthedisadvantageofproteinselectionbeingthatdifferentselectionsusuallyresultindifferentphylogenies—sowhichisright?).WeidentifytheviriithatareclosesttothesequencedSARSvirus;wegiveanexampleofclusteringoflanguagefamilies;RussianauthorsintheoriginalRussian,thesamepiecesinEnglishtranslation(clusteringpartiallyfollowsthetranslators);clusteringofmusicinMIDIformat;clusteringofhandwrittendigitsusedforopticalcharacterrecognition;andclusteringofradioobservationsofamysteriousastronomicalobject,amicroquasarofextremelycomplexvariability.Inallthesecasesthemethodperformsverywellinthefollowingsense:Themethodyieldsthephylogenyof24speciesagree-ingwithbiologicalwisdominsofarasitisuncontroversial.Theprobabilitythatitrandomlywouldhitthisoutcome,oranythingreasonablyclose,isverysmall.Inclustering36musicpiecestakenequallymanyfrompop,jazz,classic,sothat12-12-12isthegroupingweunderstandiscorrect,wecanidentifyconvexclusterssothatonlysixerrorsaremade.(Thatis,ifthreeitemsgetdislodgedwithouttwoofthembeinginterchanged,thensixitemsgetmisplaced.)Theprobabilitythatthishappensbychanceisextremelysmall.ThereasonwhywethinkthemethoddoessomethingremarkableisconciselyputbyLaplace[28]:“Ifweseekacausewhereverweperceivesymmetry,itisnotthatweregardthesymmetricaleventaslesspossiblethantheothers,but,sincethiseventoughttobetheeffectofaregularcauseorthatofchance,therstofthesesuppositionsismoreprobablethanthesecond.OnatableweseelettersarrangedinthisorderConstantinople,andwejudgethatthisarrangementisnottheresultofchance,notbecauseitislesspossiblethanothers,forifthiswordwerenotemployedinanylanguagewewouldnotsuspectitcamefromanyparticularcause,butthiswordbeinginuseamongus,itisincomparablymoreprobablethatsomepersonhasthusarrangedtheaforesaidlettersthanthatthisarrangementisduetochance.”MaterialsandMethods:Thedatasamplesweusedwereobtainedfromstandarddatabasesaccessibleontheworld-wideweb,generatedbyourselves,orobtainedfromresearchgroupsintheeldofinvestigation.Wesupplythedetailswitheachexperiment.Themethodofprocessingthedatawasthesameinallexperiments.First,wepreprocessedthedatasamplestobringtheminappropriateformat:thegenomicmaterialoverthefour-letteralphabetfA;T;G;Cgisrecodedinafour-letteralphabet;themusicMIDIlesarestrippedofidentifyinginformationsuchascomposerandnameofthemusicpiece.Then,inallcasesthedatasampleswerecompletelyautomaticallyprocessedbyourCompLearnToolkit,ratherthanasisusualinphylogeny,byusinganecclecticsetofsoftwaretoolsperexperiment.Oblivioustotheproblemareaconcerned,simplyusingthedistancesaccordingtotheNCDbelow,themethoddescribedinthispaperfullyautomaticallyclassiestheobjectsconcerned.Themethodhasbeenreleasedinthepublicdomainasopen-sourcesoftware:TheCompLearnToolkit[9]isasuiteofsimpleutilitiesthatonecanusetoapplycompressiontechniquestotheprocessofdiscoveringandlearningpatternsincompletelydifferentdomains.Infact,thismethodissogeneralthatitrequiresnobackgroundknowledgeaboutanyparticularsubjectarea.Therearenodomain-specicparameterstoset,andonlyahandfulofgeneralsettings.TheComplearnToolkitusingNCDandnot,say,alignment,cancopewithfullgenomesandotherlargedatalesandthuscomesupwithasingledistancematrix.Theclusteringheuristicgeneratesatreewithacertaincondence,calledstandardizedbenetscoreorS(T)valueinthesequel.Gener-atingtreesfromthesamedistancematrixmanytimesresultedinthesametreeincaseofhighS(T)value,orasimilartreeincaseofmoderatelyhighS(T)value,foralldistancematricesweused,eventhoughtheheuristicisrandomized.Thatis,thereisonlyonewaytoberight,butincreasinglymanywaystobeincreasinglywrongwhichcanallberealizedbydifferentrunsoftherandomizedalgorithm.Thisisagreatdifferencewithpreviousphylogenymethods,wherebecauseofcomputationallimitationsoneusesonlypartsofthegenome,orcertainproteinsthatareviewedassignicant[21].Thesearerunthroughatreereconstructionmethodlikeneighborjoining[38],maximumlikelihood,maximumevolution,maximumparsimonyasin[21],orquartethypercleaning[6],manytimes.Thepercentage-wiseagreementoncertainbranchesarisingarecalled“bootstrapvalues.”Treesaredepictedwiththebestbootstrapvaluesonthebranchesthatareviewedassupportingthetheorytested.Differentchoicesofproteinsresultindifferentbesttrees.Onewaytoavoidthisambiguityistousethefullgenome,[36],[31],leadingtowhole-genomephylogeny.Withourmethodwedowhole-genomephylogeny,andendupwithasingleoverallbesttree,notoptimizingselectedpartsofit.Thequalityoftheresultsdependson(a)theNCDdistancematrix,and(b)howwellthehierarchicaltreerepresentstheinformationinthematrix.Thequalityof(b)ismeasuredbytheS(T)value,andisgivenwitheachexperiment.Ingeneral,theS(T)valuedeterioratesforlargesets.Webelievethistobepartiallyanartifactofalow-resolutionNCDmatrixduetolimitedcompressionpower,andlimitedlesize.Themainrea-son,however,isthefactthatwithincreasingsizeofanaturaldatasettheprojectionoftheinformationintheNCDmatrixintoabinarytreegetsinecessarilyincreasinglydistorted.AnotheraspectlimitingthequalityoftheNCDmatrixismoresubtle.Recallthatthemethodknowsnothingaboutanyoftheareasweapplyitto.ItdeterminesthedominantfeatureasseenthroughtheNCDlter.Thedominantfeatureofalikenessbetweentwolesmaynotcorrespondtoouraprioriconceptionbutmayhaveanunexpectedcause.Theresultsofourexperimentssuggestthatthisisnotoftenthecase:Inthenaturaldatasetswherewehavepreconceptionsoftheoutcome,forexamplethatworksbythesameauthorsshouldclustertogether,ormusicpiecesbythesamecomposers,musicalgenres,orgenomes,theoutcomesconformlargelytoourexpectations.Forexample,inthemusicgenreexperimentthemethodwouldfaildramaticallyifgenreswereevenly 4CORRECTEDVERSIONOF:IEEETRANSACTIONSONINFORMATIONTHEORY,VOL.51,NO4,APRIL2005,1523–1545mixed,ormixedwithlittlebias.However,tothecontrary,theseparationinclustersisalmostperfect.Thefewmisplacementsthatarediscernableareeithererrors(themethodwasnotpow-erfulenoughtodiscernthedominantfeature),thedistortionduetomappingmultidimensionaldistancesintotreedistances,orthedominantfeaturebetweenapairofmusicpiecesisnotthegenrebutsomeotheraspect.Thesurprisingnewsisthatwecangenerallyconrmexpectationswithfewmisplacements,indeed,thatthedatadon'tcontainunknownroguefeaturesthatdominatetocausespurious(inourpreconceivedidea)clustering.Thisgivesevidencethatwherethepreconceptionisindoubt,likewithphylogenyhypotheses,theclusteringcangivetruesupportofonehypothesisagainstanotherone.Figures:Weusetwostylestodisplaythehierarchicalclusters.InthecaseofgenomicsofEutherianordersandfungi,languagetrees,itisconvenienttofollowthedendrogramsthatarecustomaryinthatarea(suggestingtemporalevolution)foreasycomparisonwiththeliterature.Althoughthereisnotemporalrelationintended,thedendrogramrepresentationlookedalsoappropriatefortheRussianwriters,andtransla-tionsofRussianwriters.Intheotherexperiments(eventhegenomicSARSexperiment)itismoreinformativetodisplayanunrootedternarytree(orbinarytreeifwethinkaboutincomingandoutgoingedges)withexplicitinternalnodes.Thisfacilitatesidenticationofclustersintermsofsubtreesrootedatinternalnodesorcontiguoussetsofsubtreesrootedatbranchesofinternalnodes.II.SIMILARITYDISTANCEWegiveapreciseformalmeaningtotheloosedistancenotionof“degreeofsimilarity”usedinthepatternrecognitionliterature.DistanceandMetricLet\nbeanonemptysetandR+bethesetofnonnegativerealnumbers.Adistancefunctionon\nisafunctionD:\n\n!R+.Itisametricifitsatisesthemetric(in)equalities:D(x;y)=0iffx=y,D(x;y)=D(y;x)(symmetry),andD(x;y)D(x;z)+D(z;y)(triangleinequality).ThevalueD(x;y)iscalledthedistancebetweenx;y2\n.AfamiliarexampleofadistancethatisalsometricistheEuclideanmetric,theeverydaydistancee(a;b)betweentwogeographicalobjectsa;bexpressedin,say,meters.Clearly,thisdistancesatisesthepropertiese(a;a)=0,e(a;b)=e(b;a),ande(a;b)e(a;c)+e(c;b)(forinstance,a=Amsterdam,b=Brussels,andc=Chicago.)Weareinterestedinaparticulartypeofdistance,the“similaritydistance”,whichweformallydeneinDenition2.5.Forexample,iftheobjectsareclassicalmusicpiecesthenthefunctionDdenedbyD(a;b)=0ifaandbarebythesamecomposerandD(a;b)=1otherwise,isasimilaritydistancethatisalsoametric.Thismetriccapturesonlyonesimilarityaspect(feature)ofmusicpieces,presumablyanimportantonethatsubsumesaconglomerateofmoreelementaryfeatures.B.AdmissibleDistanceIndeningaclassofadmissibledistances(notnecessarilymetricdistances)wewanttoexcludeunrealisticoneslikef(x;y)=12foreverypairx6=y.Wedothisbyrestrictingthenumberofobjectswithinagivendistanceofanobject.Asin[4]wedothisbyonlyconsideringeffectivedistances,asfollows.Fixasuitable,andfortheremainderofthepaper,xed,programminglanguage.Thisisthereferenceprogramminglanguage.Denition2.1:Let\n=,withanitenonemptyalphabetandthesetofnitestringsoverthatalphabet.Sinceeverynitealphabetcanberecodedinbinary,wechoose=f0;1g.Inparticular,“les”incomputermemoryarenitebinarystrings.AfunctionD:\n\n!R+isanadmissibledistanceifforeverypairofobjectsx;y2\nthedistanceD(x;y)satisesthedensityconditionXy2D(x;y)1;(II.1)iscomputable,andissymmetric,D(x;y)=D(y;x).IfDisanadmissibledistance,thenforeveryxthesetfD(x;y):y2f0;1ggisthelengthsetofaprexcode,sinceitsatises(II.1),theKraftinequality.Conversely,ifadistanceisthelengthsetofaprexcode,thenitsatises(II.1),see[12].Example2.2:InrepresentingtheHammingdistancedbe-tweentwostringsofequallengthndifferinginpositionsi1;:::;id,wecanuseasimpleprex-freeencodingof(n;d;i1;:::;id)in2logn+4loglogn+2+dlognbits.Weencodenanddprex-freeinlogn+2loglogn+1bitseach,seee.g.[32],andthentheliteralindexesoftheactualipped-bitpositions.AddinganO(1)-bitprogramtointerpretthesedata,withthestringsconcernedbeingxandy,wehavedenedHn(x;y)=2logn+4loglogn+dlogn+O(1)asthelengthofaprexcodeword(prexprogram)tocomputexfromyandviceversa.Then,bytheKraftinequality,Py2Hn(x;y)1.ItiseasytoverifythatHnisametricinthesensethatitsatisesthemetric(in)equalitiesuptoO(logn)additiveprecision.}C.NormalizedAdmissibleDistanceLargeobjects(inthesenseoflongstrings)thatdifferbyatinypartareintuitivelycloserthantinyobjectsthatdifferbythesameamount.Forexample,twowholemitochondrialgenomesof18,000basesthatdifferby9,000areverydiffer-ent,whiletwowholenucleargenomesof3109basesthatdifferbyonly9,000basesareverysimilar.Thus,absolutedifferencebetweentwoobjectsdoesn'tgovernsimilarity,butrelativedifferenceappearstodoso.Denition2.3:Acompressorisalosslessencodermapping\nintof0;1gsuchthattheresultingcodeisaprexcode.“Lossless”meansthatthereisadecompressorthatreconstructsthesourcemessagefromthecodemessage.Forconvenienceofnotationweidentify“compressor”witha“codewordlengthfunction”C:\n!N,whereNisthesetofnonnegativeintegers.Thatis,thecompressedversionofalexhaslengthC(x).WeonlyconsidercompressorssuchthatC(x) RUDICILIBRASIANDPAULVIT´ANYI:CLUSTERINGBYCOMPRESSION5jxj+O(logjxj).(Theadditivelogarithmictermisduetoourrequirementthatthecompressedlebeaprexcodeword.)WexacompressorC,andcallthexedcompressorthereferencecompressor.Denition2.4:LetDbeanadmissibledistance.ThenD+(x)isdenedbyD+(x)=maxfD(x;z):C(z)C(x)g,andD+(x;y)isdenedbyD+(x;y)=maxfD+(x);D+(y)g.NotethatsinceD(x;y)=D(y;x),alsoD+(x;y)=D+(y;x).Denition2.5:LetDbeanadmissibledistance.Thenor-malizedadmissibledistance,alsocalledasimilaritydistance,d(x;y),basedonDrelativetoareferencecompressorC,isdenedbyd(x;y)=D(x;y)D+(x;y):Itfollowsfromthedenitionsthatanormalizedadmissibledistanceisafunctiond:\n\n![0;1]thatissymmetric:d(x;y)=d(y;x).Lemma2.6:Foreveryx2\n,andconstante2[0;1],anormalizedadmissibledistancesatisesthedensityconstraintjfy:d(x;y)e;C(y)C(x)gj2eD+(x)+1:(II.2)Proof:Assumetothecontrarythatddoesnotsatisfy(II.2).Then,thereisane2[0;1]andanx2\n,suchthat(II.2)isfalse.Werstnotethat,sinceD(x;y)isanadmissibledistancethatsatises(II.1),d(x;y)satisesa“normalized”versionoftheKraftinequality:Xy:C(y)C(x)2d(x;y)D+(x)Xy2d(x;y)D+(x;y)1:(II.3)Startingfrom(II.3)weobtaintherequiredcontradiction:1Xy:C(y)C(x)2d(x;y)D+(x)Xy:d(x;y)e;C(y)C(x)2eD+(x)2eD+(x)+12eD+(x)�1:Ifd(x;y)isthenormalizedversionofanadmissibledis-tanceD(x;y)then(II.3)isequivalentto(II.1).Wecallanormalizeddistancea“similarity”distance,becauseitgivesarelativesimilarityaccordingtothedistance(withdistance0whenobjectsaremaximallysimilaranddistance1whentheyaremaximallydissimilar)and,conversely,foreverywell-denedcomputablenotionofsimilaritywecanexpressitasametricdistanceaccordingtoourdenition.Intheliteratureadistancethatexpresseslackofsimilarity(likeours)isoftencalleda“dissimilarity”distanceora“disparity”distance.Remark2.7:Asfarastheauthorsknow,theideaofnor-malizedmetricis,surprisingly,notwell-studied.Anexceptionis[41],whichinvestigatesnormalizedmetricstoaccountforrelativedistancesratherthanabsoluteones,anditdoessoformuchthesamereasonsasinthepresentwork.AnexamplethereisthenormalizedEuclideanmetricjxyj=(jxj+jyj),wherex;y2Rn(Rdenotestherealnumbers)andjjistheEuclideanmetric—theL2norm.Anotherexampleisanormalizedsymmetric-set-differencemetric.Butthesenormalizedmetricsarenotnecessarilyeffectiveinthatthedistancebetweentwoobjectsgivesthelengthofaneffectivedescriptiontogofromeitherobjecttotheotherone.}Remark2.8:Ourdenitionofnormalizedadmissibledis-tanceismoredirectthanin[31],andthedensityconstraints(II.2)and(II.3)followfromthedenition.In[31]weputastricterdensityconditioninthedenitionof“admissible”normalizeddistance,whichis,however,hardertosatisfyandmaybetoostricttoberealistic.Thepurposeofthisstricterden-sityconditionwastoobtainastronger“universality”propertythanthepresentTheorem6.3,namelyonewith =1and=O(1=maxfC(x);C(y)g).Nonetheless,bothdenitionscoincideifwesetthelengthofthecompressedversionC(x)ofxtotheultimatecompressedlengthK(x),theKolmogorovcomplexityofx.}Example2.9:ToobtainanormalizedversionoftheHam-mingdistanceofExample2.2,wedenehn(x;y)=Hn(x;y)=H+n(x;y).WecansetH+n(x;y)=H+n(x)=(n+2)dlogne+4dloglogne+O(1)sinceeverycontemplatedcompressorCwillsatisfyC(x)=C(x),wherexisxwithallbitsipped(soH+n(x;y)H+n(z;z)foreitherz=xorz=y).By(II.2),foreveryx,thenumberofywithC(y)C(x)intheHammingballhn(x;y)eislessthan2eH+n(x)+1.Thisupperboundisanobviousoverestimatefore1=logn.Forlowervaluesofe,theupperboundiscorrectbytheobservationthatthenumberofy'sequalsPen=0nen2nH(e),whereH(e)=eloge+(1e)log(1e),Shannon'sentropyfunction.Then,eH+n(x)�enlogn�enH(e)sinceelogn�H(e).}III.NORMALCOMPRESSORWegiveaxiomsdeterminingalargefamilyofcompressorsthatbothincludemost(ifnotall)real-worldcompressorsandensurethedesiredpropertiesoftheNCDtobedenedlater.Denition3.1:AcompressorCisnormalifitsatises,uptoanadditiveO(logn)term,withnthemaximalbinarylengthofanelementof\ninvolvedinthe(in)equalityconcerned,thefollowing:1)Idempotency:C(xx)=C(x),andC()=0,whereistheemptystring.2)Monotonicity:C(xy)C(x).3)Symmetry:C(xy)=C(yx).4)Distributivity:C(xy)+C(z)C(xz)+C(yz).Idempotency:Areasonablecompressorwillseeexactrepetitionsandobeyidempotencyuptotherequiredprecision.Itwillalsocompresstheemptystringtotheemptystring.Monotonicity:Arealcompressormusthavethemonotonic-ityproperty,atleastuptotherequiredprecision.Thepropertyisevidentforstream-basedcompressors,andonlyslightlylessevidentforblock-codingcompressors.Symmetry:Stream-basedcompressorsoftheLempel-Zivfamily,likegzipandpkzip,andthepredictivePPMfamily,likePPMZ,arepossiblynotpreciselysymmetric.Thisisrelatedtothestream-basedproperty:theinitiallexmayhaveregularitiestowhichthecompressoradapts;aftercrossingthebordertoyitmustunlearnthoseregularitiesandadapttotheonesofx.Thisprocessmaycausesomeimprecisioninsymmetrythatvanishesasymptoticallywiththelengthof 6CORRECTEDVERSIONOF:IEEETRANSACTIONSONINFORMATIONTHEORY,VOL.51,NO4,APRIL2005,1523–1545x;y.Acompressormustbepoorindeed(andwillcertainlynotbeusedtoanyextent)ifitdoesn'tsatisfysymmetryuptotherequiredprecision.Apartfromstream-based,theothermajorfamilyofcompressorsisblock-codingbased,likebzip2.Theyessentiallyanalyzethefullinputblockbyconsideringallrotationsinobtainingthecompressedversion.Itistoagreatextentsymmetrical,andrealexperimentsshownodeparturefromsymmetry.Distributivity:Thedistributivitypropertyisnotimmedi-atelyintuitive.InKolmogorovcomplexitytheorythestrongerdistributivitypropertyC(xyz)+C(z)C(xz)+C(yz)(III.1)holds(withK=C).However,toprovethedesiredpropertiesofNCDbelow,onlytheweakerdistributivitypropertyC(xy)+C(z)C(xz)+C(yz)(III.2)aboveisrequired,alsofortheboundarycasewereC=K.Inpractice,real-worldcompressorsappeartosatisfythisweakerdistributivitypropertyuptotherequiredprecision.Denition3.2:DeneC(yjx)=C(xy)C(x):(III.3)ThisnumberC(yjx)ofbitsofinformationiny,relativetox,canbeviewedastheexcessnumberofbitsinthecompressedversionofxycomparedtothecompressedversionofx,andiscalledtheamountofconditionalcompressedinformation.Inthedenitionofcompressorthedecompressionalgorithmisnotincluded(unlikethecaseofKolmorogovcomplexity,wherethedecompressingalgorithmisgivenbydenition),butitiseasytoconstructone:GiventhecompressedversionofxinC(x)bits,wecanrunthecompressoronallcandidatestringsz—forexample,inlength-increasinglexicographicalorder,untilwendthecompressedstringz0=x.Sincethisstringdecompressestoxwehavefoundx=z0.GiventhecompressedversionofxyinC(xy)bits,werepeatthisprocessusingstringsxzuntilwendthestringxz1ofwhichthecompressedversionequalsthecompressedversionofxy.Sincetheformercompressedversiondecompressestoxy,wehavefoundy=z1.BytheuniquedecompressionpropertywendthatC(yjx)istheextranumberofbitswerequiretodescribeyapartfromdescribingx.ItisintuitivelyacceptablethattheconditionalcompressedinformationC(xjy)satisesthetriangleinequalityC(xjy)C(xjz)+C(zjy):(III.4)Lemma3.3:Both(III.1)and(III.4)imply(III.2).Proof:((III.1)implies(III.2):)Bymonotonicity.((III.4)implies(III.2):)Rewritethetermsin(III.4)accord-ingto(III.3),cancelC(y)intheleft-andright-handsides,usesymmetry,andrearrange.Lemma3.4:Anormalcompressorsatisesadditionallysubadditivity:C(xy)C(x)+C(y).Proof:Considerthespecialcaseofdistributivitywithztheemptywordsothatxz=x,yz=y,andC(z)=0.Subadditivity:Thesubadditivitypropertyisclearlyalsorequiredforeveryviablecompressor,sinceacompressormayuseinformationacquiredfromxtocompressy.Minorim-precisionmayarisefromtheunlearningeffectofcrossingtheborderbetweenxandy,mentionedinrelationtosymmetry,butagainthismustvanishasymptoticallywithincreasinglengthofx;y.IV.BACKGROUNDINKOLMOGOROVCOMPLEXITYTechnically,theKolmogorovcomplexityofxgivenyisthelengthoftheshortestbinaryprogram,forthereferenceuniversalprexTuringmachine,thatoninputyoutputsx;itisdenotedasK(xjy).Forprecisedenitions,theoryandapplications,see[32].TheKolmogorovcomplexityofxisthelengthoftheshortestbinaryprogramwithnoinputthatoutputsx;itisdenotedasK(x)=K(xj)wheredenotestheemptyinput.Essentially,theKolmogorovcomplexityofaleisthelengthoftheultimatecompressedversionofthele.In[4]theinformationdistanceE(x;y)wasintroduced,denedasthelengthoftheshortestbinaryprogramforthereferenceuniversalprexTuringmachinethat,withinputxcomputesy,andwithinputycomputesx.Itwasshowntherethat,uptoanadditivelogarithmicterm,E(x;y)=maxfK(xjy);K(yjx)g.ItwasshownalsothatE(x;y)isametric,uptonegligibleviolationsofthemetricinequalties.Moreover,itisuniversalinthesensethatforeveryadmissibledistanceD(x;y)asinDenition2.1,E(x;y)D(x;y)uptoanadditiveconstantdependingonDbutnotonxandy.In[31],thenormalizedversionofE(x;y),calledthenormalizedinformationdistance,isdenedasNID(x;y)=maxfK(xjy);K(yjx)gmaxfK(x);K(y)g:(IV.1)Ittooisametric,anditisuniversalinthesensethatthissinglemetricminorizesuptoannegligibleadditiveerrortermallnormalizedadmissibledistancesintheclassconsideredin[31].Thus,iftwoles(ofwhatevertype)aresimilar(thatis,close)accordingtotheparticularfeaturedescribedbyaparticularnormalizedadmissibledistance(notnecessarilymetric),thentheyarealsosimilar(thatis,close)inthesenseofthenormalizedinformationmetric.Thisjustiescallingthelatterthesimilaritymetric.Westressoncemorethatdifferentpairsofobjectsmayhavedifferentdominatingfeatures.YeteverysuchdominantsimilarityisdetectedbytheNID.However,thismetricisbasedonthenotionofKolmogorovcomplexity.Unfortunately,theKolmogorovcomplexityisnon-computableintheTuringsense.Approximationofthedenom-inatorof(IV.1)byagivencompressorCisstraightforward:itismaxfC(x);C(y)g.Thenumeratorismoretricky.ItcanberewrittenasmaxfK(x;y)K(x);K(x;y)K(y)g;(IV.2)withinlogarithmicadditiveprecision,bytheadditivepropertyofKolmogorovcomplexity[32].ThetermK(x;y)representsthelengthoftheshortestprogramforthepair(x;y).Incompressionpracticeitiseasiertodealwiththeconcatenationxyoryx.Again,withinlogarithmicprecisionK(x;y)=K(xy)=K(yx).FollowingasuggestionbyStevendeRooij,onecanapproximate(IV.2)bestbyminfC(xy);C(yx)gminfC(x);C(y)g.Here,andinthelaterexperimentsusing RUDICILIBRASIANDPAULVIT´ANYI:CLUSTERINGBYCOMPRESSION7theCompLearnToolkit[9],wesimplyuseC(xy)ratherthanminfC(xy);C(yx)g.Thisisjustiedbytheobservationthatblock-codingbasedcompressorsaresymmetricalmostbydenition,andexperimentswithvariousstream-basedcompressors(gzip,PPMZ)showonlysmalldeviationsfromsymmetry.TheresultofapproximatingtheNIDusingarealcompres-sorCiscalledthenormalizedcompressiondistance(NCD),formallydenedin(VI.1).ThetheoryasdevelopedfortheKolmogorov-complexitybasedNIDin[31],maynotholdforthe(possiblypoorly)approximatingNCD.ItisnonethelessthecasethatexperimentsshowthattheNCDapparentlyhas(some)propertiesthatmaketheNIDsoappealing.Tollthisgapbetweentheoryandpractice,wedevelopthetheoryofNCDfromrstprinciples,basedontheaxiomaticsofSec-tionIII.WeshowthattheNCDisaquasi-universalsimilaritymetricrelativetoanormalreferencecompressorC.Thetheorydevelopedin[31]istheboundarycaseC=K,wherethe“quasi-universality”belowhasbecomefull“universality”.V.COMPRESSIONDISTANCEWedeneacompressiondistancebasedonanormalcom-pressorandshowitisanadmissibledistance.Inapplyingtheapproach,wehavetomakedowithanapproximationbasedonafarlesspowerfulreal-worldreferencecompressorC.AcompressorCapproximatestheinformationdistanceE(x;y),basedonKolmogorovcomplexity,bythecompressiondistanceEC(x;y)denedasEC(x;y)=C(xy)minfC(x);C(y)g:(V.1)Here,C(xy)denotesthecompressedsizeoftheconcatenationofxandy,C(x)denotesthecompressedsizeofx,andC(y)denotesthecompressedsizeofy.Lemma5.1:IfCisanormalcompressor,thenEC(x;y)+O(1)isanadmissibledistance.Proof:Case1:AssumeC(x)C(y).ThenEC(x;y)=C(xy)C(x).Then,givenxandaprex-programoflengthEC(x;y)consistingofthesufxoftheC-compressedversionofxy,andthecompressorCinO(1)bits,wecanrunthecompressorConallxz's,thecandidatestringszinlength-increasinglexicographicalorder.Whenwendazsothatthesufxofthecompressedversionofxzmatchesthegivensufx,thenz=ybytheuniquedecompressionproperty.Case2:AssumeC(y)C(x).BysymmetryC(xy)=C(yx).NowfollowtheproofofCase1.Lemma5.2:IfCisanormalcompressor,thenEC(x;y)satisesthemetric(in)equalitiesuptologarithmicadditiveprecision.Proof:Onlythetriangularinequalityisnon-obvious.By(III.2)C(xy)+C(z)C(xz)+C(yz)uptologarithmicadditiveprecision.Therearesixpossibilities,andweverifythecorrectnessofthetriangularinequalityinturnforeachofthem.AssumeC(x)C(y)C(z):ThenC(xy)C(x)C(xz)C(x)+C(yz)C(y).AssumeC(y)C(x)C(z):ThenC(xy)C(y)C(xz)C(y)+C(yz)C(x).AssumeC(x)C(z)C(y):ThenC(xy)C(x)C(xz)C(x)+C(yz)C(z).AssumeC(y)C(z)C(x):ThenC(xy)C(y)C(xz)C(z)+C(yz)C(y).AssumeC(z)C(x)C(y):ThenC(xy)C(x)C(xz)C(z)+C(yz)C(z).AssumeC(z)C(y)C(x):ThenC(xy)C(y)C(xz)C(z)+C(yz)C(z).Lemma5.3:IfCisanormalcompressor,thenE+C(x;y)=maxfC(x);C(y)g.Proof:Considerapair(x;y).ThemaxfC(xz)C(z):C(z)C(y)gisC(x)whichisachievedforz=,theemptyword,withC()=0.Similarly,themaxfC(yz)C(z):C(z)C(x)gisC(y).Hencethelemma.VI.NORMALIZEDCOMPRESSIONDISTANCEThenormalizedversionoftheadmissibledistanceEC(x;y),thecompressorCbasedapproximationofthenormalizedinformationdistance(IV.1),iscalledthenormalizedcompres-siondistanceorNCD:NCD(x;y)=C(xy)minfC(x);C(y)gmaxfC(x);C(y)g:(VI.1)ThisNCDisthemainconceptofthiswork.Itisthereal-worldversionoftheidealnotionofnormalizedinformationdistanceNIDin(IV.1).Remark6.1:Inpractice,theNCDisanon-negativenum-ber0r1+representinghowdifferentthetwolesare.Smallernumbersrepresentmoresimilarles.Theintheupperboundisduetoimperfectionsinourcompressiontechniques,butformoststandardcompressionalgorithmsoneisunlikelytoseeanabove0.1(inourexperimentsgzipandbzip2achievedNCD'sabove1,butPPMZalwayshadNCDatmost1).}ThereisanaturalinterpretationtoNCD(x;y):If,say,C(y)C(x)thenwecanrewriteNCD(x;y)=C(xy)C(x)C(y):Thatis,thedistanceNCD(x;y)betweenxandyistheimprovementduetocompressingyusingxaspreviouslycompressed“database,”andcompressingyfromscratch,expressedastheratiobetweenthebit-wiselengthofthetwocompressedversions.RelativetothereferencecompressorwecandenetheinformationinxaboutyasC(y)C(yjx).Then,using(III.3),NCD(x;y)=1C(y)C(yjx)C(y):Thatis,theNCDbetweenxandyis1minustheratiooftheinformationxaboutyandtheinformationiny.Theorem6.2:Ifthecompressorisnormal,thentheNCDisanormalizedadmissibledistancesatsifyingthemetric(in)equalities,thatis,asimilaritymetric.Proof:Ifthecompressorisnormal,thenbyLemma5.1andLemma5.3,theNCDisanormalizedadmissibledistance.Itremainstoshowitsatisesthethreemetric(in)equalities.1)ByidempotencywehaveNCD(x;x)=0.Bymono-tonicitywehaveNCD(x;y)0foreveryx;y,withinequalityfory6=x.2)NCD(x;y)=NCD(y;x).TheNCDisunchangedbyinterchangingxandyin(VI.1). 8CORRECTEDVERSIONOF:IEEETRANSACTIONSONINFORMATIONTHEORY,VOL.51,NO4,APRIL2005,1523–15453)Thedifcultpropertyisthetriangleinequality.With-outlossofgeneralityweassumeC(x)C(y)C(z).SincetheNCDissymmetrical,thereareonlythreetriangleinequalitiesthatcanbeexpressedbyNCD(x;y);NCD(x;z);NCD(y;z).Weverifytheminturn:a)NCD(x;y)NCD(x;z)+NCD(z;y):Bydis-tributivity,thecompressoritselfsatisesC(xy)+C(z)C(xz)+C(zy).SubtractingC(x)frombothsidesandrewriting,C(xy)C(x)C(xz)C(x)+C(zy)C(z).DividingbyC(y)onbothsideswendC(xy)C(x)C(y)C(xz)C(x)+C(zy)C(z)C(y):Theleft-handsideis1.i)Assumetheright-handsideis1.SettingC(z)=C(y)+,andaddingtoboththenumeratoranddenominatoroftheright-handside,itcanonlyincreaseanddrawcloserto1.Therefore,C(xy)C(x)C(y)C(xz)C(x)+C(zy)C(z)+C(y)+=C(zx)C(x)C(z)+C(zy)C(y)C(z);whichwaswhatwehadtoprove.ii)Assumetheright-handsideis�1.Wepro-ceedlikeinthepreviouscase,andaddtobothnumeratoranddenominator.Althoughnowtheright-handsidedecreases,itmuststillbegreaterthan1,andthereforetheright-handsideremainsatleastaslargeastheleft-handside.b)NCD(x;z)NCD(x;y)+NCD(y;z):Bydis-tributivitywehaveC(xz)+C(y)C(xy)+C(yz).SubtractingC(x)frombothsides,rear-ranging,anddividingbothsidesbyC(z)weobtainC(xz)C(x)C(z)C(xy)C(x)C(z)+C(yz)C(y)C(z):Theright-handsidedoesn'tdecreasewhenwesubstituteC(y)forthedenominatorC(z)oftherstterm,sinceC(y)C(z).Therefore,theinequalitystaysvalidunderthissubstitution,whichwaswhatwehadtoprove.c)NCD(y;z)NCD(y;x)+NCD(x;z):Bydis-tributivitywehaveC(yz)+C(x)C(yx)+C(xz).SubtractingC(y)frombothsides,usingsymmetry,rearranging,anddividingbothsidesbyC(z)weobtainC(yz)C(y)C(z)C(xy)C(x)C(z)+C(yz)C(y)C(z):Theright-handsidedoesn'tdecreasewhenwesubstituteC(y)forthedenominatorC(z)oftherstterm,sinceC(y)C(z).Therefore,theinequalitystaysvalidunderthissubstitution,whichwaswhatwehadtoprove.Quasi-Universality:Wenowdigresstothetheorydevel-opedin[31],whichformedthemotivationfordevelopingtheNCD.If,insteadoftheresultofsomerealcompressor,wesubstitutetheKolmogorovcomplexityforthelengthsofthecompressedlesintheNCDformula,theresultistheNIDasin(IV.1).Itisuniversalinthefollowingsense:Everyadmissibledistanceexpressingsimilarityaccordingtosomefeature,thatcanbecomputedfromtheobjectsconcerned,iscomprised(inthesenseofminorized)bytheNID.Notethateveryfeatureofthedatagivesrisetoasimilarity,and,conversely,everysimilaritycanbethoughtofasexpressingsomefeature:beingsimilarinthatsense.OuractualpracticeinusingtheNCDfallsshortofthisidealtheoryinatleastthreerespects:(i)TheclaimeduniversalityoftheNIDholdsonlyforindenitelylongsequencesx;y.Onceweconsiderstringsx;yofdenitelengthn,itisonlyuniversalwithrespectto“simple”computablenormalizedadmissibledistances,where“simple”meansthattheyarecomputablebyprogramsoflength,say,logarithmicinn.Thisreectsthefactthat,technicallyspeaking,theuniversalityisachievedbysummingtheweightedcontributionofallsimilaritydistancesintheclassconsideredwithrespecttotheobjectsconsidered.Onlysimilaritydistancesofwhichthecomplexityissmall(whichmeansthattheweightislarge),withrespecttothesizeofthedataconcerned,kickin.(ii)TheKolmogorovcomplexityisnotcomputable,anditisinprincipleimpossibletocomputehowfarofftheNCDisfromtheNID.SowecannotingeneralknowhowwellwearedoingusingtheNCD.(iii)ToapproximatetheNCDweusestandardcompressionprogramslikegzip,PPMZ,andbzip2.Whilebettercompres-sionofastringwillalwaysapproximatetheKolmogorovcomplexitybetter,thismaynotbetruefortheNCD.Duetoitsarithmeticform,subtractionanddivision,itistheoret-icallypossiblethatwhileallitemsintheformulagetbettercompressed,theimprovementisnotthesameforallitems,andtheNCDvaluemovesawayfromtheNIDvalue.Inourexperimentswehavenotobservedthisbehaviorinanoticablefashion.Formally,wecanstatethefollowing:Theorem6.3:LetdbeacomputablenormalizedadmissibledistanceandCbeanormalcompressor.Then,NCD(x;y) d(x;y)+,whereforC(x)C(y),wehave =D+(x)=C(x)and=(C(xjy)K(xjy))=C(x),withC(xjy)accordingto(III.3).Proof:Fixd;C;x;yinthestatementofthetheorem.SincetheNCDissymmetrical,wecan,withoutlossofgenerality,letC(x)C(y).By(III.3)andthesymmetrypropertyC(xy)=C(yx)wehaveC(xjy)C(yjx).Therefore,NCD(x;y)=C(xjy)=C(x).Letd(x;y)bethenormalizedversionoftheadmissibledistanceD(x;y);thatis,d(x;y)=D(x;y)=D+(x;y).Letd(x;y)=e.By(II.2),thereare2eD+(x)+1many(x;v)pairs,suchthatd(x;v)eandC(y)C(x).Sincediscomputable,wecancompute RUDICILIBRASIANDPAULVIT´ANYI:CLUSTERINGBYCOMPRESSION9andenumerateallthesepairs.Theinitiallyxedpair(x;y)isanelementinthelistanditsindextakeseD+(x)+1bits.Therefore,givenx,theycanbedescribedbyatmosteD+(x)+O(1)bits—itsindexinthelistandanO(1)termaccountingforthelengthsoftheprogramsinvolvedinreconstructingygivenitsindexinthelist,andalgorithmstocomputefunctionsdandC.SincetheKolmogorovcomplexitygivesthelengthoftheshortesteffectivedescription,wehaveK(yjx)eD+(x)+O(1).Substitution,rewriting,andusingK(xjy)E(x;y)D(x;y)uptoignorableadditiveterms(SectionIV),yieldsNCD(x;y)=C(xjy)=C(x) e+,whichwaswhatwehadtoprove.Remark6.4:ClusteringaccordingtoNCDwillgroupse-quencestogetherthataresimilaraccordingtofeaturesthatarenotexplicitlyknowntous.Analysisofwhatthecompressoractuallydoes,stillmaynottelluswhichfeaturesthatmakesensetouscanbeexpressedbyconglomeratesoffeaturesanalyzedbythecompressor.Thiscanbeexploitedtotrackdownunknownfeaturesimplicitlyinclassication:formingautomaticallyclustersofdataandseeinwhichcluster(ifany)anewcandidateisplaced.Anotheraspectthatcanbeexploitedisexploratory:GiventhattheNCDissmallforapairx;yofspecicsequences,whatdoesthisreallysayaboutthesenseinwhichthesetwosequencesaresimilar?Theaboveanalysissuggeststhatclosesimilaritywillbeduetoadominatingfeature(thatperhapsexpressesaconglomerateofsubfeatures).LookingintothesedeepercausesmaygivefeedbackabouttheappropriatenessoftherealizedNCDdistancesandmayhelpextractmoreintrinsicinformationabouttheobjects,thantheobliviousdivisionintoclusters,bylookingforthecommonfeaturesinthedataclusters.}VII.CLUSTERINGGivenasetofobjects,thepairwiseNCD'sformtheentriesofadistancematrix.Thisdistancematrixcontainsthepairwiserelationsinrawform.Butinthisformatthatinformationisnoteasilyusable.Justasthedistancematrixisareducedformofinformationrepresentingtheoriginaldataset,wenowneedtoreducetheinformationevenfurtherinordertoachieveacognitivelyacceptableformatlikedataclusters.Toextractahierarchyofclustersfromthedistancematrix,wedetermineadendrogram(binarytree)thatagreeswiththedistancematrixaccordingtoacostmeasure.Thisallowsustoextractmoreinformationfromthedatathanjustatclustering(determiningdisjointclustersindimensionalrepresentation).Clustersaregroupsofobjectsthataresimilaraccordingtoourmetric.Therearevariouswaystocluster.Ouraimistoanalyzedatasetsforwhichthenumberofclustersisnotknownapriori,andthedataarenotlabeled.Asstatedin[16],conceptuallysimple,hierarchicalclusteringisamongthebestknownunsupervisedmethodsinthissetting,andthemostnaturalwayistorepresenttherelationsintheformofadendrogram,whichiscustomarilyadirectedbinarytreeorundirectedternarytree.Toconstructthetreefromadistancematrixwithentriesconsistingofthepairwisedistancesbetweenobjects,weuseaquartetmethod.Thisisamatterofchoiceonly,othermethodsmayworkequallywell.Thedistanceswecomputeinourexperimentsareoftenwithintherange0.85to1.1.Thatis,thedistinguishingfeaturesaresmall,andweneedasensitivemethodtoextractasmuchinformationcontainedinthedistancematrixasispossible.Forexample,ourexperimentsshowedthatreconstructingaminimumspanningtreeisnotsensitiveenoughandgivespoorresults.Withincreasingnumberofdataitems,theprojectionoftheNCDmatrixinformationintothetreerepresentationformatgetsincreasinglydistorted.Asimilarsituationarisesinusingalignmentcostingenomiccomparisons.Experienceshowsthatinbothcasesthehierarchicalclusteringmethodsseemtoworkbestforsmallsetsofdata,upto25items,andtodeteriorateforlargersets,say40itemsormore.Astandardsolutiontohierarchicallyclusterlargersetsofdataistorstclusternonhierarchically,bysaymultidimensionalscalingofk-means,availableinstandardpackages,forinstanceMatlab,andthenapplyhierarchicalclusteringontheemergingclusters.Thequartetmethod:Weconsidereverygroupoffourelementsfromoursetofnelements;therearen4suchgroups.Fromeachgroupu;v;w;xweconstructatreeofarity3,whichimpliesthatthetreeconsistsoftwosubtreesoftwoleaveseach.Letuscallsuchatreeaquartettopology.Therearethreepossibilitiesdenoted(i)uvjwx,(ii)uwjvx,and(iii)uxjvw,whereaverticalbardividesthetwopairsofleafnodesintotwodisjointsubtrees(Figure1).un0un0un0vvn1vn1wn1wwxxxFig.1.Thethreepossiblequartettopologiesforthesetofleaflabelsu,v,w,xForanygiventreeTandanygroupoffourleaflabelsu;v;w;x,wesayTisconsistentwithuvjwxifandonlyifthepathfromutovdoesnotcrossthepathfromwtox.Notethatexactlyoneofthethreepossiblequartettopologiesforanysetof4labelsmustbeconsistentforanygiventree.Wemaythinkofalargetreehavingmanysmallerquartettopologiesembeddedwithinitsstructure.Commonlythegoalinthequartetmethodistond(orapproximateascloselyaspossible)thetreethatembedsthemaximalnumberof 10CORRECTEDVERSIONOF:IEEETRANSACTIONSONINFORMATIONTHEORY,VOL.51,NO4,APRIL2005,1523–1545un0n1svn2n3xwtFig.2.Anexampletreeconsistentwithquartettopologyuvjwxconsistent(possiblyweighted)quartettopologiesfromagivensetQofquartettopologies[19](Figure2).Thisiscalledthe(weighted)MaximumQuartetConsistency(MQC)problem.Weproposeanewoptimizationproblem:theMinimumQuartetTreeCost(MQTC),asfollows:Thecostofaquar-tettopologyisdenedasthesumofthedistancesbe-tweeneachpairofneighbors;thatis,Cuvjwx=d(u;v)+d(w;x).ThetotalcostCTofatreeTwithasetNofleaves(externalnodesofdegree1)isdenedasCT=Pfu;v;w;xgNfCuvjwx:Tisconsistentwithuvjwxg—thesumofthecostsofallitsconsistentquartettopologies.First,wegeneratealistofallpossiblequartettopologiesforallfour-tuplesoflabelsunderconsideration.Foreachgroupofthreepossiblequartettopologiesforagivensetoffourlabelsu;v;w;x,calculateabest(minimal)costm(u;v;w;x)=minfCuvjwx;Cuwjvx;Cuxjvwg,andaworst(maximal)costM(u;v;w;x)=maxfCuvjwx;Cuwjvx;Cuxjvwg.Summingallbestquartettoplogiesyieldsthebest(minimal)costm=Pfu;v;w;xgNm(u;v;w;x).Conversely,summingallworstquartettoplogiesyieldstheworst(maximal)costM=Pfu;v;w;xgNM(u;v;w;x).Forsomedistancematrices,theseminimalandmaximalvaluescannotbeattainedbyactualtrees;however,thescoreCTofeverytreeTwillliebetweenthesetwovalues.Inordertobeabletocomparetreescoresinamoreuniformway,wenowrescalethescorelinearlysuchthattheworstscoremapsto0,andthebestscoremapsto1,andtermthisthenormalizedtreebenetscoreS(T)=(MCT)=(Mm).OurgoalistondafulltreewithamaximumvalueofS(T),whichistosay,thelowesttotalcost.Toexpressthenotionofcomputationaldifcultyoneusesthenotionof“nondeterministicpolynomialtime(NP)”.IfaproblemconcerningnobjectsisNP-hardthismeansthatthebestknownalgorithmforthis(andawideclassofsignicantproblems)requirescomputationtimeexponentialinn.Thatis,itisinfeasibleinpractice.TheMQCdecisionproblemisthefollowing:Givennobjects,letTbeatreeofwhichthenleavesarelabeledbytheobjects,andletQTbethesetofquar-tettopologiesembeddedinT.GivenasetofquartettopologiesQ,andanintegerk,theproblemistodecidewhetherthereisabinarytreeTsuchthatQTQT�k.In[19]itisshownthattheMQCdecisionproblemisNP-hard.ForeveryMQCdecisionproblemonecandeneanMQTCproblemthathasthesamesolution:givethequartettopologiesinQcost0andtheotheronescost1.ThiswaytheMQCdecisionproblemcanbereducedtotheMQTCdecisionproblem,whichshowsalsothelattertobeNP-hard.Hence,itisinfeasibleinpractice,butwecansometimessolveit,andalwaysapproximateit.(Thereductionalsoshowsthatthequartetproblemsreviewedin[19],aresubsumedbyourproblem.)Adaptingcurrentmethodsin[6]toourMQTCoptimizationproblem,resultsinfartoocomputationallyintensivecalculations;theyrunmanymonthsoryearsonmoderate-sizedproblemsof30objects.Therefore,wehavedesignedasimple,feasible,heuristicmethodforourproblembasedonrandomizationandhill-climbing.First,arandomtreewith2n2nodesiscreated,consistingofnleafnodes(with1connectingedge)labeledwiththenamesofthedataitems,andn2non-leaforinternalnodeslabeledwiththelowercaseletter“n”followedbyauniqueintegeridentier.Eachinternalnodehasexactlythreeconnectingedges.ForthistreeT,wecalculatethetotalcostofallembeddedquartettoplogies,andinvertandscalethisvaluetondS(T).Atreeisconsistentwithprecisely13ofallquartettopologies,oneforeveryquartet.Arandomtreemaybeconsistentwithabout13ofthebestquartettopologies—butbecauseofdependenciesthisgureisnotprecise.Theinitialrandomtreeischosenasthecurrentlybestknowntree,andisusedasthebasisforfurthersearching.Wedeneasimplemutationonatreeasoneofthethreepossibletransformations:1)Aleafswap,whichconsistsofrandomlychoosingtwoleafnodesandswappingthem.2)Asubtreeswap,whichconsistsofrandomlychoosingtwointernalnodesandswappingthesubtreesrootedatthosenodes.3)Asubtreetransfer,wherebyarandomlychosensubtree(possiblyaleaf)isdetachedandreattachedinanotherplace,maintainingarityinvariants.Eachofthesesimplemutationskeepsthenumberofleafnodesandinternalnodesinthetreeinvariant;onlythestructureandplacementschange.Deneafullmutationasasequenceofatleastonebutpotentiallymanysimplemutations,pickedaccordingtothefollowingdistribution.Firstwepickthenumberkofsimplemutationsthatwewillperformwithprobability2k.Foreachsuchsimplemutation,wechooseoneofthethreetypeslistedabovewithequalprobability.Finally,foreachofthesesimplemutations,weuniformlyatrandomselectleavesorinternalnodes,asappropriate.Noticethattreeswhichareclosetotheoriginaltree(intermsofnumberofsimplemutationstepsinbetween)areexaminedoften,whiletreesthatarefarawayfromtheoriginaltreewilleventuallybeexamined,butnotveryfrequently.Inordertosearchforabettertree,wesimplyapplyafullmutationonTtoarriveatT0,andthencalculateS(T0).IfS(T0)�S(T),then RUDICILIBRASIANDPAULVIT´ANYI:CLUSTERINGBYCOMPRESSION11keepT0asthenewbesttree.Otherwise,repeattheprocedure.IfS(T0)everreaches1,thenhalt,outputtingthebesttree.Otherwise,rununtilitseemsnobettertreesarebeingfoundinareasonableamountoftime,inwhichcasetheapproximationiscomplete. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 10000 20000 30000 40000 50000 60000 70000 80000S(T)Total trees examined'gnup.dat'Fig.3.Progressofa60-itemdatasetexperimentovertimeNotethatifatreeiseverfoundsuchthatS(T)=1,thenwecanstopbecausewecanbecertainthatthistreeisoptimal,asnotreecouldhavealowercost.Infact,thisperfecttreeresultisachievedinourarticialtreereconstructionexperiment(SectionVII-A)reliablyinafewminutes.Forreal-worlddata,S(T)reachesamaximumsomewhatlessthan1,presumablyreectingdistortionoftheinformationinthedistancematrixdatabythebestpossibletreerepresentation,asnotedabove,orindicatinggettingstuckinalocaloptimumorasearchspacetoolargetondtheglobaloptimum.Onmanytypicalproblemsofupto40objectsthistree-searchgivesatreewithS(T)0:9withinhalfanhour.Forlargenumbersofobjects,treescoringitselfcanbeslow(asthistakesordern4computationsteps),andthespaceoftreesisalsolarge,sothealgorithmmayslowdownsubstantially.Forlargerexperiments,weuseaC++/RubyimplementationwithMPI(MessagePassingInterface,acommonstandardusedonmassivelyparallelcomputers)onaclusterofworkstationsinparalleltondtreesmorerapidly.WecanconsiderthegraphmappingtheachievedS(T)scoreasafunctionofthenumberoftreesexamined.Progressoccurstypicallyinasigmoidalfashiontowardsamaximalvalue1,Figure3.A.ThreecontrolledexperimentsWiththenaturaldatasetsweuse,onemayhavethepreconception(orprejudice)that,say,musicbyBachshouldbeclusteredtogether,musicbyChopinshouldbeclusteredtogether,andsoshouldmusicbyrockstars.However,thepreprocessedmusiclesofapiecebyBachandapiecebyChopin,ortheBeatles,mayresembleoneanothermorethantwodifferentpiecesbyBach—byaccidentorindeedbydesignandcopying.Thus,naturaldatasetsmayhaveambiguous,conicting,orcounterintuitiveoutcomes.Inotherwords,theexperimentsonnaturaldatasetshavethedrawbackofnothavinganobjectiveclear“correct”answerthatcanfunctionasabenchmarkforassessingourexperimentaloutcomes,butonlyintuitiveortraditionalpreconceptions.Wediscussthreeexperimentsthatshowthatourprogramindeeddoeswhatitissupposedtodo—atleastinarticialsituationswhereweknowinadvancewhatthecorrectansweris.Recall,thatthe“similaritymachine”wehavedescribedconsistsoftwoparts:(i)extractingadistancematrixfromthedata,and(ii)constructingatreefromthedistancematrixusingournovelquartet-basedheuristic.s0n15s6n7s1n14s7n13s2n1n0n5s3n12s5s4n10n3s8n2s9n11n9s17s10n8s15n4s11s12s13s14n6s16Fig.4.Therandomlygeneratedtreethatouralgorithmreconstructed.S(T)=1.Testingthequartet-basedtreeconstruction:Wersttestwhetherthequartet-basedtreeconstructionheuristicistrustworthy:WegeneratedaternarytreeTwith18leaves,usingthepseudo-randomnumbergenerator“rand”oftheRubyprogramminglanguage,andderivedametricfromitbydeningthedistancebetweentwonodesasfollows:Giventhelengthofthepathfromatob,inanintegernumberofedges,asL(a;b),letd(a;b)=L(a;b)+118;exceptwhena=b,inwhichcased(a;b)=0.Itiseasytoverifythatthissimpleformulaalwaysgivesanumberbetween0and1,andismonotonicwithpathlength.Givenonlythe1818matrixofthesenormalizeddistances,ourquartetmethodexactlyreconstructedtheoriginaltreeTrepresentedinFigure4,withS(T)=1.Testingonarti®cialdata:Giventhatthetreereconstructionmethodisaccurateoncleanconsistentdata,wetriedwhetherthefullprocedureworksinanacceptablemannerwhenweknowwhattheoutcomeshouldbelike.Weusedthe“rand”pseudo-randomnumbergeneratorfromtheCprogramming 12CORRECTEDVERSIONOF:IEEETRANSACTIONSONINFORMATIONTHEORY,VOL.51,NO4,APRIL2005,1523–1545an5acn11abn15n4n6abcn17abcdabcen9abeabfgn12acfgabhin14n2abhjcn7en3dn1gn0fn10hhijkn19jkn16in18n8ijn13jFig.5.Classi®cationofarti®cial®leswithrepeated1-kilobytetags.Notallpossiblitiesareincluded;forexample,®leªbºismissing.S(T)=0:905.languagestandardlibraryunderLinux.Werandomlygener-ated11separate1-kilobyteblocksofdatawhereeachbytewasequallyprobableandcalledthesetags.Eachtagwasassociatedwithadifferentlowercaseletterofthealphabet.Next,wegenerated22lesof80kilobyteeach,bystartingwithablockofpurelyrandombytesandapplyingone,two,three,orfourdifferenttagsonit.Applyingatagconsistsoftenrepetitionsofpickingarandomlocationinthe80-kilobytele,andoverwritingthatlocationwiththegloballyconsistenttagthatisindicated.So,forinstance,tocreatethelereferredtointhediagramby“a,”westartwith80kilobytesofrandomdata,thenpicktenplacestocopyoverthisrandomdatawiththearbitrary1-kilobytesequenceidentiedastaga.Similarly,tocreatele“ab,”westartwith80kilobytesofrandomdata,thenpicktenplacestoputcopiesoftaga,thenpicktenmoreplacestoputcopiesoftagb(perhapsoverwritingsomeoftheatags).Becauseweneverusemorethanfourdifferenttags,andthereforeneverplacemorethan40copiesoftags,wecanexpectthatatleasthalfofthedataineachleisrandomanduncorrelatedwiththerestoftheles.Therestoftheleiscorrelatedwithotherlesthatalsocontaintagsincommon;themoretagsincommon,themorerelatedthelesare.ThecompressorusedtocomputetheNCDmatrixwasbzip2.TheresultingtreeisgiveninFigure5;itcanbeseenthattheclusteringhasoccuredexactlyaswewouldexpect.TheS(T)scoreis0.905.Testingonheterogenousnaturaldata:Wetestgrossclassicationoflesbasedonheterogenousdataofmarkedlydifferentletypes:(i)Fourmitochondrialgenesequences,fromablackbear,polarbear,fox,andratobtainedfromtheGenBankDatabaseontheworld-wideweb;(ii)FourexcerptsfromthenovelTheZeppelin'sPassengerbyE.PhillipsOppenheim,obtainedfromtheProjectGutenbergEditionontheWorld-Wideweb;(iii)FourMIDIleswithoutfur-therprocessing;twofromJimiHendrixandtwomovementsELFExecutableAn12n7ELFExecutableBGenesBlackBearAn13GenesPolarBearBn5GenesFoxCn10GenesRatDJavaClassAn6n1JavaClassBMusicBergAn8n2MusicBergBMusicHendrixAn0n3MusicHendrixBTextAn9n4TextBTextCn11TextDFig.6.Classi®cationofdifferent®letypes.TreeagreesexceptionallywellwithNCDdistancematrix:S(T)=0:984.fromDebussy'sSuiteBergamasque,downloadedfromvariousrepositoriesontheworld-wideweb;(iv)TwoLinuxx86ELFexecutables(thecpandrmcommands),copieddirectlyfromtheRedHat9.0Linuxdistribution;and(v)TwocompiledJavaclassles,generatedbyourselves.ThecompressorusedtocomputetheNCDmatrixwasbzip2.Asexpected,theprogramcorrectlyclassieseachofthedifferenttypesoflestogetherwithlikenearlike.TheresultisreportedinFigure6withS(T)equaltotheveryhighcondencevalue0.984.Thisexperimentshowsthepoweranduniversalityofthemethod:nofeaturesofanyspecicdomainofapplicationareused.Webelievethatthereisnoothermethodknownthatcanclusterdatathatissoheterogenousthisreliably.Thisisborneoutbythemassiveexperimentswiththemethodin[20].VIII.EXPERIMENTALVALIDATIONWedevelopedtheCompLearnToolkit,SectionI,andper-formedexperimentsinvastlydifferentapplicationeldstotestthequalityanduniversalityofthemethod.Thesuccessofthemethodasreportedbelowdependsstronglyonthejudicioususeofencodingoftheobjectscompared.Hereoneshouldusecommonsenseonwhatarealworldcompressorcando.Therearesituationswhereourapproachfailsifappliedinastraightforwardway.Forexample:comparingtextlesbythesameauthorsindifferentencodings(say,Unicodeand8-bitversion)isboundtofail.FortheidealsimilaritymetricbasedonKolmogorovcomplexityasdenedin[31]thisdoesnotmatteratall,butforpracticalcompressorsusedintheexperimentsitwillbefatal.Similarly,inthemusicexperiments RUDICILIBRASIANDPAULVIT´ANYI:CLUSTERINGBYCOMPRESSION13CarpCowBlueWhaleFinbackWhaleCatBrownBearPolarBearGreySealHarborSealHorseWhiteRhinoFerungulatesGibbonGorillaHumanChimpanzeePygmyChimpOrangutanSumatranOrangutanPrimatesEutheriaHouseMouseRatEutheria - RodentsOpossumWallarooMetatheriaEchidnaPlatypusPrototheriaFig.7.TheevolutionarytreebuiltfromcompletemammalianmtDNAsequencesof24species,usingtheNCDmatrixofFigure9.Wehaveredrawnthetreefromouroutputtoagreebetterwiththecustomaryphylogenytreeformat.ThetreeagreesexceptionallywellwiththeNCDdistancematrix:S(T)=0:996.Fig.8.MultidimensionalclusteringofsameNCDmatrix(Figure9)asusedforFigure7.Kruskal'sstress-1=0.389.belowweusesymbolicMIDImusicleformatratherthanwaveformatmusicles.Thereasonisthatthestringsresultingfromstraightforwarddiscretizingthewaveformlesmaybetoosensitivetohowwediscretize.Furtherresearchmayovecomethisproblem.A.GenomicsandPhylogenyInrecentyears,asthecompletegenomesofvariousspeciesbecomeavailable,ithasbecomepossibletodowholegenomephylogeny(thisovercomestheproblemthatusingdifferenttargetedpartsofthegenome,orproteins,maygivedifferenttrees[36]).Traditionalphylogeneticmethodsonindividualgenesdependedonmultiplealignmentoftherelatedproteinsandonthemodelofevolutionofindividualaminoacids.Neitheroftheseispracticallyapplicabletothegenomelevel.Inabsenceofsuchmodels,amethodwhichcancomputethesharedinformationbetweentwosequencesisusefulbecausebiologicalsequencesencodeinformation,andtheoccurrenceofevolutionaryevents(suchasinsertions,deletions,pointmutations,rearrangements,andinversions)separatingtwosequencessharingacommonancestorwillresultinthelossoftheirsharedinformation.Ourmethod(intheformoftheCompLearnToolkit)isafullyautomatedsoftwaretoolbasedonsuchadistancetocomparetwogenomes.a)MammalianEvolution::Inevolutionarybiologythetimingandoriginofthemajorextantplacentalclades(groupsoforganismsthathaveevolvedfromacommonancestor)continuestofueldebateandresearch.Here,weprovideevidencebywholemitochondrialgenomephylogenyforcom-petinghypothesesintwomainquestions:thegroupingoftheEutherianorders,andtheTherianhypothesisversustheMarsupiontahypothesis.EutherianOrders:Wedemonstrate(alreadyin[31])thatawholemitochondrialgenomephylogenyoftheEutherians(placentalmammals)canbereconstructedautomaticallyfromunalignedcompletemitochondrialgenomesbyuseofanearlyformofourcompressionmethod,usingstandardsoftwarepackages.Asmoregenomicmaterialhasbecomeavailable,thedebateinbiologyhasintensiedconcerningwhichtwoofthethreemaingroupsofplacentalmammalsaremorecloselyrelated:Primates,Ferungulates,andRodents.In[7],themaximumlikelihoodmethodofphylogenytreereconstruc-tiongaveevidenceforthe(Ferungulates,(Primates,Rodents))groupingforhalfoftheproteinsinthemitochondialgenomesinvestigated,and(Rodents,(Ferungulates,Primates))fortheotherhalvesofthemtgenomes.Inthatexperimenttheyaligned12concatenatedmitochondrialproteins,takenfrom20species:rat(Rattusnorvegicus),housemouse(Musmusculus),greyseal(Halichoerusgrypus),harborseal(Phocavitulina),cat(Feliscatus),whiterhino(Ceratotheriumsimum),horse(Equuscaballus),nbackwhale(Balaenopteraphysalus),bluewhale(Balaenopteramusculus),cow(Bostaurus),gibbon(Hylobateslar),gorilla(Gorillagorilla),human(Homosapi-ens),chimpanzee(Pantroglodytes),pygmychimpanzee(Panpaniscus),orangutan(Pongopygmaeus),Sumatranorangutan(Pongopygmaeusabelii),usingopossum(Didelphisvirgini-ana),wallaroo(Macropusrobustus),andtheplatypus(Or-nithorhynchusanatinus)asoutgroup.In[30],[31]weusedthewholemitochondrialgenomeofthesame20species,computingtheNCDdistances(oracloselyrelateddistancein[30]),usingtheGenCompresscompressor,followedbytreereconstructionusingtheneighborjoiningprogramintheMOLPHYpackage[38]toconrmthecommonlybelievedmorphology-supportedhypothesis(Rodents,(Primates,Ferun-gulates)).Repeatingtheexperimentusingthehypercleaningmethod[6]ofphylogenytreereconstructiongavethesameresult.Here,werepeatedthisexperimentseveraltimesusingtheCompLearnToolkitusingournewquartetmethodforreconstructingtrees,andcomputingtheNCDwithvariouscompressors(gzip,bzip2,PPMZ),againalwayswiththesameresult.Theseexperimentsarenotreportedsincetheyaresub-sumedbythelargerexperimentofFigure7.Thisisafarlargerexperimentthantheonein[30],[31],andaimedattestingtwodistincthypothesessimultaniously:theoneinthelatterreferencesabouttheEutherianorders,andthefarmoregeneral 14CORRECTEDVERSIONOF:IEEETRANSACTIONSONINFORMATIONTHEORY,VOL.51,NO4,APRIL2005,1523–1545oneabouttheordersoftheplacentalmammals(Eutheria,Metatheria,andPrototheria).Notealsothataddingtheextraspeciesfrom20to24isanadditionthatbiologistsareloathtodo:bothforcomputationalreasonsandfearofdestabilizingarealisticphylogenybyaddingevenonemorespeciestothecomputation.Furthermore,inthelastmentionedreferencesweusedthespecial-purposegenomecompressorGenCompresstodeterminethedistancematrix,andthestandardbiologicalsoftwareMOLPHYpackagetoreconstructthephylogenytreefromthedistancematrix.Incontrast,inthispaperweconductalargerexperimentthanbefore,usingjustthegeneral-purposecompressorbzip2toobtainthedistancematrix,andournewquartettreereconstructionmethodtoobtainthephylogenytree—thatis,ourownComplearnpackage[9],usedwithoutanychangeinalltheotherexperiments.MarsupiontaandTheria:Theextantmonophyleticdivi-sionsoftheclassMammaliaarethePrototheria(monotremes:mammalsthatprocreateusingeggs),Metatheria(marsupials:mammalsthatprocreateusingpouches),andEutheria(placen-talmammals:mammalsthatprocreateusingplacentas).Thesisterrelationshipsbetweenthesegroupsisviewedasthemostfundamentalquestioninmammalianevolution[21].Phyloge-neticcomparisonbyeitheranatomyormitochondrialgenomehasresultedintwoconictinghypotheses:thegene-isolation-supportedMarsupiontahypothesis:((Prototheria,Metatheria),Eutheria)versusthemorphology-supportedTheriahypothesis:(Prototheria,(Methateria,Eutheria)),thethirdpossiblityappar-entlynotbeingheldseriouslybyanyone.Therehasbeenalotofsupportforeitherhypothesis;recentsupportfortheTheriahypothesiswasgivenin[21]byanalyzingalargenucleargene(M6P/IG2R),viewedasimportantacrossthespeciesconcerned,andevenmorerecentsupportfortheMarsupiontahypothesiswasgivenin[18]byphylogeneticanalysisofanothersequencefromthenucleargene(18SrRNA)andbythewholemitochondrialgenome.ExperimentalEvidence:TotesttheEutherianorderssi-multaneouslywiththeMarsupionta-versusTheriahypothesis,weaddedfouranimalstotheabovetwenty:Australianechidna(Tachyglossusaculeatus),brownbear(Ursusarctos),polarbear(Ursusmaritimus),usingthecommoncarp(Cyprinuscarpio)astheoutgroup.Interestingly,whiletherearemanyspeciesofEutheriaandMetatheria,thereareonlythreespeciesofnowlivingPrototheriaknown:platypus,andtwotypesofechidna(orspinyanteater).SooursampleofthePrototheriaislarge.Theadditionofthenewspeciesmightberiskyinthattheadditionofnewrelationsisknowntodistortthepreviousphylogenyintraditionalcomputationalgenomicspractice.Withourmethod,usingthefullgenomeandobtainingasingletreewithaveryhighcondenceS(T)value,thatriskisnotasgreatasintraditionalmethodsobtainingambiguoustreeswithbootstrap(statisticsupport)valuesontheedges.Themitochondrialgenomesofthetotalof24speciesweusedweredownloadedfromtheGenBankDatabaseontheworld-wideweb.Eachisaround17,000bases.TheNCDdistancematrixwascomputedusingthecompressorPPMZ.Theresultingphy-logeny,withanalmostmaximalS(T)scoreof0.996supportsanewthecurrentlyacceptedgrouping(Rodents,(Primates,Ferungulates))oftheEutherianorders,andadditionallytheMarsupiontahypothesis((Prototheria,Metatheria),Eutheria),seeFigure7.Overall,ourwhole-mitochondrialNCDanalysissupportsthefollowinghypothesis:Mammaliaz}|{((primates;ferungulates)(rodents|{z}Eutheria;(Metatheria;Prototheria)));whichindicatesthattherodents,andthebranchleadingtotheMetatheriaandPrototheria,splitoffearlyfromthebranchthatledtotheprimatesandferungulates.Inspectionofthedistancematrixshowsthattheprimatesareveryclosetogether,asaretherodents,theMetatheria,andthePrototheria.Thesearetightly-knitgroupswithrelativelycloseNCD's.TheferungulatesareamuchloosergroupwithgenerallydistantNCD's.TheintergroupdistancesshowthatthePrototheriaarefurthestawayfromtheothergroups,followedbytheMetatheriaandtherodents.Alsothene-structureofthetreeisconsistentwithbiologicalwisdom.HierarchicalversusFlatClustering:ThisisagoodplacetocontrasttheinformativenessofhierarchicalclusteringwithmultidimensionalclusteringusingthesameNCDmatrix,exhibitedinFigure9.TheentriesgiveagoodexampleoftypicalNCDvalues;wetruncatedthenumberofdecimalsfrom15to3signicantdigitstosavespace.Notethatthemajorityofdistancesbunchesintherange[0:9;1].Thisisduetotheregularitiesthecompressorcanperceive.Thediagonalelementsgivetheself-distance,which,forPPMZ,isnotactually0,butisofffrom0onlyinthethirddecimal.InFigure8weclusteredthe24animalsusingtheNCDmatrixbymultidimenionalscalingaspointsin2-dimensionalEuclideanspace.Inthismethod,theNCDmatrixof24animalscanbeviewedasasetofdistancesbetweenpointsinn-dimensionalEuclideanspace(n24),whichwewanttoprojectintoa2-dimensionalEuclideanspace,tryingtodistortthedistancesbetweenthepairsaslittleaspossible.Thisisakintotheproblemofprojectingthesurfaceoftheearthglobeonatwo-dimensionalmapwithminimaldistancedistortion.Themainfeatureisthechoiceofthemeasureofdistortiontobeminimized,[16].Lettheoriginalsetofdistancesbed1;:::;dkandtheprojecteddistancesbed0;:::;d0.InFigure8weusedthedistortionmeasureKruskall'sstress-1,[24],whichminimizesq(Pik(did0)2)=Pikd2i.Kruskall'sstress-1equal0meansnodistortion,andtheworstvalueisatmost1(unlessyouhaveareallybadprojection).IntheprojectionoftheNCDmatrixaccordingtoourquartetmethodoneminimizesthemoresubtledistortionS(T)measure,where1meansperfectrepresentationoftherelativerelationsbetweenevery4-tuple,and0meansminimalrepresentation.Therefore,weshouldcomparedistortionKruskallstress-1with1S(T).Figure7hasaverygood1S(T)=0:04andFigure8hasapoorKruskalstress0:389.Assumingthatthecomparisonissignicantforsmallvalues(closetoperfectprojection),wendthatthemultidimensionalscalingofthisexperiment'sNCDmatrixisformallyinferiortothatofthequartettree.Thisconclusionformallyjustiestheimpressionconveyedbytheguresonvisualinspection. RUDICILIBRASIANDPAULVIT´ANYI:CLUSTERINGBYCOMPRESSION15BlueWhaleCatEchidnaGorillaHorseOpossumPolarBearSumOrangBrownBearChimpanzeeFinWhaleGreySealHouseMouseOrangutanPygmyChimpWallarooCarpCowGibbonHarborSealHumanPlatypusRatWhiteRhinoBlueWhale0.0050.9060.9430.8970.9250.8830.9360.6160.9280.9310.9010.8980.8960.9260.9200.9360.9280.9290.9070.9300.9270.9290.9250.902BrownBear0.9060.0020.9430.8870.9350.9060.9440.9150.9390.9400.8750.8720.9100.9340.9300.9360.9380.9370.2690.9400.9350.9360.9230.915Carp0.9430.9430.0060.9460.9540.9470.9550.9520.9510.9570.9490.9500.9520.9560.9460.9560.9530.9540.9450.9600.9500.9530.9420.960Cat0.8970.8870.9460.0030.9260.8970.9420.9050.9280.9310.8700.8720.8850.9190.9220.9330.9320.9310.8850.9290.9200.9340.9190.897Chimpanzee0.9250.9350.9540.9260.0060.9260.9480.9260.8490.7310.9250.9220.9210.9430.6670.9430.8410.9460.9310.4410.9330.8350.9340.930Cow0.8830.9060.9470.8970.9260.0060.9360.8850.9310.9270.8900.8880.8930.9250.9200.9310.9300.9290.9050.9310.9210.9300.9230.899Echidna0.9360.9440.9550.9420.9480.9360.0050.9360.9470.9470.9400.9370.9420.9410.9390.9360.9470.8550.9350.9490.9410.9470.9290.948FinbackWhale0.6160.9150.9520.9050.9260.8850.9360.0050.9300.9310.9110.9080.9010.9330.9220.9360.9330.9340.9100.9320.9280.9320.9270.902Gibbon0.9280.9390.9510.9280.8490.9310.9470.9300.0050.8590.9320.9300.9270.9480.8440.9510.8720.9520.9360.8540.9390.8680.9330.929Gorilla0.9310.9400.9570.9310.7310.9270.9470.9310.8590.0060.9270.9290.9240.9440.7370.9440.8350.9430.9280.7320.9380.8360.9340.929GreySeal0.9010.8750.9490.8700.9250.8900.9400.9110.9320.9270.0030.3990.8880.9240.9220.9330.9310.9360.8630.9290.9220.9300.9200.898HarborSeal0.8980.8720.9500.8720.9220.8880.9370.9080.9300.9290.3990.0040.8880.9220.9220.9330.9320.9370.8600.9300.9220.9280.9190.900Horse0.8960.9100.9520.8850.9210.8930.9420.9010.9270.9240.8880.8880.0030.9280.9130.9370.9230.9360.9030.9230.9120.9240.9240.848HouseMouse0.9260.9340.9560.9190.9430.9250.9410.9330.9480.9440.9240.9220.9280.0060.9320.9230.9440.9300.9240.9420.8600.9450.9210.928Human0.9200.9300.9460.9220.6670.9200.9390.9220.8440.7370.9220.9220.9130.9320.0050.9490.8340.9490.9310.6810.9380.8260.9340.929Opossum0.9360.9360.9560.9330.9430.9310.9360.9360.9510.9440.9330.9330.9370.9230.9490.0060.9600.9380.9390.9540.9410.9600.8910.952Orangutan0.9280.9380.9530.9320.8410.9300.9470.9330.8720.8350.9310.9320.9230.9440.8340.9600.0060.9540.9330.8430.9430.5850.9450.934Platypus0.9290.9370.9540.9310.9460.9290.8550.9340.9520.9430.9360.9370.9360.9300.9490.9380.9540.0030.9320.9480.9370.9490.9200.948PolarBear0.9070.2690.9450.8850.9310.9050.9350.9100.9360.9280.8630.8600.9030.9240.9310.9390.9330.9320.0020.9420.9400.9360.9270.917PygmyChimp0.9300.9400.9600.9290.4410.9310.9490.9320.8540.7320.9290.9300.9230.9420.6810.9540.8430.9480.9420.0070.9350.8380.9310.929Rat0.9270.9350.9500.9200.9330.9210.9410.9280.9390.9380.9220.9220.9120.8600.9380.9410.9430.9370.9400.9350.0060.9390.9220.922SumOrangutan0.9290.9360.9530.9340.8350.9300.9470.9320.8680.8360.9300.9280.9240.9450.8260.9600.5850.9490.9360.8380.9390.0070.9420.937Wallaroo0.9250.9230.9420.9190.9340.9230.9290.9270.9330.9340.9200.9190.9240.9210.9340.8910.9450.9200.9270.9310.9220.9420.0050.935WhiteRhino0.9020.9150.9600.8970.9300.8990.9480.9020.9290.9290.8980.9000.8480.9280.9290.9520.9340.9480.9170.9290.9220.9370.9350.002Fig.9.DistancematrixofpairwiseNCD.Fordisplaypurpose,wehavetruncatedtheoriginalentriesfrom15decimalsto3decimalsprecision.b)SARSVirus::InanotherexperimentweclusteredtheSARSvirusafteritssequencedgenomewasmadepubliclyavailable,inrelationtopotentialsimilarvirii.The15virusgenomesweredownloadedfromTheUniversalVirusDatabaseoftheInternationalCommitteeonTaxonomyofViruses,avail-ableontheworld-wideweb.TheSARSviruswasdownloadedfromCanada'sMichaelSmithGenomeSciencesCentrewhichhadtherstpublicSARSCoronovirusdraftwholegenomeassemblyavailablefordownload(SARSTOR2draftgenomeassembly120403).TheNCDdistancematrixwascomputedusingthecompressorbzip2.TherelationsinFigure10areverysimilartothedenitivetreebasedonmedical-macrobio-genomicsanalysis,appearinglaterintheNewEnglandJournalofMedicine,[25].Wedepictedthegureintheternarytreestyle,ratherthanthegenomics-dendrogramstyle,sincetheformerismorepreciseforvisualinspectionofproximityrelations.c)AnalysisofMitochondrialGenomesofFungi::AsapilotforapplicationsoftheCompLearnToolkitinfungigenomicsreasearch,thegroupofT.Boekhout,E.Kuramae,V.Robert,oftheFungalBiodiversityCenter,RoyalNetherlandsAcademyofSciences,supplieduswiththemitochondrialgenomesofCandidaglabrata,Pichiacanadensis,Saccha-romycescerevisiae,S.castellii,S.servazzii,Yarrowialipoly-tica(allyeasts),andtwolamentousascomycetesHypocreajecorinaandVerticilliumlecanii.TheNCDdistancematrixwascomputedusingthecompressorPPMZ.TheresultingtreeisdepictedinFigure11.Theinterpretationofthefungiresearchersis“thetreeclearlyclusteredtheascomycetousyeastsversusthetwolamentousAscomycetes,thussupport-ingthecurrenthypothesisontheirclassication(forexample,see[26]).Interestingly,inarecenttreatmentoftheSaccha-romycetaceae,S.servazii,S.castelliiandC.glabratawereallproposedtobelongtogeneradifferentfromSaccharomyces,andthisissupportedbythetopologyofourtreeaswell([27]).”TocomparetheveracityoftheNCDclusteringwithamorefeature-basedclustering,wealsocalculatedthepair-wisedistancesasfollows:Eachleisconvertedtoa4096-dimensionalvectorbyconsideringthefrequencyofall(over-AvianAdeno1CELOn1n6n11AvianIB1n13n5AvianIB2BovineAdeno3HumanAdeno40DuckAdeno1n3HumanCorona1n8SARSTOR2v120403n2MeaslesMoran12MeaslesSchMurineHep11n10n7MurineHep2PRD1n4n9RatSialCoronaSIRV1SIRV2n0Fig.10.SARSvirusamongothervirii.Legend:AvianAdeno1CELO.inp:Fowladenovirus1;AvianIB1.inp:Avianinfectiousbronchitisvirus(strainBeaudetteUS);AvianIB2.inp:Avianinfectiousbronchitisvirus(strainBeaudetteCK);BovineAdeno3.inp:Bovineadenovirus3;DuckAdeno1.inp:Duckadenovirus1;HumanAdeno40.inp:Humanadenovirustype40;HumanCorona1.inp:Humancoronavirus229E;MeaslesMora.inp:MeaslesvirusstrainMoraten;MeaslesSch.inp:MeaslesvirusstrainSchwarz;MurineHep11.inp:MurinehepatitisvirusstrainML-11;MurineHep2.inp:Murinehepatitisvirusstrain2;PRD1.inp:EnterobacteriaphagePRD1;RatSialCorona.inp:Ratsialodacryoadenitiscoronavirus;SARS.inp:SARSTOR2v120403;SIRV1.inp:SulfolobusvirusSIRV-1;SIRV2.inp:SulfolobusvirusSIRV-2.S(T)=0:988.VerticilliumlecaniiHypocreajecorinaYarrowialipolyticaPichiacanadensisSaccharomycescerevisiaeSaccharomycesservazziiSaccharomycescastelliiCandidaglabrataFig.11.DendrogramofmitochondrialgenomesoffungiusingNCD.ThisrepresentsthedistancematrixpreciselywithS(T)=0:999. 16CORRECTEDVERSIONOF:IEEETRANSACTIONSONINFORMATIONTHEORY,VOL.51,NO4,APRIL2005,1523–1545SaccharomycesservazziiPichiacanadensisSaccharomycescastelliiSaccharomycescerevisiaeCandidaglabrataYarrowialipolyticaVerticilliumlecaniiHypocreajecorinaFig.12.Dendrogramofmitochondrialgenomesoffungiusingblockfrequencies.ThisrepresentsthedistancematrixpreciselywithS(T)=0:999.NdebeleRundiKicongoBembaDagaareDitammariAfricaSomaliDendiAfricaZapotecoChickasawMazahuaPurhepechaAmericasDutchGermanEnglishSpanishEuropeFig.13.ClusteringofNative-American,Native-African,andNative-Europeanlanguages.S(T)=0:928.lapping)6-bytecontiguousblocks.Thel2-distance(Euclideandistance)iscalculatedbetweeneachpairoflesbytakingthesquarerootofthesumofthesquaresofthecomponent-wisedifferences.Thesedistancesarearrangedintoadistancematrixandlinearlyscaledtottherange[0;1:0].Finally,werantheclusteringroutineonthisdistancematrix.TheresultsareinFigure12.AsseenbycomparingwiththeNCD-basedFigure11thereareapparentmisplacementswhenusingtheEuclideandistanceinthisway.Thus,inthissimpleexperiment,theNCDperformedbetter,thatis,agreedmorepreciselywithacceptedbiologicalknowledge.B.LanguageTreesOurmethodimprovestheresultsof[1],usingalinguis-ticcorpusof“TheUniversalDeclarationofHumanRights(UDoHR)”[35]in52languages.Previously,[1]usedanasym-metricmeasurebasedonrelativeentropy,andthefullmatrixofthepair-wisedistancesbetweenall52languages,tobuildalanguageclassicationtree.Thisexperimentwasrepeated(resultinginasomewhatbettertree)usingthecompressionmethodin[31]usingstandardbiologicalsoftwarepackagestoconstructthephylogeny.Wehaveredonethisexperiment,anddonenewexperiments,usingtheCompLearnToolkit.Here,wereportonanexperimenttoseparateradicallydifferentlanguagefamilies.WedownloadedthelanguageversionsoftheUDoHRtextinEnglish,Spanish,Dutch,German(Native-European),Pemba,Dendi,Ndbele,Kicongo,Somali,Rundi,Ditammari,Dagaare(NativeAfrican),Chikasaw,Perhupecha,Mazahua,Zapoteco(Native-American),anddidn'tpreprocessthemexceptforremovinginitialidentifyinginformation.WeusedanLempel-Ziv-typecompressorgziptocompresstextsequencesofsizesnotexceedingthelengthoftheslidingwindowgzipuses(32kilobytes),andcomputetheNCDforeachpairoflanguagesequences.Subsequentlyweclusteredtheresult.WeshowtheoutcomeofoneoftheexperimentsinDostoevskiiCrimeDostoevskiiPoorfolkDostoevskiiGmblDostoevskiiIdiotTurgenevRudinTurgenevGentlefolksTurgenevEveTurgenevOtcydetiTolstoyIunostiTolstoyAnnakTolstoyWar1GogolPortrDvaivGogolMertvyeGogolDikGogolTarasTolstoyKasakBulgakovMasterBulgakovEggsBulgakovDghrtFig.14.ClusteringofRussianwriters.Legend:I.S.Turgenev,1818–1883[FatherandSons,Rudin,OntheEve,AHouseofGentlefolk];F.Dostoyevsky1821–1881[CrimeandPunishment,TheGambler,TheIdiot;PoorFolk];L.N.Tolstoy1828–1910[AnnaKarenina,TheCossacks,Youth,WarandPiece];N.V.Gogol1809–1852[DeadSouls,TarasBulba,TheMysteriousPortrait,HowtheTwoIvansQuarrelled];M.Bulgakov1891–1940[TheMasterandMargarita,TheFatefullEggs,TheHeartofaDog].S(T)=0:949.Figure13.Notethatthreegroupsarecorrectlyclustered,andthateventhesubclustersoftheEuropeanlanguagesarecorrect(EnglishisgroupedwiththeRomancelanguagesbecauseitcontainsupto40%admixtureofwordsfromLatinorigine).C.LiteratureThetextsusedinthisexperimentweredown-loadedfromtheworld-widewebinoriginalCyrillic-letteredRussianandinLatin-letteredEnglishbyL.Avanasiev(MoldavianMScstudentattheUniversityofAmsterdam).ThecompressorusedtocomputetheNCDmatrixwasbzip2.WeclusteredRussianliteratureintheoriginal(Cyrillic)byGogol,Dostojevski,Tolstoy,Bulgakov,Tsjechov,withthreeorfourdifferenttextsperauthor.Ourpurposewastoseewhethertheclusteringissensitiveenough,andtheauthorsdistinctiveenough,toresultinclusteringbyauthor.InFigure14weseeaperfectcluster-ing.ConsideringtheEnglishtranslationsofthesametexts,inFigure15,weseeerrorsintheclustering.Inspectionshowsthattheclusteringisnowpartiallybasedonthetranslator.Itappearsthatthetranslatorsuperimposeshischaracteristicsonthetexts,partiallysuppressingthecharacteristicsoftheoriginalauthors.Inotherexperiments,notreportedhere,weseparatedauthorsbygenderandbyperiod.D.MusicTheamountofdigitizedmusicavailableontheinternethasgrowndramaticallyinrecentyears,bothinthepublicdomainandoncommercialsites.Napsteranditsclonesareprimeexamples.Websitesofferingmusicalcontentinsomeformorother(MP3,MIDI,...)needawaytoorganizetheirwealthofmaterial;theyneedtosomehowclassifytheirlesaccord-ingtomusicalgenresandsubgenres,puttingsimilarpiecestogether.Thepurposeofsuchorganizationistoenableusers RUDICILIBRASIANDPAULVIT´ANYI:CLUSTERINGBYCOMPRESSION17BulgakovMmBulgakovEggBulgakovDghrtTolstoyCosacsDostoyevskyCrimeTurgenevOtcydetiTurgenevGentlefolkTurgenevEveTurgenevRudinTolstoyAnnakTolstoyWar1DostoyevskyIdiotGogolDsolsTolstoyYouthDostoyevskyGamblDostoyevskyPoorfolkGogolPortrDvaivGogolTarasFig.15.ClusteringofRussianwriterstranslatedinEnglish.Thetranslatorisgiveninbracketsafterthetitlesofthetexts.Legend:I.S.Turgenev,1818–1883[FatherandSons(R.Hare),Rudin(Garnett,C.Black),OntheEve(Garnett,C.Black),AHouseofGentlefolk(Garnett,C.Black)];F.Dostoyevsky1821–1881[CrimeandPunishment(Garnett,C.Black),TheGambler(C.J.Hogarth),TheIdiot(E.Martin);PoorFolk(C.J.Hogarth)];L.N.Tolstoy1828–1910[AnnaKarenina(Garnett,C.Black),TheCossacks(L.andM.Aylmer),Youth(C.J.Hogarth),WarandPiece(L.andM.Aylmer)];N.V.Gogol1809–1852[DeadSouls(C.J.Hogarth),TarasBulba(G.Tolstoy,1860,B.C.Baskerville),TheMysteriousPortrait+HowtheTwoIvansQuarrelled(I.F.Hapgood];M.Bulgakov1891–1940[TheMasterandMargarita(R.Pevear,L.Volokhonsky),TheFatefullEggs(K.Gook-Horujy),TheHeartofaDog(M.Glenny)].S(T)=0:953.tonavigatetopiecesofmusictheyalreadyknowandlike,butalsotogivethemadviceandrecommendations(“Ifyoulikethis,youmightalsolike...”).Currently,suchorganizationismostlydonemanuallybyhumans,butsomerecentresearchhasbeenlookingintothepossibilitiesofautomatingmusicclassication.Initially,wedownloaded36separateMIDI(MusicalIn-strumentDigitalInterface,aversatiledigitalmusicformatavailableontheworld-wide-web)lesselectedfromarangeofclassicalcomposers,aswellassomepopularmusic.Thelesweredown-loadedfromseveraldifferentMIDIDatabasesontheworld-wideweb.Theidentifyinginformation,composer,title,andsoon,wasstrippedfromtheles(otherwisethismaygiveamarginaladvantagetoidentifycomposerstothecompressor).Eachoftheseleswasrunthroughaprepro-cessortoextractjustMIDINote-OnandNote-Offevents.Theseeventswerethenconvertedtoaplayer-pianostylerepresentation,withtimequantizedin0:05secondintervals.Allinstrumentindicators,MIDIcontrolsignals,andtempovariationswereignored.ForeachtrackintheMIDIle,wecalculatetwoquantities:Anaveragevolumeandamodalnote.Theaveragevolumeiscalculatedbyaveragingthevolume(MIDInotevelocity)ofallnotesinthetrack.Themodalnoteisdenedtobethenotepitchthatsoundsmostofteninthattrack.Ifthisisnotunique,thenthelowestsuchnoteischosen.Themodalnoteisusedasakey-invariantreferencepointfromwhichtorepresentallnotes.Itisdenotedby0,highernotesaredenotedbypositivenumbers,andlowernotesaredenotedbynegativenumbers.Avalueof1BachWTK2F1n14BachWTK2P2n6BachWTK2F2n15MetalOnen5BachWTK2P1n18GershSummn27BeatlEleanorn19n23n1BeatlMichn26PoliceMessChopPrel15n8n13n4ChopPrel1n25n30ChopPrel22ChopPrel24n17MilesSowhatClaptonCocan0PoliceBreathn29ClaptonLaylan32HendrixJoen31ColtrBlueTrn28n12n9ColtrGiantStpDireStMoneyColtrImpresn11MilesSolarColtrLazybirdn22DebusBerg1n16n10n7DebusBerg2n24DebusBerg3DebusBerg4GilleTunisian33Miles7stepsHendrixVoodoon3LedZStairwn20MilesMilestoMonkRoundMn21ParkYardbirdRushYyzn2Fig.16.Outputforthe36piecesfrom3music-genres.Legend:12Jazz:JohnColtrane[BlueTrane,GiantSteps,LazyBird,Impressions];MilesDavis[Milestones,SevenStepstoHeaven,Solar,SoWhat];GeorgeGershwin[Summertime];DizzyGillespie[NightinTunisia];TheloniousMonk[RoundMidnight];CharlieParker[YardbirdSuite];12Rock&Pop:TheBeatles[EleanorRigby,Michelle];EricClapton[Cocaine,Layla];DireStraits[MoneyforNothing];LedZeppelin[StairwaytoHeaven];Metallica[One];JimiHendrix[HeyJoe,VoodooChile];ThePolice[EveryBreathYouTake,MessageinaBottle]Rush[Yyz];12Classic:seeLegendFigure17.S(T)=0:858.indicatesahalf-stepabovethemodalnote,andavalueof2indicatesawhole-stepbelowthemodalnote.Thetracksaresortedaccordingtodecreasingaveragevolume,andthenoutputinsuccession.Foreachtrack,weiteratethrougheachtimesampleinorder,outputtingasinglesigned8-bitvalueforeachcurrentlysoundingnote.Twospecialvaluesarereservedtorepresenttheendofatimestepandtheendofatrack.Thisleisthenusedasinputtothecompressionstagefordistancematrixcalculationandsubsequenttreesearch.Tocheckwhetheranyimportantfeatureofthemusicwaslostduringpreprocessing,weplayeditbackfromthepreprocessedlestocompareitwiththeoriginal.Totheauthorsthepiecessoundedalmostunchanged.ThecompressorusedtocomputetheNCDmatrixofthegenrestree,Figure16,andthatof12-piecemusicset,Figure17isbzip2.Forthefullrangeofthemusicexperimentssee[10].Beforetestingwhetherourprogramcanseethedistinctions 18CORRECTEDVERSIONOF:IEEETRANSACTIONSONINFORMATIONTHEORY,VOL.51,NO4,APRIL2005,1523–1545BachWTK2F1n5n8BachWTK2F2BachWTK2P1n0BachWTK2P2ChopPrel15n9n1ChopPrel1n6n3ChopPrel22ChopPrel24DebusBerg1n7DebusBerg4n4DebusBerg2n2DebusBerg3Fig.17.Outputforthe12-pieceset.Legend:J.S.Bach[WohltemperierteKlavierII:PreludesandFugues1,2ÐBachWTK2fF,Pgf1,2g];Chopin[Pr´eludesop.28:1,15,22,24ÐChopPrelf1,15,22,24g];Debussy[SuiteBergamasque,4movementsÐDebusBergf1,2,3,4g].S(T)=0:968.betweenvariousclassicalcomposers,werstshowthatitcandistinguishbetweenthreebroadermusicalgenres:classicalmusic,rock,andjazz.Thismaybeeasierthanmakingdistinctions“within”classicalmusic.Allmusicalpiecesweusedarelistedinthetablesinthefullpaper[10].Forthegenre-experimentweused12classicalpiecesconsistingofBach,Chopin,andDebussy,12jazzpieces,and12rockpieces.Thetree(Figure16)thatourprogramcameupwithhasS(T)=0:858.Thediscriminationbetweenthe3genresisreasonablebutnotperfect.SinceS(T)=0:858,afairlylowvalue,theresultingtreedoesn'trepresenttheNCDdistancematrixverywell.Presumably,theinformationintheNCDdistancematrixcannotberepresentedbyadendrogramofhighS(T)score.Thisappearstobeacommonproblemwithlarge(�25orso)naturaldatasets.Anotherreasonmaybethattheprogramterminated,whiletrappedinalocaloptimum.Werepeatedtheexperimentmanytimeswithalmostthesameresults,sothatdoesn'tappeartobethecase.The11-itemsubtreerootedatn4contains10ofthe12jazzpieces,togetherwithapieceofBach's“WohltemporierteKlavier(WTK)”.Theothertwojazzpieces,MilesDavis'“SoWhat,”andJohnColtrane's“GiantSteps”areplacedelsewhereinthetree,perhapsaccordingtosomekinshipthatnowescapesus(butmaybeidentiedbycloserstudyingoftheobjectsconcerned).Ofthe12rockpieces,10areplacedinthe12-itemsubtreerootedatn29,togetherwithapieceofBach's“WTK,”andColtrane's“GiantSteps,”whileHendrix's“VoodooChile”andFig.18.ImagesofhandwrittendigitsusedforOCR.fivesaan18n26fivesaefivesabn25n2n4fivesacn13n23n3fivesadfivesafn0fivesagn9fivesahfivesain16fivesajfoursaan21foursaen1foursabn19foursadn5foursacn22n24foursaifoursafn17foursagfoursahn12foursajn6sixesaan15n14n11sixesabn8n27sixesacn10n20sixesadsixesaen7sixesafsixesaisixesagsixesajsixesahFig.19.ClusteringoftheOCRimages.S(T)=0:901.Rush“Yyz”isfurtheraway.Ofthe12classicalpieces,10areinthe13-itemsubtreesrootedatthebranchn8;n13;n6;n7,togetherwithHendrix's“VoodooChile,”Rush's“Yyz,”andMilesDavis'“SoWhat.”Surprisingly,2ofthe4Bach“WTK”piecesareplacedelsewhere.Yetweperceivethe4Bachpiecestobeveryclose,bothstructurallyandmelodically(astheyallcomefromthemono-thematic“WohltemporierteKlavier”).Buttheprogramndsareasonthatatthispointishiddenfromus.Infact,runningthisexperimentwithdifferentcompressorsandterminationconditionsconsistentlydisplayedthisanomaly.Thesmallsetencompassesthe4movements RUDICILIBRASIANDPAULVIT´ANYI:CLUSTERINGBYCOMPRESSION19fromDebussy's“SuiteBergamasque,”4movementsofbook2ofBach's“WohltemperierteKlavier,”and4preludesfromChopin's“Opus28.”AsonecanseeinFigure17,ourprogramdoesaprettygoodjobatclusteringthesepieces.TheS(T)scoreisalsohigh:0.968.The4Debussymovementsformonecluster,asdothe4Bachpieces.Theonlyimperfectioninthetree,judgedbywhatonewouldintuitivelyexpect,isthatChopin'sPrÂeludeno.15liesabitclosertoBachthantotheother3Chopinpieces.ThisPrÂeludeno15,infact,consistentlyformsanodd-one-outinourotherexperimentsaswell.Thisisanexampleofpuredatamining,sincethereissomemusicaltruthtothis,asno.15isperceivedasbyfarthemosteccentricamongthe24PrÂeludesofChopin'sopus28.E.OpticalCharacterRecognitionCanwealsoclustertwo-dimensionalimages?Becauseourmethodappearsfocussedonstringsthisisnotstraightfor-ward.Itturnsoutthatscanningapictureinrasterrow-majororderretainsenoughregularityinbothdimensionsforthecompressortograsp.Asimpletaskalongtheselinesistoclusterhandwrittencharacters.ThehandwrittencharactersinFigure18weredownloadedfromtheNISTSpecialDataBase19(opticalcharacterrecognitiondatabase)ontheworld-wideweb.Eachleinthedatadirectorycontains1digitimage,eitherafour,ve,orsix.Eachpixelisasinglecharacter;'#'forablackpixel,'.'forwhite.Newlinesareaddedattheendofeachline.Eachcharacteris128x128pixels.TheNCDmatrixwascomputedusingthecompressorPPMZ.Figure19showstheclustersobtained.Thereare10ofeachdigit“4,”“5,”“6,”makingatotalof30itemsinthisexperiment.Allbutoneofthe4'sareputinthesubtreerootedatn1,allbutoneofthe5'sareputinthesubtreerootedatn4,andall6'sareputinthesubtreerootedatn3.Theremaining4and5areinthebranchn23;n13joiningn6andn3.So28itemsoutof30areclusteredcorrectly,thatis,93%.Inthisexperimentweusedonly3digits.Usingthefullsetofdecimaldigitsmeansthattoomanyobjectsareinvolved,resultinginalowerclusteringaccuracy.However,wecanusetheNCDasaobliviousfeature-extractiontechniquetoconvertgenericobjectsintonite-dimensionalvectors.Wehaveusedthistechniquetotrainasupportvectormachine(SVM)basedOCRsystemtoclassifyhandwrittendigitsbyextracting80distinct,orderedNCDfeaturesfromeachinputimage.Inthisinitialstageofongoingresearch,byourobliviousmethodofcomputingtheNCD'stouseintheSVMclassier,weachievedahandwrittensingledecimaldigitrecognitionaccuracyof87%.Thecurrentstate-of-the-artforthisproblem,afterhalfacenturyofinteractivefeature-drivenclassicationresearch,intheupperninety%level[34],[40].AllexperimentsarebenchmarkedonthestandardNISTSpecialDataBase19.UsingtheNCDforgeneralclassicationbycompressionisthesubjectofafuturepaper.F.AstronomyAsaproofofprincipleweclustereddatafromunknownob-jects,forexampleobjectsthatarefaraway.In[3]observationsofthemicroquasarGRS1915+105madewiththeRossiX-rayDab1n11Dab4n2Dab2n10n1n0Dab3Gab1n6Gab3Gab2n3Gab4Pb1n13n9Pb2Pb3n8Pb4Tac1n12n7Tac2Tac3n5n4Tac4Fig.20.16observationintervalsofGRS1915+105fromfourclasses.TheinitialcapitalletterindicatestheclasscorrespondingtoGreeklowercaselettersin[3].Theremaininglettersanddigitsidentifytheparticularobservationintervalintermsof®nerfeaturesandidentity.TheT-clusteristopleft,theP-clusterisbottomleft,theG-clusteristotheright,andtheD-clusterinthemiddle.ThistreealmostexactlyrepresentstheunderlyingNCDdistancematrix:S(T)=0:994.TimingEplorerwereanalyzed.TheinterestinthismicroquasarstemsfromthefactthatitwastherstGalacticobjecttoshowacertainbehavior(superluminalexpansioninradioobserva-tions).PhotonometricobservationdatafromX-raytelescopesweredividedintoshorttimesegments(usuallyintheorderofoneminute),andthesesegmentshavebeenclassiedintoabewilderingarrayoffteendifferentmodesafterconsiderableeffort.Briey,spectrumhardnessratios(roughly,“color”)andphotoncountsequenceswereusedtoclassifyagivenintervalintocategoriesofvariabilitymodes.Fromthisanalysis,theextremelycomplexvariabilityofthissourcewasreducedtotransitionsbetweenthreebasicstates,which,interpretedinastronomicalterms,givesrisetoanexplanationofthispeculiarsourceinstandardblack-holetheory.ThedataweusedinthisexperimentmadeavailabletousbyM.KleinWolt(co-authoroftheabovepaper)andT.Maccarone,bothresearchersattheAstronomicalInstitute“AntonPannekoek”oftheUniversityofAmsterdam.Theobservationsareessentiallytimeseries,andouraimwasexperimentingwithourmethodasapilottomoreextensivejointresearch.Herethetaskwastoseewhethertheclusteringwouldagreewiththeclassicationabove.TheNCDmatrixwascomputedusingthecompressorPPMZ.TheresultsareinFigure20.Weclustered12objects,consistingofthreeintervalsfromfourdifferentcategoriesdenotedas;\r;;inTable1of[3].InFigure20wedenotethecategoriesbythecorrespondingRomanlettersD,G,P,andT,respectively.Theresultingtreegroupsthesedifferentmodestogetherinawaythatisconsistentwiththeclassicationbyexpertsfortheseobservations.Theobliviouscompressionclusteringcorrespondspreciselywiththelaboriousfeature-drivenclassicationin[3].Furtherworkonclusteringof(possiblyheterogenous)timeseriesandanomalydetection,usingthenewcompressionmethod,wasrecentlydoneonamassivescalein[20]. 20CORRECTEDVERSIONOF:IEEETRANSACTIONSONINFORMATIONTHEORY,VOL.51,NO4,APRIL2005,1523–1545IX.CONCLUSIONTointerpretwhattheNCDisdoing,andtoexplainitsremarkableaccuracyandrobustnessacrossapplicationeldsandcompressors,theintuitionisthattheNCDminorizesallsimilaritymetricsbasedonfeaturesthatarecapturedbythereferencecompressorinvolved.Suchfeaturesmustberelativelysimpleinthesensethattheyareexpressedbyanaspectthatthecompressoranalyzes(forexamplefrequencies,matches,repeats).Certainsophisticatedfeaturesmaywellbeexpressibleascombinationsofsuchsimplefeatures,andarethereforethemselvessimplefeaturesinthissense.Theexten-siveexperimentingaboveshowsthatevenelusivefeaturesarecaptured.Apotentialapplicationofournon-feature(orrather,many-unknown-feature)approachisexploratory.Presentedwithdataforwhichthefeaturesareasyetunknown,certaindominantfeaturesgoverningsimilarityareautomaticallydiscoveredbytheNCD.Examiningthedataunderlyingtheclustersmayyieldthishithertounknowndominantfeature.OurexperimentsindicatethattheNCDhasapplicationintwonewareasofsupportvectormachine(SVM)basedlearning.Firstly,wendthattheinvertedNCD(1-NCD)isusefulasakernelforgenericobjectsinSVMlearning.Secondly,wecanusethenormalNCDasafeature-extractiontechniquetoconvertgenericobjectsintonite-dimensionalvectors,seethelastparagraphofSectionVIII-E.Ineffectoursimilarityengineaimsattheidealofaperfectdataminingprocess,discoveringunknownfeaturesinwhichthedatacanbesimilar.Thisisthesubjectofongoingjointresearchingenomicsoffungi,clinicalmoleculargenetics,andradio-astronomy.ACKNOWLEDGEMENTWethankLoredanaAfanasiev,GraduateSchoolofLogic,UniversityofAmsterdam;TeunBoekhout,EikoKuramae,VincentRobert,FungalBiodiversityCenter,RoyalNether-landsAcademyofSciences;MarcKleinWolt,ThomasMac-carone,AstronomicalInstitute“AntonPannekoek”,UniversityofAmsterdam;EvgenyVerbitskiy,PhilipsResearch;StevendeRooij,RonalddeWolf,CWI;therefereesandtheeditors,forsuggestions,comments,helpwithexperiments,anddata;JormaRissanenandBorisRyabkofordiscussions,JohnLang-fordforsuggestions,Tzu-KuoHuangforpointingoutsometyposandsimplications,andTeemuRoosandHenriTirryforimplementingavisualizationoftheclusteringprocess.REFERENCES[1]D.Benedetto,E.Caglioti,andV.Loreto.Languagetreesandzipping,PhysicalReviewLetters,88:4(2002)048702.[2]Ph.Ball.Algorithmmakestonguetree,Nature,22January,2002.[3]T.Belloni,M.Klein-Wolt,M.M´endez,M.vanderKlis,J.vanParadijs,Amodel-independentanalysisofthevariabilityofGRS1915+105,AstronomyandAstrophysics,355(2000),271–290.[4]C.H.Bennett,P.G´acs,M.Li,P.M.B.Vit´anyi,andW.Zurek.InformationDistance,IEEETransactionsonInformationTheory,44:4(1998),1407–1423.[5]C.H.Bennett,M.Li,B.Ma,Chainlettersandevolutionaryhistories,Scienti®cAmerican,June2003,76–81.[6]D.Bryant,V.Berry,P.Kearney,M.Li,T.Jiang,T.WarehamandH.Zhang.Apracticalalgorithmforrecoveringthebestsupportededgesofanevolutionarytree.Proc.11thACM-SIAMSymposiumonDiscreteAlgorithms,January9–11,2000,SanFrancisco,California,USA,287–296,2000.[7]Y.Cao,A.Janke,P.J.Waddell,M.Westerman,O.Takenaka,S.Murata,N.Okada,S.P¨a¨abo,M.Hasegawa,Con¯ictamongindividualmitochondrialproteinsinresolvingthephylogenyofEutherianorders,J.Mol.Evol.,47(1998),307-322.[8]X.Chen,B.Francia,M.Li,B.McKinnon,A.Seker,Sharedinformationandprogramplagiarismdetection,IEEETrans.Inform.Th.,50:7(2004),1545–1551.[9]R.Cilibrasi,TheCompLearnToolkit,2003,http://complearn.sourceforge.net/.[10]R.Cilibrasi,P.M.B.Vit´anyi,R.deWolf,Algorithmicclusteringofmusic,ComputerMusicJournal,Toappear.http://xxx.lanl.gov/abs/cs.SD/0303025[11]G.Cormode,M.Paterson,S.Sahinalp,andU.Vishkin.Communicationcomplexityofdocumentexchange.InProc.11thACM–SIAMSymp.onDiscreteAlgorithms,2000,197–206.[12]T.M.CoverandJ.A.Thomas.ElementsofInformationTheory.Wiley&Sons,1991.[13]W.ChaiandB.Vercoe.Folkmusicclassi®cationusinghiddenMarkovmodels.Proc.ofInternationalConferenceonArti®cialIntelligence,2001.[14]M.CooperandJ.Foote.Automaticmusicsummarizationviasimilarityanalysis,Proc.IRCAM,2002.[15]R.Dannenberg,B.Thom,andD.Watson.Amachinelearningapproachtomusicalstylerecognition,Proc.InternationalComputerMusicCon-ference,pp.344-347,1997.[16]R.O.Duda,P.E.Hart,D.G.Stork,PatternClassi®cation,2ndEdition,WileyInterscience,2001.[17]M.Grimaldi,A.Kokaram,andP.Cunningham.Classifyingmusicbygenreusingthewaveletpackettransformandaround-robinen-semble.TechnicalreportTCD-CS-2002-64,TrinityCollegeDublin,2002.http://www.cs.tcd.ie/publications/tech-reports/reports.02/TCD-CS-2002-64.pdf[18]A.Janke,O.Magnell,G.Wieczorek,M.Westerman,U.Arnason,Phylogeneticanalysisof18SrRNAandthemitochondrialgenomesofwombat,Vombatusursinus,andthespinyanteater,Tachyglossusacelaetus:increasedsupportfortheMarsupiontahypothesis,J.Mol.Evol.,1:54(2002),71–80.[19]T.Jiang,P.Kearney,andM.Li.APolynomialTimeApproximationSchemeforInferringEvolutionaryTreesfromQuartetTopologiesanditsApplication.SIAMJ.Computing,30:6(2001),1942–1961.[20]E.Keogh,S.Lonardi,andC.A.Rtanamahatana,Towardparameter-freedatamining,In:Proc.10thACMSIGKDDIntn'lConf.KnowledgeDiscoveryandDataMining,Seattle,Washington,USA,August22Ð25,2004,206–215.[21]J.K.Killian,T.R.Buckley,N.Steward,B.L.Munday,R.L.Jirtle,MarsupialsandEutheriansreunited:geneticevidencefortheTheriahypothesisofmammalianevolution,MammalianGenome,12(2001),513–517.[22]M.Koppel,S.Argamon,A.R.Shimoni,Automaticcatagorizingwrittentextsbyauthorgender,LiteraryandLinguisticComputing,Toappear.[23]A.Kraskov,H.St¨ogbauer,R.G.Adrsejak,P.Grassberger,Hierarchicalclusteringbasedonmutualinformation,2003,http://arxiv.org/abs/q-bio/0311039[24]J.B.Kruskal,Nonmetricmultidimensionalscaling:anumericalmethod,Psychometrika,29(1964),115–129.[25]T.G.Ksiazek,et.al.,ANovelCoronavirusAssociatedwithSevereAcuteRespiratorySyndrome,NewEnglandJ.Medicine,Publishedatwww.nejm.orgApril10,2003(10.1056/NEJMoa030781).[26]C.P.Kurtzman,J.Sugiyama,Ascomycetousyeastsandyeast-liketaxa.In:ThemycotaVII,Systemticsandevolution,partA,pp.179-200,Springer-Verlag,Berlin,2001.[27]C.P.Kurtzman,PhylogeneticcircumscriptionofSaccharomyces,KluyveromycesandothermembersoftheSaccharomycetaceaea,andtheproposalofthenewgeneraLachnacea,Nakaseomyces,Naumovia,VanderwaltozymaandZygotorulaspora,FEMSYeastRes.,4(2003),233–245.[28]P.S.Laplace,Aphilosophicalessayonprobabilities,1819.Englishtranslation,Dover,1951.[29]M.Li,J.H.Badger,X.Chen,S.Kwong,P.Kearney,andH.Zhang.Aninformation-basedsequencedistanceanditsapplicationtowholemitochondrialgenomephylogeny,Bioinformatics,17:2(2001),149–154. RUDICILIBRASIANDPAULVIT´ANYI:CLUSTERINGBYCOMPRESSION21[30]M.LiandP.M.B.Vit´anyi.AlgorithmicComplexity,pp.376–382in:InternationalEncyclopediaoftheSocial&BehavioralSciences,N.J.SmelserandP.B.Baltes,Eds.,Pergamon,Oxford,2001/2002.[31]M.Li,X.Chen,X.Li,B.Ma,P.M.B.Vit´anyi.Thesimilaritymetric,IEEETrans.Inform.Th.,50:12(2004),3250-3264.[32]M.LiandP.M.B.Vit´anyi.AnIntroductiontoKolmogorovComplexityanditsApplications,Springer-Verlag,NewYork,2ndEdition,1997.[33]A.Londei,V.Loreto,M.O.Belardinelli,Musicstyleandauthorshipcat-egorizationbyinformativecompressors,Proc.5thTriannualConferenceoftheEuropeanSocietyfortheCognitiveSciencesofMusic(ESCOM),September8-13,2003,Hannover,Germany,pp.200-203.[34]L.S.Oliveira,R.Sabourin,F.Bortolozzi,C.Y.Suen,Automaticrecog-nitionofhandwrittennumericalstrings:Arecognitionandveri®ca-tionstrategy,IEEETrans.PatternAnalysisandMachineIntelligence,24:11(2002),1438–1454.[35]UnitedNationsGeneralAssemblyresolution217A(III)of10December1948:UniversalDeclarationofHumanRights,http://www.un.org/Overview/rights.html[36]A.Rokas,B.L.Williams,N.King,S.B.Carroll,Genome-scaleap-proachestoresolvingincongruenceinmolecularphylogenies,Nature,425(2003),798–804(25October2003).[37]D.Salomon,DataCompression,Springer-Verlag,NewYork,1997.[38]N.Saitou,M.Nei,Theneighbor-joiningmethod:anewmethodforreconstructingphylogenetictrees,Mol.Biol.Evol.,4(1987),406–425.[39]P.Scott.Musicclassi®cationusingneuralnetworks,2001.http://www.stanford.edu/class/ee373a/musicclassi®cation.pdf[40]Ø.D.Trier,A.K.Jain,T.Taxt,FeatureextractionmethodsforcharacterrecognitionÐAsurvey,PatternRecognition,29:4(1996),641–662.[41]P.N.Yianilos,Normalizedformsfortwocommonmetrics,NECResearchInstitute,Report91-082-9027-1,1991,Revision7/7/2002.http://www.pnylab.com/pny/[42]A.C.-C.Yang,C.-K.Peng,H.-W.Yien,A.L.Goldberger,Informationcategorizationapproachtoliteraryauthorshipdisputes,PhysicaA,329(2003),473-483.[43]G.TzanetakisandP.Cook,Musicgenreclassi®cationofaudiosignals,IEEETransactionsonSpeechandAudioProcessing,10(5):293–302,2002.[44]S.Wehner,Analyzingnetworktraf®candwormsusingcompression,Manuscript,CWI,2004.Partiallyavailableathttp://homepages.cwi.nl/wehner/worms/X.BIOGRAPHIESOFTHEAUTHORSRUDICILIBRASIreceivedhisB.S.withhonorsfromtheCali-forniaInstituteofTechnologyin1996.Hehasprogrammedcom-putersforovertwodecades,bothinacademia,andindustrywithvariouscompaniesinSiliconValley,includingMicrosoft,indi-verseareassuchasmachinelearning,datacompression,processcontrol,VLSIdesign,computergraphics,computersecurity,andnetworkingprotocols,andisnowaPhDstudentattheCentreforMathematicsandComputerScience(CWI)andtheUniversityofAmsterdamintheNetherlands.Hehelpedcreatethe®rstpub-liclydownloadableNormalizedCompressionDistancesoftware,andismaintaininghttp://complearn.sourceforge.net/now.Homepage:http://www.cwi.nl/cilibrar/PAULM.B.VITÂANYIisaFellowoftheCentreforMathematicsandComputerScience(CWI)inAmsterdamandisProfessorofComputerScienceattheUniversityofAmsterdam.HeservesontheeditorialboardsofDistributedComputing(until2003),Infor-mationProcessingLetters,TheoryofComputingSystems,ParallelProcessingLetters,InternationaljournalofFoundationsofComputerScience,JournalofComputerandSystemsSciences(guesteditor),andelsewhere.Hehasworkedoncellularautomata,computationalcomplexity,distributedandparallelcomputing,machinelearningandprediction,physicsofcomputation,Kolmogorovcomplexity,quantumcomputing.TogetherwithMingLitheypioneeredappli-cationsofKolmogorovcomplexityandco-authoredªAnIntroduc-tiontoKolmogorovComplexityanditsApplications,ºSpringer-Verlag,NewYork,1993(2ndEdition1997),partsofwhichhavebeentranslatedintoChinese,RussianandJapanese.Homepage:http://www.cwi.nl/paulv/