CORRECTED VERSION OF IEEE TRANSA CTIONS ON INFORMA TION THEOR OL

CORRECTED VERSION OF IEEE TRANSA CTIONS ON INFORMA TION THEOR OL - Description

51 NO 4 APRIL 2005 15231545 Clustering by Compression Rudi Cilibrasi and aul MB it an yi Abstract pr esent new method or clustering based on compr ession The method doesnt use subjectspeci64257c featur es or backgr ound kno wledge and orks as ollo w ID: 29880 Download Pdf

164K - views

CORRECTED VERSION OF IEEE TRANSA CTIONS ON INFORMA TION THEOR OL

51 NO 4 APRIL 2005 15231545 Clustering by Compression Rudi Cilibrasi and aul MB it an yi Abstract pr esent new method or clustering based on compr ession The method doesnt use subjectspeci64257c featur es or backgr ound kno wledge and orks as ollo w

Similar presentations


Tags : APRIL
Download Pdf

CORRECTED VERSION OF IEEE TRANSA CTIONS ON INFORMA TION THEOR OL




Download Pdf - The PPT/PDF document "CORRECTED VERSION OF IEEE TRANSA CTIONS ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentation on theme: "CORRECTED VERSION OF IEEE TRANSA CTIONS ON INFORMA TION THEOR OL"— Presentation transcript:


Page 1
CORRECTED VERSION OF: IEEE TRANSA CTIONS ON INFORMA TION THEOR OL. 51, NO 4, APRIL 2005, 1523–1545 Clustering by Compression Rudi Cilibrasi and aul M.B. it an yi Abstract pr esent new method or clustering based on compr ession. The method doesn’t use subject-specific featur es or backgr ound kno wledge, and orks as ollo ws: First, we determine parameter -fr ee, uni ersal, similarity distance, the normalized compr ession distance or computed fr om the lengths of compr essed data files (singly and in pairwise concatenation). Second, we apply hierar chical clustering

method. The is not estricted to specific application ar ea, and orks acr oss application ar ea boundaries. theor etical pr ecursor the normal- ized inf ormation distance, co-de eloped by one of the authors, is pr ably optimal. Ho we er the optimality comes at the price of using the non-computable notion of olmogor complexity pr opose axioms to captur the eal-w orld setting, and sho that the appr oximates optimality extract hierar ch of clusters fr om the distance matrix, we determine dendr ogram (binary tr ee) by new quartet method and fast heuristic to implement it. The method is

implemented and ailable as public softwar e, and is ob ust under choice of differ ent compr essors. substantiate our claims of uni ersality and ob ustness, we eport vidence of successful application in ar eas as di erse as genomics, vir ology languages, literatur e, music, hand written digits, astr onomy and combinations of objects fr om completely differ ent domains, using statistical, dictionary and block sorting compr essors. In genomics we pr esented new vidence or major questions in Mammalian olution, based on whole-mitochondrial genomic analysis: the Eutherian orders and the Marsupionta

ypothesis against the Theria ypothesis. Index erms uni ersal dissimilarity distance, normalized compr ession distance, hierar chical unsuper vised clustering, quar tet tr ee method, parameter -fr ee data-mining, heter ogenous data analysis, olmogor complexity ith espect to the ersion published in the IEEE rans. Inf orm. Th. 51:4(2005), 1523–1545, we ha changed Defini- tion 2.1 of “admissible distance making it mor general and Definitions 2.4 and 2.5 of “normalized admissible distance, consequently adapted Lemma 2.6 (II.2) and in its pr oof (II.3) and the display ed inequalities.

This left Theor em 6.3 unchanged except or changing “such that x; to “such that x; and All data are created equal ut some data are more alik than others. propose method xpressing this alik eness, Manuscript recei ed xxx, 2003; re vised xxx 2004. The xperimental results in this paper were presented in part at the IEEE International Symposium on Information Theory ok ohama, Japan, June 29 July 4, 2003. Rudi Cilibrasi is with the Centre for Mathematics and Computer Science (CWI). Address: CWI, Kruislaan 413, 1098 SJ Amsterdam, The Netherlands. Email: Rudi.Cilibrasi@cwi.nl art of his ork as

supported by the Nether lands BSIK/BRICKS project, and by NW project 612.55.002. aul it an yi is with the Centre for Mathematics and Computer Science (CWI), the Uni ersity of Amsterdam, and National ICT of Australia. Address: CWI, Kruislaan 413, 1098 SJ Amsterdam, The Netherlands. Email: Paul.Vitanyi@cwi.nl art of his ork as done while the author as on Sabbatical lea at the National ICT of Australia, Sydne Laboratory at UNSW He as supported in part by the EU project RESQ, IST –2001–37559, the NoE UIPR OCONE IST –1999–29064, the ESF QiT Programmme, and the EU NoE ASCAL, the Netherlands

BSIK/BRICKS project, and the KRR and SML&KA Programs of National ICT of Australia. using ne similarity metric based on compression. It is parameter -free in that it doesn use an features or background kno wledge about the data, and can without changes be applied to dif ferent areas and across area boundaries. It is uni ersal in that it approximates the parameter xpressing similarity of the dominant feature in all pairwise comparisons. It is rob ust in the sense that its success appears independent from the type of compressor used. The clustering we use is hierarchical clustering in dendrograms

based on ne ast heuristic for the quartet method. The method is ailable as an open- source softw are tool. Belo we xplain the method, the theory underpinning it, and present vidence for its uni ersality and rob ustness by xperiments and results in plethora of dif ferent areas using dif ferent types of compressors. eatur e-Based Similarities: are presented with un- kno wn data and the question is to determine the similarities among them and group lik with lik together Commonly the data are of certain type: music files, transaction records of TM machines, credit card applications, genomic

data. In these data there are hidden relations that we ould lik to get out in the open. or xample, from genomic data one can xtract letter or block frequencies (the blocks are er the four -letter alphabet); from music files one can xtract arious specific numerical features, related to pitch, rh ythm, harmon etc. One can xtract such features using for instance ourier transforms [43] or elet transforms [17], to quantify param- eters xpressing similarity The resulting ectors corresponding to the arious files are then classified or clustered using xist- ing

classification softw are, based on arious standard statistical pattern recognition classifiers [43 ], Bayesian classifiers [15 ], hidden Mark models [13], ensembles of nearest-neighbor classifiers [17] or neural netw orks [15 ], [39 ]. or xample, in music one feature ould be to look for rh ythm in the sense of beats per minute. One can mak histogram where each histogram bin corresponds to particular tempo in beats-per minute and the associated peak sho ws ho frequent and strong that particular periodicity as er the entire piece. In [43 we see gradual change from fe high

peaks to man lo and spread-out ones going from hip-hip, rock, jazz, to classical. One can use this similarity type to try to cluster pieces in these cate gories. Ho we er such method requires specific and detailed kno wledge of the problem area, since one needs to kno what features to look for Non-F eatur Similarities: Our aim is to capture, in single similarity metric, very ef fective distance ef fecti ersions of Hamming distance, Euclidean distance, edit distances, align- ment distance, Lempel-Zi distance [11], and so on. This metric should be so general that it orks in ery domain:

music, te xt, literature, programs, genomes, ecutables, natural language determination, equally and simultaneously It ould be able to simultaneously detect all similarities between pieces
Page 2
CORRECTED VERSION OF: IEEE TRANSA CTIONS ON INFORMA TION THEOR OL. 51, NO 4, APRIL 2005, 1523–1545 that other ef fecti distances can detect seperately Compr ession-based Similarity: Such “uni ersal metric as co-de eloped by us in [29 ], [30], [31 ], as normalized ersion of the “information metric of [32 ], [4]. Roughly speaking, tw objects are deemed close if we can signifi- cantly

“compress one gi en the information in the other the idea being that if tw pieces are more similar then we can more succinctly describe one gi en the other The mathematics used is based on olmogoro comple xity theory [32 ]. In [31] we defined ne class of (possibly non- metric) distances, taking alues in [0 1] and appropriate for measuring ef fecti similarity relations between sequences, say one type of similarity per distance, and vice ver sa It as sho wn that an appropriately “normalized information distance minorizes ery distance in the class. It disco ers all ef fecti similarities in

the sense that if tw objects are close according to some ef fecti similarity then the are also close according to the normalized information distance. Put dif ferently the normalized information distance represents similarity according to the dominating shared feature between the tw objects being compared. In comparisons of more than tw objects, dif ferent pairs may ha dif ferent dominating features. The normalized information distance is metric and tak es alues in [0 1] hence it may be called “the similarity metric apply this ideal precise mathematical theory in real life, we ha to replace

the use of the noncomputable olmogoro comple xity by an approximation using stan- dard real-w orld compressor Earlier approaches resulted in the first completely automatic construction of the ph ylogen tree based on whole mitochondrial genomes, [29 ], [30], [31 ], completely automatic construction of language tree for er 50 Euro-Asian languages [31 ], detects plagiarism in student programming assignments [8], gi es ph ylogen of chain letters [5], and clusters music [10]. Moreo er the method turns out to be rob ust under change of the underlying compressor -types: statistical (PPMZ),

Lempel-Zi based dictionary (gzip), block based (bzip2), or special purpose (Gencompress). Related ork: In vie of the simplicity and naturalness of our proposal, it is perhaps surprising that compression based clustering and classification approaches did not arise before. But recently there ha been se eral partially independent proposals in that direction: [1], [2] for author attrib ution and uilding language trees—while citing the earlier ork [32 ], [4]—doesn de elop theory based on information distance ut proceeds by more ad hoc ar guments related to the compressibility of tar get

file after first compressing reference file. The better the tar get file compresses, the more we feel it is similar to the reference file in question. See also the xplanation in Appendix of [31]. This approach is used also to cluster music MIDI files by ohonen maps in [33 ]. Another recent of fshoot based on our ork is hierarchical clustering based on mutual information, [23 ]. In related, ut considerably simpler feature-based approach, one can compare the ord frequencies in te xt files to assess similarity In [42 the ord frequencies of ords common to

pair of te xt files are used as entries in tw ectors, and the similarity of the tw files is based on the distance between those ectors. The authors attrib ute authorship to Shak espeare plays, the Federalist apers, and the Chinese classic “The Dream of the Red Chamber This approach based on block occur rence statistics is standard in genomics, ut in an xperiment reported in [31] gi es inferior ph ylogen trees compared to our compression method (and wrong ones according to current biological wisdom). related, opposite, approach as tak en in [22 ], where literary te xts are clustered

by author gender or act ersus fiction, essentially by first identifying distinguishing features, lik gender dependent ord usage, and then classifying according to those features. Apart from the xperiments reported here, the clustering by compression method reported in this paper has recently been used to analyze netw ork traf fic and cluster computer orms and virusses [44 ]. Finally recent ork [20 reports xperiments with our method on all time sequence data used in all the major data-mining conferences in the last decade. Comparing the compression method with all major

methods used in those conferences the established clear superiority of the compression method for clustering heterogenous data, and for anomaly detection. See also the xplanation in Appendix II of [31]. Outline: Here we propose first comprehensi theory of real-w orld compressor -based normalized compression dis- tance, no el hierarchical clustering heuristic, together with se eral applications. First, we propose mathematical notions of “admissible distance (using the term for wider class than we did in [31]), “normalized admissible distance or “similarity distance, “normal compressor and

“normalized compres- sion distance. then pro the normalized compression distance based on normal compressor to be similarity distance satisfying the metric (in)equalities. The normalized compression distance is sho wn to be quasi-uni ersal in the sense that it minorizes ery computable similarity distance up to an error that depends on the quality of the compressor approximation of the true olmogoro comple xities of the files concerned. This means that the captures the dominant similarity er all possible features for ery pair of objects compared, up to the stated precision. Note that dif

ferent pairs of objects may ha dif ferent dominant shared features. Ne xt, we present method of hierarchical clustering based on no el ast randomized hill-climbing heuristic of ne quartet tree optimization criterion. Gi en matrix of the pairwise similarity distances between the objects, we score ho well the resulting tree represents the information in the distance matrix on scale of to 1. Then, as proof of principle, we run the program on three data sets, where we kno what the final answer should be: (i) reconstruct tree from distance matrix obtained from randomly generated tree; (ii)

reconstruct tree from files containing artificial similarities; and (iii) reconstruct tree from natural files of heterogenous data of astly dif ferent types. substantiate our claim of parameter -freeness and uni ersality we apply the method to dif ferent areas, not using an feature analysis at all. first gi an xample in whole-genome ph ylogen using the whole mitochondrial DN of the species concerned. compare the hierarchical clustering of our method with more standard method of tw o- dimensional clustering (to sho that our dendrogram method of depicting the clusters is

more informati e). gi whole-genome ph ylogen of fungi and compare this to results
Page 3
UDI CILIBRASI AND UL VIT ANYI: CLUSTERING BY COMPRESSION using alignment of selected proteins (alignment being often too costly to perform on the whole-mitochondial genome, ut the disadv antage of protein selection being that dif ferent selections usually result in dif ferent ph ylogenies—so which is right?). identify the virii that are closest to the sequenced SARS virus; we gi an xample of clustering of language amilies; Russian authors in the original Russian, the same pieces in English

translation (clustering partially follo ws the translators); clustering of music in MIDI format; clustering of handwritten digits used for optical character recognition; and clustering of radio observ ations of mysterious astronomical object, microquasar of xtremely comple ariability In all these cases the method performs ery well in the follo wing sense: The method yields the ph ylogen of 24 species agree- ing with biological wisdom insof ar as it is uncontro ersial. The probability that it randomly ould hit this outcome, or an ything reasonably close, is ery small. In clustering 36 music

pieces tak en equally man from pop, jazz, classic, so that 12-12-12 is the grouping we understand is correct, we can identify con clusters so that only six errors are made. (That is, if three items get dislodged without tw of them being interchanged, then six items get misplaced.) The probability that this happens by chance is xtremely small. The reason wh we think the method does something remarkable is concisely put by Laplace [28]: “If we seek cause where er we percei symmetry it is not that we re ard the symmetrical ent as less possible than the others, ut, since this ent ought to be the

ef fect of re gular cause or that of chance, the first of these suppositions is more probable than the second. On table we see letters arranged in this order and we judge that this arrangement is not the result of chance, not because it is less possible than others, for if this ord were not emplo yed in an language we ould not suspect it came from an particular cause, ut this ord being in use among us, it is incomparably more probable that some person has thus arranged the aforesaid letters than that this arrangement is due to chance. Materials and Methods: The data samples we used were

obtained from standard data bases accessible on the orld- wide web, generated by ourselv es, or obtained from research groups in the field of in estig ation. supply the details with each xperiment. The method of processing the data as the same in all xperiments. First, we preprocessed the data samples to bring them in appropriate format: the genomic material er the four -letter alphabet A; G; is recoded in four -letter alphabet; the music MIDI files are stripped of identifying information such as composer and name of the music piece. Then, in all cases the data samples were

completely automatically processed by our CompLearn oolkit, rather than as is usual in ph ylogen by using an ecclectic set of softw are tools per xperiment. Obli vious to the problem area concerned, simply using the distances according to the belo the method described in this paper fully automatically classifies the objects concerned. The method has been released in the public domain as open-source softw are: The CompLearn oolkit [9] is suite of simple utilities that one can use to apply compression techniques to the process of disco ering and learning patterns in completely dif ferent

domains. In act, this method is so general that it requires no background kno wledge about an particular subject area. There are no domain-specific parameters to set, and only handful of general settings. The Complearn oolkit using and not, say alignment, can cope with full genomes and other lar ge data files and thus comes up with single distance matrix. The clustering heuristic generates tree with certain confidence, called standardized benefit score or alue in the sequel. Gener ating trees from the same distance matrix man times resulted in the same tree in case of

high alue, or similar tree in case of moderately high alue, for all distance matrices we used, en though the heuristic is randomized. That is, there is only one ay to be right, ut increasingly man ays to be increasingly wrong which can all be realized by dif ferent runs of the randomized algorithm. This is great dif ference with pre vious ph ylogen methods, where because of computational limitations one uses only parts of the genome, or certain proteins that are vie wed as significant [21]. These are run through tree reconstruction method lik neighbor joining [38], maximum lik elihood,

maximum olution, maximum parsimon as in [21 ], or quartet ypercleaning [6], man times. The percentage-wise agreement on certain branches arising are called “bootstrap alues. rees are depicted with the best bootstrap alues on the branches that are vie wed as supporting the theory tested. Dif ferent choices of proteins result in dif ferent best trees. One ay to oid this ambiguity is to use the full genome, [36 ], [31 ], leading to whole-genome ph ylogen ith our method we do whole-genome ph ylogen and end up with single erall best tree, not optimizing selected parts of it. The quality of the

results depends on (a) the distance matrix, and (b) ho well the hierarchical tree represents the information in the matrix. The quality of (b) is measured by the alue, and is gi en with each xperiment. In general, the alue deteriorates for lar ge sets. belie this to be partially an artif act of lo w-resolution matrix due to limited compression po wer and limited file size. The main rea- son, ho we er is the act that with increasing size of natural data set the projection of the information in the matrix into binary tree gets inecessarily increasingly distorted. Another aspect limiting

the quality of the matrix is more subtle. Recall that the method kno ws nothing about an of the areas we apply it to. It determines the dominant feature as seen through the filter The dominant feature of alik eness between tw files may not correspond to our priori conception ut may ha an une xpected cause. The results of our xperiments suggest that this is not often the case: In the natural data sets where we ha preconceptions of the outcome, for xample that orks by the same authors should cluster together or music pieces by the same composers, musical genres, or genomes, the

outcomes conform lar gely to our xpectations. or xample, in the music genre xperiment the method ould ail dramatically if genres were enly
Page 4
CORRECTED VERSION OF: IEEE TRANSA CTIONS ON INFORMA TION THEOR OL. 51, NO 4, APRIL 2005, 1523–1545 mix ed, or mix ed with little bias. Ho we er to the contrary the separation in clusters is almost perfect. The fe misplacements that are discernable are either errors (the method as not po w- erful enough to discern the dominant feature), the distortion due to mapping multidimensional distances into tree distances, or the dominant feature

between pair of music pieces is not the genre ut some other aspect. The surprising ne ws is that we can generally confirm xpectations with fe misplacements, indeed, that the data don contain unkno wn rogue features that dominate to cause spurious (in our preconcei ed idea) clustering. This gi es vidence that where the preconception is in doubt, lik with ph ylogen ypotheses, the clustering can gi true support of one ypothesis ag ainst another one. Figur es: use tw styles to display the hierarchical clusters. In the case of genomics of Eutherian orders and fungi, language trees, it is con

enient to follo the dendrograms that are customary in that area (suggesting temporal olution) for easy comparison with the literature. Although there is no temporal relation intended, the dendrogram representation look ed also appropriate for the Russian writers, and transla- tions of Russian writers. In the other xperiments (e en the genomic SARS xperiment) it is more informati to display an unrooted ternary tree (or binary tree if we think about incoming and outgoing edges) with xplicit internal nodes. This acilitates identification of clusters in terms of subtrees rooted at internal

nodes or contiguous sets of subtrees rooted at branches of internal nodes. gi precise formal meaning to the loose distance notion of “de gree of similarity used in the pattern recognition literature. A. Distance and Metric Let be nonempty set and be the set of nonne ati real numbers. distance function on is function It is metric if it satisfies the metric (in)equalities: x; if x; (symmetry), and x; x; (triangle inequality). The alue x; is called the distance between x; amiliar xample of distance that is also metric is the Euclidean metric, the eryday distance a; between tw geographical

objects a; xpressed in, say meters. Clearly this distance satisfies the properties a; a; b; and a; a; c; (for instance, Amsterdam, Brussels, and Chicago.) are interested in particular type of distance, the “similarity distance”, which we formally define in Definition 2.5. or xample, if the objects are classical music pieces then the function defined by a; if and are by the same composer and a; otherwise, is similarity distance that is also metric. This metric captures only one similarity aspect (feature) of music pieces, presumably an important one that subsumes

conglomerate of more elementary features. B. Admissible Distance In defining class of admissible distances (not necessarily metric distances) we ant to xclude unrealistic ones lik x; for very pair do this by restricting the number of objects within gi en distance of an object. As in [4] we do this by only considering ef fecti distances, as follo ws. Fix suitable, and for the remainder of the paper fix ed, programming language. This is the efer ence pr gr amming language. Definition 2.1: Let with finite nonempty alphabet and the set of finite strings er that

alphabet. Since ery finite alphabet can be recoded in binary we choose In particular “files in computer memory are finite binary strings. function is an admissible distance if for ery pair of objects x; the distance x; satisfies the density condition x;y (II.1) is computable and is symmetric x; If is an admissible distance, then for ery the set x; is the length set of prefix code, since it satisfies (II.1) the Kraft inequality Con ersely if distance is the length set of prefix code, then it satisfies (II.1) see [12]. Example 2.2: In representing

the Hamming distance be- tween tw strings of equal length dif fering in positions we can use simple prefix-free encoding of n; d; in log log log log bits. encode and prefix-free in log log log bits each, see e.g. [32 ], and then the literal inde es of the actual flipped-bit positions. Adding an (1) -bit program to interpret these data, with the strings concerned being and we ha defined x; log log log log (1) as the length of prefix code ord (prefix program) to compute from and vice ver sa Then, by the Kraft inequality x;y It is easy to erify that is metric

in the sense that it satisfies the metric (in)equalities up to (log additi precision. C. Normalized Admissible Distance Lar ge objects (in the sense of long strings) that dif fer by tin part are intuiti ely closer than tin objects that dif fer by the same amount. or xample, tw whole mitochondrial genomes of 18,000 bases that dif fer by 9,000 are ery dif fer ent, while tw whole nuclear genomes of 10 bases that dif fer by only 9,000 bases are ery similar Thus, absolute dif ference between tw objects doesn go ern similarity ut relati dif ference appears to do so. Definition 2.3: compr

essor is lossless encoder mapping into such that the resulting code is prefix code. “Lossless means that there is decompressor that reconstructs the source message from the code message. or con enience of notation we identify “compressor with “code ord length function where is the set of nonne ati inte gers. That is, the compressed ersion of file has length only consider compressors such that
Page 5
UDI CILIBRASI AND UL VIT ANYI: CLUSTERING BY COMPRESSION (log (The additi log arithmic term is due to our requirement that the compressed file be prefix code ord.)

fix compressor and call the fix ed compressor the efer ence compr essor Definition 2.4: Let be an admissible distance. Then is defined by max x; and x; is defined by x; max Note that since x; also x; Definition 2.5: Let be an admissible distance. The nor malized admissible distance also called similarity distance x; based on relati to reference compressor is defined by x; x; x; It follo ws from the definitions that normalized admissible distance is function [0 1] that is symmetric: x; Lemma 2.6: or ery and constant [0 1] normalized admissible

distance satisfies the density constraint jf x; e; gj eD )+1 (II.2) Pr oof: Assume to the contrary that does not satisfy (II.2) Then, there is an [0 1] and an such that (II.2) is alse. first note that, since x; is an admissible distance that satisfies (II.1) x; satisfies “normalized ersion of the Kraft inequality: x;y x;y x;y (II.3) Starting from (II.3) we obtain the required contradiction: x;y x;y e; eD eD )+1 eD If x; is the normalized ersion of an admissible dis- tance x; then (II.3) is equi alent to (II.1) call normalized distance “similarity distance, because it gi

es relati similarity according to the distance (with distance when objects are maximally similar and distance when the are maximally dissimilar) and, con ersely for ery well- defined computable notion of similarity we can xpress it as metric distance according to our definition. In the literature distance that xpresses lack of similarity (lik ours) is often called “dissimilarity distance or “disparity distance. Remark 2.7: As ar as the authors kno the idea of nor malized metric is, surprisingly not well-studied. An xception is [41 ], which in estig ates normalized metrics to

account for relati distances rather than absolute ones, and it does so for much the same reasons as in the present ork. An xample there is the normalized Euclidean metric where x; denotes the real numbers) and is the Euclidean metric—the norm. Another xample is normalized symmetric-set-dif ference metric. But these normalized metrics are not necessarily ef fecti in that the distance between tw objects gi es the length of an ef fecti description to go from either object to the other one. Remark 2.8: Our definition of normalized admissible dis- tance is more direct than in [31], and the

density constraints (II.2) and (II.3) follo from the definition. In [31] we put stricter density condition in the definition of “admissible normalized distance, which is, ho we er harder to satisfy and maybe too strict to be realistic. The purpose of this stricter den- sity condition as to obtain stronger “uni ersality property than the present Theorem 6.3, namely one with and (1 max Nonetheless, both definitions coincide if we set the length of the compressed ersion of to the ultimate compressed length the olmogoro comple xity of Example 2.9: obtain normalized ersion of the

Ham- ming distance of Example 2.2, we define x; x; =H x; can set x; 2) log log log (1) since ery contemplated compressor will satisfy where is with all bits flipped (so x; for either or ). By (II.2) for ery the number of with in the Hamming ball x; is less than eH )+1 This upper bound is an ob vious erestimate for log or lo wer alues of the upper bound is correct by the observ ation that the number of equals en =0 en nH where log (1 log (1 Shannon entrop function. Then, eH en log enH since log gi axioms determining lar ge amily of compressors that both include most (if not all)

real-w orld compressors and ensure the desired properties of the to be defined later Definition 3.1: compressor is normal if it satisfies, up to an additi (log term, with the maximal binary length of an element of in olv ed in the (in)equality concerned, the follo wing: 1) Idempotency xx and where is the empty string. 2) Monotonicity xy 3) Symmetry xy 4) Distrib utivity xy xz Idempotency: reasonable compressor will see xact repetitions and obe idempotenc up to the required precision. It will also compress the empty string to the empty string. Monotonicity: real compressor

must ha the monotonic- ity property at least up to the required precision. The property is vident for stream-based compressors, and only slightly less vident for block-coding compressors. Symmetry: Stream-based compressors of the Lempel-Zi amily lik gzip and pkzip, and the predicti PPM amily lik PPMZ, are possibly not precisely symmetric. This is related to the stream-based property: the initial file may ha re gularities to which the compressor adapts; after crossing the border to it must unlearn those re gularities and adapt to the ones of This process may cause some imprecision in

symmetry that anishes asymptotically with the length of
Page 6
CORRECTED VERSION OF: IEEE TRANSA CTIONS ON INFORMA TION THEOR OL. 51, NO 4, APRIL 2005, 1523–1545 x; compressor must be poor indeed (and will certainly not be used to an xtent) if it doesn satisfy symmetry up to the required precision. Apart from stream-based, the other major amily of compressors is block-coding based, lik bzip2. The essentially analyze the full input block by considering all rotations in obtaining the compressed ersion. It is to great xtent symmetrical, and real xperiments sho no departure from symmetry

Distrib uti vity: The distrib uti vity property is not immedi- ately intuiti e. In olmogoro comple xity theory the stronger distrib uti vity property xy xz (III.1) holds (with ). Ho we er to pro the desired properties of belo only the weak er distrib uti vity property xy xz (III.2) abo is required, also for the boundary case were In practice, real-w orld compressors appear to satisfy this weak er distrib uti vity property up to the required precision. Definition 3.2: Define xy (III.3) This number of bits of information in relati to can be vie wed as the xcess number of bits in the

compressed ersion of xy compared to the compressed ersion of and is called the amount of conditional compr essed information In the definition of compressor the decompression algorithm is not included (unlik the case of olmorogo comple xity where the decompressing algorithm is gi en by definition), ut it is easy to construct one: Gi en the compressed ersion of in bits, we can run the compressor on all candidate strings —for xample, in length-increasing le xicographical order until we find the compressed string Since this string decompresses to we ha found Gi en the compressed

ersion of xy in xy bits, we repeat this process using strings xz until we find the string xz of which the compressed ersion equals the compressed ersion of xy Since the former compressed ersion decompresses to xy we ha found By the unique decompression property we find that is the xtra number of bits we require to describe apart from describing It is intuiti ely acceptable that the conditional compressed information satisfies the triangle inequality (III.4) Lemma 3.3: Both (III.1) and (III.4) imply (III.2) Pr oof: ((III.1) implies (III.2) :) By monotonicity ((III.4) implies

(III.2) :) Re write the terms in (III.4) accord- ing to (III.3) cancel in the left- and right-hand sides, use symmetry and rearrange. Lemma 3.4: normal compressor satisfies additionally subadditivity xy Pr oof: Consider the special case of distrib uti vity with the empty ord so that xz and Subadditi vity: The subadditi vity property is clearly also required for ery viable compressor since compressor may use information acquired from to compress Minor im- precision may arise from the unlearning ef fect of crossing the border between and mentioned in relation to symmetry ut ag ain this

must anish asymptotically with increasing length of x; echnically the olmo gor comple xity of gi en is the length of the shortest binary program, for the reference uni ersal prefix uring machine, that on input outputs it is denoted as or precise definitions, theory and applications, see [32 ]. The olmogoro comple xity of is the length of the shortest binary program with no input that outputs it is denoted as where denotes the empty input. Essentially the olmogoro comple xity of file is the length of the ultimate compressed ersion of the file. In [4] the information

distance x; as introduced, defined as the length of the shortest binary program for the reference uni ersal prefix uring machine that, with input computes and with input computes It as sho wn there that, up to an additi log arithmic term, x; max It as sho wn also that x; is metric, up to ne gligible violations of the metric inequalties. Moreo er it is uni ersal in the sense that for ery admissible distance x; as in Definition 2.1, x; x; up to an additi constant depending on ut not on and In [31 ], the normalized ersion of x; called the normalized information distance is

defined as x; max max (IV .1) It too is metric, and it is uni ersal in the sense that this single metric minorizes up to an ne gligible additi error term all normalized admissible distances in the class considered in [31]. Thus, if tw files (of whate er type) are similar (that is, close) according to the particular feature described by particular normalized admissible distance (not necessarily metric), then the are also similar (that is, close) in the sense of the normalized information metric. This justifies calling the latter the similarity metric. stress once more that dif

ferent pairs of objects may ha dif ferent dominating features. et ery such dominant similarity is detected by the Ho we er this metric is based on the notion of olmogoro comple xity Unfortunately the olmogoro comple xity is non- computable in the uring sense. Approximation of the denom- inator of (IV .1) by gi en compressor is straightforw ard: it is max The numerator is more trick It can be re written as max x; x; (IV .2) within log arithmic additi precision, by the additi property of olmogoro comple xity [32 ]. The term x; represents the length of the shortest program for the pair x; In

compression practice it is easier to deal with the concatenation xy or Ag ain, within log arithmic precision x; xy ollo wing suggestion by Ste en de Rooij, one can approximate (IV .2) best by min xy min Here, and in the later xperiments using
Page 7
UDI CILIBRASI AND UL VIT ANYI: CLUSTERING BY COMPRESSION the CompLearn oolkit [9], we simply use xy rather than min xy This is justified by the observ ation that block-coding based compressors are symmetric almost by definition, and xperiments with arious stream-based compressors (gzip, PPMZ) sho only small de viations from

symmetry The result of approximating the using real compres- sor is called the normalized compression distance ), formally defined in (VI.1) The theory as de eloped for the olmogoro v-comple xity based in [31 ], may not hold for the (possibly poorly) approximating It is nonetheless the case that xperiments sho that the apparently has (some) properties that mak the so appealing. fill this ap between theory and practice, we de elop the theory of from first principles, based on the axiomatics of Sec- tion III. sho that the is quasi-uni ersal similarity metric relati to normal

reference compressor The theory de eloped in [31 is the boundary case where the “quasi-uni ersality belo has become full “uni ersality”. define compression distance based on normal com- pressor and sho it is an admissible distance. In applying the approach, we ha to mak do with an approximation based on ar less po werful real-w orld reference compressor compressor approximates the information distance x; based on olmogoro comple xity by the compression distance x; defined as x; xy min (V .1) Here, xy denotes the compressed size of the concatenation of and denotes the compressed

size of and denotes the compressed size of Lemma 5.1: If is normal compressor then x; (1) is an admissible distance. Pr oof: Case 1: Assume Then x; xy Then, gi en and prefix-program of length x; consisting of the suf fix of the -compressed ersion of xy and the compressor in (1) bits, we can run the compressor on all xz s, the candidate strings in length- increasing le xicographical order When we find so that the suf fix of the compressed ersion of xz matches the gi en suf fix, then by the unique decompression property Case 2: Assume By symmetry xy No follo the

proof of Case 1. Lemma 5.2: If is normal compressor then x; satisfies the metric (in)equalities up to log arithmic additi precision. Pr oof: Only the triangular inequality is non-ob vious. By (III.2) xy xz up to log arithmic additi precision. There are six possibilities, and we erify the correctness of the triangular inequality in turn for each of them. Assume Then xy xz Assume Then xy xz Assume Then xy xz Assume Then xy xz Assume Then xy xz Assume Then xy xz Lemma 5.3: If is normal compressor then x; max Pr oof: Consider pair x; The max xz is which is achie ed for the empty ord, with

Similarly the max is Hence the lemma. The normalized ersion of the admissible distance x; the compressor based approximation of the normalized information distance (IV .1) is called the normalized compr es- sion distance or x; xy min max (VI.1) This is the main concept of this ork. It is the real- orld ersion of the ideal notion of normalized information distance in (IV .1) Remark 6.1: In practice, the is non-ne ati num- ber representing ho dif ferent the tw files are. Smaller numbers represent more similar files. The in the upper bound is due to imperfections in our compression

techniques, ut for most standard compression algorithms one is unlik ely to see an abo 0.1 (in our xperiments gzip and bzip2 achie ed abo 1, ut PPMZ al ays had at most 1). There is natural interpretation to x; If, say then we can re write x; xy That is, the distance x; between and is the impro ement due to compressing using as pre viously compressed “data base, and compressing from scratch, xpressed as the ratio between the bit-wise length of the tw compressed ersions. Relati to the reference compressor we can define the information in about as Then, using (III.3) x; That is, the between

and is minus the ratio of the information about and the information in Theor em 6.2: If the compressor is normal, then the is normalized admissible distance satsifying the metric (in)equalities, that is, similarity metric. Pr oof: If the compressor is normal, then by Lemma 5.1 and Lemma 5.3, the is normalized admissible distance. It remains to sho it satisfies the three metric (in)equalities. 1) By idempotenc we ha x; By mono- tonicity we ha x; for ery x; with inequality for 2) x; The is unchanged by interchanging and in (VI.1)
Page 8
CORRECTED VERSION OF: IEEE TRANSA CTIONS ON

INFORMA TION THEOR OL. 51, NO 4, APRIL 2005, 1523–1545 3) The dif ficult property is the triangle inequality ith- out loss of generality we assume Since the is symmetrical, there are only three triangle inequalities that can be xpressed by x; x; erify them in turn: a) x; x; By dis- trib uti vity the compressor itself satisfies xy xz Subtracting from both sides and re writing, xy xz Di viding by on both sides we find xy xz The left-hand side is i) Assume the right-hand side is Setting and adding to both the numerator and denominator of the right-hand side, it can only increase

and dra closer to 1. Therefore, xy xz which as what we had to pro e. ii) Assume the right-hand side is pro- ceed lik in the pre vious case, and add to both numerator and denominator Although no the right-hand side decreases, it must still be greater than 1, and therefore the right-hand side remains at least as lar ge as the left-hand side. b) x; x; By dis- trib uti vity we ha xz xy Subtracting from both sides, rear ranging, and di viding both sides by we obtain xz xy The right-hand side doesn decrease when we substitute for the denominator of the first term, since Therefore, the

inequality stays alid under this substitution, which as what we had to pro e. c) x; By dis- trib uti vity we ha xz Subtracting from both sides, using symmetry rearranging, and di viding both sides by we obtain xy The right-hand side doesn decrease when we substitute for the denominator of the first term, since Therefore, the inequality stays alid under this substitution, which as what we had to pro e. Quasi-Uni ersality: no digress to the theory de el- oped in [31 ], which formed the moti ation for de eloping the If, instead of the result of some real compressor we substitute the

olmogoro comple xity for the lengths of the compressed files in the formula, the result is the as in (IV .1). It is uni ersal in the follo wing sense: Ev ery admissible distance xpressing similarity according to some feature, that can be computed from the objects concerned, is comprised (in the sense of minorized) by the Note that ery feature of the data gi es rise to similarity and, con ersely ery similarity can be thought of as xpressing some feature: being similar in that sense. Our actual practice in using the alls short of this ideal theory in at least three respects: (i) The

claimed uni ersality of the holds only for indefinitely long sequences x; Once we consider strings x; of definite length it is only uni ersal with respect to “simple computable normalized admissible distances, where “simple means that the are computable by programs of length, say log arithmic in This reflects the act that, technically speaking, the uni ersality is achie ed by summing the weighted contrib ution of all similarity distances in the class considered with respect to the objects considered. Only similarity distances of which the comple xity is small (which means

that the weight is lar ge), with respect to the size of the data concerned, kick in. (ii) The olmogoro comple xity is not computable, and it is in principle impossible to compute ho ar of the is from the So we cannot in general kno ho well we are doing using the (iii) approximate the we use standard compression programs lik gzip, PPMZ, and bzip2. While better compres- sion of string will al ays approximate the olmogoro comple xity better this may not be true for the Due to its arithmetic form, subtraction and di vision, it is theoret- ically possible that while all items in the formula get

better compressed, the impro ement is not the same for all items, and the alue mo es ay from the alue. In our xperiments we ha not observ ed this beha vior in noticable ashion. ormally we can state the follo wing: Theor em 6.3: Let be computable normalized admissible distance and be normal compressor Then, x; x; where for we ha =C and )) =C with according to (III.3) Pr oof: Fix d; x; in the statement of the theorem. Since the is symmetrical, we can, without loss of generality let By (III.3) and the symmetry property xy we ha Therefore, x; =C Let x; be the normalized ersion of the admissible

distance x; that is, x; x; =D x; Let x; By (II.2) there are eD )+1 man x; pairs, such that x; and Since is computable, we can compute
Page 9
UDI CILIBRASI AND UL VIT ANYI: CLUSTERING BY COMPRESSION and enumerate all these pairs. The initially fix ed pair x; is an element in the list and its inde tak es eD bits. Therefore, gi en the can be described by at most eD (1) bits—its inde in the list and an (1) term accounting for the lengths of the programs in olv ed in reconstructing gi en its inde in the list, and algorithms to compute functions and Since the olmogoro comple xity gi

es the length of the shortest ef fecti description, we ha eD (1) Substitution, re writing, and using x; x; up to ignorable additi terms (Section IV), yields x; =C which as what we had to pro e. Remark 6.4: Clustering according to will group se- quences together that are similar according to features that are not xplicitly kno wn to us. Analysis of what the compressor actually does, still may not tell us which features that mak sense to us can be xpressed by conglomerates of features analyzed by the compressor This can be xploited to track do wn unkno wn features implicitly in

classification: forming automatically clusters of data and see in which cluster (if an y) ne candidate is placed. Another aspect that can be xploited is xploratory: Gi en that the is small for pair x; of specific sequences, what does this really say about the sense in which these tw sequences are similar? The abo analysis suggests that close similarity will be due to dominating feature (that perhaps xpresses conglomerate of subfeatures). Looking into these deeper causes may gi feedback about the appropriateness of the realized distances and may help xtract more intrinsic

information about the objects, than the obli vious di vision into clusters, by looking for the common features in the data clusters. Gi en set of objects, the pairwise form the entries of distance matrix. This distance matrix contains the pairwise relations in ra form. But in this format that information is not easily usable. Just as the distance matrix is reduced form of information representing the original data set, we no need to reduce the information en further in order to achie cogniti ely acceptable format lik data clusters. xtract hierarch of clusters from the distance matrix, we

determine dendrogram (binary tree) that agrees with the distance matrix according to cost measure. This allo ws us to xtract more information from the data than just flat clustering (determining disjoint clusters in dimensional representation). Clusters are groups of objects that are similar according to our metric. There are arious ays to cluster Our aim is to analyze data sets for which the number of clusters is not kno wn priori, and the data are not labeled. As stated in [16 ], conceptually simple, hierarchical clustering is among the best kno wn unsupervised methods in this setting,

and the most natural ay is to represent the relations in the form of dendrogram, which is customarily directed binary tree or undirected ternary tree. construct the tree from distance matrix with entries consisting of the pairwise distances between objects, we use quartet method. This is matter of choice only other methods may ork equally well. The distances we compute in our xperiments are often within the range 0.85 to 1.1. That is, the distinguishing features are small, and we need sensiti method to xtract as much information contained in the distance matrix as is possible. or xample, our

xperiments sho wed that reconstructing minimum spanning tree is not sensiti enough and gi es poor results. ith increasing number of data items, the projection of the matrix information into the tree representation format gets increasingly distorted. similar situation arises in using alignment cost in genomic comparisons. Experience sho ws that in both cases the hierarchical clustering methods seem to ork best for small sets of data, up to 25 items, and to deteriorate for lar ger sets, say 40 items or more. standard solution to hierarchically cluster lar ger sets of data is to first

cluster nonhierarchically by say multidimensional scaling of -means, ailable in standard packages, for instance Matlab and then apply hierarchical clustering on the emer ging clusters. The quartet method: consider ery group of four elements from our set of elements; there are such groups. From each group u; we construct tree of arity 3, which implies that the tree consists of tw subtrees of tw lea es each. Let us call such tree quartet topolo gy There are three possibilities denoted (i) uv (ii) uw and (iii) ux where ertical bar di vides the tw pairs of leaf nodes into tw disjoint subtrees

(Figure 1). n0 n0 n0 n1 n1 n1 Fig. 1. The three possible quartet topologies for the set of leaf labels u,v ,w ,x or an gi en tree and an group of four leaf labels u; we say is consistent with uv if and only if the path from to does not cross the path from to Note that xactly one of the three possible quartet topologies for an set of labels must be consistent for an gi en tree. may think of lar ge tree ha ving man smaller quartet topologies embedded within its structure. Commonly the goal in the quartet method is to find (or approximate as closely as possible) the tree that embeds the

maximal number of
Page 10
10 CORRECTED VERSION OF: IEEE TRANSA CTIONS ON INFORMA TION THEOR OL. 51, NO 4, APRIL 2005, 1523–1545 n0 n1 n2 n3 Fig. 2. An xample tree consistent with quartet topology uv consistent (possibly weighted) quartet topologies from gi en set of quartet topologies [19 (Figure 2). This is called the (weighted) Maximum Quartet Consistency (MQC) problem. propose ne optimization problem: the Minimum Quartet ee Cost (MQTC) as follo ws: The cost of quar tet topology is defined as the sum of the distances be- tween each pair of neighbors; that is, uv u; The total

cost of tree with set of lea es (e xternal nodes of de gree 1) is defined as u;v ;w ;x g uv is consistent with uv —the sum of the costs of all its consistent quartet topologies. First, we generate list of all possible quartet topologies for all four -tuples of labels under consideration. or each group of three possible quartet topologies for gi en set of four labels u; calculate best (minimal) cost u; min uv uw ux and orst (maximal) cost u; max uv uw ux Summing all best quartet toplogies yields the best (minimal) cost u;v ;w ;x g u; Con ersely summing all orst quartet toplogies yields

the orst (maximal) cost u;v ;w ;x g u; or some distance matrices, these minimal and maximal alues can not be attained by actual trees; ho we er the score of ery tree will lie between these tw alues. In order to be able to compare tree scores in more uniform ay we no rescale the score linearly such that the orst score maps to 0, and the best score maps to 1, and term this the normalized tr ee benefit scor Our goal is to find full tree with maximum alue of which is to say the lo west total cost. xpress the notion of computational dif ficulty one uses the notion of

“nondeterministic polynomial time (NP)”. If problem concerning objects is NP-hard this means that the best kno wn algorithm for this (and wide class of significant problems) requires computation time xponential in That is, it is infeasible in practice. The MQC decision pr oblem is the follo wing: Gi en objects, let be tree of which the lea es are labeled by the objects, and let be the set of quar tet topologies embedded in Gi en set of quartet topologies and an inte ger the problem is to decide whether there is binary tree such that In [19] it is sho wn that the MQC decision problem is

NP-hard. or ery MQC decision problem one can define an MQTC problem that has the same solution: gi the quartet topologies in cost and the other ones cost 1. This ay the MQC decision problem can be reduced to the MQTC decision problem, which sho ws also the latter to be NP-hard. Hence, it is infeasible in practice, ut we can sometimes solv it, and al ays approximate it. (The reduction also sho ws that the quartet problems re vie wed in [19], are subsumed by our problem.) Adapting current methods in [6] to our MQTC optimization problem, results in ar too computationally intensi

calculations; the run man months or years on moderate-sized problems of 30 objects. Therefore, we ha designed simple, feasible, heuristic method for our problem based on randomization and hill-climbing. First, random tree with nodes is created, consisting of leaf nodes (with connecting edge) labeled with the names of the data items, and non-leaf or internal nodes labeled with the lo wercase letter “n follo wed by unique inte ger identifier Each internal node has xactly three connecting edges. or this tree we calculate the total cost of all embedded quartet toplogies, and in ert and scale

this alue to find tree is consistent with precisely of all quartet topologies, one for ery quartet. random tree may be consistent with about of the best quartet topologies—b ut because of dependencies this figure is not precise. The initial random tree is chosen as the currently best kno wn tree, and is used as the basis for further searching. define simple mutation on tree as one of the three possible transformations: 1) leaf swap which consists of randomly choosing tw leaf nodes and sw apping them. 2) subtr ee swap which consists of randomly choosing tw internal nodes and

sw apping the subtrees rooted at those nodes. 3) subtr ee tr ansfer whereby randomly chosen subtree (possibly leaf) is detached and reattached in another place, maintaining arity in ariants. Each of these simple mutations eeps the number of leaf nodes and internal nodes in the tree in ariant; only the structure and placements change. Define full mutation as sequence of at least one ut potentially man simple mutations, pick ed according to the follo wing distrib ution. First we pick the number of simple mutations that we will perform with probability or each such simple mutation, we

choose one of the three types listed abo with equal probability Finally for each of these simple mutations, we uniformly at random select lea es or internal nodes, as appropriate. Notice that trees which are close to the original tree (in terms of number of simple mutation steps in between) are xamined often, while trees that are ar ay from the original tree will entually be xamined, ut not ery frequently In order to search for better tree, we simply apply full mutation on to arri at and then calculate If then
Page 11
UDI CILIBRASI AND UL VIT ANYI: CLUSTERING BY COMPRESSION 11 eep as

the ne best tree. Otherwise, repeat the procedure. If er reaches then halt, outputting the best tree. Otherwise, run until it seems no better trees are being found in reasonable amount of time, in which case the approximation is complete. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 10000 20000 30000 40000 50000 60000 70000 80000 S(T) Total trees examined ’gnup.dat Fig. 3. Progress of 60-item data set xperiment er time Note that if tree is er found such that then we can stop because we can be certain that this tree is optimal, as no tree could ha lo wer cost. In act, this perfect tree result is

achie ed in our artificial tree reconstruction xperiment (Section VII-A) reliably in fe minutes. or real-w orld data, reaches maximum some what less than presumably reflecting distortion of the information in the distance matrix data by the best possible tree representation, as noted abo e, or indicating getting stuck in local optimum or search space too lar ge to find the global optimum. On man typical problems of up to 40 objects this tree-search gi es tree with within half an hour or lar ge numbers of objects, tree scoring itself can be slo (as this tak es order

computation steps), and the space of trees is also lar ge, so the algorithm may slo do wn substantially or lar ger xperiments, we use C++/Ruby implementation with MPI (Message assing Interf ace, common standard used on massi ely parallel computers) on cluster of orkstations in parallel to find trees more rapidly can consider the graph mapping the achie ed score as function of the number of trees xamined. Progress occurs typically in sigmoidal ashion to ards maximal alue Figure 3. A. Thr ee contr olled xperiments ith the natural data sets we use, one may ha the preconception (or

prejudice) that, say music by Bach should be clustered together music by Chopin should be clustered together and so should music by rock stars. Ho we er the preprocessed music files of piece by Bach and piece by Chopin, or the Beatles, may resemble one another more than tw dif ferent pieces by Bach—by accident or indeed by design and cop ying. Thus, natural data sets may ha ambiguous, conflicting, or counterintuiti outcomes. In other ords, the xperiments on natural data sets ha the dra wback of not ha ving an objecti clear “correct answer that can function as benchmark for

assessing our xperimental outcomes, ut only intuiti or traditional preconceptions. discuss three xperiments that sho that our program indeed does what it is supposed to do—at least in artificial situations where we kno in adv ance what the correct answer is. Recall, that the “similarity machine we ha described consists of tw parts: (i) xtracting distance matrix from the data, and (ii) constructing tree from the distance matrix using our no el quartet-based heuristic. s0 n15 s6 n7 s1 n14 s7 n13 s2 n1 n0 n5 s3 n12 s5 s4 n10 n3 s8 n2 s9 n11 n9 s17 s10 n8 s15 n4 s11 s12 s13 s14 n6 s16 Fig.

4. The randomly generated tree that our algorithm reconstructed. esting the quartet-based tr ee construction: first test whether the quartet-based tree construction heuristic is trustw orth y: generated ternary tree with 18 lea es, using the pseudo-random number generator “rand of the Ruby programming language, and deri ed metric from it by defining the distance between tw nodes as follo ws: Gi en the length of the path from to in an inte ger number of edges, as a; let a; a; 18 xcept when in which case a; It is easy to erify that this simple formula al ays gi es number between and

1, and is monotonic with path length. Gi en only the 18 18 matrix of these normalized distances, our quartet method xactly reconstructed the original tree represented in Figure 4, with esting on arti˛cial data: Gi en that the tree reconstruction method is accurate on clean consistent data, we tried whether the full procedure orks in an acceptable manner when we kno what the outcome should be lik e. used the “rand pseudo-random number generator from the programming
Page 12
12 CORRECTED VERSION OF: IEEE TRANSA CTIONS ON INFORMA TION THEOR OL. 51, NO 4, APRIL 2005, 1523–1545 n5 ac

n11 ab n15 n4 n6 abc n17 abcd abce n9 abe abfg n12 acfg abhi n14 n2 abhj n7 n3 n1 n0 n10 hijk n19 jk n16 n18 n8 ij n13 Fig. 5. Classi˛cation of arti˛cial ˛les with repeated 1-kilobyte tags. Not all possiblities are included; for xample, ˛le is missing. 905 language standard library under Linux. randomly gener ated 11 separate 1-kilobyte blocks of data where each byte as equally probable and called these ta gs Each tag as associated with dif ferent lo wercase letter of the alphabet. Ne xt, we generated 22 files of 80 kilobyte each, by starting with block of purely

random bytes and applying one, tw o, three, or four dif ferent tags on it. Applying tag consists of ten repetitions of picking random location in the 80-kilobyte file, and erwriting that location with the globally consistent tag that is indicated. So, for instance, to create the file referred to in the diagram by “a, we start with 80 kilobytes of random data, then pick ten places to cop er this random data with the arbitrary 1-kilobyte sequence identified as tag Similarly to create file “ab, we start with 80 kilobytes of random data, then pick ten places to put copies

of tag then pick ten more places to put copies of tag (perhaps erwriting some of the tags). Because we ne er use more than four dif ferent tags, and therefore ne er place more than 40 copies of tags, we can xpect that at least half of the data in each file is random and uncorrelated with the rest of the files. The rest of the file is correlated with other files that also contain tags in common; the more tags in common, the more related the files are. The compressor used to compute the matrix as bzip2. The resulting tree is gi en in Figure 5; it can be seen that

the clustering has occured xactly as we ould xpect. The score is 0.905. esting on heter ogenous natural data: test gross classification of files based on heterogenous data of mark edly dif ferent file types: (i) our mitochondrial gene sequences, from black bear polar bear fox, and rat obtained from the GenBank Database on the orld-wide web; (ii) our xcerpts from the no el The Zeppelin asseng er by E. Phillips Oppenheim, obtained from the Project Gutenber Edition on the orld-W ide web; (iii) our MIDI files without fur ther processing; tw from Jimi Hendrix and tw mo

ements ELFExecutableA n12 n7 ELFExecutableB GenesBlackBearA n13 GenesPolarBearB n5 GenesFoxC n10 GenesRatD JavaClassA n6 n1 JavaClassB MusicBergA n8 n2 MusicBergB MusicHendrixA n0 n3 MusicHendrixB TextA n9 n4 TextB TextC n11 TextD Fig. 6. Classi˛cation of dif ferent ˛le types. ree agrees xceptionally well with distance matrix: 984 from Deb ussy Suite Ber amasque, do wnloaded from arious repositories on the orld-wide web; (i v) Linux x86 ELF ecutables (the cp and rm commands), copied directly from the RedHat 9.0 Linux distrib ution; and (v) compiled Ja class files, generated by

ourselv es. The compressor used to compute the matrix as bzip2. As xpected, the program correctly classifies each of the dif ferent types of files together with lik near lik e. The result is reported in Figure with equal to the ery high confidence alue 0.984. This xperiment sho ws the po wer and uni ersality of the method: no features of an specific domain of application are used. belie that there is no other method kno wn that can cluster data that is so heterogenous this reliably This is borne out by the massi xperiments with the method in [20 ]. de eloped the

CompLearn oolkit, Section I, and per formed xperiments in astly dif ferent application fields to test the quality and uni ersality of the method. The success of the method as reported belo depends strongly on the judicious use of encoding of the objects compared. Here one should use common sense on what real orld compressor can do. There are situations where our approach ails if applied in straightforw ard ay or xample: comparing te xt files by the same authors in dif ferent encodings (say Unicode and 8- bit ersion) is bound to ail. or the ideal similarity metric based on olmogoro

comple xity as defined in [31 this does not matter at all, ut for practical compressors used in the xperiments it will be atal. Similarly in the music xperiments
Page 13
UDI CILIBRASI AND UL VIT ANYI: CLUSTERING BY COMPRESSION 13 Carp Cow BlueWhale FinbackWhale Cat BrownBear PolarBear GreySeal HarborSeal Horse WhiteRhino Ferungulates Gibbon Gorilla Human Chimpanzee PygmyChimp Orangutan SumatranOrangutan Primates Eutheria HouseMouse Rat Eutheria - Rodents Opossum Wallaroo Metatheria Echidna Platypus Prototheria Fig. 7. The olutionary tree uilt from complete mammalian mtDN

sequences of 24 species, using the matrix of Figure 9. ha redra wn the tree from our output to agree better with the customary ph ylogen tree format. The tree agrees xceptionally well with the distance matrix: 996 Fig. 8. Multidimensional clustering of same matrix (Figure 9) as used for Figure 7. Kruskal stress-1 0.389. belo we use symbolic MIDI music file format rather than format music files. The reason is that the strings resulting from straightforw ard discretizing the form files may be too sensiti to ho we discretize. Further research may ecome this problem. A. Genomics

and Phylo eny In recent years, as the complete genomes of arious species become ailable, it has become possible to do whole genome ph ylogen (this ercomes the problem that using dif ferent tar geted parts of the genome, or proteins, may gi dif ferent trees [36]). raditional ph ylogenetic methods on indi vidual genes depended on multiple alignment of the related proteins and on the model of olution of indi vidual amino acids. Neither of these is practically applicable to the genome le el. In absence of such models, method which can compute the shared information between tw sequences is useful

because biological sequences encode information, and the occurrence of olutionary ents (such as insertions, deletions, point mutations, rearrangements, and in ersions) separating tw sequences sharing common ancestor will result in the loss of their shared information. Our method (in the form of the CompLearn oolkit) is fully automated softw are tool based on such distance to compare tw genomes. a) Mammalian Evolution:: In olutionary biology the timing and origin of the major xtant placental clades (groups of or anisms that ha olv ed from common ancestor) continues to fuel debate and research.

Here, we pro vide vidence by whole mitochondrial genome ph ylogen for com- peting ypotheses in tw main questions: the grouping of the Eutherian orders, and the Therian ypothesis ersus the Marsupionta ypothesis. Eutherian Orders: demonstrate (already in [31 ]) that whole mitochondrial genome ph ylogen of the Eutherians (placental mammals) can be reconstructed automatically from unaligned complete mitochondrial genomes by use of an early form of our compression method, using standard softw are packages. As more genomic material has become ailable, the debate in biology has intensified

concerning which tw of the three main groups of placental mammals are more closely related: Primates, Ferungulates, and Rodents. In [7], the maximum lik elihood method of ph ylogen tree reconstruc- tion vidence for the (Ferungulates, (Primates, Rodents)) grouping for half of the proteins in the mitochondial genomes in estig ated, and (Rodents, (Ferungulates, Primates)) for the other halv es of the mt genomes. In that xperiment the aligned 12 concatenated mitochondrial proteins, tak en from 20 species: rat Rattus norve gicus ), house mouse Mus musculus ), gre seal Halic hoerus grypus ), harbor

seal Phoca vitulina ), cat elis catus ), white rhino Cer atotherium simum ), horse Equus caballus ), finback whale Balaenopter physalus ), blue whale Balaenopter musculus ), co Bos taurus ), gibbon Hylobates lar ), gorilla Gorilla gorilla ), human Homo sapi- ens ), chimpanzee an tr glodytes ), ygmy chimpanzee an paniscus ), orangutan ongo pygmaeus ), Sumatran orangutan ongo pygmaeus abelii ), using opossum Didelphis vir gini- ana ), allaroo Macr opus ob ustus ), and the platypus Or nithorhync hus anatinus as outgroup. In [30], [31] we used the whole mitochondrial genome of the same 20

species, computing the distances (or closely related distance in [30]), using the GenCompress compressor follo wed by tree reconstruction using the neighbor joining program in the MOLPHY package [38 to confirm the commonly belie ed morphology-supported ypothesis (Rodents, (Primates, Ferun- gulates)). Repeating the xperiment using the ypercleaning method [6] of ph ylogen tree reconstruction the same result. Here, we repeated this xperiment se eral times using the CompLearn oolkit using our ne quartet method for reconstructing trees, and computing the with arious compressors (gzip, bzip2,

PPMZ), ag ain al ays with the same result. These xperiments are not reported since the are sub- sumed by the lar ger xperiment of Figure 7. This is ar lar ger xperiment than the one in [30], [31 ], and aimed at testing tw distinct ypotheses simultaniously: the one in the latter references about the Eutherian orders, and the ar more general
Page 14
14 CORRECTED VERSION OF: IEEE TRANSA CTIONS ON INFORMA TION THEOR OL. 51, NO 4, APRIL 2005, 1523–1545 one about the orders of the placental mammals (Eutheria, Metatheria, and Prototheria). Note also that adding the xtra species from 20 to

24 is an addition that biologists are loath to do: both for computational reasons and fear of destabilizing realistic ph ylogen by adding en one more species to the computation. Furthermore, in the last mentioned references we used the special-purpose genome compressor GenCompress to determine the distance matrix, and the standard biological softw are MOLPHY package to reconstruct the ph ylogen tree from the distance matrix. In contrast, in this paper we conduct lar ger xperiment than before, using just the general-purpose compressor bzip2 to obtain the distance matrix, and our ne quartet tree

reconstruction method to obtain the ph ylogen tree—that is, our wn Complearn package [9], used without an change in all the other xperiments. Marsupionta and Theria: The xtant monoph yletic di vi- sions of the class Mammalia are the Prototheria (monotremes: mammals that procreate using ggs), Metatheria (marsupials: mammals that procreate using pouches), and Eutheria (placen- tal mammals: mammals that procreate using placentas). The sister relationships between these groups is vie wed as the most fundamental question in mammalian olution [21 ]. Ph yloge- netic comparison by either anatomy or

mitochondrial genome has resulted in tw conflicting ypotheses: the gene-isolation- supported Mar supionta hypothesis ((Prototheria, Metatheria), Eutheria) ersus the morphology-supported Theria hypothesis (Prototheria, (Methateria, Eutheria)), the third possiblity appar ently not being held seriously by an yone. There has been lot of support for either ypothesis; recent support for the Theria ypothesis as gi en in [21] by analyzing lar ge nuclear gene (M6P/IG2R), vie wed as important across the species concerned, and en more recent support for the Marsupionta ypothesis as gi en in [18] by

ph ylogenetic analysis of another sequence from the nuclear gene (18S rRN A) and by the whole mitochondrial genome. Experimental Evidence: test the Eutherian orders si- multaneously with the Marsupionta- ersus Theria ypothesis, we added four animals to the abo twenty: Australian echidna ac hyglossus aculeatus ), bro wn bear Ur sus ar ctos ), polar bear Ur sus maritimus ), using the common carp Cyprinus carpio as the outgroup. Interestingly while there are man species of Eutheria and Metatheria, there are only three species of no li ving Prototheria kno wn: platypus, and tw types of echidna (or

spin anteater). So our sample of the Prototheria is lar ge. The addition of the ne species might be risk in that the addition of ne relations is kno wn to distort the pre vious ph ylogen in traditional computational genomics practice. ith our method, using the full genome and obtaining single tree with ery high confidence alue, that risk is not as great as in traditional methods obtaining ambiguous trees with bootstrap (statistic support) alues on the edges. The mitochondrial genomes of the total of 24 species we used were do wnloaded from the GenBank Database on the orld-wide web Each

is around 17,000 bases. The distance matrix as computed using the compressor PPMZ. The resulting ph y- logen with an almost maximal score of 0.996 supports ane the currently accepted grouping (Rodents, (Primates, Ferungulates)) of the Eutherian orders, and additionally the Marsupionta ypothesis ((Prototheria, Metatheria), Eutheria), see Figure 7. Ov erall, our whole-mitochondrial analysis supports the follo wing ypothesis: Mammalia }| (( pr imates; er ung ul ates )( odents {z Eutheria etather ia; otother ia ))) which indicates that the rodents, and the branch leading to the Metatheria and

Prototheria, split of early from the branch that led to the primates and ferungulates. Inspection of the distance matrix sho ws that the primates are ery close together as are the rodents, the Metatheria, and the Prototheria. These are tightly-knit groups with relati ely close s. The ferungulates are much looser group with generally distant s. The inter group distances sho that the Prototheria are furthest ay from the other groups, follo wed by the Metatheria and the rodents. Also the fine-structure of the tree is consistent with biological wisdom. Hierar chical ersus Flat Clustering:

This is good place to contrast the informati eness of hierarchical clustering with multidimensional clustering using the same matrix, xhibited in Figure 9. The entries gi good xample of typical alues; we truncated the number of decimals from 15 to significant digits to sa space. Note that the majority of distances unches in the range [0 1] This is due to the re gularities the compressor can percei e. The diagonal elements gi the self-distance, which, for PPMZ, is not actually 0, ut is of from only in the third decimal. In Figure we clustered the 24 animals using the matrix by

multidimenional scaling as points in 2-dimensional Euclidean space. In this method, the matrix of 24 animals can be vie wed as set of distances between points in -dimensional Euclidean space 24 ), which we ant to project into 2- dimensional Euclidean space, trying to distort the distances between the pairs as little as possible. This is akin to the problem of projecting the surf ace of the earth globe on tw o-dimensional map with minimal distance distortion. The main feature is the choice of the measure of distortion to be minimized, [16]. Let the original set of distances be and the projected

distances be In Figure we used the distortion measure Kruskall str ess-1 [24], which minimizes Kruskall stress-1 equal means no distortion, and the orst alue is at most (unless you ha really bad projection). In the projection of the matrix according to our quartet method one minimizes the more subtle distortion measure, where means perfect representation of the relati relations between ery 4-tuple, and means minimal representation. Therefore, we should compare distortion Kruskall stress-1 with Figure has ery good 04 and Figure has poor Kruskal stress 389 Assuming that the comparison is

significant for small alues (close to perfect projection), we find that the multidimensional scaling of this xperiment matrix is formally inferior to that of the quartet tree. This conclusion formally justifies the impression con yed by the figures on visual inspection.
Page 15
UDI CILIBRASI AND UL VIT ANYI: CLUSTERING BY COMPRESSION 15 BlueWhale Cat Echidna Gorilla Horse Opossum PolarBear SumOrang BrownBear Chimpanzee FinWhale GreySeal HouseMouse Orangutan PygmyChimp Wallaroo Carp Cow Gibbon HarborSeal Human Platypus Rat WhiteRhino BlueWhale 0.005 0.906 0.943

0.897 0.925 0.883 0.936 0.616 0.928 0.931 0.901 0.898 0.896 0.926 0.920 0.936 0.928 0.929 0.907 0.930 0.927 0.929 0.925 0.902 BrownBear 0.906 0.002 0.943 0.887 0.935 0.906 0.944 0.915 0.939 0.940 0.875 0.872 0.910 0.934 0.930 0.936 0.938 0.937 0.269 0.940 0.935 0.936 0.923 0.915 Carp 0.943 0.943 0.006 0.946 0.954 0.947 0.955 0.952 0.951 0.957 0.949 0.950 0.952 0.956 0.946 0.956 0.953 0.954 0.945 0.960 0.950 0.953 0.942 0.960 Cat 0.897 0.887 0.946 0.003 0.926 0.897 0.942 0.905 0.928 0.931 0.870 0.872 0.885 0.919 0.922 0.933 0.932 0.931 0.885 0.929 0.920 0.934 0.919 0.897 Chimpanzee 0.925 0.935

0.954 0.926 0.006 0.926 0.948 0.926 0.849 0.731 0.925 0.922 0.921 0.943 0.667 0.943 0.841 0.946 0.931 0.441 0.933 0.835 0.934 0.930 Cow 0.883 0.906 0.947 0.897 0.926 0.006 0.936 0.885 0.931 0.927 0.890 0.888 0.893 0.925 0.920 0.931 0.930 0.929 0.905 0.931 0.921 0.930 0.923 0.899 Echidna 0.936 0.944 0.955 0.942 0.948 0.936 0.005 0.936 0.947 0.947 0.940 0.937 0.942 0.941 0.939 0.936 0.947 0.855 0.935 0.949 0.941 0.947 0.929 0.948 FinbackWhale 0.616 0.915 0.952 0.905 0.926 0.885 0.936 0.005 0.930 0.931 0.911 0.908 0.901 0.933 0.922 0.936 0.933 0.934 0.910 0.932 0.928 0.932 0.927 0.902 Gibbon

0.928 0.939 0.951 0.928 0.849 0.931 0.947 0.930 0.005 0.859 0.932 0.930 0.927 0.948 0.844 0.951 0.872 0.952 0.936 0.854 0.939 0.868 0.933 0.929 Gorilla 0.931 0.940 0.957 0.931 0.731 0.927 0.947 0.931 0.859 0.006 0.927 0.929 0.924 0.944 0.737 0.944 0.835 0.943 0.928 0.732 0.938 0.836 0.934 0.929 GreySeal 0.901 0.875 0.949 0.870 0.925 0.890 0.940 0.911 0.932 0.927 0.003 0.399 0.888 0.924 0.922 0.933 0.931 0.936 0.863 0.929 0.922 0.930 0.920 0.898 HarborSeal 0.898 0.872 0.950 0.872 0.922 0.888 0.937 0.908 0.930 0.929 0.399 0.004 0.888 0.922 0.922 0.933 0.932 0.937 0.860 0.930 0.922 0.928 0.919

0.900 Horse 0.896 0.910 0.952 0.885 0.921 0.893 0.942 0.901 0.927 0.924 0.888 0.888 0.003 0.928 0.913 0.937 0.923 0.936 0.903 0.923 0.912 0.924 0.924 0.848 HouseMouse 0.926 0.934 0.956 0.919 0.943 0.925 0.941 0.933 0.948 0.944 0.924 0.922 0.928 0.006 0.932 0.923 0.944 0.930 0.924 0.942 0.860 0.945 0.921 0.928 Human 0.920 0.930 0.946 0.922 0.667 0.920 0.939 0.922 0.844 0.737 0.922 0.922 0.913 0.932 0.005 0.949 0.834 0.949 0.931 0.681 0.938 0.826 0.934 0.929 Opossum 0.936 0.936 0.956 0.933 0.943 0.931 0.936 0.936 0.951 0.944 0.933 0.933 0.937 0.923 0.949 0.006 0.960 0.938 0.939 0.954 0.941 0.960

0.891 0.952 Orangutan 0.928 0.938 0.953 0.932 0.841 0.930 0.947 0.933 0.872 0.835 0.931 0.932 0.923 0.944 0.834 0.960 0.006 0.954 0.933 0.843 0.943 0.585 0.945 0.934 Platypus 0.929 0.937 0.954 0.931 0.946 0.929 0.855 0.934 0.952 0.943 0.936 0.937 0.936 0.930 0.949 0.938 0.954 0.003 0.932 0.948 0.937 0.949 0.920 0.948 PolarBear 0.907 0.269 0.945 0.885 0.931 0.905 0.935 0.910 0.936 0.928 0.863 0.860 0.903 0.924 0.931 0.939 0.933 0.932 0.002 0.942 0.940 0.936 0.927 0.917 PygmyChimp 0.930 0.940 0.960 0.929 0.441 0.931 0.949 0.932 0.854 0.732 0.929 0.930 0.923 0.942 0.681 0.954 0.843 0.948 0.942

0.007 0.935 0.838 0.931 0.929 Rat 0.927 0.935 0.950 0.920 0.933 0.921 0.941 0.928 0.939 0.938 0.922 0.922 0.912 0.860 0.938 0.941 0.943 0.937 0.940 0.935 0.006 0.939 0.922 0.922 SumOrangutan 0.929 0.936 0.953 0.934 0.835 0.930 0.947 0.932 0.868 0.836 0.930 0.928 0.924 0.945 0.826 0.960 0.585 0.949 0.936 0.838 0.939 0.007 0.942 0.937 Wallaroo 0.925 0.923 0.942 0.919 0.934 0.923 0.929 0.927 0.933 0.934 0.920 0.919 0.924 0.921 0.934 0.891 0.945 0.920 0.927 0.931 0.922 0.942 0.005 0.935 WhiteRhino 0.902 0.915 0.960 0.897 0.930 0.899 0.948 0.902 0.929 0.929 0.898 0.900 0.848 0.928 0.929 0.952 0.934

0.948 0.917 0.929 0.922 0.937 0.935 0.002 Fig. 9. Distance matrix of pairwise or display purpose, we ha truncated the original entries from 15 decimals to decimals precision. b) SARS irus:: In another xperiment we clustered the SARS virus after its sequenced genome as made publicly ailable, in relation to potential similar virii. The 15 virus genomes were do wnloaded from The Uni ersal irus Database of the International Committee on axonomy of iruses, ail- able on the orld-wide web The SARS virus as do wnloaded from Canada Michael Smith Genome Sciences Centre which had the first public

SARS Corono virus draft whole genome assembly ailable for do wnload (SARS OR2 draft genome assembly 120403). The distance matrix as computed using the compressor bzip2. The relations in Figure 10 are ery similar to the definiti tree based on medical-macrobio- genomics analysis, appearing later in the Ne England Journal of Medicine, [25 ]. depicted the figure in the ternary tree style, rather than the genomics-dendrogram style, since the former is more precise for visual inspection of proximity relations. c) Analysis of Mitoc hondrial Genomes of Fungi:: As pilot for applications of

the CompLearn oolkit in fungi genomics reasearch, the group of Boekhout, E. uramae, Robert, of the Fung al Biodi ersity Center Ro yal Netherlands Academy of Sciences, supplied us with the mitochondrial genomes of Candida glabr ata, Pic hia canadensis, Sacc ha- omyces cer visiae S. castellii, S. servazzii, arr owia lipoly- tica (all yeasts), and tw filamentous ascomycetes Hypocr ea jecorina and erticillium lecanii The distance matrix as computed using the compressor PPMZ. The resulting tree is depicted in Figure 11. The interpretation of the fungi researchers is “the tree clearly

clustered the ascomycetous yeasts ersus the tw filamentous Ascomycetes, thus support- ing the current ypothesis on their classification (for xample, see [26 ]). Interestingly in recent treatment of the Saccha- romycetaceae, S. serv azii, S. castellii and C. glabrata were all proposed to belong to genera dif ferent from Saccharomyces, and this is supported by the topology of our tree as well ([27]). compare the eracity of the clustering with more feature-based clustering, we also calculated the pair wise distances as follo ws: Each file is con erted to 4096 dimensional ector

by considering the frequenc of all (o er AvianAdeno1CELO n1 n6 n11 AvianIB1 n13 n5 AvianIB2 BovineAdeno3 HumanAdeno40 DuckAdeno1 n3 HumanCorona1 n8 SARSTOR2v120403 n2 MeaslesMora n12 MeaslesSch MurineHep11 n10 n7 MurineHep2 PRD1 n4 n9 RatSialCorona SIRV1 SIRV2 n0 Fig. 10. SARS virus among other virii. Le gend: vianAdeno1CELO.inp: wl adeno virus 1; vianIB1.inp: vian infectious bronchitis virus (strain Beaudette US); vianIB2.inp: vian infectious bronchitis virus (strain Beaudette CK); Bo vineAdeno3.inp: Bo vine adeno virus 3; DuckAdeno1.inp: Duck adeno virus 1; HumanAdeno40.inp: Human adeno

virus type 40; HumanCorona1.inp: Human corona virus 229E; MeaslesMora.inp: Measles virus strain Moraten; MeaslesSch.inp: Measles virus strain Schw arz; MurineHep11.inp: Murine hepatitis virus strain ML-11; MurineHep2.inp: Murine hepatitis virus strain 2; PRD1.inp: Enterobacteria phage PRD1; RatSialCorona.inp: Rat sialodacryoadenitis corona virus; SARS.inp: SARS OR2v120403; SIR V1.inp: Sulfolob us virus SIR -1; SIR V2.inp: Sulfolob us virus SIR -2. 988 Verticilliumlecanii Hypocreajecorina Yarrowialipolytica Pichiacanadensis Saccharomycescerevisiae Saccharomycesservazzii Saccharomycescastellii

Candidaglabrata Fig. 11. Dendrogram of mitochondrial genomes of fungi using This represents the distance matrix precisely with 999
Page 16
16 CORRECTED VERSION OF: IEEE TRANSA CTIONS ON INFORMA TION THEOR OL. 51, NO 4, APRIL 2005, 1523–1545 Saccharomycesservazzii Pichiacanadensis Saccharomycescastellii Saccharomycescerevisiae Candidaglabrata Yarrowialipolytica Verticilliumlecanii Hypocreajecorina Fig. 12. Dendrogram of mitochondrial genomes of fungi using block frequencies. This represents the distance matrix precisely with 999 Ndebele Rundi Kicongo Bemba Dagaare Ditammari Africa

Somali Dendi Africa Zapoteco Chickasaw Mazahua Purhepecha Americas Dutch German English Spanish Europe Fig. 13. Clustering of Nati e-American, Nati e-African, and Nati e- European languages. 928 lapping) 6-byte contiguous blocks. The l2-distance (Euclidean distance) is calculated between each pair of files by taking the square root of the sum of the squares of the component- wise dif ferences. These distances are arranged into distance matrix and linearly scaled to fit the range [0 0] Finally we ran the clustering routine on this distance matrix. The results are in Figure 12. As

seen by comparing with the -based Figure 11 there are apparent misplacements when using the Euclidean distance in this ay Thus, in this simple xperiment, the performed better that is, agreed more precisely with accepted biological kno wledge. B. Langua ees Our method impro es the results of [1], using linguis- tic corpus of “The Uni ersal Declaration of Human Rights (UDoHR) [35 in 52 languages. Pre viously [1] used an asym- metric measure based on relati entrop and the full matrix of the pair -wise distances between all 52 languages, to uild language classification tree. This xperiment

as repeated (resulting in some what better tree) using the compression method in [31 using standard biological softw are packages to construct the ph ylogen ha redone this xperiment, and done ne xperiments, using the CompLearn oolkit. Here, we report on an xperiment to separate radically dif ferent language amilies. do wnloaded the language ersions of the UDoHR te xt in English, Spanish, Dutch, German (Nati e- European), Pemba, Dendi, Ndbele, Kicongo, Somali, Rundi, Ditammari, Dag aare (Nati African), Chikasa Perhupecha, Mazahua, Zapoteco (Nati e-American), and didn preprocess them xcept for

remo ving initial identifying information. used an Lempel-Zi v-type compressor gzip to compress te xt sequences of sizes not xceeding the length of the sliding windo gzip uses (32 kilobytes), and compute the for each pair of language sequences. Subsequently we clustered the result. sho the outcome of one of the xperiments in DostoevskiiCrime DostoevskiiPoorfolk DostoevskiiGmbl DostoevskiiIdiot TurgenevRudin TurgenevGentlefolks TurgenevEve TurgenevOtcydeti TolstoyIunosti TolstoyAnnak TolstoyWar1 GogolPortrDvaiv GogolMertvye GogolDik GogolTaras TolstoyKasak BulgakovMaster BulgakovEggs

BulgakovDghrt Fig. 14. Clustering of Russian writers. Le gend: I.S. ur gene 1818–1883 [F ather and Sons, Rudin, On the Ev e, House of Gentlefolk]; Dosto ye vsk 1821–1881 [Crime and Punishment, The Gambler The Idiot; Poor olk]; L.N. olsto 1828–1910 [Anna Karenina, The Cossacks, outh, ar and Piece]; N.V Gogol 1809–1852 [Dead Souls, aras Bulba, The Mysterious Portrait, Ho the Iv ans Quarrelled]; M. Bulg ak 1891–1940 [The Master and Mar arita, The atefull Eggs, The Heart of Dog]. 949 Figure 13. Note that three groups are correctly clustered, and that en the subclusters of the European languages

are correct (English is grouped with the Romance languages because it contains up to 40% admixture of ords from Latin origine). C. Liter atur The te xts used in this xperiment were do wn-loaded from the orld-wide web in original Cyrillic-lettered Russian and in Latin-lettered English by L. anasie (Molda vian MSc student at the Uni ersity of Amsterdam). The compressor used to compute the matrix as bzip2. clustered Russian literature in the original (Cyrillic) by Gogol, Dostoje vski, olsto Bulg ak ,Tsjecho with three or four dif ferent te xts per author Our purpose as to see whether the

clustering is sensiti enough, and the authors distincti enough, to result in clustering by author In Figure 14 we see perfect cluster ing. Considering the English translations of the same te xts, in Figure 15, we see errors in the clustering. Inspection sho ws that the clustering is no partially based on the translator It appears that the translator superimposes his characteristics on the te xts, partially suppressing the characteristics of the original authors. In other xperiments, not reported here, we separated authors by gender and by period. D. Music The amount of digitized music ailable

on the internet has gro wn dramatically in recent years, both in the public domain and on commercial sites. Napster and its clones are prime xamples. ebsites of fering musical content in some form or other (MP3, MIDI, need ay to or anize their wealth of material; the need to someho classify their files accord- ing to musical genres and subgenres, putting similar pieces together The purpose of such or anization is to enable users
Page 17
UDI CILIBRASI AND UL VIT ANYI: CLUSTERING BY COMPRESSION 17 BulgakovMm BulgakovEgg BulgakovDghrt TolstoyCosacs DostoyevskyCrime

TurgenevOtcydeti TurgenevGentlefolk TurgenevEve TurgenevRudin TolstoyAnnak TolstoyWar1 DostoyevskyIdiot GogolDsols TolstoyYouth DostoyevskyGambl DostoyevskyPoorfolk GogolPortrDvaiv GogolTaras Fig. 15. Clustering of Russian writers translated in English. The translator is gi en in brack ets after the titles of the te xts. Le gend: I.S. ur gene 1818–1883 [F ather and Sons (R. Hare), Rudin (Garnett, C. Black), On the Ev (Garnett, C. Black), House of Gentlefolk (Garnett, C. Black)]; Dosto ye vsk 1821–1881 [Crime and Punishment (Garnett, C. Black), The Gambler (C.J. Hog arth), The Idiot (E.

Martin); Poor olk (C.J. Hog arth)]; L.N. olsto 1828 1910 [Anna Karenina (Garnett, C. Black), The Cossacks (L. and M. ylmer), outh (C.J. Hog arth), ar and Piece (L. and M. ylmer)]; N.V Gogol 1809 1852 [Dead Souls (C.J. Hog arth), aras Bulba G. olsto 1860, B.C. Bask erville), The Mysterious Portrait Ho the Iv ans Quarrelled I.F Hapgood]; M. Bulg ak 1891–1940 [The Master and Mar arita (R. Pe ear L. olokhonsk y), The atefull Eggs (K. Gook-Horujy), The Heart of Dog (M. Glenn y)]. 953 to na vig ate to pieces of music the already kno and lik e, ut also to gi them advice and recommendations (“If you

lik this, you might also lik e. ”). Currently such or anization is mostly done manually by humans, ut some recent research has been looking into the possibilities of automating music classification. Initially we do wnloaded 36 separate MIDI (Musical In- strument Digital Interf ace, ersatile digital music format ailable on the orld-wide-web) files selected from range of classical composers, as well as some popular music. The files were do wn-loaded from se eral dif ferent MIDI Databases on the orld-wide web The identifying information, composer title, and so on, as stripped

from the files (otherwise this may gi mar ginal adv antage to identify composers to the compressor). Each of these files as run through prepro- cessor to xtract just MIDI Note-On and Note-Of ents. These ents were then con erted to player -piano style representation, with time quantized in 05 second interv als. All instrument indicators, MIDI control signals, and tempo ariations were ignored. or each track in the MIDI file, we calculate tw quantities: An aver volume and modal note The erage olume is calculated by eraging the olume (MIDI note elocity) of all notes in the track.

The modal note is defined to be the note pitch that sounds most often in that track. If this is not unique, then the lo west such note is chosen. The modal note is used as y-in ariant reference point from which to represent all notes. It is denoted by higher notes are denoted by positi numbers, and lo wer notes are denoted by ne ati numbers. alue of BachWTK2F1 n14 BachWTK2P2 n6 BachWTK2F2 n15 MetalOne n5 BachWTK2P1 n18 GershSumm n27 BeatlEleanor n19 n23 n1 BeatlMich n26 PoliceMess ChopPrel15 n8 n13 n4 ChopPrel1 n25 n30 ChopPrel22 ChopPrel24 n17 MilesSowhat ClaptonCoca n0 PoliceBreath n29

ClaptonLayla n32 HendrixJoe n31 ColtrBlueTr n28 n12 n9 ColtrGiantStp DireStMoney ColtrImpres n11 MilesSolar ColtrLazybird n22 DebusBerg1 n16 n10 n7 DebusBerg2 n24 DebusBerg3 DebusBerg4 GilleTunisia n33 Miles7steps HendrixVoodoo n3 LedZStairw n20 MilesMilesto MonkRoundM n21 ParkYardbird RushYyz n2 Fig. 16. Output for the 36 pieces from music-genres. Le gend: 12 Jazz: John Coltrane [Blue rane, Giant Steps, Lazy Bird, Impressions]; Miles Da vis [Milestones, Se en Steps to Hea en, Solar So What]; Geor ge Gershwin [Summertime]; Dizzy Gillespie [Night in unisia]; Thelonious Monk [Round Midnight];

Charlie ark er [Y ardbird Suite]; 12 Rock Pop: The Beatles [Eleanor Rigby Michelle]; Eric Clapton [Cocaine, Layla]; Dire Straits [Mone for Nothing]; Led Zeppelin [Stairw ay to Hea en]; Metallica [One]; Jimi Hendrix [He Joe, oodoo Chile]; The Police [Ev ery Breath ou ak e, Message in Bottle] Rush [Yyz]; 12 Classic: see Le gend Figure 17. 858 indicates half-step abo the modal note, and alue of indicates whole-step belo the modal note. The tracks are sorted according to decreasing erage olume, and then output in succession. or each track, we iterate through each time sample in order outputting

single signed 8-bit alue for each currently sounding note. special alues are reserv ed to represent the end of time step and the end of track. This file is then used as input to the compression stage for distance matrix calculation and subsequent tree search. check whether an important feature of the music as lost during preprocessing, we played it back from the preprocessed files to compare it with the original. the authors the pieces sounded almost unchanged. The compressor used to compute the matrix of the genres tree, Figure 16, and that of 12-piece music set, Figure 17 is

bzip2. or the full range of the music xperiments see [10 ]. Before testing whether our program can see the distinctions
Page 18
18 CORRECTED VERSION OF: IEEE TRANSA CTIONS ON INFORMA TION THEOR OL. 51, NO 4, APRIL 2005, 1523–1545 BachWTK2F1 n5 n8 BachWTK2F2 BachWTK2P1 n0 BachWTK2P2 ChopPrel15 n9 n1 ChopPrel1 n6 n3 ChopPrel22 ChopPrel24 DebusBerg1 n7 DebusBerg4 n4 DebusBerg2 n2 DebusBerg3 Fig. 17. Output for the 12-piece set. Le gend: J.S. Bach [W ohltemperierte Kla vier II: Preludes and Fugues 1,2Đ BachWTK2 ,P gf 1,2 ]; Chopin [Pr eludes op. 28: 1, 15, 22, 24 ĐChopPrel 1,15,22,24 ];

Deb ussy [Suite Ber amasque, mo ementsĐDeb usBer 1,2,3,4 ]. 968 between arious classical composers, we first sho that it can distinguish between three broader musical genres: classical music, rock, and jazz. This may be easier than making distinctions “within classical music. All musical pieces we used are listed in the tables in the full paper [10]. or the genre-e xperiment we used 12 classical pieces consisting of Bach, Chopin, and Deb ussy 12 jazz pieces, and 12 rock pieces. The tree (Figure 16) that our program came up with has 858 The discrimination between the genres is reasonable

ut not perfect. Since 858 airly lo alue, the resulting tree doesn represent the distance matrix ery well. Presumably the information in the distance matrix cannot be represented by dendrogram of high score. This appears to be common problem with lar ge 25 or so) natural data sets. Another reason may be that the program terminated, while trapped in local optimum. repeated the xperiment man times with almost the same results, so that doesn appear to be the case. The 11-item subtree rooted at contains 10 of the 12 jazz pieces, together with piece of Bach “W ohltemporierte Kla vier (WTK)”. The

other tw jazz pieces, Miles Da vis “So What, and John Coltrane “Giant Steps are placed else where in the tree, perhaps according to some kinship that no escapes us (b ut may be identified by closer studying of the objects concerned). Of the 12 rock pieces, 10 are placed in the 12-item subtree rooted at 29 together with piece of Bach “WTK, and Coltrane “Giant Steps, while Hendrix “V oodoo Chile and Fig. 18. Images of handwritten digits used for fivesaa n18 n26 fivesae fivesab n25 n2 n4 fivesac n13 n23 n3 fivesad fivesaf n0 fivesag n9 fivesah fivesai n16 fivesaj foursaa n21 foursae n1

foursab n19 foursad n5 foursac n22 n24 foursai foursaf n17 foursag foursah n12 foursaj n6 sixesaa n15 n14 n11 sixesab n8 n27 sixesac n10 n20 sixesad sixesae n7 sixesaf sixesai sixesag sixesaj sixesah Fig. 19. Clustering of the images. 901 Rush “Yyz is further ay Of the 12 classical pieces, 10 are in the 13-item subtrees rooted at the branch 13 together with Hendrix “V oodoo Chile, Rush “Yyz, and Miles Da vis “So What. Surprisingly of the Bach “WTK pieces are placed else where. et we percei the Bach pieces to be ery close, both structurally and melodically (as the all come from the

mono-thematic “W ohltemporierte Kla vier”). But the program finds reason that at this point is hidden from us. In act, running this xperiment with dif ferent compressors and termination conditions consistently displayed this anomaly The small set encompasses the mo ements
Page 19
UDI CILIBRASI AND UL VIT ANYI: CLUSTERING BY COMPRESSION 19 from Deb ussy “Suite Ber amasque, mo ements of book of Bach “W ohltemperierte Kla vier and preludes from Chopin “Opus 28. As one can see in Figure 17, our program does pretty good job at clustering these pieces. The score is also high: 0.968.

The Deb ussy mo ements form one cluster as do the Bach pieces. The only imperfection in the tree, judged by what one ould intuiti ely xpect, is that Chopin Pr elude no. 15 lies bit closer to Bach than to the other Chopin pieces. This Pr elude no 15, in act, consistently forms an odd-one-out in our other xperiments as well. This is an xample of pure data mining, since there is some musical truth to this, as no. 15 is percei ed as by ar the most eccentric among the 24 Pr eludes of Chopin opus 28. E. Optical Char acter Reco gnition Can we also cluster tw o-dimensional images? Because our method

appears focussed on strings this is not straightfor ard. It turns out that scanning picture in raster ro w-major order retains enough re gularity in both dimensions for the compressor to grasp. simple task along these lines is to cluster handwritten characters. The handwritten characters in Figure 18 were do wnloaded from the NIST Special Data Base 19 (optical character recognition database) on the orld-wide web Each file in the data directory contains digit image, either four e, or six. Each pix el is single character; ’# for black pix el, ’. for white. Ne wlines are added at the end of

each line. Each character is 128x128 pix els. The matrix as computed using the compressor PPMZ. Figure 19 sho ws the clusters obtained. There are 10 of each digit “4, “5, “6, making total of 30 items in this xperiment. All ut one of the 4 are put in the subtree rooted at all ut one of the 5 are put in the subtree rooted at and all 6 are put in the subtree rooted at The remaining and are in the branch 23 13 joining and So 28 items out of 30 are clustered correctly that is, 93%. In this xperiment we used only digits. Using the full set of decimal digits means that too man objects are in olv ed,

resulting in lo wer clustering accurac Ho we er we can use the as obli vious feature-e xtraction technique to con ert generic objects into finite-dimensional ectors. ha used this technique to train support ector machine based system to classify handwritten digits by xtracting 80 distinct, ordered features from each input image. In this initial stage of ongoing research, by our obli vious method of computing the to use in the classifier we achie ed handwritten single decimal digit recognition accurac of 87%. The current state-of-the-art for this problem, after half century of

interacti feature- dri en classification research, in the upper ninety le el [34 ], [40 ]. All xperiments are bench mark ed on the standard NIST Special Data Base 19. Using the for general classification by compression is the subject of future paper Astr onomy As proof of principle we clustered data from unkno wn ob- jects, for xample objects that are ar ay In [3] observ ations of the microquasar GRS 1915+105 made with the Rossi X-ray Dab1 n11 Dab4 n2 Dab2 n10 n1 n0 Dab3 Gab1 n6 Gab3 Gab2 n3 Gab4 Pb1 n13 n9 Pb2 Pb3 n8 Pb4 Tac1 n12 n7 Tac2 Tac3 n5 n4 Tac4 Fig. 20. 16 observ ation

interv als of GRS 1915+105 from four classes. The initial capital letter indicates the class corresponding to Greek lo wer case letters in [3]. The remaining letters and digits identify the particular observ ation interv al in terms of ˛ner features and identity The -cluster is top left, the -cluster is bottom left, the -cluster is to the right, and the -cluster in the middle. This tree almost xactly represents the underlying distance matrix: 994 iming Eplorer were analyzed. The interest in this microquasar stems from the act that it as the first Galactic object to sho certain beha

vior (superluminal xpansion in radio observ a- tions). Photonometric observ ation data from X-ray telescopes were di vided into short time se gments (usually in the order of one minute), and these se gments ha been classified into be wildering array of fifteen dif ferent modes after considerable ef fort. Briefly spectrum hardness ratios (roughly “color”) and photon count sequences were used to classify gi en interv al into cate gories of ariability modes. From this analysis, the xtremely comple ariability of this source as reduced to transitions between three basic states,

which, interpreted in astronomical terms, gi es rise to an xplanation of this peculiar source in standard black-hole theory The data we used in this xperiment made ailable to us by M. Klein olt (co-author of the abo paper) and Maccarone, both researchers at the Astronomical Institute Anton annek oek of the Uni ersity of Amsterdam. The observ ations are essentially time series, and our aim as xperimenting with our method as pilot to more xtensi joint research. Here the task as to see whether the clustering ould agree with the classification abo e. The matrix as computed using the

compressor PPMZ. The results are in Figure 20. clustered 12 objects, consisting of three interv als from four dif ferent cate gories denoted as ; in able of [3 ]. In Figure 20 we denote the cate gories by the corresponding Roman letters D, G, and respecti ely The resulting tree groups these dif ferent modes together in ay that is consistent with the classification by xperts for these observ ations. The obli vious compression clustering corresponds precisely with the laborious feature- dri en classification in [3]. Further ork on clustering of (possibly heterogenous) time

series and anomaly detection, using the ne compression method, as recently done on massi scale in [20 ].
Page 20
20 CORRECTED VERSION OF: IEEE TRANSA CTIONS ON INFORMA TION THEOR OL. 51, NO 4, APRIL 2005, 1523–1545 interpret what the is doing, and to xplain its remarkable accurac and rob ustness across application fields and compressors, the intuition is that the minorizes all similarity metrics based on features that are captured by the reference compressor in olv ed. Such features must be relati ely simple in the sense that the are xpressed by an aspect that the compressor

analyzes (for xample frequencies, matches, repeats). Certain sophisticated features may well be xpressible as combinations of such simple features, and are therefore themselv es simple features in this sense. The xten- si xperimenting abo sho ws that en elusi features are captured. potential application of our non-feature (or rather man y- unkno wn-feature) approach is xploratory Presented with data for which the features are as yet unkno wn, certain dominant features go erning similarity are automatically disco ered by the Examining the data underlying the clusters may yield this hitherto

unkno wn dominant feature. Our xperiments indicate that the has application in tw ne areas of support ector machine based learning. Firstly we find that the in erted (1- is useful as ernel for generic objects in learning. Secondly we can use the normal as feature-e xtraction technique to con ert generic objects into finite-dimensional ectors, see the last paragraph of Section VIII-E. In ef fect our similarity engine aims at the ideal of perfect data mining process, disco ering unkno wn features in which the data can be similar This is the subject of ongoing joint research in

genomics of fungi, clinical molecular genetics, and radio- astronomy thank Loredana Af anasie Graduate School of Logic, Uni ersity of Amsterdam; eun Boekhout, Eik uramae, incent Robert, Fung al Biodi ersity Center Ro yal Nether lands Academy of Sciences; Marc Klein olt, Thomas Mac- carone, Astronomical Institute Anton annek oek”, Uni ersity of Amsterdam; Evgen erbitskiy Philips Research; Ste en de Rooij, Ronald de olf, CWI; the referees and the editors, for suggestions, comments, help with xperiments, and data; Jorma Rissanen and Boris Ryabk for discussions, John Lang- ford for suggestions,

Tzu-K uo Huang for pointing out some typos and simplifications, and eemu Roos and Henri irry for implementing visualization of the clustering process. [1] D. Benedetto, E. Caglioti, and Loreto. Language trees and zipping, Physical Re vie Letter 88:4(2002) 048702. [2] Ph. Ball. Algorithm mak es tongue tree, Natur 22 January 2002. [3] Belloni, M. Klein-W olt, M. endez, M. an der Klis, J. an aradijs, model-independent analysis of the ariability of GRS 1915+105, Astr onomy and Astr ophysics 355(2000), 271–290. [4] C.H. Bennett, acs, M. Li, .M.B. it an yi, and Zurek. Information Distance,

IEEE ansactions on Information Theory 44:4(1998), 1407 1423. [5] C.H. Bennett, M. Li, B. Ma, Chain letters and olutionary histories, Scienti˛c American June 2003, 76–81. [6] D. Bryant, Berry earne M. Li, Jiang, areham and H. Zhang. practical algorithm for reco ering the best supported edges of an olutionary tree. Pr oc. 11th CM-SIAM Symposium on Discr ete Algorithms January 9–11, 2000, San Francisco, California, USA, 287 296, 2000. [7] Cao, A. Jank e, J. addell, M. esterman, O. ak enaka, S. Murata, N. Okada, S. abo, M. Hase a, Conˇict among indi vidual mitochondrial proteins in

resolving the ph ylogen of Eutherian orders, Mol. Evol. 47(1998), 307-322. [8] X. Chen, B. Francia, M. Li, B. McKinnon, A. Sek er Shared information and program plagiarism detection, IEEE ans. Inform. Th. 50:7(2004), 1545–1551. [9] R. Cilibrasi, The CompLearn oolkit, 2003, http://complearn.sourcefor ge.net/ [10] R. Cilibrasi, .M.B. it an yi, R. de olf, Algorithmic clustering of music, Computer Music ournal appear http://xxx.lanl.go v/abs/cs.SD/0303025 [11] G. Cormode, M. aterson, S. Sahinalp, and U. ishkin. Communication comple xity of document xchange. In Pr oc. 11th CM–SIAM Symp. on Discr

ete Algorithms 2000, 197–206. [12] .M. Co er and J.A. Thomas. Elements of Information Theory ile Sons, 1991. [13] Chai and B. ercoe. olk music classi˛cation using hidden Mark models. Pr oc. of International Confer ence on Arti˛cial Intellig ence 2001. [14] M. Cooper and J. oote. Automatic music summarization via similarity analysis, Pr oc. IRCAM 2002. [15] R. Dannenber g, B. Thom, and D. atson. machine learning approach to musical style recognition, Pr oc. International Computer Music Con- fer ence pp. 344-347, 1997. [16] R.O. Duda, .E. Hart, D.G. Stork, attern Classi˛cation 2nd

Edition, ile Interscience, 2001. [17] M. Grimaldi, A. okaram, and Cunningham. Classifying music by genre using the elet pack et transform and round-robin en- semble. echnical report TCD-CS-2002-64, rinity Colle ge Dublin, 2002. http://www .cs.tcd.ie/publications/tech-reports/reports.02/TCD-CS- 2002-64.pdf [18] A. Jank e, O. Magnell, G. ieczorek, M. esterman, U. Arnason, Ph ylogenetic analysis of 18S rRN and the mitochondrial genomes of ombat, ombatus ursinus, and the spin anteater ach yglossus acelaetus: increased support for the Marsupionta ypothesis, Mol. Evol. 1:54(2002), 71–80. [19] Jiang,

earne and M. Li. Polynomial ime Approximation Scheme for Inferring Ev olutionary rees from Quartet opologies and its Application. SIAM Computing 30:6(2001), 1942–1961. [20] E. eogh, S. Lonardi, and C.A. Rtanamahatana, ard parameter free data mining, In: Pr oc. 10th CM SIGKDD Intn’l Conf Knowledg Disco very and Data Mining Seattle, ashington, USA, August 22Đ25, 2004, 206–215. [21] J.K. Killian, .R. Buckle N. Ste ard, B.L. Munday R.L. Jirtle, Marsupials and Eutherians reunited: genetic vidence for the Theria ypothesis of mammalian olution, Mammalian Genome 12(2001), 513–517. [22] M. oppel, S. Ar

amon, A.R. Shimoni, Automatic catagorizing written te xts by author gender Liter ary and Linguistic Computing appear [23] A. Krask H. St ogbauer R.G. Adrsejak, Grassber ger Hierarchical clustering based on mutual information, 2003, http://arxi .or g/abs/q- bio/0311039 [24] J.B. Kruskal, Nonmetric multidimensional scaling: numerical method, Psyc hometrika 29(1964), 115–129. [25] .G. Ksiazek, et.al., No el Corona virus Associated with Se ere Acute Respiratory Syndrome, Ne England Medicine Published at www .nejm.or April 10, 2003 (10.1056/NEJMoa030781). [26] C.P urtzman, J. Sugiyama, Ascomycetous

yeasts and yeast-lik taxa. In: The mycota VII, Systemtics and volution, part pp. 179-200, Springer -V erlag, Berlin, 2001. [27] C.P urtzman, Ph ylogenetic circumscription of Saccharomyces, Kluyv eromyces and other members of the Saccharomycetaceaea, and the proposal of the ne genera Lachnacea, Nakaseomyces, Naumo via, anderw altozyma and Zygotorulaspora, FEMS east Res. 4(2003), 233 245. [28] .S. Laplace, philosophical essay on pr obabilities 1819. English translation, Do er 1951. [29] M. Li, J.H. Badger X. Chen, S. Kw ong, earne and H. Zhang. An information-based sequence distance and its

application to whole mitochondrial genome ph ylogen Bioinformatics 17:2(2001), 149–154.
Page 21
UDI CILIBRASI AND UL VIT ANYI: CLUSTERING BY COMPRESSION 21 [30] M. Li and .M.B. it an yi. Algorithmic Comple xity pp. 376–382 in: International Encyclopedia of the Social Behavior al Sciences N.J. Smelser and .B. Baltes, Eds., Per amon, Oxford, 2001/2002. [31] M. Li, X. Chen, X. Li, B. Ma, .M.B. it an yi. The similarity metric, IEEE ans. Inform. Th. 50:12(2004), 3250- 3264. [32] M. Li and .M.B. it an yi. An Intr oduction to olmo gor Comple xity and its Applications Springer -V erlag, Ne

ork, 2nd Edition, 1997. [33] A. Londei, Loreto, M.O. Belardinelli, Music style and authorship cat- gorization by informati compressors, Proc. 5th riannual Conference of the European Society for the Cogniti Sciences of Music (ESCOM), September 8-13, 2003, Hanno er German pp. 200-203. [34] L.S. Oli eira, R. Sabourin, Bortolozzi, C.Y Suen, Automatic recog- nition of handwritten numerical strings: recognition and eri˛ca- tion strate gy IEEE ans. attern Analysis and Mac hine Intellig ence 24:11(2002), 1438–1454. [35] United Nations General Assembly resolution 217 (III) of 10 December 1948: Uni

ersal Declaration of Human Rights, http://www .un.or g/Ov ervie w/rights.html [36] A. Rokas, B.L. illiams, N. King, S.B. Carroll, Genome-scale ap- proaches to resolving incongruence in molecular ph ylogenies, Natur 425(2003), 798–804 (25 October 2003). [37] D. Salomon, Data Compr ession Springer -V erlag, Ne ork, 1997. [38] N. Saitou, M. Nei, The neighbor -joining method: ne method for reconstructing ph ylogenetic trees, Mol. Biol. Evol. 4(1987), 406–425. [39] Scott. Music classi˛cation using neural netw orks, 2001. http://www .stanford.edu/class/ee373a/musicclassi˛cation.pdf [40] Ř.

D. rier A.K. Jain, axt, Feature xtraction methods for character recognitionĐA surv attern Reco gnition 29:4(1996), 641–662. [41] .N. ianilos, Normalized forms for tw common metrics, NEC Research Institute, Report 91-082-9027-1, 1991, Re vision 7/7/2002. http://www .pn ylab .com/pn y/ [42] A. C.-C. ang, C.-K. Peng, H.-W ien, A.L. Goldber ger Information cate gorization approach to literary authorship disputes, Physica 329(2003), 473-483. [43] G. Tzanetakis and Cook, Music genre classi˛cation of audio signals, IEEE ansactions on Speec and udio Pr ocessing 10(5):293–302, 2002. [44] S. ehner

Analyzing netw ork traf ˛c and orms using compression, Manuscript, CWI, 2004. artially ailable at http://homepages.cwi.nl/ wehner/w orms/ recei ed his B.S. with honors from the Cali- fornia Institute of echnology in 1996. He has programmed com- puters for er tw decades, both in academia, and industry with arious companies in Silicon alle including Microsoft, in di- erse areas such as machine learning, data compression, process control, VLSI design, computer graphics, computer security and netw orking protocols, and is no PhD student at the Centre for Mathematics and Computer Science (CWI)

and the Uni ersity of Amsterdam in the Netherlands. He helped create the ˛rst pub- licly do wnloadable Normalized Compression Distance softw are, and is maintaining http://complearn.sourcefor ge.net/ no Home page: http://www .cwi.nl/ cilibrar/ is Fello of the Centre for Mathematics and Computer Science (CWI) in Amsterdam and is Professor of Computer Science at the Uni ersity of Amsterdam. He serv es on the editorial boards of Distrib uted Computing (until 2003), Infor mation Processing Letters, Theory of Computing Systems, arallel Processing Letters, International journal of oundations of

Computer Science, Journal of Computer and Systems Sciences (guest editor), and else where. He has ork ed on cellular automata, computational comple xity distrib uted and parallel computing, machine learning and prediction, ph ysics of computation, olmogoro comple xity quantum computing. ogether with Ming Li the pioneered appli- cations of olmogoro comple xity and co-authored An Introduc- tion to olmogoro Comple xity and its Applications, Springer erlag, Ne ork, 1993 (2nd Edition 1997), parts of which ha been translated into Chinese, Russian and Japanese. Home page: http://www .cwi.nl/ paulv/