LIMBO Scalable Clustering of Categorical Data Periklis Andritsos anayiotis Tsaparas Ren ee J
117K - views

LIMBO Scalable Clustering of Categorical Data Periklis Andritsos anayiotis Tsaparas Ren ee J

Miller and enneth C Se vcik periklistsapmiller kcs cstorontoe du Uni ersity of oronto Department of Computer Science Abstract Clustering is problem of great practical importance in numerous applica tions The problem of clustering becomes more challe

Tags : Miller and enneth
Download Pdf

LIMBO Scalable Clustering of Categorical Data Periklis Andritsos anayiotis Tsaparas Ren ee J




Download Pdf - The PPT/PDF document "LIMBO Scalable Clustering of Categorical..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentation on theme: "LIMBO Scalable Clustering of Categorical Data Periklis Andritsos anayiotis Tsaparas Ren ee J"— Presentation transcript:


Page 1
LIMBO: Scalable Clustering of Categorical Data Periklis Andritsos, anayiotis Tsaparas, Ren ee J. Miller, and enneth C. Se vcik periklis,tsap,miller ,kcs @cs.toronto.e du Uni ersity of oronto, Department of Computer Science Abstract. Clustering is problem of great practical importance in numerous applica- tions. The problem of clustering becomes more challenging when the data is cate gorical, that is, when there is no inherent distance measure between data alues. introduce LIMBO, scalable hierarchical cate gorical clustering algorithm that uilds on the In- formation Bottlenec

(IB) frame ork for quantifying the rele ant information preserv ed when clustering. As hierarchical algorithm, LIMBO has the adv antage that it can pro- duce clusterings of dif ferent sizes in single ecution. use the IB frame ork to define distance measure for cate gorical tuples and we also present no el distance mea- sure for cate gorical attrib ute alues. sho ho the LIMBO algorithm can be used to cluster both tuples and alues. LIMBO handles lar ge data sets by producing memory bounded summary model for the data. present an xperimental aluation of LIMBO, and we study ho clustering

quality compares to other cate gorical clustering algorithms. LIMBO supports trade-of between ef ficienc (in terms of space and time) and quality quantify this trade-of and demonstrate that LIMBO allo ws for substantial impro e- ments in ef ficienc with ne gligible decrease in quality Intr oduction Clustering is problem of great practical importance that has been the focus of substantial research in se eral domains for decades. It is defined as the problem of partitioning data objects into groups, such that objects in the same group are similar while objects in dif ferent

groups are dissimilar This definition assumes that there is some well defined notion of similarity or distance between data objects. When the objects are defined by set of numerical attrib utes, there are natural definitions of distance based on geometric analogies. These definitions rely on the semantics of the data alues themselv es (for xample, the alues $100K and $110K are more similar than $100K and $1). The definition of distance allo ws us to define quality measur for clustering .g the mean square distance between each point and the centroid of

its cluster). Clustering then becomes the problem of grouping together points such that the quality measure is optimized. The problem of clustering becomes more challenging when the data is cate gorical, that is, when there is no inherent distance measure between data alues. This is often the case in man domains, where data is described by set of descripti attrib utes, man of which are neither numerical nor inherently ordered in an ay As concrete xample, consider relation that stores information about mo vies. or the purpose of xposition, mo vie is tuple characterized by the attrib utes

“director”, “actor/actress”, and “genre”. An instance of this relation is sho wn in able 1. In this setting it is not immediately ob vious what the distance, or similarity is between the alues “Coppola and “Scorsese”, or the tuples “V ertigo and “Harv y”. ithout measure of distance between data alues, it is unclear ho to define quality measure for cate gorical clustering. do this, we emplo mutual information measure from information theory good clustering is one where the clusters are informative about the data objects the contain. Since data objects are xpressed in terms of attrib ute

alues, we require that the clusters con information about the attrib ute alues of the objects in the cluster That is, gi en cluster we wish to predict the attrib ute alues associated with objects of the cluster accurately The quality measure of the clustering is then the mutual information of the clusters and the attrib ute alues. Since clustering is summary of the data, some information
Page 2
is generally lost. Our objecti will be to minimize this loss, or equi alently to minimize the increase in uncertainty as the objects are grouped into fe wer and lar ger clusters. director

actor genre (Godf ather II) Scorsese De Niro Crime (Good Fellas) Coppola De Niro Crime (V ertigo) Hitchcock Ste art Thriller (N by NW) Hitchcock Grant Thriller (Bishop ife) oster Grant Comedy (Harv y) oster Ste art Comedy able 1. An instance of the mo vie database Consider partitioning the tuples in able into tw clusters. Clustering groups the first tw mo vies together into one cluster and the remaining four into another Note that cluster preserv es all information about the actor and the genre of the mo vies it holds. or objects in we kno with certainty that the genre is “Crime”, the

actor is “De Niro and there are only tw possible alues for the director Cluster in olv es only tw dif ferent alues for each attrib ute. An other clustering will result in greater information loss. or xample, in clustering is equally informati as ut includes three dif ferent actors and three dif ferent directors. So, while in there are tw equally lik ely alues for each attrib ute, in the director is an of “Scorsese”, “Coppola”, or “Hitchcock (with respecti probabilities 25 25 and 50 ), and similarly for the actor This intuiti idea as formalized by ishby Pereira and Bialek [20]. The recast

cluster ing as the compression of one random ariable into compact representation that preserv es as much information as possible about another random ariable. Their approach as named the Information Bottlenec (IB) method, and it has been applied to ariety of dif ferent areas. In this paper we consider the application of the IB method to the problem of clustering lar ge data sets of cate gorical data. formulate the problem of clustering relations with cate gorical attrib utes within the In- formation Bottleneck frame ork, and define dissimilarity between cate gorical data objects based on

the IB method. Our contrib utions are the follo wing. propose LIMBO, the first scalable hierarchical algorithm for clustering cate gorical data based on the IB method. As result of its hierarchical approach, LIMBO allo ws us in single ecution to consider clusterings of arious sizes. LIMBO can also control the size of the model it uilds to summarize the data. use LIMBO to cluster both tuples (in relational and mark et-bask et data sets) and attrib ute alues. define no el distance between attrib ute alues that allo ws us to quantify the de gree of interchangeability of attrib ute

alues within single attrib ute. empirically aluate the quality of clusterings produced by LIMBO relati to other cate gorical clustering algorithms including the tuple clustering algorithms IB, OCK [13], and COOLCA [4]; as well as the attrib ute alue clustering algorithm STIRR [12]. compare the clusterings based on comprehensi set of quality metrics. The rest of the paper is structured as follo ws. In Section 2, we present the IB method, and we describe ho to formulate the problem of clustering cate gorical data within the IB frame- ork. In Section 3, we introduce LIMBO and sho ho it can be

used to cluster tuples. In Section 4, we present no el distance measure for cate gorical attrib ute alues and discuss ho it can be used within LIBMO to cluster attrib ute alues. Section presents the xperi- mental aluation of LIMBO and other algorithms for clustering cate gorical tuples and alues. Section describes related ork on cate gorical clustering and Section discusses additional applications of the LIMBO frame ork.
Page 3
The Inf ormation Bottleneck Method In this section, we re vie some of the concepts from information theory that will be used in the rest of the paper also

introduce the Information Bottleneck method, and we formulate the problem of clustering cate gorical data within this frame ork. 2.1 Inf ormation Theory basics The follo wing definitions can be found in an information theory te xtbook, e.g., [7]. Let denote discrete random ariable that tak es alues er the set and let denote the probability mass function of The entr opy of ariable is defined by log Intuiti ely entrop captures the “uncertainty of ariable the higher the entrop the lo wer the certainty with which we can predict its alue. No let and be tw random ariables that range er

sets and respecti ely The conditional entr opy of gi en is defined as follo ws. log Conditional entrop captures the uncertainty of predicting the alues of ariable gi en the alues of ariable The mutual information quantifies the amount of information that the ariables con about each other Mutual information is symmetric, and non-ne gati e, and it is related to entrop via the equation Relative Entr opy or ullbac k-Leibler (KL) di er gence, is an information-theoretic measure of the dif ference between tw probability distrib utions. Gi en tw distrib utions and er set the relati entrop

is defined as follo ws. log 2.2 Clustering using the IB Method In cate gorical data clustering, the input to our problem is set of tuples on attrib utes The domain of attrib ute is the set :v :v :v so that identical alues from dif ferent attrib utes are treated as distinct alues. tuple tak es xactly one alue from the set for the th attrib ute. Let denote the set of all possible attrib ute alues. Let denote the size of The data can then be conceptualized as an matrix where each tuple is -dimensional ro ector in Matrix entry t; is if tuple contains attrib ute alue and zero otherwise. Each

tuple contains one alue for each attrib ute, so each tuple ector contains xactly 1 s. No let be random ariables that range er the sets (the set of tuples) and (the set of attrib ute alues) respecti ely normalize matrix so that the entries of each ro sum up to 1. or some tuple the corresponding ro of the normalized matrix holds the conditional probability distrib ution Since each tuple contains xactly attrib ute alues, for some =m if appears in tuple and zero otherwise. able sho ws the normalized matrix for the mo vie database xample. similar formulation can be applied in the case of mark

et-bask et data, where each tuple contains set of alues from single attrib ute [1]. clustering of the tuples in partitions them into clusters :::; where each cluster is non-empty subset of such that for all i; and =1 Let denote random ariable that ranges er the clusters in de- fine to be the size of the clustering. When is fix ed or when it is immaterial to the discussion, we will use and to denote the clustering and the corresponding random ariable. or the remainder of the paper we use italic capital letters (e.g., to denote random ariables, and boldf ace capital letters (e.g., to

denote the set from which the random ariable tak es alues. use abbre viations for the attrib ute alues. or xample d.H stands for director .Hitchcock.
Page 4
d.S d.C d.H d.K a.DN a.S a.G g.Cr g. g.C p(t) 1/3 1/3 1/3 1/6 1/3 1/3 1/3 1/6 1/3 1/3 1/3 1/6 1/3 1/3 1/3 1/6 1/3 1/3 1/3 1/6 1/3 1/3 1/3 1/6 able 2. The normalized mo vie table No let be specific clustering. Gi ving equal weight to each tuple we define Then, for the elements of and are related as follo ws. and seek clusterings of the elements of such that, for kno wledge of the cluster identity pro vides essentially

the same prediction of, or information about, the alues in as does the specific kno wledge of The mutual information measures the information about the alues in pro vided by the identity of cluster in The higher the more in- formati the cluster identity is about the alues in contained in the cluster ishby Pereira and Bialek [20], define clustering as an optimization problem, where, for gi en number of clusters, we wish to identify the -clustering that maximizes Intuiti ely in this pro- cedure, the information contained in about is “squeezed through compact “bottleneck clustering

which is forced to represent the “rele ant part in with respect to ishby et al. [20] pro that, for fix ed number of clusters, the optimal clustering partitions the objects in so that the erage relati entrop ;t t; )] is minimized. Finding the optimal clustering is an NP-complete problem [11]. Slonim and ishby [18] propose greedy agglomerati approach, the Ag glomer ative Information Bottlenec (AIB) algorithm, for finding an informati clustering. The algorithm starts with the clustering in which each object is assigned to its wn cluster Due to the one-to-one mapping between and The

algorithm then proceeds iterati ely for steps, reducing the number of clusters in the current clustering by one in each iteration. At step of the AI algorithm, tw clusters in the -clustering are mer ged into single component to produce ne 1) -clustering As the algorithm forms clusterings of smaller size, the information that the clustering contains about the alues in decreases; that is, The clusters and to be mer ged are chosen to minimize the information loss in mo ving from clustering to clustering This infor mation loss is gi en by can also vie the information loss as the increase in the

uncertainty Recall that Since is independent of the clustering maximizing the mutual information is the same as minimizing the entrop of the clustering or the mer ged cluster we ha the follo wing. (1) (2) ishby et al. [20] sho that )] )] (3) where is the ensen-Shannon (JS) di er gence, defined as follo ws. Let and and let Then, the distance is defined as follo ws. jj jj
Page 5
The distance defines metric and it is bounded abo by one. note that the infor mation loss for mer ging clusters and depends only on the clusters and and not on other parts of the clustering

This approach considers all attrib ute alues as single ariable, without taking into account the act that the alues come from dif ferent attrib utes. Alternati ely we could define ran- dom ariable for ery attrib ute can sho that in applying the Information Bottleneck method to relational data, considering all attrib utes as single random ariable is equi alent to considering each attrib ute independently [1]. In the model of the data described so ar ery tuple contains one alue for each attrib ute. Ho we er this is not the case when we consider mark et-bask et data, which describes database

of transactions for store, where ery tuple consists of the items purchased by single cus- tomer It is also used as term that collecti ely describes data set where the tuples are sets of alues of single attrib ute, and each tuple may contain dif ferent number of alues. In the case of mark et-bask et data, tuple contains alues. Setting =n and =d if appears in we can define the mutual information I(T ;A) and proceed with the Information Bottleneck method to clusters the tuples. LIMBO Clustering The Agglomerati Information Bottleneck algorithm suf fers from high computational com- ple xity

namely log which is prohibiti for lar ge data sets. no introduce the scaLable InforMation BOttlenec k, LIMBO, algorithm that uses distrib utional summaries in or der to deal with lar ge data sets. LIMBO is based on the idea that we do not need to eep whole tuples, or whole clusters in main memory ut instead, just suf ficient statistics to describe them. LIMBO produces compact summary model of the data, and then performs clustering on the summarized data. In our algorithm, we bound the suf ficient statistics, that is the size of our summary model. This, together with an IB inspired

notion of distance and no el definition of summaries to produce the solution, mak es our approach dif ferent from the one emplo yed in the BIRCH clustering algorithm for clustering numerical data [21]. In BIRCH heuristic threshold is used to control the accurac of the summary created. In the xperimental section of this paper we study the ef fect of such threshold in LIMBO. 3.1 Distrib utional Cluster eatur es summarize cluster of tuples in Distrib utional Cluster eatur DCF will use the information in the rele ant DCF to compute the distance between tw clusters or between cluster and

tuple. Let denote set of tuples er set of attrib utes, and let and be the corresponding random ariables, as described earlier Also let denote clustering of the tuples in and let be the corresponding random ariable. or some cluster the Distrib utional Cluster eatur DCF of cluster is defined by the pair DCF where is the probability of cluster and is the conditional probability distrib ution of the attrib ute alues gi en the cluster will often use DCF and interchangeably If consists of single tuple =n and is computed as described in Section 2. or xample, in the mo vie database, for tuple

DCF corresponds to the th ro of the normalized matrix in able 2. or lar ger clusters, the DCF is computed recursi ely as follo ws: let denote the cluster we obtain by mer ging tw clusters and The DCF of the cluster is DCF
Page 6
where and are computed using Equations 1, and respecti ely define the distance, between DCF and DCF as the information loss incurred for mer ging the corresponding clusters and The distance is computed using Equation 3. The information loss depends only on the clusters and and not on the clustering in which the belong. Therefore, is well-defined

distance measure. The DCF can be stored and updated incrementally The probability ectors are stored as sparse ectors, reducing the amount of space considerably Each DCF pro vides summary of the corresponding cluster which is suf ficient for computing the distance between tw clusters. 3.2 The DCF tr ee The DCF tree is height-balanced tree as depicted in Figure 1. Each node in the tree contains child 1 child 3 child 2 child 6 child 1 child 3 child 2 child 6 child 1 child 3 child 2 child 5 prev next prev next Root Node Non - leaf node Leaf node Leaf node DCF 6 DCF 2 DCF 1 DCF 4 DCF 2 DCF 1

DCF 5 DCF 2 DCF 1 DCF 3 DCF 6 DCF 2 DCF 1 DCF 3 { Data } Fig 1. DCF tr ee with branching factor 6. at most entries, where is the br anc hing factor of the tree. All node entries store DCF s. At an point in the construction of the tree, the DCF at the lea es define clustering of the tuples seen so ar Each non-leaf node stores DCF that are produced by mer ging the DCF of its children. The DCF tree is uilt in B-tree-lik dynamic ashion. The insertion algorithm is described in detail belo After all tuples are inserted in the tree, the DCF tree embodies compact representation where the data is

summarized by the DCFs of the lea es. 3.3 The LIMBO clustering algorithm The LIMBO algorithm proceeds in three phases. In the first phase, the DCF tree is constructed to summarize the data. In the second phase, the DCF of the tree lea es are mer ged to produce chosen number of clusters. In the third phase, we associate each tuple with the DCF to which the tuple is closest. Phase 1: Insertion into the DCF tr ee. uples are read and inserted one by one. uple is con erted into DCF as described in Section 3.1. Then, starting from the root, we trace path do wnw ard in the DCF tree. When at

non-leaf node, we compute the distance between DCF and each DCF entry of the node, finding the closest DCF entry to DCF follo the child pointer of this entry to the ne xt le el of the tree. When at leaf node, let DCF denote the DCF entry in the leaf node that is closest to DCF DCF is the summary of some cluster At this point, we need to decide whether will be absorbed in the cluster or not. In our space-bounded algorithm, an input parameter indicates the maximum space bound. Let be the maximum size of DCF entry (note that sparse DCF may be smaller than ). compute the maximum number of

nodes S= and eep counter of the number of used nodes as we uild the tree. If there is an empty entry in the leaf node that contains DCF then DCF is placed in that entry If there is no empty leaf entry and there is suf ficient free space, then the leaf node is split into tw lea es. find the tw DCF in the leaf node that are arthest apart and we use them as seeds for the ne lea es. The remaining DCF s, and DCF are placed in the leaf that contains the seed DCF to which
Page 7
the are closest. Finally if the space bound has been reached, then we compare c; with the minimum

distance of an tw DCF entries in the leaf. If c; is smaller than this minimum, we mer ge DCF with DCF otherwise the tw closest entries are mer ged and DCF occupies the freed entry When leaf node is split, resulting in the creation of ne leaf node, the leaf parent is updated, and ne entry is created at the parent node that describes the ne wly created leaf. If there is space in the non-leaf node, we add ne DCF entry otherwise the non-leaf node must also be split. This process continues upw ard in the tree until the root is either updated or split itself. In the latter case, the height of the

tree increases by one. Phase 2: Clustering After the construction of the DCF tree, the leaf nodes hold the DCF of clustering of the tuples in Each DCF corresponds to cluster and contains suf ficient statistics for computing and probability emplo the Ag glomer ative Information Bottlenec (AIB) algorithm to cluster the DCF in the lea es and produce clus- terings of the DCF s. note that an clustering algorithm is applicable at this phase of the algorithm. Phase 3: Associating tuples with clusters. or chosen alue of Phase produces DCF that serv as epr esentatives of clusters. In the

final phase, we perform scan er the data set and assign each tuple to the cluster whose representati is closest to the tuple. 3.4 Analysis of LIMBO no present an analysis of the I/O and CPU costs for each phase of the LIMBO algorithm. In what follo ws, is the number of tuples in the data set, is the total number of attrib ute alues, is the branching actor of the DCF tree, and is the chosen number of clusters. Phase 1: The I/O cost of this stage is scan that in olv es reading the data set from the disk. or the CPU cost, when ne tuple is inserted the algorithm considers path of nodes in

the tree, and for each node in the path, it performs at most operations (distance computations, or updates), each taking time Thus, if is the height of the DCF tree produced in Phase 1, locating the correct leaf node for tuple tak es time hdB The time for split is dB If is the number of non-leaf nodes, then all splits are performed in time dU in total. Hence, the CPU cost of creating the DCF tree is nhdB dU observ ed xperi- mentally that LIMBO produces compact trees of small height (both and are bounded). Phase 2: or alues of that produce clusterings of high quality the DCF tree is compact

enough to fit in main memory Hence, there is no I/O cost in olv ed in this phase, since it in olv es only the clustering of the leaf node entries of the DCF tree. If is the number of DCF entries at the lea es of the tree, then the AIB algorithm tak es time log In our xperiments, so the CPU cost is lo Phase 3: The I/O cost of this phase is the reading of the data set from the disk again. The CPU comple xity is dn since each tuple is compared against the DCF that represent the clusters. Intra-Attrib ute alue Distance In this section, we propose no el application that can be used within

LIMBO to quantify the distance between attrib ute alues of the same attrib ute. Cate gorical data is characterized by the act that there is no inherent distance between attrib ute alues. or xample, in the mo vie database instance, gi en the alues “Scorsese and “Coppola”, it is not apparent ho to assess their similarity Comparing the set of tuples in which the appear is not useful since ery mo vie has single director In order to compare attrib ute alues, we need to place them within conte xt Then, tw attrib ute alues are similar if the conte xts in which the appear are
Page 8
similar

define the conte xt as the distrib ution these attrib ute alues induce on the remaining attrib utes. or xample, for the attrib ute “director”, tw directors are considered similar if the induce “similar distrib ution er the attrib utes “actor and “genre”. ormally let be the attrib ute of interest, and let denote the set of alues of attrib ute Also let denote the set of attrib ute alues for the remaining attrib utes. or the xample of the mo vie database, if is the director attrib ute, with d:S; d:C d:H d:K then a:D a:S; a:G; :C :T :C Let and be random ariables that range er and respecti

ely and let denote the distrib ution that alue induces on the alues in or some is the fraction of the tuples in that contain and also contain alue Also, for some is the fraction of tuples in that contain the alue able sho ws an xample of table when is the director attrib ute. dir ector a.DN a.S a.G g.Cr g.T g.C p(d) Scorsese 1/2 1/2 1/6 Coppola 1/2 1/2 1/6 Hitchcock 1/3 1/3 2/3 2/6 oster 1/3 1/3 2/3 2/6 able 3. The “dir ector attrib ute or tw alues we define the distance between and to be the information loss incurred about the ariable if we mer ge alues and This is equal to the increase

in the uncertainty of predicting the alues of ariable when we replace alues and with In the mo vie xample, Scorsese and Coppola are the most similar directors. The definition of distance measure for cate gorical attrib ute alues is contrib ution in itself, since it imposes some structure on an inherently unstructured problem. can define distance measure between tuples as the sum of the distances of the indi vidual attrib utes. An- other possible application is to cluster intra-attrib ute alues. or xample, in mo vie database, we may be interested in disco ering clusters of directors

or actors, which in turn could help in impro ving the classification of mo vie tuples. Gi en the joint distrib ution of random ariables and we can apply the LIMBO algorithm for clustering the alues of attrib ute Mer ging tw produces ne alue where since and ne er appear together Also, The problem of defining conte xt sensiti distance measure between attrib ute alues is also considered by Das and Mannila [9]. The define an iterati algorithm for computing the inter hang eability of tw alues. belie that our approach gi es natural quantification of the concept of

interchangeability Furthermore, our approach has the adv antage that it al- lo ws for the definition of distance between clusters of alues, which can be used to perform intra-attrib ute alue clustering. Gibson et al. [12] proposed STIRR, an algorithm that clus- ters attrib ute alues. STIRR does not define distance measure between attrib ute alues and, furthermore, produces just tw clusters of alues. Experimental Ev aluation In this section, we perform comparati aluation of the LIMBO algorithm on both real and synthetic data sets. with other cate gorical clustering algorithms,

including what we belie to be the only other scalable information-theoretic clustering algorithm COOLCA [3, 4]. conclusion that agrees with well-informed cinematic opinion.
Page 9
5.1 Algorithms compare the clustering quality of LIMBO with the follo wing algorithms. OCK Algorithm. OCK [13] assumes similarity measure between tuples, and defines link between tw tuples whose similarity xceeds threshold The aggre gate interconnec- ti vity between tw clusters is defined as the sum of links between their tuples. OCK is an agglomerati algorithm, so it is not applicable to lar ge

data sets. use the accar Coef fi- cient for the similarity measure as suggested in the original paper or data sets that appear in the original OCK paper we set the threshold to the alue suggested there, otherwise we set to the alue that ga us the best results in terms of quality In our xperiments, we use the implementation of Guha et al. [13]. COOLCA Algorithm. The approach most similar to ours is the COOLCA algorithm [3, 4], by Barbar a, Couto and Li. The COOLCA algorithm is scalable algorithm that optimizes the same objecti function as our approach, namely the entrop of the clustering.

It dif fers from our approach in that it relies on sampling, and it is non-hierarchical. COOLCA starts with sample of points and identifies set of initial tuples such that the minimum pairwise distance among them is maximized. These serv as representati es of the clusters. All remaining tuples of the data set are placed in one of the clusters such that, at each step, the increase in the entrop of the resulting clustering is minimized. or the xperiments, we implement COOLCA based on the CIKM paper by Barbar et al. [4]. STIRR Algorithm. STIRR [12] applies linear dynamical system er

multiple copies of hyper graph of weighted attrib ute alues, until fixed point is reached. Each cop of the hyper graph contains tw groups of attrib ute alues, one with positi and another with ne gati weights, which define the tw clusters. compare this algorithm with our intra-attrib ute alue clustering algorithm. In our xperiments, we use our wn implementation and report results for ten iterations. LIMBO Algorithm. In addition to the space-bounded ersion of LIMBO as described in Sec- tion 3, we implemented LIMBO so that the accurac of the summary model is controlled instead. If we

wish to control the accurac of the model, we use threshold on the distance c; to determine whether to mer ge DCF with DCF thus controlling directly the in- formation loss for mer ging tuple with cluster The selection of an appropriate threshold alue will necessarily be data dependent and we require an intuiti ay of allo wing user to set this threshold. ithin data set, ery tuple contrib utes, on “a erage”, =n to the mutual information define the clustering threshold to be multiple of this er age and we denote the threshold by That is, can mak pass er the data, or use sample of the data,

to estimate Gi en alue for ), if mer ge incurs information loss more than times the “a erage mutual information, then the ne tuple is placed in cluster by itself. In the xtreme case we prohibit an information loss in our summary (this is equi alent to setting in the space-bounded ersion of LIMBO). discuss the ef fect of in Section 5.4. distinguish between the tw ersions of LIMBO, we shall refer to the space-bounded ersion as LIMBO and the accurac y-bounded as LIMBO Note that algorithmically only the mer ging decision in Phase dif fers in the tw ersions, while all other phases remain the same

for both LIMBO and LIMBO 5.2 Data Sets xperimented with the follo wing data sets. The first three ha been pre viously used for the aluation of the aforementioned algorithms [4, 12, 13]. The synthetic data sets are used both for quality comparison, and for our scalability aluation. Congr essional otes This relational data set as tak en from the UCI Mac hine Learning Repository It contains 435 tuples of otes from the U.S. Congressional oting Record of http://www.ics.uci.edu/ mlearn/MLReposi tory.h tml
Page 10
1984. Each tuple is congress-person ote on 16 issues and each ote is

boolean, either YES or NO. Each congress-person is classified as either Republican or Democrat. There are total of 168 Republicans and 267 Democrats. There are 288 missing alues that we treat as separate alues. Mushr oom The Mushroom relational data set also comes from the UCI Repository It con- tains 8,124 tuples, each representing mushroom characterized by 22 attrib utes, such as color shape, odor etc. The total number of distinct attrib ute alues is 117. Each mushroom is classi- fied as either poisonous or edible. There are 4,208 edible and 3,916 poisonous mushrooms in total.

There are 2,480 missing alues. Database and Theory Bibliograph This relational data set contains 8,000 tuples that rep- resent research papers. About 3,000 of the tuples represent papers from database research and 5,000 tuples represent papers from theoretical computer science. Each tuple contains four at- trib utes with alues for the first Author second Author Conference/Journal and the ear of publication. use this data to test our intra-attrib ute clustering algorithm. Synthetic Data Sets produce synthetic data sets using data generator ailable on the eb This generator of fers wide

ariety of options, in terms of the number of tuples, at- trib utes, and attrib ute domain sizes. specify the number of classes in the data set by the use of conjuncti rules of the form Attr Attr ass The rules may in olv an arbitrary number of attrib utes and attrib ute alues. name these synthetic data sets by the prefix DS follo wed by the number of classes in the data set, .g., DS5 or DS10. The data sets contain 5,000 tuples, and 10 attrib utes, with domain sizes between 20 and 40 for each attrib ute. Three attrib utes participate in the rules the data generator uses to produce the

class labels. Finally these data sets ha up to 10% erroneously entered alues. Additional lar ger synthetic data sets are described in Section 5.6. eb Data This is mark et-bask et data set that consists of collection of web pages. The pages were collected as described by Kleinber [14]. query is made to search engine, and an initial set of web pages is retrie ed. This set is augmented by including pages that point to, or are pointed to by pages in the set. Then, the links between the pages are disco ered, and the underlying graph is constructed. ollo wing the terminology of Kleinber [14] we

define hub to be page with non-zero out-de gree, and an authority to be page with non-zero in-de gree. Our goal is to cluster the authorities in the graph. The set of tuples is the set of authorities in the graph, while the set of attrib ute alues is the set of hubs. Each authority is xpressed as ector er the hubs that point to this authority or our xperiments, we use the data set used by Borodin et al. [5] for the “abortion query applied filtering step to assure that each hub points to more than 10 authorities and each authority is pointed by more than 10 hubs. The data set

contains 93 authorities related to 102 hubs. ha also applied LIMBO on Softw are Re erse Engineering data sets with considerable benefits compared to other algorithms [2]. 5.3 Quality Measur es or Clustering Clustering quality lies in the ye of the beholder; determining the best clustering usually de- pends on subjecti criteria. Consequently we will use se eral quantitati measures of clus- tering performance. Inf ormation Loss, use the information loss, to compare clus- terings. The lo wer the information loss, the better the clustering. or clustering with lo information loss, gi en

cluster we can predict the attrib ute alues of the tuples in the cluster with relati ely high accurac present as percentage of the initial mutual information lost after producing the desired number of clusters using each algorithm. ollo wing the approach of Gibson et al. [12], if the second author does not xist, then the name of the first author is copied instead. also filter the data so that each conference/journal appears at least times. http://www.datgen.com/
Page 11
Category Utility Cate gory utility [15], is defined as the dif ference between the x- pected

number of attrib ute alues that can be correctly guessed gi en clustering, and the xpected number of correct guesses with no such kno wledge. depends only on the parti- tioning of the attrib utes alues by the corresponding clustering algorithm and, thus, is more objecti measure. Let be clustering. If is an attrib ute with alues ij then is gi en by the follo wing xpression: ij ij present as an absolute alue that should be compared to the alues gi en by other algorithms, for the same number of clusters, in order to assess the quality of specific algorithm. Man data sets commonly used in

testing clustering algorithms include ariable that is hidden from the algorithm, and specifies the class with which each tuple is associated. All data sets we consider include such ariable. This ariable is not used by the clustering algorithms. While there is no guarantee that an gi en classification corresponds to an optimal clustering, it is nonetheless enlightening to compare clusterings with pre-specified classifications of tuples. do this, we use the follo wing quality measures. Min Classification Err or min Assume that the tuples in are already

classified into classes and let denote clustering of the tuples in into clusters produced by clustering algorithm. Consider one-to-one mapping, from classes to clusters, such that each class is mapped to the cluster The classification err or of the mapping is defined as =1 where measures the number of tuples in class that recei ed the wrong label. The optimal mapping between clusters and classes, is the one that minimizes the classification error use min to denote the classification error of the optimal mapping. Pr ecision, (P) Recall, (R) ithout loss of

generality assume that the optimal mapping assigns class to cluster define precision, and recall, for cluster as follo ws. nd and tak alues between and and, intuiti ely measures the accurac with which cluster reproduces class while measures the completeness with which reproduces class define the precision and recall of the clustering as the weighted erage of the precision and recall of each cluster More precisely =1 nd =1 think of precision, recall, and classification error as indicati alues (percentages) of the ability of the algorithm to reconstruct the xisting classes in

the data set. In our xperiments, we report alues for all of the abo measures. or LIMBO and COOL- CA numbers are erages er 100 runs with dif ferent (random) orderings of the tuples. 5.4 Quality-Efficiency trade-offs or LIMBO In LIMBO, we can control the size of the model (using or the accurac of the model (using ). Both and permit trade-of between the xpressi eness (information preserv ation) of the summarization and the compactness of the model (number of leaf entries in the tree) it produces. or lar ge alues of and small alues of we obtain fine grain representation
Page

12
of the data set at the end of Phase 1. Ho we er this results in tree with lar ge number of leaf entries, which leads to higher computational cost for both Phase and Phase of the algorithm. or small alues of and lar ge alues of we obtain compact representation of the data set (small number of leaf entries), which results in aster ecution time, at the xpense of increased information loss. no in estigate this trade-of for range of alues for and observ ed xperi- mentally that the branching actor does not significantly af fect the quality of the clustering. set which results in

manageable ecution time for Phase 1. Figure presents the ecution times for LIMBO and LIMBO on the DS5 data set, as function of and respecti ely or 25 the Phase time is 210 seconds (be yond the edge of the graph). 64 256 1024 5120 10 20 30 40 50 60 70 Buffer Size (KB) Time (sec) Phase1 Phase2 Phase3 41KB 108KB 270KB 513KB 0.25 0.5 0.75 1.25 10 20 30 40 50 60 70 Value of Time (sec) Phase1 Phase2 Phase3 918KB 553KB 316KB 145KB 25KB Fig 2. LIMBO and LIMBO execution times (DS5) The figures also include the size of the tree in KBytes. In this figure, we observ that for lar ge and small

the computational bottleneck of the algorithm is Phase 2. As decreases and increases the time for Phase decreases in quadratic ashion. This agrees with the plot in Fig- ure 3, where we observ that the number of lea es decreases also in quadratic ashion. Due to the decrease in the size (and height) of the tree, time for Phase also decreases, ho we er at much slo wer rate. Phase 3, as xpected, remains unaf fected, and it is equal to fe seconds for all alues of and or 256 KB and the number of leaf entries becomes suf ficiently small, so that the computational bottleneck of the algorithm

becomes Phase 1. or these alues the ecution time is dominated by the linear scan of the data in Phase 1. no study the change in the quality measures for the same range of alues for and In the xtreme cases of and we only mer ge identical tuples, and no information is lost in Phase 1. LIMBO then reduces to the AIB algorithm, and we obtain the same quality as AIB. Figures and sho the quality measures for the dif ferent alues of and The alue (not plotted) is equal to 51 for 256 KB, and 56 for 256 KB. observ that for 256 KB and we obtain clusterings of xactly the same quality as for and that is,

the AIB algorithm. At the same time, for 256 KB and the ecution time of the algorithm is only small fraction of that of the AIB algorithm, which as fe minutes. 0.25 0.5 0.75 1.25 1.5 0 1 2 3 4 5 Value of Number of Leaf Entries (x1000) CU=2.56 CU=2.56 CU=2.56 CU=2.56 CU=2.56 CU=2.29 CU=0.00 Fig 3. LIMBO Lea es (DS5) 0.25 0.5 0.75 1.25 1.5 0.0 0.2 0.4 0.6 0.8 1.0 Value of Quality Measures Information Loss (IL) Precision (P) Recall (R) Class. Error (E min Fig 4. LIMBO Quality (DS5) 64 100 200 300 400 500 0.0 0.2 0.4 0.6 0.8 1.0 Buffer Size (KB) Quality Measures Information Loss (IL) Precision (P)

Recall (R) Class. Error (E min Fig 5. LIMBO Quality (DS5)
Page 13
Similar trends were observ ed for all other data sets. There is range of alues for and where the ecution time of LIMBO is dominated by Phase 1, while at the same time, we observ essentially no change (up to the third decimal digit) in the quality of the clustering. able sho ws the reduction in the number of leaf entries for each data set for LIMBO and LIMBO The parameters and are set so that the cluster quality is almost identical to that of otes Mushr oom DS5 DS10 LIMBO 85.94% 99.34% 95.36% 95.28% LIMBO 94.01% 99.77%

98.68% 98.82% able 4. Reduction in Leaf Entries AIB (as demonstrated in able 6). These xperiments demonstrate that in Phase we can obtain significant compression of the data sets at no xpense in the final quality The consistenc of LIMBO can be attrib uted in part to the ef fect of Phase 3, which assigns the tuples to cluster representati es, and hides some of the information loss incurred in the pre vious phases. Thus, it is suf ficient for Phase to disco er well separated representati es. As result, en for lar ge alues of and small alues of LIMBO obtains essentially the same

clustering quality as AIB, ut in linear time. otes (2 cluster s) Algorithm siz (%) min LIMBO =0 384 72.52 0.89 0.87 0.13 2.89 LIMBO 128 54 72.54 0.89 0.87 0.13 2.89 LIMBO 23 72.55 0.89 0.87 0.13 2.89 COOLCA 435 435 73.55 0.87 0.85 0.15 2.78 OCK 74.00 0.87 0.86 0.16 2.63 Mushr oom (2 cluster s) Algorithm siz (%) min LIMBO =0 8124 81.45 0.91 0.89 0.11 1.71 LIMBO 128 54 81.46 0.91 0.89 0.11 1.71 LIMBO 18 81.45 0.91 0.89 0.11 1.71 COOLCA 1000 1,000 84.57 0.76 0.73 0.27 1.46 OCK 86.00 0.77 0.57 0.43 0.59 able 5. Results or eal data sets DS5 (n=5000, 10 attrib utes, cluster s) Algorithm siz (%) min

LIMBO =0 5000 77.56 0.998 0.998 0.002 2.56 LIMBO 1024 232 77.57 0.998 0.998 0.002 2.56 LIMBO 66 77.56 0.998 0.998 0.002 2.56 COOLCA 125 125 78.02 0.995 0.995 0.05 2.54 OCK 85.00 0.839 0.724 0.28 0.44 DS10 (n=5000, 10 attrib utes, 10 cluster s) Algorithm siz (%) min LIMBO =0 5000 73.50 0.997 0.997 0.003 2.82 LIMBO 1024 236 73.52 0.996 0.996 0.004 2.82 LIMBO 59 73.51 0.994 0.996 0.004 2.82 COOLCA 125 125 74.32 0.979 0.973 0.026 2.74 OCK 78.00 0.830 0.818 0.182 2.13 able 6. Results or synthetic data sets 5.5 Comparati Ev aluations In this section, we demonstrate that LIMBO produces clusterings of

high quality and we com- pare against other cate gorical clustering algorithms. uple Clustering able sho ws the results for all algorithms on all quality measures for the otes and Mushroom data sets. or LIMBO we present results for 128 while for LIMBO we present results for can see that both ersion of LIMBO ha results almost identical to the quality measures for and i.e the AIB algorithm. The size entry in the table holds the number of leaf entries for LIMBO, and the sample size for COOLCA or the otes data set, we use the whole data set as sample, while for Mush- room we use 1,000 tuples. As

able indicates, LIMBO quality is superior to OCK, and COOLCA in both data sets. In terms of LIMBO created clusters which retained most of the initial information about the attrib ute alues. ith respect to the other measures, LIMBO outperforms all other algorithms, xhibiting the highest and in all data sets tested, as well as the lo west min also aluate LIMBO performance on tw synthetic data sets, namely DS5 and DS10. These data sets allo us to aluate our algorithm on data sets with more than tw classes. The
Page 14
TES in ax Av ar LIMBO 128 71.98 73.68 72.54 0.08 2.80 2.93 2.89

0.0007 LIMBO 71.98 73.29 72.55 0.083 2.83 2.94 2.89 0.0006 COOLCA 435 71.99 95.31 73.55 12.25 0.19 2.94 2.78 0.15 MUSHR OOM in ax Av ar LIMBO 1024 81.46 81.46 81.46 0.00 1.71 1.71 1.71 0.00 LIMBO 81.45 81.45 81.45 0.00 1.71 1.71 1.71 0.00 COOLCA 1000 81.60 87.07 84.57 3.50 0.80 1.73 1.46 0.05 able 7. Statistics or IL(%) and CU results are sho wn in able 6. observ again that LIMBO has the lo west information loss and produces nearly optimal results with respect to precision and recall. or the OCK algorithm, we observ ed that it is ery sensiti to the threshold alue and in man cases, the

algorithm produces one giant cluster that includes tuples from most classes. This results in poor precision and recall. Comparison with COOLCA COOLCA xhibits erage clustering quality that is close to that of LIMBO. It is interesting to xamine ho COOLCA beha es when we consider other statistics. In able 7, we present statistics for 100 runs of COOLCA and LIMBO on dif ferent orderings of the otes and Mushroom data sets. present LIMBO results for 128 KB and which are ery similar to those for or the otes data set, COOLCA xhibits information loss as high as 95.31% with ariance of 12.25%. or all

runs, we use the whole data set as the sample for COOLCA or the Mushroom data set, the situation is better ut still the ariance is as high as 3.5%. The sample size as 1,000 for all runs. able indicates that LIMBO beha es in more stable ashion er dif ferent runs (that is, dif ferent input orders). Notably for the Mushroom data set, LIMBO performance is xactly the same in all runs, while for otes it xhibits ery lo ariance. This indicates that LIMBO is not particularly sensiti to the input order of data. The performance of COOLCA appears to be sensiti to the follo wing actors: the choice of

representati es, the sample size, and the ordering of the tuples. After detailed xamina- tion we found that the runs with maximum information loss for the otes data set correspond to cases where an outlier as selected as the initial representati e. The otes data set con- tains three such tuples, which are ar from all other tuples, and the are naturally pick ed as representati es. Reducing the sample size, decreases the probability of selecting outliers as representati es, ho we er it increases the probability of missing one of the clusters. In this case, high information loss may occur if

COOLCA picks as representati es tw tuples that are not maximally ar apart. Finally there are cases where the same representati es may produce dif- ferent results. As tuples are inserted to the clusters, the representati es “mo e closer to the inserted tuples, thus making the algorithm sensiti to the ordering of the data set. In terms of computational comple xity both LIMBO and COOLCA include stage that requires quadratic comple xity or LIMBO this is Phase 2. or COOLCA this is the step where all pairwise entropies between the tuples in the sample are computed. xperimented with both algorithms

ha ving the same input size for this phase, i.e we made the sample size of COOLCA equal to the number of lea es for LIMBO. Results for the otes and Mushroom data sets are sho wn in ables and 9. LIMBO outperforms COOLCA in all runs, for all quality measures en though ecution time is essentially the same for both algorithms. The tw algorithms are closest in quality for the otes data set with input size 27, and arthest apart for the Mushroom data set with input size 275. COOLCA appears to perform better with smaller sample size, while LIMBO remains essentially unaf fected. eb Data Since this data

set has no predetermined cluster labels, we use dif ferent aluation approach. applied LIMBO with and clustered the authorities into three clusters. (Due to lack of space the choice of is discussed in detail in [1].) The total information loss as 61% Figure sho ws the authority to hub table, after permuting the ro ws so that we group together authorities in the same cluster and the columns so that each hub is assigned to the cluster to which it has the most links.
Page 15
Sample Size Leaf Entries 384 Algorithm (%) min LIMBO 72.52 0.89 0.87 0.13 2.89 COOLCA 74.15 0.86 0.84 0.15 2.63

Sample Size Leaf Entries 27 Algorithm (%) min LIMBO 72.55 0.89 0.87 0.13 2.89 COOLCA 73.50 0.88 0.86 0.13 2.87 able 8. LIMBO vs COOLCA on otes Sample Size Leaf Entries 275 Algorithm (%) min LIMBO 81.45 0.91 0.89 0.11 1.71 COOLCA 83.50 0.76 0.73 0.27 1.46 Sample Size Leaf Entries 18 Algorithm (%) min LIMBO 81.45 0.91 0.89 0.11 1.71 COOLCA 82.10 0.82 0.81 0.19 1.60 able 9. LIMBO vs COOLCA on Mushr oom LIMBO accurately characterize the structure of the web graph. Authorities are clustered in three distinct clusters. Authorities in the same cluster share man hubs, while the those in dif ferent

clusters ha ery fe hubs in common. The three dif ferent clusters correspond to dif ferent vie wpoints on the issue of abortion. The first cluster consists of “pro-choice pages. The second cluster consists of “pro-life pages. The third cluster contains set of pages from cincinnati.com that were included in the data set by the algorithm that collects the web pages [5], despite ha ving no apparent relation to the abortion query complete list of the results can be found in [1]. Intra-Attrib ute alue Clustering no present results for the application of LIMBO to the problem of intra-attrib ute

alue clustering. or this xperiment, we use the Bibliographic data set. are interested in clustering the conferences and journals, as well as the first authors of the papers. compare LIMBO with STIRR, an algorithm for clustering attrib ute alues. ollo wing the description of Section 4, for the first xperiment we set the random ari- able to range er the conferences/journals, while ariable ranges er first and second authors, and the year of publication. There are 1,211 distinct enues in the data set; 815 are database enues, and 396 are theory enues. Results for MB and are sho wn

in able 10. LIMBO results are superior to those of STIRR with respect to all quality measures. The dif ference is especially pronounced in the and measures. Algorithm Leav es (%) min LIMBO MB) 16 94.02 0.90 0.89 0.12 LIMBO 47 94.01 0.90 0.90 0.11 STIRR 98.01 0.56 0.55 0.45 able 10. Bib clustering using LIMBO STIRR no turn to the problem of clustering the first authors. ariable ranges er the set of 1,416 distinct first authors in the data set, and ariable ranges er the rest of the attrib utes. produce tw clusters, and we aluate the results of LIMBO and STIRR based on the distrib

ution of the papers that were written by first authors in each cluster Figures and illustrate the clusters produced by LIMBO and STIRR, respecti ely The -axis in both figures represents publishing enues while the -axis represents first authors. If an author has published paper in particular enue, this is represented by point in each figure. The thick horizontal line separates the clusters of authors, and the thick ertical line distinguishes between theory and database enues. Database enues lie on the left of the line, while theory ones on the right of the line. From

these figures, it is apparent that LIMBO yields better partition of the authors than STIRR. The upper half corresponds to set of theory researchers with almost no publications in database enues. The bottom half, corresponds to set of database researchers with ery fe publications in theory enues. Our clustering is slightly smudged by the authors between inde 400 and 450 that appear to ha number of publications in theory These are dra wn in ailable at: http://www .cs.toronto.edu/ periklis/pubs/csr g467.pdf The data set is pre-classified, so class labels are kno wn.
Page 16

20 40 60 80 100 20 40 60 80 Hubs Authorities Fig 6. eb data clusters 100 200 300 400 200 400 600 800 1000 1200 1400 Conferences/Journals First Authors Fig 7. LIMBO clusters 100 200 300 400 200 400 600 800 1000 1200 1400 Conferences/Journals First Authors Fig 8. STIRR clusters the database cluster due to their co-authors. STIRR, on the other hand, creates well separated theory cluster (upper half), ut the second cluster contains authors with publications almost equally distrib uted between theory and database enues. 5.6 Scalability Ev aluation In this section, we study the scalability of LIMBO,

and we in estigate ho the parameters af fect its ecution time. study the ecution time of both LIMBO and LIMBO consider four data sets of size 500 and 10 each containing 10 clusters and 10 attrib utes with 20 to 40 alues each. The first three data sets are samples of the 10 data set. or LIMBO the size and the number of leaf entries of the DCF tree, at the end of Phase is controlled by the parameter or LIMBO we study Phase in detail. As we ary Figure demonstrates that the ecution time for Phase decreases at steady rate for alues of up to or ecution time drops significantly This

decrease is due to the reduced number of splits and the decrease in the DCF tree size. In the same plot, we sho some indicati sizes of the tree demonstrating that the ectors that we maintain remain relati ely sparse. The erage density of the DCF tree ectors, i.e the erage fraction of non-zero entries remains between 41% and 87% Figure 10 plots the number of lea es as function of observ that for the same range of alues for ), LIMBO produces manageable DCF tree, with small number of lea es, leading to ast ecution time in Phase 2. Furthermore, in all our xperiments the height of the tree as ne er

more than 11, and the occupanc of the tree, i.e the number of occupied entries er the total possible number of entries, as al ays abo 85 7% indicating that the memory space as well used. 0.5 0.75 1.1 1.2 1.3 1.5 5000 10000 15000 Value of Time (sec) 500K 1M 5M 10M 520MB 340MB 154MB 50MB 250MB 189MB 84MB 28MB 183MB 44MB 22MB 8MB Fig 9. Phase execution times 0.5 0.75 1.1 1.2 1.3 1.5 10 10 10 10 10 Value of Number of Leaf Entries 500K 1M 5M 10M Fig 10. Phase leaf entries Thus, for we ha DCF tree with manageable size, and ast ecution time for Phase and 2. or our xperiments, we set and or LIMBO we

use uf fer sizes of and no study the total ecution time of the algorithm for these parameter alues. The graph in Figure 11 sho ws the ecution time for LIMBO and LIMBO on the data sets we consider In this figure, we observ that ecution The -axis of Figure 10 has logarithmic scale.
Page 17
time scales in linear ashion with respect to the size of the data set for both ersions of LIMBO. also observ ed that the clustering quality remained unaf fected for all alues of and and it as the same acr oss the data sets (e xcept for IL in the data set, which dif fered by 01% ). Precision and

Recall were 0.999, and the classification error min as 0.0013, indicating that LIMBO can produce clusterings of high quality en for lar ge data sets. In our ne xt xperiment, we aried the number of attrib utes, in the and 10 data sets and ran both LIMBO with uf fer size of and LIMBO with Figure 12 sho ws the ecution time as function number of attrib utes, for dif ferent data set sizes. In all cases, ecution time increased linearly able 11 presents the quality results for all alues of 0 0.5 1 5 10 2000 4000 6000 8000 10000 12000 14000 Number of Tuples (x 1M) Time (sec) =1.3 =1.2 S=1MB

S=5MB Fig 11. Execution time (m=10) 10 15 20 0.5 1.5 2.5 3.5 x 10 Number of Attributes (m) Time (sec) 10M, S=5MB 10M, =1.2 5M, S=5MB 5M, =1.2 Fig 12. Execution time for both LIMBO algorithms. The quality measures are essentially the same for dif ferent sizes of the data set. LIMBO ;S IL(%) min CU 49.12 0.991 0.991 0.0013 2.52 10 60.79 0.999 0.999 0.0013 3.87 20 52.01 0.997 0.994 0.0015 4.56 able 11. LIMBO and LIMBO quality Finally we aried the number of clusters from 10 up to 50 in the 10 data set, for and As xpected from the analysis of LIMBO in Section 3.4, the number of clusters af

fected only Phase 3. Recall from Figure in Section 5.4 that Phase is small fraction of the total ecution time. Indeed, as we increase from 10 to 50, we observ ed just 5% increase in the ecution time for LIMBO and just 1% for LIMBO Other Related ork CA CTUS, [10], by Ghanti, Gehrk and Ramakrishnan, uses summaries of information con- structed from the data set that are suf ficient for disco ering clusters. The algorithm defines attrib ute alue clusters with erlapping cluster -projections on an attrib ute. This mak es the assignment of tuples to clusters unclear Our approach is based

on the Information Bottlenec (IB) Method introduced by ishby Pereira and Bialek [20]. The Information Bottleneck method has been used in an agglomer ati hierarchical clustering algorithm [18] and applied to the clustering of documents [19]. Recently Slonim and ishby [17] introduced the sequential Information Bottlenec k, (sIB) al- gorithm, which reduces the running time relati to the agglomerati approach. Ho we er it depends on an initial random partition and requires multiple passes er the data for dif ferent initial partitions. In the future, we plan to xperiment with sIB in Phase of LIMBO.

Finally an algorithm that uses an xtension to BIRCH [21] is gi en by Chiu, ang, Chen, and and Jeris [6]. Their approach assumes that the data follo ws multi ariate normal distri- ution. The performance of the algorithm has not been tested on cate gorical data sets.
Page 18
Conclusions and Futur Dir ections ha aluated the ef fecti eness of LIMBO in trading of either quality for time or quality for space to achie compact, yet accurate, models for small and lar ge cate gorical data sets. ha sho wn LIMBO to ha adv antages er other information theoretic clustering algorithms including AIB

(in terms of scalability) and COOLCA (in terms of clustering quality and parameter stability). ha also sho wn adv antages in quality er other scalable and non- scalable algorithms designed to cluster either cate gorical tuples or alues. ith our space- bounded ersion of LIMBO (LIMBO ), we can uild model in one pass er the data in fix ed amount of memory while still ef fecti ely controlling information loss in the model. These properties mak LIMBO amenable for use in clustering streaming cate gorical data [8] In addition, to the best of our kno wledge, LIMBO is the only scalable cate

gorical algorithm that is hierarchical. Using its compact summary model, LIMBO ef ficiently uilds clusterings for not just single alue of ut for lar ge range of alues (typically hundreds). Furthermore, we are also able to produce statistics that let us directly compare clusterings. are currently formalizing the use of such statistics in determining good alues for Finally we plan to apply LIMBO as data mining technique to schema disco ery [16]. Refer ences [1] Andritsos, Tsaparas, R. J. Miller and K. C. Se vcik. Limbo: linear algorithm to cluster cate gorical data. echnical report, UofT

Dept of CS, CSRG-467, 2003. [2] Andritsos and Tzerpos. Softw are Clustering based on Information Loss Minimization. In WCRE ictoria, BC, Canada, 2003. [3] D. Barbar a, J. Couto, and Li. An Information Theory Approach to Cate gorical Clustering. Sub- mitted for Publication. [4] D. Barbar a, J. Couto, and Li. COOLCA An entrop y-based algorithm for cate gorical clustering. In CIKM McLean, A, 2002. [5] A. Borodin, G. O. Roberts, J. S. Rosenthal, and Tsaparas. Finding authorities and hubs from link structures on the orld ide eb. In WWW -10 Hong ong, 2001. [6] Chiu, D. ang, J. Chen, ang, and C.

Jeris. Rob ust and Scalable Clustering Algorithm for Mix ed ype Attrib utes in Lar ge Database En vironment. In KDD San Francisco, CA, 2001. [7] M. Co er and J. A. Thomas. Elements of Information Theory ile Sons, 1991. [8] D. Barbar a. Requirements for Clustering Data Streams. SIGKDD Explor ations 3(2), Jan. 2002. [9] G. Das and H. Mannila. Conte xt-Based Similarity Measures for Cate gorical Databases. In PKDD yon, France, 2000. [10] Ganti, J. Gehrk e, and R. Ramakrishnan. CA CTUS: Clustering Cate gorical Data Using Sum- maries. In KDD San Die go, CA, 1999. [11] M. R. Gare and D. S. Johnson.

Computer and intr actability; guide to the theory of NP- completeness .H. Freeman, 1979. [12] D. Gibson, J. M. Kleinber g, and Ragha an. Clustering Cate gorical Data: An Approach Based on Dynamical Systems. In VLDB Ne ork, NY 1998. [13] S. Guha, R. Rastogi, and K. Shim. OCK: Rob ust Clustering Algorithm for Cate gorical Atrib utes. In ICDE Sydne Australia, 1999. [14] J. M. Kleinber g. Authoritati Sources in Hyperlink ed En vironment. In SOD SF CA, 1998. [15] M. A. Gluck and J. E. Corter. Information, Uncertainty and the Utility of Cate gories. In COGSCI Irvine, CA, USA, 1985. [16] R. J. Miller

and Andritsos. On Schema Disco ery. IEEE Data Engineering Bulletin 26(3):39–44, 2003. [17] N. Slonim, N. Friedman, and N. ishby Unsupervised Document Classification using Sequential Information Maximization. In SIGIR ampere, Finland, 2002. [18] N. Slonim and N. ishby Agglomerati Information Bottleneck. In NIPS Breck enridge, 1999. [19] N. Slonim and N. ishby Document Clustering Using ord Clusters via the Information Bottleneck Method. In SIGIR Athens, Greece, 2000. [20] N. ishby C. Pereira, and Bialek. The Information Bottleneck Method. In 37th Annual Allerton Confer ence on

Communication, Contr ol and Computing Urban-Champaign, IL, 1999. [21] Zhang, R. Ramakrishnan, and M. Li vn BIRCH: An ef ficient Data Clustering Method for ery Lar ge Databases. In SIGMOD Montreal, QB, 1996.