However the application to large spatial databases rises the following requirements for clustering algorithms minimal requirements of domain knowledge to determine the input parameters discovery of clusters with arbitrary shape and good ef64257cienc ID: 32989 Download Pdf

Embed / Share - Abstract Clustering algorithms are attractive for the task of class iden tication in spatial databases

Clustering algorithms are attractive for the task of class iden-tiÞcation in spatial databases. However, the application tolarge spatial databases rises the following requirements forknowledge to determine the input parameters, discovery ofclusters with arbitrary shape and good efÞciency on large da-tabases. The well-known clustering algorithms offer no solu-tion to the combination of these requirements. In this paper,we present the new clustering algorithm DBSCAN relying oncover clusters of arbitrary shape. DBSCAN requires only one Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei XuInstitute for Computer Science, University of MunichOettingenstr. 67, D-80538 MŸnchen, Germany{ester | kriegel | sander | xwxu}@informatik.uni-muenchen.de Published in Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96) clusters found by a partitioning algorithm is convex which isvery restrictive.Ng & Han (1994) explore partitioning algorithms for(Clustering Large Applications based on RANdomizedSearch) is introduced which is an improved k-medoid meth-is more effective and more efÞcient. An experimental evalu-ation indicates that CLARANS runs efÞciently on databases of clusters in adatabase. They propose to run CLARANS once for each kfrom 2 to n. For each of the discovered clusterings the sil-houette coefÞcient (Kaufman & Rousseeuw 1990) is calcu-lated, and Þnally, the clustering with the maximum silhou-ette coefÞcient is chosen as the ÒnaturalÓ clustering.Unfortunately, the run time of this approach is prohibitivefor large n, because it implies O(n) calls of CLARANS.for large databases. Furthermore, the run time of CLARANSis prohibitive on large databases. Therefore, Ester, Kriegel&Xu (1995) present several focusing techniques which ad-cess on the relevant parts of the database. First, the focus isHierarchical algorithms create a hierarchical decomposi-dendrogram, a tree that iteratively splits into smallera hierarchy, each node of the tree represents a cluster ofThe dendrogram can either be created from the leaves up toagglomerative approach) or from the root down tothe leaves (divisive approach) by merging or dividing clus- as an input. However, a has to be deÞned indicating when themerge or division process should be terminated. One exam-ple of a termination condition in the agglomerative approachSo far, the main problem with hierarchical clustering al-gorithms has been the difÞculty of deriving appropriate pa-rameters for the termination condition, e.g. a value of Dat the same time large enough such that no cluster is split intotwo parts. Recently, in the area of signal processing the hier-Fdez-Valdivia, Cortijo & Molina 1994) automatically deriv-ing a termination condition. Its key idea is that two points be-long to the same cluster if you can walk from the Þrst pointto the second one by a ÒsufÞciently smallÓ step. Ejclusterfollows the divisive approach. It does not require any inputof domain knowledge. Furthermore, experiments show thatit is very effective in discovering non-convex clusters. How-ever, the computational cost of Ejcluster is O(nmoderate values for n, but it is prohibitive for applications onlarge databases.Jain (1988) explores a density based approach to identifytioned into a number of nonoverlapping cells and histogramsare constructed. Cells with relatively high frequency countsbetween clusters fall in the ÒvalleysÓ of the histogram. Thismethod has the capability of identifying clusters of anyshape. However, the space and run-time requirements forenormous. Even if the space and run-time requirements areÞgure1, we can easily and unambiguously detect clusters ofpoints and noise points not belonging to any of those clus-The main reason why we recognize the clusters is thatwithin each cluster we have a typical density of points whichis considerably higher than outside of the cluster. Further-more, the density within the areas of noise is lower than thedensity in any of the clusters.In the following, we try to formalize this intuitive notion of points of someThe key idea is that for each point of a cluster the neighbor-hood of a given radius has to contain at least a minimumexceed some threshold. The shape of a neighborhood is de-For instance, when using thehood is rectangular. Note, that our approach works with anysen for some given application. For the purpose of proper vi-sualization, all examples will be in 2D space using the Eu- of a point p, denoted by N Eps}.A naive approach could require for each point in a clusterin an Eps-neighborhood of that point. However, this ap- figure 1: Sample databases database 1database 2 proach fails because there are two kinds of points in a clus-ter, points inside of the cluster (core pointsborder pointswe would have to set the minimum number of points to a rel-atively low value in order to include all points belonging tothe same cluster. This value, however, will not be character-istic for the respective cluster - particularly in the presence ofnoise. Therefore, we require that for every point p in a clus-points. This deÞnition is elaborated in the following.:(directly density-reachable) A point p isrectly density-reachable from a point q wrt. Eps, MinPts if(q) and MinPts (core point condition).Obviously, directly density-reachable is symmetric for pairsof core points. In general, however, it is not symmetric if onecore point and one border point are involved. Figure2 showsreachable from a point q wrt. Eps and MinPts if there is a,...,p = q, p = p such that p is di-Density-reachability is a canonical extension of directdensity-reachability. This relation is transitive, but it is notsymmetric. Figure3 depicts the relations of some samplepoints and, in particular, the asymmetric case. Although notsymmetric in general, it is obvious that density-reachabilityTwo border points of the same cluster C are possibly notcondition might not hold for both of them. However, theredensity-connectivity which covers this relation of border to a point q wrt. Eps and MinPts if there is a pointDensity-connectivity is a symmetric relation. For densityreachable points, the relation of density-connectivity is alsoreßexive (c.f. Þgure3).Now, we are able to deÞne our density-based notion of acluster. Intuitively, a cluster is deÞned to be a set of density-ty. Noise will be deÞned relative to a given set of clusters. not belonging to any of be a database of points. Asatisfying the following conditions: C and q is density-reachable from p wrt. C. (Maximality) p, q C: p is density-connected to q wrt. EPS andMinPts. (Connectivity) and MinPts as the set of points in the database not belonging to any clusterMinPts points because of the following reasons. Since Cquently, the Eps-Neighborhood of o contains at least MinPtsThe following lemmata are important for validating thecorrectness of our clustering algorithm. Intuitively, theystate the following. Given the parameters Eps and MinPts,we can discover a cluster in a two-step approach. First,core point condition as a seed. Second, retrieve all points MinPts. = {o | oIt is not obvious that a cluster wrt. Eps and MinPts is of its core points. However, is density-reachable from any of the core and, therefore, a cluster contains exactly the be a cluster wrt. Eps and MinPts and letp be any point in with |N equals = {o | o is density-reachable from p wrt. Eps andis designed to discover the clusters and the noise in a spatialdatabase according to deÞnitions 5 and 6. Ideally, we wouldhave to know the appropriate parameters Eps and MinPts ofeach cluster and at least one point from the respective clus-ter. Then, we could retrieve all points that are density-reach-able from the given point using the correct parameters. But figure 2: core points and border points figure 3: density-reachability and density-connectivity there is no easy way to get this information in advance for allclusters of the database. However, there is a simple and ef-fective heuristic (presented in sectionsection 4.2) to deter-uses global values for Eps and MinPts, i.e. the same valuescluster are good candidates for these global parameter valuesspecifying the lowest density which is not considered to be4.1The AlgorithmTo Þnd a cluster, DBSCAN starts with an arbitrary point pand retrieves all points density-reachable from p wrt. Epsthe next point of the database.Since we use global values for Eps and MinPts, DBSCANmay merge two clusters according to deÞnition 5 into onecluster, if two clusters of different density are ÒcloseÓ to eachother. Let the and SThen, two sets of points having at least the density of thedistance between the two sets is larger than Eps. Conse-quently, a recursive call of DBSCAN may be necessary forthe detected clusters with a higher value for MinPts. This is,however, no disadvantage because the recursive applicationof DBSCAN yields an elegant and very efÞcient basic algo-rithm. Furthermore, the recursive clustering of the points ofIn the following, we present a basic version of DBSCAN is either the whole database or a dis-covered cluster from a previous run. and are returns the i-th ele-ed below: SetOfPointsas a list of points. Region que-ries can be supported efÞciently by spatial access methodsto be available in a SDBS for efÞcient processing of severaltypes of spatial queries (Brinkhoff et al. 1994). The height ofan R*-tree is O(log n) for a database of n points in the worstcase and a query with a ÒsmallÓ query region has to traverseNeighborhoods are expected to be small compared to thesize of the whole data space, the average run time complexi-ty of a single region query is O(log n). For each of the npoints of the database, we have at most one region query.Thus, the average run time complexity of DBSCAN isO(n*log n). (clusterId) of points which have been marked may be changed later, if they are density-reach-border points of a cluster. Those points are not added to the-list because we already know that a point with a is not a core point. Adding those points to would only result in additional region queries whichwould yield no new answers.If two clusters C and C are very close to each other, it and C would be equal to C since we use global parame- covered Þrst. Except from these rare situations, the result of4.2Determining the Parameters Eps and MinPtsIn this section, we develop a simple but effective heuristic tocluster in the database. This heuristic is based on the follow-ing observation. Let d be the distance of a point p to its k-thnearest neighbor, then the d-neighborhood of p contains ex-of p contains more than k+1 points only if several pointshave exactly the same distance d from p which is quite un-likely. Furthermore, changing k for a point in a cluster doesnot result in large changes of d. This only happens if the k-thpoint in a cluster.For a given k we deÞne a function from the database to the real numbers, mapping each point to the distancefrom its k-th nearest neighbor. When sorting the points of thedatabase in descending order of their k-dist values, the graphof this function gives some hints concerning the density dis-tribution in the database. We call this graph thegraphwith an equal or smaller k-dist value will be core points. Ifthreshold point with the maximal k-dist val- we would have the desiredparameter values. The threshold point is the Þrst point in theÞrst ÒvalleyÓ of the sorted k-dist graph (see Þgure4). Allpoints with a higher k-dist value ( left of the threshold) areold) are assigned to some cluster.In general, it is very difÞcult to detect the Þrst ÒvalleyÓ au-tomatically, but it is relatively simple for a user to see thisvalley in a graphical representation. Therefore, we proposeto follow an interactive approach for determining the thresh-DBSCAN needs two parameters, Eps and MinPts. How-ever, our e�xperiments indicate that the k-dist graphs for k 4do not signiÞcantly differ from the 4-dist graph and, further-more, they need considerably more computation. Therefore,databases (for 2-dimensional data). We propose the follow-ing interactive approach for determining the parameter Epscentage is entered and the system derives a proposal foranother point as the threshold point. The 4-dist value ofthe threshold point is used as the Eps value for DBSCAN.In this section, we evaluate the performance of DBSCAN.We compare it with the performance of CLARANS becauserithms. We have implemented DBSCAN in C++ based on anexperiments have been run on HP 735 / 100 workstations.We have used both synthetic sample databases and the data-base of the SEQUOIA 2000 benchmark.To compare DBSCAN with CLARANS in terms of effec-tivity (accuracy), we use the three synthetic sample databas-es which are depicted in Þgure1. Since DBSCAN andCLARANS are clustering algorithms of different types, theyhave no common quantitative measure of the classiÞcationaccuracy. Therefore, we evaluate the accuracy of both algo-four ball-shaped clusters of signiÞcantly differing sizes.Sample database 2 contains four clusters of nonconvexshape. In sample database 3, there are four clusters of differ-ent shape and size with additional noise. To show the resultsdifferent color (see www availability after section 6). To giveCLARANS some advantage, we set the parameter to 4 forthese sample databases. The clusterings discovered byCLARANS are depicted in Þgure5.For DBSCAN, we set the noise percentage to 0% for sam-spectively. The clusterings discovered by DBSCAN are de-picted in Þgure6.DBSCAN discovers all clusters (according to deÞnitionfrom all sample databases. CLARANS, however, splits clus-ters if they are relatively large or if they are close to someother cluster. Furthermore, CLARANS has no explicit no- threshold 4-dist noiseclusterspoints point database 1 database 2 To test the efÞciency of DBSCAN and CLARANS, weuse the SEQUOIA 2000 benchmark data. The SEQUOIA2000 benchmark database (Stonebraker et al. 1993) uses realdata sets that are representative of Earth Science tasks. Theretains 62,584 Californian names of landmarks, extractedfrom the US Geological SurveyÕs Geographic Names Infor-ANS on the whole data set is very high, we have extracted aseries of subsets of the SEQUIOA 2000 point data set con-taining from 2% to 20% representatives of the whole set.these databases is shown in table 1.The results of our experiments show that the run time ofpoints. The run time of CLARANS, however, is close to qua-dratic in the number of points. The results show that DB-SCAN outperforms CLARANS by a factor of between 250and 1900 which grows with increasing size of the database.Clustering algorithms are attractive for the task of class iden-tiÞcation in spatial databases. However, the well-known al-gorithms suffer from severe drawbacks when applied tolarge spatial databases. In this paper, we presented the clus-supports the user in determining an appropriate value for it.We performed a performance evaluation on synthetic dataand on real data of the SEQUOIA 2000 benchmark. The re-sults of these experiments demonstrate that DBSCAN is sig-niÞcantly more effective in discovering clusters of arbitraryshape than the well-known algorithm CLARANS. Further-more, the experiments have shown that DBSCAN outper-forms CLARANS by a factor of at least 100 in terms of efÞ-ciency.Future research will have to consider the following issues.First, we have only considered point objects. Spatial data-bases, however, may also contain extended objects such aspolygons. We have to develop a deÞnition of the density inmensional feature spaces should be investigated. In particu-lar, the shape of the k-dist graph in such applications has tobe explored.A version of this paper in larger font, with large Þgures andclusterings in color is available under the following URL:Beckmann N., Kriegel H.-P., Schneider R, and Seeger B. 1990. TheR*-tree: An EfÞcient and Robust Access Method for Points andProc. ACM SIGMOD Int. Conf. on Management ofData, Atlantic City, NJ, 1990, pp.322-331.Brinkhoff T., Kriegel H.-P., Schneider R., and Seeger B. 1994EfÞcient Multi-Step Processing of Spatial JoinsProc. ACM1994, pp.197-208.Ester M., Kriegel H.-P., and Xu X. 1995. A Database Interface forClustering in Large Spatial Databases, Proc. 1st Int. Conf. onKnowledge Discovery and Data Mining, Montreal, Canada, 1995,Garc’a J.A., Fdez-Valdivia J., Cortijo F. J., and Molina R. 1994. ASignal Processing, Vol. 44,The VLDB Journal 3(4): 357-399.Kaufman L., and Rousseeuw P.J. 1990.Finding Groups in Data: anIntroduction to Cluster Analysis. John Wiley & Sons.Matheus C.J.; Chan P.K.; and Piatetsky-Shapiro G. 1993. Systemsfor Knowledge Discovery in Databases,IEEE Transactions onKnowledge and Data Engineering 5(6): 903-913.Ng R.T., and Han J. 1994. EfÞcient and Effective ClusteringMethods for Spatial Data Mining, Proc. 20th Int. Conf. on VeryLarge Data Bases, 144-155. Santiago, Chile.Stonebraker M., Frew J., Gardels K., and Meredith J.1993. TheSEQUOIA 2000 Storage Benchmark, Proc. ACM SIGMOD Int.Conf. on Management of Data, Washington, DC, 1993, pp.2-11.Table 1: run time in seconds12522503391052136256DBSCAN3.16.711.316.017.8758302668451174518029782089371042612512DBSCAN24.528.232.741.729826392656054080638 database 1 database 2 database 3figure 6: Clusterings discovered by DBSCAN

Please download the presentation from below link :

Download Pdf - The PPT/PDF document "Abstract Clustering algorithms are attra..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Try DocSlides online tool for compressing your PDF Files Try Now