/
Clustering Question: what if we don’t have (or don’t know) labels for data Clustering Question: what if we don’t have (or don’t know) labels for data

Clustering Question: what if we don’t have (or don’t know) labels for data - PowerPoint Presentation

evelyn
evelyn . @evelyn
Follow
27 views
Uploaded On 2024-02-02

Clustering Question: what if we don’t have (or don’t know) labels for data - PPT Presentation

Function approximation does not work Fx xgty x feature vector y label but we dont know y yet Patterns may still exist depending on the relationship between records What is clustering ID: 1043950

similarity cluster distance clustering cluster similarity clustering distance clusters worked vectors link points documents based number data set closest

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Clustering Question: what if we don’t ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Clustering

2. Question: what if we don’t have (or don’t know) labels for dataFunction approximation does not work: F(x): x->y, x feature vector, y: label, but we don’t know y yetPatterns may still exist (depending on the relationship between records)

3. What is clusteringClustering: the process of grouping a set of objects into classes of similar objectsDocuments within a cluster should be similar.Documents from different clusters should be dissimilar.The commonest form of unsupervised learningNo labeled training data (we have only feature vectors)Clustering patterns imply something for specific applicationsCombining with supervised learning (semi-supervised)Address the scarcity problem of labeled data

4. A data set with clear cluster structure2 dimensional dataset – 2 features, say X1 and X2Consider the following questions:What are the points?Why you can intuitively identify these ”clusters”?4

5.

6. 6Classification vs. ClusteringClassification: Classes are often human-defined and part of the input to the learning algorithm.Clustering: Clusters are inferred from the data (mostly without human input).However, there are many ways of influencing the outcome of clustering: number of clusters, similarity measure, representation of documents, domain knowlege . . .6

7. Example: text data (Document) clusteringAfter preprocessing, each document is represented as a document vectorWe can analyze the similarity of document vectors and group similar documents

8. 8 Search result clustering for better navigation (aquired by IBM) 8

9. Many Applications of clustering …Image ProcessingCluster images based on their visual contentSegment similar pixels WebCluster groups of users based on their access patterns on webpagesCluster webpages based on their contentBioinformaticsCluster similar proteins together (similarity w.r.t. chemical structure and/or functionality etc.)Sequence clustering9

10. 10Issues for clusteringRecord/entity representation for clusteringDocument representation – feature vectors Need to define a notion of similarity/distance  most criticalSimilarity/distance is tied to the representation (we will see later)How many clusters?Fixed a priori?Completely data driven?

11. Types of similarity/distance Often application-specificWe will discuss a few common methods – if you don’t find application-specific similarity, start with the common ones that make most sense for your application

12. 12Similarity for documentsIdeal: semantic similarity -- difficult to achievePractical: statistical similarityDocs as vectors.For many algorithms, easier to think in terms of a distance (rather than similarity) between docs.We can use cosine similarity (alternatively, Euclidean Distance on length normalized vectors).

13. Similarity for binary vectors (or categorical data –> one-hot representation)Q1Q2Q3Q4Q5Q6X100111Y011010Jaccard similarity between binary vectors X and YJaccard distance between binary vectors X and Y Jdist(X,Y) = 1- JSim(X,Y)Example:JSim = 1/6Jdist = 5/613

14. Similarity/Distance functions for vectorsLp norms or Minkowski distance:where p is a positive integerIf p = 1, L1 is the Manhattan (or city block) distance:14p

15. Distance functions for real-valued vectorsIf p = 2, L2 is the Euclidean distance:Also one can use weighted distance:Very often Lpp is used instead of Lp (why?)15

16. Critical for vector-based distance/similarityAny presented features will affect the distance!!!The best practice (in my opinion)manually identify features meaningful for the problem domainIdentify the subspace that can give some clustering structures (subspace clustering)Normalization is necessary!For a set of selected features, normalize their ranges to equally weight each feature in distance/similarity computation. Overweight some features based on your domain knowledge (if necessary)

17. Similarity/distance for string (sequence) dataHamming distance – for equal length stringsCount the number of positions with different valuesEdit distanceMinimum number of insert/delete/replace operations to convert one string to anotherAlignment based similarity (may not be a distance)

18. Conversion between distances and similarity measures Inverse relationship: similarity high  distance lowCommon methods: d ~ some distance d-> 1/(1+d)d -> exp(-d)Gaussian kernel: exp(−d2/𝜎2)For documents: use cosine similarity ~ use Euclidean distance on length-normalized vectors

19. 19Clustering AlgorithmsPartitional (Flat) algorithmsUsually start with a random (partial) partitionRefine it iterativelyK means clustering(Model based clustering)Hierarchical (Tree) algorithmsBottom-up, agglomerative(Top-down, divisive)

20. 20Partitioning Algorithms (representative method k-means)Partitioning method: Construct a partition of n documents into a set of K clustersGiven: a set of documents and the number K Find: a partition of K clusters that optimizes the chosen partitioning criterionGlobally optimal: exhaustively enumerate all partitionsEffective heuristic methods: K-means and K-medoids algorithms

21. 21K-MeansAssumes documents are real-valued vectors.Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c.Reassignment of instances to clusters is based on distance to the current cluster centroids.(Or one can equivalently phrase it in terms of similarities)

22. 22K-Means AlgorithmSelect K random docs {s1, s2,… sK} as seeds.Until clustering converges or other stopping criterion: For each doc di: Assign di to the cluster cj such that dist(xi, sj) is minimal. (Update the seeds to the centroid of each cluster) For each cluster cj sj = (cj)

23. Worked Example: Set to be clustered23

24. 24 Worked Example: Random selection of initial centroids Exercise: (i) Guess what the optimal clustering into two clusters is in this case; (ii) compute the centroids of the clusters24

25. Worked Example: Assign points to closest center25

26. Worked Example: Assignment26

27. Worked Example: Recompute cluster centroids27

28. Worked Example: Assign points to closest centroid28

29. Worked Example: Assignment29

30. Worked Example: Recompute cluster centroids30

31. Worked Example: Assign points to closest centroid31

32. Worked Example: Assignment32

33. Worked Example: Recompute cluster centroids33

34. Worked Example: Assign points to closest centroid34

35. Worked Example: Assignment35

36. Worked Example: Recompute cluster centroids36

37. Worked Example: Assign points to closest centroid37

38. Worked Example: Assignment38

39. Worked Example: Recompute cluster centroids39

40. Worked Example: Assign points to closest centroid40

41. Worked Example: Assignment41

42. Worked Example: Recompute cluster centroids42

43. Worked Example: Assign points to closest centroid43

44. Worked Example: Assignment44

45. Worked Example: Recompute cluster caentroids45

46. Worked Example: Centroids and assignments after convergence46

47. Two different K-means ClusteringsSub-optimal ClusteringOptimal ClusteringOriginal Points47

48. 48Termination conditionsSeveral possibilities, e.g.,A fixed number of iterations.Doc partition unchanged.Centroid positions change tiny (< a threshold).

49. 49ConvergenceWhy should the K-means algorithm ever reach a fixed point?A state in which clusters don’t change.K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm.EM is known to converge.Number of iterations could be large.

50. 50Time ComplexityComputing distance between two docs is O(m) where m is the dimensionality of the vectors.Reassigning clusters: O(Kn) distance computations, or O(Knm).Computing centroids: Each doc gets added once to some centroid: O(nm).Assume these two steps are each done once for I iterations: O(IKnm).

51. 51 Optimality of K-meansThe fundamental optimization problem is NP-hard Convergence does not mean that we converge to the optimal clustering!This is the great weakness of K-means.If we start with a bad set of seeds, the resulting clustering can be horrible.51

52. 52Seed ChoiceResults can vary based on random seed selection.Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings.Select good seeds using a heuristic (e.g., doc least similar to any existing mean)Try out multiple starting pointsInitialize with the results of another method.Kmeans++ algorithmIn the above, if you startwith B and E as centroidsyou converge to {A,B,C}and {D,E,F}.If you start with D and Fyou converge to {A,B,D,E} {C,F}.Example showingsensitivity to seeds

53. 53K-means issues, variations, etc.Recomputing the centroid after every assignment (rather than after all points are re-assigned) can improve speed of convergence of K-meansAssumes clusters are spherical in vector spaceSensitive to coordinate changes, weighting etc.Cluster sizes are approximately same

54. 54How Many Clusters?Kmeans assume the number of clusters K is givenUse some measures to determine the best KSilhouette CoefficientGap statisticsElbow methodMethod Run kmeans with different k in a range, e.g., [2, 20] and find the corresponding measuresFind the globally optimal one

55. Apply kMeans to non-Euclidean distance data?General you cannot Special case: cosine similarity for document data

56. 56Medoid as Cluster RepresentativeIn the cases the centroid is difficult to defineMedoid: A cluster representative that is one of the record closest to the centroidHow to find medoids? One approximation method: the representative with the minimum sum of distances to all other representatives

57. 57“The Curse of Dimensionality”Why high-dimensional clustering is difficultWhile clustering looks intuitive in 2 dimensions, many of our applications involve 10,000 or more dimensions (document clustering)High-dimensional spaces look differentThe probability of random points being close drops quickly as the dimensionality grows.Furthermore, random pair of vectors are all almost perpendicular. Dimensionality reductionFeature selectionSubspace clustering

58. Summary of kmeansAdvantages:FastSimpleAlways converge (not necessary the best result)DisadvantagesDoes not guarantee the best result - 2^n for n data pointsSeed selection can be tricky – kmeans++ K selection can be tricky – you can apply the best k method

59. 59Hierarchical Clustering

60. 60Hierarchical ClusteringBuild a tree-based hierarchical taxonomy (dendrogram) from a set of documents.animalvertebratefish reptile amphib. mammal worm insect crustaceaninvertebrate

61. Another example: in biological studies

62. 62Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster.Dendrogram: Hierarchical Clustering

63. 63A dendrogram for document clustering The history of mergers can be read off from bottom to top.The horizontal line of each merger tells us what the similarity of the merger was.We can cut the dendrogram at a particular point (e.g., at 0.1 or 0.4) to get a flat clustering.63

64. 64Agglomerative (bottom-up): Precondition: Start with each record as a separate cluster.Postcondition: Eventually all recordss belong to the same cluster.Divisive (top-down): Precondition: Start with all records belonging to the same cluster. Postcondition: Eventually each record forms a cluster of its own.Does not require the number of clusters k in advance.Needs a termination/readout condition Hierarchical Clustering algorithms

65. 65 Hierarchical agglomerative clustering (HAC)HAC creates a hierarchy in the form of a binary tree.Assumes a similarity measure for determining the similarity of two clusters.Up to now, our similarity measures were for individual records.We will look at four different cluster similarity measures.65

66. 66Hierarchical Agglomerative Clustering (HAC) AlgorithmStart with all instances in their own cluster.Until there is only one cluster: Among the current clusters, determine the two clusters, ci and cj, that are most similar. Replace ci and cj with a single cluster ci  cj

67. 67Dendrogram: Document ExampleAs clusters agglomerate, docs likely to fall into a hierarchy of “topics” or concepts.d1d2d3d4d5d1,d2d4,d5d3d3,d4,d5

68. 68Key notion: cluster representativeWe want a notion of a representative point in a cluster, to represent the location of each clusterRepresentative should be some sort of “typical” or central point in the cluster, e.g.,point inducing smallest radii to docs in clustersmallest squared distances.point that is the “average” of all docs in the clusterCentroid or center of gravityMeasure inter-cluster distances by distances of centroids.

69. 69Example: n=6, k=3, closest pair of centroidsd1d2d3d4d5d6Centroid after first step.Centroid aftersecond step.

70. 70Pairwise cluster distance/similarityMany variants to defining closest pair of clusters – cluster similarity is defined with Single-linksimilarity of the most similar (single-link) records between clustersComplete-linksimilarity of the “furthest” points, the least similarCentroidSimilarity between cluster centroidsAverage-linkAverage similarity between all pairs of elements (of merged clusters)

71. side note: Methods for converting distance to similaritySim = 1 – distance, if distance in [0, 1], e.g., Jaccard similarity/distanceSim = 1/(1+distance), for distance in [0, infinity) Sim = exp (-d) Radial basis functionSim = exp(- d2/22), commonly used for converting Euclidean distances 

72. 72 Cluster similarity: Example72

73. 73 Single-link: Maximum similarity 73

74. 74 Complete-link: Minimum similarity 74

75. 75Centroid: Average intersimilarityintersimilarity = similarity of two documents in different clusters75

76. 76Group average: Average intrasimilarityintrasimilarity = similarity of any pair, including cases where thetwo documents are in the same cluster76

77. 77Cluster similarity: Larger Example77

78. 78 Single-link: Maximum similarity 78

79. 79 Complete-link: Minimum similarity 79

80. 80Centroid: Average intersimilarity80

81. 81Group average: Average intrasimilarity81

82. Incremental updates of cluster similarity for single-linkage clusteringUse maximum similarity of pairs:After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

83. 83Single-link potential problem: ChainingSingle-link clustering often produces long, straggly clusters. For some applications, these are undesirable.83

84. 84Complete Link Agglomerative ClusteringUse minimum similarity of pairs:Makes “tighter,” spherical clusters that are typically preferable.After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:CiCjCk

85. 85Complete-link potential problem: Sensitivity to outliersThe complete-link clustering of this set splits d2 from its right neighbors – clearly undesirable.The reason is the outlier d1.This shows that a single outlier can negatively affect the outcome of complete-link clustering.Single-link clustering does better in this case.85

86. 86Group Average Agglomerative ClusteringUse average similarity across all pairs within the merged cluster to measure the similarity of two clusters.Compromise between single and complete link.Two options:Averaged across all ordered pairs in the merged cluster Averaged over all pairs between the two original clusters No clear difference in efficacy

87. 87Computing Group Average Similarity – fast method for cosine similarityAssume cosine similarity and normalized vectors with unit length.Always maintain sum of vectors in each cluster.Compute similarity of clusters in constant time:

88. Computational ComplexityIn the first iteration, all HAC methods need to compute similarity of all pairs of n individual instances which is O(n2) – slow and memory inefficient for large nIn each of the subsequent n2 merging iterations, it computes the distance between the most recently created cluster and all other existing clusters.In order to maintain a good overall performance, computing similarity to each cluster must be done in constant time.Often O(n3) if done naively or O(n2 log n) if done cleverly

89. 8989methodcombination similaritytime compl. optimal? commentsingle-linkmax intersimilarity of any 2 docsƟ(N2)yeschaining effectcomplete-linkmin intersimilarity of any 2 docsƟ(N2 log N)nosensitive to outliersgroup-averageaverage of all simsƟ(N2 log N)nobest choice formost applicationscentroidaverage intersimilarityƟ(N2 log N)noinversions can occur

90. 90Clustering quality evaluationInternal criteria: A good clustering will produce high quality clusters in which:the intra-class (that is, intra-cluster) similarity is highthe inter-class similarity is lowThe measured quality of a clustering depends on both the document representation and the similarity measure usedExternal criteria: assume there are ground-truth cluster labelsOften use class labels (be careful of the validity)

91. Internal criteria - examplesA well know method is the sum of squared errors (SSE) (intra-cluster only)Intuition: with the increased number of clusters, the SSE should decreaseUsed for the elbow method to determine the best KSum of (xi distance to centroid)2, for all xi in the clusterAlternativesthe ratio between the inter-cluster SSE and the overall SSEThe ratio between the inter-cluster SSE and the intra-cluster SSE

92. Internal criteria - examplesThe silhouette method for finding the best Ka: the mean intra-cluster distance for a cluster Cb: the mean nearest-cluster distance for each sample in the cluster CThe Silhouette Coefficient (SC) for a sample set is (b - a) / max(a, b)Overall silhouette coefficient: the sum of all sample SC

93. 93External criteria for clustering qualityQuality measured by its ability to discover some or all of the hidden patterns or latent classes in gold standard dataAssesses a clustering with respect to ground truth … requires labeled dataAssume documents with C gold standard classes, while our clustering algorithms produce K clusters, ω1, ω2, …, ωK with ni members.Then, we can use all measures used in classification: confusion matrix, precision, recall, accuracy, error rate, etc.

94. 94External Evaluation of Cluster QualitySimple measure: purity, the ratio between the dominant class in the cluster, πi ,and the size of cluster ωi

95. 95             Cluster ICluster IICluster IIICluster I: Purity = 1/6 * (max(5, 1, 0)) = 5/6Cluster II: Purity = 1/6 * (max(1, 4, 1)) = 4/6Cluster III: Purity = 1/5 * (max(2, 0, 3)) = 3/5Purity example

96. Normalized mutual informationIt’s normalized: good for comparing the quality of different number of clusters or clustering results

97. SummaryIn clustering, clusters are inferred from the data without human input (unsupervised learning)However, in practice, it’s a bit less clear: there are many ways of influencing the outcome of clustering: representation of objects (features), number of clusters, similarity measure, Algorithms Incorporating domain knowledge97