/
Cluster Analysis CSCI N317 Computation for Scientific Applications Cluster Analysis CSCI N317 Computation for Scientific Applications

Cluster Analysis CSCI N317 Computation for Scientific Applications - PowerPoint Presentation

adah
adah . @adah
Follow
67 views
Uploaded On 2023-10-27

Cluster Analysis CSCI N317 Computation for Scientific Applications - PPT Presentation

Unit 3 4 Weka 2 What is Cluster Analysis The purpose of grouping a set of physical or abstract objects into classes of similar objects A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters ID: 1025747

data cluster clusters objects cluster data objects clusters clustering matrix object groups means measurements categorical variables method similar dissimilarity

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Cluster Analysis CSCI N317 Computation f..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Cluster AnalysisCSCI N317 Computation for Scientific ApplicationsUnit 3 - 4 Weka

2. 2What is Cluster Analysis?The purpose of grouping a set of physical orabstract objects into classes of similar objects. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. Clustering is also called data segmentation in some applications because clustering partition large data sets into groups according to their similarity. Clustering can be used for outlier detection, where outliers may be more interesting than common cases.Unsupervised learning, learning by observation

3. 3Examples of Clustering ApplicationsMarketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programsLand use: Identification of areas of similar land use in an earth observation databaseInsurance: Identifying groups of motor insurance policy holders with a high average claim costCity-planning: Identifying groups of houses according to their house type, value, and geographical location

4. 4Types of Data in Cluster AnalysisTypically operate on either of the following two data structures:Data matrix: represents n objects such as persons, with p variables (measurements or attributes), such as age, height, weight and so on. The structure is in the form of an n-by-p matrix

5. 5Types of Data in Cluster AnalysisDissimilarity matrix: Stores a collection of proximities that are available for all pairs of n objects. The structure is in the form of an n-by-n matrix, where d(i,j) is the measured difference or dissimilarity between objects i and j. In general, d(i,j) is a nonnegative number that is close to 0 when object i and j are highly similar or “near” each other, and becomes larger the more they differ. d(i,j) = d(j,i) d(i,i) = 0

6. 6Types of Data in Cluster AnalysisMany clustering algorithm operate on a dissimilarity matrix. If the data are presented in the form of a data matrix, it can first be transformed into a dissimilarity matrix before applying clustering algorithmsThe definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables. Different algorithms are developing for computing distances between different types of variables and objects with mixed-type variables.

7. 7Cautions for Interval-Scaled VariablesContinuous measurements of a roughly linear scale. E.g. weight, height, latitude and longitude coordinates, temperatureThe measurement units used can affect the clustering analysis. E.g. changing measurements from meters to inches for height may lead to very different clustering structure. Therefore data need to be standardized.

8. 8Cautions for Interval-Scaled VariablesStandardize data – convert the original measurements of variable f to unitless variablesCalculate the mean absolute deviation, Sf: xif, …., xnf are n measurements of f, mf is the mean of fCalculate the standardized measurement or z-score:Whether and how to perform standardization is the choice of the user.

9. 9Example: calculate the dissimilarity using all tests After distance computation, the resulting matrix is Thus, object 1 and 4 are most similar.Variables of Mixed Types

10. 10Clustering MethodsPartitioning methodsGiven a database of n objects, constructs k partitions of the data, where each partition represents a cluster and k<=n. Each group must contain at least one object, and each object must belong exactly to one group. A partitioning method creates an initial partitioning. It then uses an iterative relocation technique that attempts to improve the partitioning by moving objects from one group to another.

11. 11Clustering MethodsHierarchical methodsCreates a hierarchical decomposition of the given set of data objects. The agglomerative approach, also called the bottom-up approach, starts with each object forming a separate group. It successively merges the objects or groups that are close to one another, until all of the groups are merged into one, or until a termination condition holds.The divisive approach, also called the top-down approach, starts with all of the objects in the same cluster. In each successive iteration, a cluster is split up into smaller clusters, until eventually each object is one cluster, or until a termination condition holds.

12. 12Clustering MethodsThe choice of clustering algorithm depends both on the type of data available and on the particular purpose of the application. If cluster analysis is used as a descriptive or exploratory tool, it is possible to try several algorithms on the same data to see what the data may disclose.

13. 13Partitioning MethodsGiven D, a data set of n objects, and k, the number of clusters to form, a partitioning algorithm organizes the objects into k partitions (k<=n), where each partition represents a cluster. K-means methodCluster similarity is measured in regard to the mean values of the objects in a cluster, which can be viewed as the cluster’s centroid or center of gravity.

14. 14Partitioning MethodsExample

15. 15Partitioning MethodsExampleLet k = 3.Arbitrarily choose three objects as the three initial cluster centers, where centers are marked by a “+”. Each object is distributed to a cluster based on the center to which it is the nearest.The cluster centers are updated by calculating the new mean based on the current objects in the cluster.Using the new cluster centers, the objects are redistributed to the clusters based on which cluster center is the nearest. Eventually no redistribution of the objects in any cluster occurs, and so the process terminates.

16. 16Partitioning MethodsK-means method can only be applied when the mean of a cluster is defined, thus cannot be used to data with categorical attributes.K-modes method – extends the k-means paradigm to cluster categorical data by replacing the means of clusters with modes, using new dissimilarity measures to deal with categorical objects and a frequency-based method to update modes of clusters. The k-means and the k-modes methods can be integrated to cluster data with mixed numeric and categorical values.K-means method is sensitive to noise and outlier data points because a small number of such data can substantially influence the mean value

17. 17Hierarchical MethodsGroup data objects into a tree of clusters.Two types:Agglomerative – bottom-up strategy, placing each objects in its own cluster and merges these atomic clusters into larger and larger clusters, until all of the objects are in a single cluster or until certain termination conditions are satisfied.Divisive – top-down strategy, starting with all objects in one cluster. It subdivides the cluster into smaller and smaller pieces, until each object forms a cluster on its own or until it satisfies certain termination conditions, such as a desired number of clusters is obtained.

18. 18Hierarchical Methods

19. 19Outlier DiscoveryWhat are outliers?The set of objects are considerably dissimilar from the remainder of the dataExample: Sports: Michael Jordon, Wayne Gretzky, ...Problem: Define and find outliers in large data setsApplications:Credit card fraud detectionTelecom fraud detectionCustomer segmentationMedical analysis

20. 20VideosWekahttps://www.youtube.com/watch?v=HCA0Z9kL7Hghttps://www.youtube.com/watch?v=9aODdNSAauIRhttps://www.youtube.com/watch?v=sAtnX3UJyN0 Data site used in video - http://archive.ics.uci.edu/ml/ Data file on Canvas: iris.csv Sample code on Canvas: clusterAnalysis.R