Data Mining: Data Lecture Notes for Chapter 2 - PowerPoint Presentation

66 views
Uploaded On 2023-07-28

Data Mining: Data Lecture Notes for Chapter 2 - PPT Presentation

Introduction to Data Mining 2 nd Edition by Tan Steinbach Kumar Outline Attributes and Objects Types of Data Data Quality Similarity and Distance Data Preprocessing What is Data Collection of ID: 1012352

attribute data number attributes data attribute attributes number objects values properties information set time distance sample similarity points f11

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/1012352" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "Data Mining: Data Lecture Notes for Chap..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

1. Data Mining: DataLecture Notes for Chapter 2Introduction to Data Mining , 2nd EditionbyTan, Steinbach, Kumar

2. OutlineAttributes and ObjectsTypes of DataData QualitySimilarity and DistanceData Preprocessing

3. What is Data?Collection of data objects and their attributesAn attribute is a property or characteristic of an objectExamples: eye color of a person, temperature, etc.Attribute is also known as variable, field, characteristic, dimension, or featureA collection of attributes describe an objectObject is also known as record, point, case, sample, entity, or instanceAttributesObjects

4. Attribute ValuesAttribute values are numbers or symbols assigned to an attribute for a particular objectDistinction between attributes and attribute valuesSame attribute can be mapped to different attribute values Example: height can be measured in feet or metersDifferent attributes can be mapped to the same set of values Example: Attribute values for ID and age are integers But properties of attribute can be different than the properties of the values used to represent the attribute

5. Measurement of Length The way you measure an attribute may not match the attributes properties.This scale preserves the ordering and additvity properties of length.This scale preserves only the ordering property of length.

6. Types of Attributes There are different types of attributesNominalExamples: ID numbers, eye color, zip codesOrdinalExamples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height {tall, medium, short}IntervalExamples: calendar dates, temperatures in Celsius or Fahrenheit.RatioExamples: temperature in Kelvin, length, counts, elapsed time (e.g., time to run a race)

7. Properties of Attribute Values The type of an attribute depends on which of the following properties/operations it possesses:Distinctness: =  Order: < > Differences are + - meaningful : Ratios are * /meaningfulNominal attribute: distinctnessOrdinal attribute: distinctness & orderInterval attribute: distinctness, order & meaningful differencesRatio attribute: all 4 properties/operations

8. Difference Between Ratio and Interval Is it physically meaningful to say that a temperature of 10 ° is twice that of 5° on the Celsius scale?the Fahrenheit scale?the Kelvin scale?Consider measuring the height above averageIf Bill’s height is three inches above average and Bob’s height is six inches above average, then would we say that Bob is twice as tall as Bill?Is this situation analogous to that of temperature?

9. This categorization of attributes is due to S. S. Stevens

10. This categorization of attributes is due to S. S. Stevens

11. Discrete and Continuous Attributes Discrete AttributeHas only a finite or countably infinite set of valuesExamples: zip codes, counts, or the set of words in a collection of documents Often represented as integer variables. Note: binary attributes are a special case of discrete attributes Continuous Attribute Has real numbers as attribute valuesExamples: temperature, height, or weight. Practically, real values can only be measured and represented using a finite number of digits.Continuous attributes are typically represented as floating-point variables.

12. Asymmetric AttributesOnly presence (a non-zero attribute value) is regarded as importantWords present in documentsItems present in customer transactionsIf we met a friend in the grocery store would we ever say the following?“I see our purchases are very similar since we didn’t buy most of the same things.”

13. Critiques of the attribute categorization Incomplete Asymmetric binaryCyclicalMultivariatePartially orderedPartial membershipRelationships between the dataReal data is approximate and noisyThis can complicate recognition of the proper attribute typeTreating one attribute type as another may be approximately correct

14. Key Messages for Attribute TypesThe types of operations you choose should be “meaningful” for the type of data you haveDistinctness, order, meaningful intervals, and meaningful ratios are only four (among many possible) properties of dataThe data type you see – often numbers or strings – may not capture all the properties or may suggest properties that are not presentAnalysis may depend on these other properties of the dataMany statistical analyses depend only on the distributionIn the end, what is meaningful can be specific to domain

15. Important Characteristics of DataDimensionality (number of attributes) High dimensional data brings a number of challengesSparsity Only presence countsResolution Patterns depend on the scale SizeType of analysis may depend on size of data

16. Types of data sets RecordData MatrixDocument DataTransaction DataGraphWorld Wide WebMolecular StructuresOrderedSpatial DataTemporal DataSequential DataGenetic Sequence Data

17. Record Data Data that consists of a collection of records, each of which consists of a fixed set of attributes

18. Data Matrix If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute Such a data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute

19. Document DataEach document becomes a ‘term’ vector Each term is a component (attribute) of the vectorThe value of each component is the number of times the corresponding term occurs in the document.

20. Transaction DataA special type of data, where Each transaction involves a set of items. For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.Can represent transaction data as record data

21. Graph Data Examples: Generic graph, a molecule, and webpages Benzene Molecule: C6H6

22. Ordered Data Sequences of transactionsAn element of the sequenceItems/Events

23. Ordered Data Genomic sequence data

24. Ordered DataSpatio-Temporal DataAverage Monthly Temperature of land and ocean

25. Data Quality Poor data quality negatively affects many data processing effortsData mining example: a classification model for detecting people who are loan risks is built using poor dataSome credit-worthy candidates are denied loansMore loans are given to individuals that default

26. Data Quality …What kinds of data quality problems?How can we detect problems with the data? What can we do about these problems? Examples of data quality problems: Noise and outliers Wrong data Fake data Missing values Duplicate data

27. NoiseFor objects, noise is an extraneous objectFor attributes, noise refers to modification of original valuesExamples: distortion of a person’s voice when talking on a poor phone and “snow” on television screenThe figures below show two sine waves of the same magnitude and different frequencies, the waves combined, and the two sine waves with random noise The magnitude and shape of the original signal is distorted

28. Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data setCase 1: Outliers are noise that interfereswith data analysis Case 2: Outliers are the goal of our analysis Credit card fraud Intrusion detection Causes?Outliers

29. Missing ValuesReasons for missing valuesInformation is not collected (e.g., people decline to give their age and weight)Attributes may not be applicable to all cases (e.g., annual income is not applicable to children)Handling missing valuesEliminate data objects or variablesEstimate missing valuesExample: time series of temperatureExample: census results Ignore the missing value during analysis

30. Duplicate DataData set may include data objects that are duplicates, or almost duplicates of one anotherMajor issue when merging data from heterogeneous sourcesExamples:Same person with multiple email addressesData cleaningProcess of dealing with duplicate data issuesWhen should duplicate data not be removed?

31. Similarity and Dissimilarity MeasuresSimilarity measureNumerical measure of how alike two data objects are.Is higher when objects are more alike.Often falls in the range [0,1]Dissimilarity measureNumerical measure of how different two data objects are Lower when objects are more alikeMinimum dissimilarity is often 0Upper limit variesProximity refers to a similarity or dissimilarity

32. Similarity/Dissimilarity for Simple AttributesThe following table shows the similarity and dissimilarity between two objects, x and y, with respect to a single, simple attribute.

33. Euclidean DistanceEuclidean Distance where n is the number of dimensions (attributes) and xk and yk are, respectively, the kth attributes (components) or data objects x and y. Standardization is necessary, if scales differ.

34. Euclidean DistanceDistance Matrix

35. Minkowski DistanceMinkowski Distance is a generalization of Euclidean Distance Where r is a parameter, n is the number of dimensions (attributes) and xk and yk are, respectively, the kth attributes (components) or data objects x and y.

36. Minkowski Distance: Examplesr = 1. City block (Manhattan, taxicab, L1 norm) distance. A common example of this for binary vectors is the Hamming distance, which is just the number of bits that are different between two binary vectorsr = 2. Euclidean distancer  . “supremum” (Lmax norm, L norm) distance. This is the maximum difference between any component of the vectorsDo not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.

37. Minkowski DistanceDistance Matrix

38. Mahalanobis DistanceFor red points, the Euclidean distance is 14.7, Mahalanobis distance is 6. is the covariance matrix-0.5

39. Mahalanobis DistanceCovariance Matrix:A: (0.5, 0.5)B: (0, 1)C: (1.5, 1.5)Mahal(A,B) = 5Mahal(A,C) = 4 BAC

40. Common Properties of a DistanceDistances, such as the Euclidean distance, have some well known properties.d(x, y)  0 for all x and y and d(x, y) = 0 if and only if x = y.d(x, y) = d(y, x) for all x and y. (Symmetry)d(x, z)  d(x, y) + d(y, z) for all points x, y, and z. (Triangle Inequality) where d(x, y) is the distance (dissimilarity) between points (data objects), x and y.A distance that satisfies these properties is a metric

41. Common Properties of a SimilaritySimilarities, also have some well known properties.s(x, y) = 1 (or maximum similarity) only if x = y. (does not always hold, e.g., cosine)s(x, y) = s(y, x) for all x and y. (Symmetry) where s(x, y) is the similarity between points (data objects), x and y.

42. Similarity Between Binary VectorsCommon situation is that objects, x and y, have only binary attributesCompute similarities using the following quantities f01 = the number of attributes where x was 0 and y was 1 f10 = the number of attributes where x was 1 and y was 0 f00 = the number of attributes where x was 0 and y was 0 f11 = the number of attributes where x was 1 and y was 1Simple Matching and Jaccard Coefficients SMC = number of matches / number of attributes = (f11 + f00) / (f01 + f10 + f11 + f00) J = number of 11 matches / number of non-zero attributes = (f11) / (f01 + f10 + f11)

43. SMC versus Jaccard: Examplex = 1 0 0 0 0 0 0 0 0 0 y = 0 0 0 0 0 0 1 0 0 1 f01 = 2 (the number of attributes where x was 0 and y was 1)f10 = 1 (the number of attributes where x was 1 and y was 0)f00 = 7 (the number of attributes where x was 0 and y was 0)f11 = 0 (the number of attributes where x was 1 and y was 1) SMC = (f11 + f00) / (f01 + f10 + f11 + f00) = (0+7) / (2+1+0+7) = 0.7 J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0

44. Cosine Similarity If d1 and d2 are two document vectors, then cos( d1, d2 ) = <d1,d2> / ||d1|| ||d2|| , where <d1,d2> indicates inner product or vector dot product of vectors, d1 and d2, and || d || is the length of vector d. Example: d1 = 3 2 0 5 0 0 0 2 0 0 d2 = 1 0 0 0 0 0 0 1 0 2 <d1, d2> = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5| d1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481|| d2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.449cos(d1, d2 ) = 0.3150

45. Correlation measures the linear relationship between objects

46. Visually Evaluating CorrelationScatter plots showing the similarity from –1 to 1.

47. Drawback of Correlationx = (-3, -2, -1, 0, 1, 2, 3)y = (9, 4, 1, 0, 1, 4, 9)yi = xi2mean(x) = 0, mean(y) = 4std(x) = 2.16, std(y) = 3.74corr = (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5) / ( 6 * 2.16 * 3.74 ) = 0

48. Correlation vs Cosine vs Euclidean DistanceCompare the three proximity measures according to their behavior under variable transformationscaling: multiplication by a valuetranslation: adding a constantConsider the examplex = (1, 2, 4, 3, 0, 0, 0), y = (1, 2, 3, 4, 0, 0, 0)ys = y * 2 (scaled version of y), yt = y + 5 (translated version)PropertyCosineCorrelationEuclidean DistanceInvariant to scaling (multiplication)YesYesNoInvariant to translation (addition)NoYesNoMeasure(x , y)(x , ys)(x , yt)Cosine0.96670.96670.7940Correlation0.94290.94290.9429Euclidean Distance1.41425.831014.2127

49. Correlation vs cosine vs Euclidean distanceChoice of the right proximity measure depends on the domainWhat is the correct choice of proximity measure for the following situations?Comparing documents using the frequencies of wordsDocuments are considered similar if the word frequencies are similarComparing the temperature in Celsius of two locationsTwo locations are considered similar if the temperatures are similar in magnitudeComparing two time series of temperature measured in CelsiusTwo time series are considered similar if their “shape” is similar, i.e., they vary in the same way over time, achieving minimums and maximums at similar times, etc.

50. Comparison of Proximity MeasuresDomain of applicationSimilarity measures tend to be specific to the type of attribute and data Record data, images, graphs, sequences, 3D-protein structure, etc. tend to have different measuresHowever, one can talk about various properties that you would like a proximity measure to haveSymmetry is a common oneTolerance to noise and outliers is anotherAbility to find more types of patterns? Many others possibleThe measure must be applicable to the data and produce results that agree with domain knowledge

51. Information Based MeasuresInformation theory is a well-developed and fundamental disciple with broad applicationsSome similarity measures are based on information theory Mutual information in various versionsMaximal Information Coefficient (MIC) and related measuresGeneral and can handle non-linear relationshipsCan be complicated and time intensive to compute

52. Information and ProbabilityInformation relates to possible outcomes of an event transmission of a message, flip of a coin, or measurement of a piece of data The more certain an outcome, the less information that it contains and vice-versaFor example, if a coin has two heads, then an outcome of heads provides no informationMore quantitatively, the information is related the probability of an outcomeThe smaller the probability of an outcome, the more information it provides and vice-versaEntropy is the commonly used measure

53. EntropyFor a variable (event), X, with n possible values (outcomes), x1, x2 …, xn each outcome having probability, p1, p2 …, pn the entropy of X , H(X), is given byEntropy is between 0 and log2n and is measured in bitsThus, entropy is a measure of how many bits it takes to represent an observation of X on average

54. Entropy ExamplesFor a coin with probability p of heads and probability q = 1 – p of tailsFor p= 0.5, q = 0.5 (fair coin) H = 1 For p = 1 or q = 1, H = 0 What is the entropy of a fair four-sided die?

55. Entropy for Sample Data: ExampleMaximum entropy is log25 = 2.3219Hair ColorCountp-plog2pBlack750.750.3113Brown150.150.4105Blond50.050.2161Red00.000Other50.050.2161Total1001.01.1540

56. Entropy for Sample DataSuppose we have a number of observations (m) of some attribute, X, e.g., the hair color of students in the class, where there are n different possible valuesAnd the number of observation in the ith category is miThen, for this sampleFor continuous data, the calculation is harder

57. Mutual InformationInformation one variable provides about another Formally, , whereH(X,Y) is the joint entropy of X and Y, Where pij is the probability that the ith value of X and the jth value of Y occur together For discrete variables, this is easy to computeMaximum mutual information for discrete variables is log2(min( nX, nY ), where nX (nY) is the number of values of X (Y)

58. Mutual Information ExampleStudent Status Countp-plog2pUndergrad450.450.5184Grad550.550.4744Total1001.000.9928GradeCountp-plog2pA350.350.5301B500.500.5000C150.150.4105Total1001.001.4406Student Status GradeCountp-plog2pUndergradA50.050.2161UndergradB300.300.5211UndergradC100.100.3322GradA300.300.5211GradB200.200.4644GradC50.050.2161Total1001.002.2710Mutual information of Student Status and Grade = 0.9928 + 1.4406 - 2.2710 = 0.1624

59. Maximal Information CoefficientReshef, David N., Yakir A. Reshef, Hilary K. Finucane, Sharon R. Grossman, Gilean McVean, Peter J. Turnbaugh, Eric S. Lander, Michael Mitzenmacher, and Pardis C. Sabeti. "Detecting novel associations in large data sets." science 334, no. 6062 (2011): 1518-1524.Applies mutual information to two continuous variablesConsider the possible binnings of the variables into discrete categoriesnX × nY ≤ N0.6 where nX is the number of values of XnY is the number of values of YN is the number of samples (observations, data objects)Compute the mutual informationNormalized by log2(min( nX, nY )Take the highest value

60. General Approach for Combining SimilaritiesSometimes attributes are of many different types, but an overall similarity is needed.1: For the kth attribute, compute a similarity, sk(x, y), in the range [0, 1].2: Define an indicator variable, k, for the kth attribute as follows:k = 0 if the kth attribute is an asymmetric attribute and both objects have a value of 0, or if one of the objects has a missing value for the kth attributek = 1 otherwise3. Compute

61. Using Weights to Combine SimilaritiesMay not want to treat all attributes the same.Use non-negative weights Can also define a weighted form of distance

62. Data PreprocessingAggregationSamplingDiscretization and BinarizationAttribute TransformationDimensionality ReductionFeature subset selectionFeature creation

63. AggregationCombining two or more attributes (or objects) into a single attribute (or object)PurposeData reduction - reduce the number of attributes or objectsChange of scale Cities aggregated into regions, states, countries, etc. Days aggregated into weeks, months, or yearsMore “stable” data - aggregated data tends to have less variability

64. Example: Precipitation in AustraliaThis example is based on precipitation in Australia from the period 1982 to 1993. The next slide shows A histogram for the standard deviation of average monthly precipitation for 3,030 0.5◦ by 0.5◦ grid cells in Australia, andA histogram for the standard deviation of the average yearly precipitation for the same locations.The average yearly precipitation has less variability than the average monthly precipitation. All precipitation measurements (and their standard deviations) are in centimeters.

65. Example: Precipitation in Australia …Standard Deviation of Average Monthly PrecipitationStandard Deviation of Average Yearly PrecipitationVariation of Precipitation in Australia

66. Sampling Sampling is the main technique employed for data reduction.It is often used for both the preliminary investigation of the data and the final data analysis. Statisticians often sample because obtaining the entire set of data of interest is too expensive or time consuming. Sampling is typically used in data mining because processing the entire set of data of interest is too expensive or time consuming.

67. Sampling … The key principle for effective sampling is the following: Using a sample will work almost as well as using the entire data set, if the sample is representativeA sample is representative if it has approximately the same properties (of interest) as the original set of data

68. Sample Size 8000 points 2000 Points 500 Points

69. Types of SamplingSimple Random SamplingThere is an equal probability of selecting any particular itemSampling without replacementAs each item is selected, it is removed from the populationSampling with replacementObjects are not removed from the population as they are selected for the sample. In sampling with replacement, the same object can be picked up more than onceStratified samplingSplit the data into several partitions; then draw random samples from each partition

70. Sample SizeWhat sample size is necessary to get at least one object from each of 10 equal-sized groups.

71. DiscretizationDiscretization is the process of converting a continuous attribute into an ordinal attributeA potentially infinite number of values are mapped into a small number of categoriesDiscretization is used in both unsupervised and supervised settings

72. Unsupervised DiscretizationData consists of four groups of points and two outliers. Data is one-dimensional, but a random y component is added to reduce overlap.

73. Unsupervised DiscretizationEqual interval width approach used to obtain 4 values.

74. Unsupervised DiscretizationEqual frequency approach used to obtain 4 values.

75. Unsupervised DiscretizationK-means approach to obtain 4 values.

76. Discretization in Supervised SettingsMany classification algorithms work best if both the independent and dependent variables have only a few valuesWe give an illustration of the usefulness of discretization using the following example.

77. BinarizationBinarization maps a continuous or categorical attribute into one or more binary variables

78. Attribute TransformationAn attribute transform is a function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new valuesSimple functions: xk, log(x), ex, |x|NormalizationRefers to various techniques to adjust to differences among attributes in terms of frequency of occurrence, mean, variance, rangeTake out unwanted, common signal, e.g., seasonality In statistics, standardization refers to subtracting off the means and dividing by the standard deviation

79. Example: Sample Time Series of Plant GrowthCorrelations between time seriesMinneapolisCorrelations between time seriesNet Primary Production (NPP) is a measure of plant growth used by ecosystem scientists.

80. Seasonality Accounts for Much CorrelationCorrelations between time seriesMinneapolisNormalized using monthly Z Score:Subtract off monthly mean and divide by monthly standard deviationCorrelations between time series

81. Curse of DimensionalityWhen dimensionality increases, data becomes increasingly sparse in the space that it occupiesDefinitions of density and distance between points, which are critical for clustering and outlier detection, become less meaningfulRandomly generate 500 pointsCompute difference between max and min distance between any pair of points

82. Dimensionality ReductionPurpose:Avoid curse of dimensionalityReduce amount of time and memory required by data mining algorithmsAllow data to be more easily visualizedMay help to eliminate irrelevant features or reduce noiseTechniquesPrincipal Components Analysis (PCA)Singular Value DecompositionOthers: supervised and non-linear techniques

83. Dimensionality Reduction: PCAGoal is to find a projection that captures the largest amount of variation in datax2x1e

84. Dimensionality Reduction: PCA

85. Feature Subset SelectionAnother way to reduce dimensionality of dataRedundant features Duplicate much or all of the information contained in one or more other attributesExample: purchase price of a product and the amount of sales tax paidIrrelevant featuresContain no information that is useful for the data mining task at handExample: students' ID is often irrelevant to the task of predicting students' GPAMany techniques developed, especially for classification

86. Feature CreationCreate new attributes that can capture the important information in a data set much more efficiently than the original attributesThree general methodologies:Feature extraction Example: extracting edges from imagesFeature construction Example: dividing mass by volume to get density Mapping data to new space Example: Fourier and wavelet analysis

87. Mapping Data to a New SpaceTwo Sine Waves + NoiseFrequencyFourier and wavelet transformFrequency