Data cleaning Data d iscretization Binning Clustering Binarization Data integration Aggregation Smoothing DATA PREPARATION Preprocessing Data reduction Sampling Dimensionality reduction ID: 793733
Download The PPT/PDF document "DATA PREPARATION: Preprocessing" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
DATA PREPARATION: Preprocessing
Data cleaning
Data
d
iscretization
Binning
Clustering
Binarization
Data integration
Aggregation
Smoothing
Slide2DATA PREPARATION: Preprocessing
Data
reduction
Sampling
Dimensionality reduction
Feature subset
selection
Feature
creation
Data
transformation
Variable transformation
Scaling
Sorting
Slide3DATA PREPARATION: Preprocessing
Mathematical computation
Normalization
Stationarity
Statistical computation
Mean
Median
Mode
Midrange
Variance, standard deviation, range
Weighted mean
Slide4DATA PREPARATION: Data cleaning
Missing data
Improper data
Detection and handling of outliers
Handling noise
Slide5DATA PREPARATION: Data cleaning
Need to determine how to handle missing values
W
hy are the values missing?
Is there significance in the fact that particular values are missing?
Need to determine how to handle inaccurate values
How to identify and handle outliers
Typographic errors (e.g. transposition errors
)
Measurement errors
Duplicate values
Noise
Slide6DATA PREPARATION: Data cleaning
Need to determine how to handle irrelevant data
W
hy is the data deemed irrelevant? Is it data truly irrelevant?
Need to determine how to handle data timeliness
How important is the age of the data?
Slide7DATA PREPARATION: Data Cleaning
Possible handling methods
Eliminate data instances
Eliminate data attributes
Estimate missing values – interpolation
Ignore missing values during analysis
Identifying inconsistent values during collection
Check
digits
Smoothing data
Slide8DATA PREPARATION: Data Cleaning
Possible handling methods
Remove duplicate data
Careful: Are duplicate instances errors or are they separate instances with identical values? Machine learning tools will give different results for repeated data.
Remove irrelevant data
Remove dated data
Weight data
by data
age
Slide9DATA PREPARATION: Discretization
Binning
Equal-frequency interval binning
Equal-width interval binning
Clustering
K-means clustering
Hierarchical methods
Binarization
Entropy-based discretization
Discretization of multiple variables
Slide10DATA PREPARATION: Data integration
Aggregation
Smoothing data
Averaging
cata
Slide11DATA PREPARATION: Data reduction
Sampling
Why sample
Sampling techniques
Simple random sampling
With replacement
Without replacement
Stratified sampling
Sample size
Slide12DATA PREPARATION: Data reduction
Dimensionality reduction
Curse of dimensionality
Projection into lower-dimensions
Principal components analysis
Singular value decomposition
Feature subset
selection
Remove redundant features
Remove irrelevant features
Feature creation
Slide13DATA PREPARATION: Data transformation
Variable transformation
Scaling
Sorting
Normalization
Slide14DATA PREPARATION: Mathematical Computation
Normalization
Stationarity
Time series – mean and variance are constant
Statistical computation
Mean
Median
Mode
Midrange
Variance, standard deviation, range
Weighted mean
Slide15DATA PREPARATION: Measures of Similarity and Dissimilarity
Euclidean distance
Direction cosines
Simple matching and
Jaccard
coefficients
Tamimoto
measure (set similarity)
Hamming distance
Edit distance
Probability based distances
Mahalanobis
distance
Slide16DATA PREPARATION: Choosing Similarity Measures
Problem specific
Data dependent
Domain knowledge
Purpose
Metric properties?
Positivity
Symmetry
Triangle Inequality