/
DATA PREPARATION:  Preprocessing DATA PREPARATION:  Preprocessing

DATA PREPARATION: Preprocessing - PowerPoint Presentation

festivehippo
festivehippo . @festivehippo
Follow
346 views
Uploaded On 2020-07-03

DATA PREPARATION: Preprocessing - PPT Presentation

Data cleaning Data d iscretization Binning Clustering Binarization Data integration Aggregation Smoothing DATA PREPARATION Preprocessing Data reduction Sampling Dimensionality reduction ID: 793733

preparation data missing values data preparation values missing cleaning handle irrelevant remove reduction sampling errors determine computation distance handling

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "DATA PREPARATION: Preprocessing" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

DATA PREPARATION: Preprocessing

Data cleaning

Data

d

iscretization

Binning

Clustering

Binarization

Data integration

Aggregation

Smoothing

Slide2

DATA PREPARATION: Preprocessing

Data

reduction

Sampling

Dimensionality reduction

Feature subset

selection

Feature

creation

Data

transformation

Variable transformation

Scaling

Sorting

Slide3

DATA PREPARATION: Preprocessing

Mathematical computation

Normalization

Stationarity

Statistical computation

Mean

Median

Mode

Midrange

Variance, standard deviation, range

Weighted mean

Slide4

DATA PREPARATION: Data cleaning

Missing data

Improper data

Detection and handling of outliers

Handling noise

Slide5

DATA PREPARATION: Data cleaning

Need to determine how to handle missing values

W

hy are the values missing?

Is there significance in the fact that particular values are missing?

Need to determine how to handle inaccurate values

How to identify and handle outliers

Typographic errors (e.g. transposition errors

)

Measurement errors

Duplicate values

Noise

Slide6

DATA PREPARATION: Data cleaning

Need to determine how to handle irrelevant data

W

hy is the data deemed irrelevant? Is it data truly irrelevant?

Need to determine how to handle data timeliness

How important is the age of the data?

Slide7

DATA PREPARATION: Data Cleaning

Possible handling methods

Eliminate data instances

Eliminate data attributes

Estimate missing values – interpolation

Ignore missing values during analysis

Identifying inconsistent values during collection

Check

digits

Smoothing data

Slide8

DATA PREPARATION: Data Cleaning

Possible handling methods

Remove duplicate data

Careful: Are duplicate instances errors or are they separate instances with identical values? Machine learning tools will give different results for repeated data.

Remove irrelevant data

Remove dated data

Weight data

by data

age

Slide9

DATA PREPARATION: Discretization

Binning

Equal-frequency interval binning

Equal-width interval binning

Clustering

K-means clustering

Hierarchical methods

Binarization

Entropy-based discretization

Discretization of multiple variables

Slide10

DATA PREPARATION: Data integration

Aggregation

Smoothing data

Averaging

cata

Slide11

DATA PREPARATION: Data reduction

Sampling

Why sample

Sampling techniques

Simple random sampling

With replacement

Without replacement

Stratified sampling

Sample size

Slide12

DATA PREPARATION: Data reduction

Dimensionality reduction

Curse of dimensionality

Projection into lower-dimensions

Principal components analysis

Singular value decomposition

Feature subset

selection

Remove redundant features

Remove irrelevant features

Feature creation

Slide13

DATA PREPARATION: Data transformation

Variable transformation

Scaling

Sorting

Normalization

Slide14

DATA PREPARATION: Mathematical Computation

Normalization

Stationarity

Time series – mean and variance are constant

Statistical computation

Mean

Median

Mode

Midrange

Variance, standard deviation, range

Weighted mean

Slide15

DATA PREPARATION: Measures of Similarity and Dissimilarity

Euclidean distance

Direction cosines

Simple matching and

Jaccard

coefficients

Tamimoto

measure (set similarity)

Hamming distance

Edit distance

Probability based distances

Mahalanobis

distance

Slide16

DATA PREPARATION: Choosing Similarity Measures

Problem specific

Data dependent

Domain knowledge

Purpose

Metric properties?

Positivity

Symmetry

Triangle Inequality