/
Data Mining Concepts Data Mining Concepts

Data Mining Concepts - PowerPoint Presentation

natalia-silvester
natalia-silvester . @natalia-silvester
Follow
482 views
Uploaded On 2016-08-31

Data Mining Concepts - PPT Presentation

Emre Eftelioglu 1 What is Knowledge Discovery in Databases Data mining is actually one step of a larger process known as knowledge discovery in databases KDD The KDD process model consists of six phases ID: 458212

item data support mining data item mining support association algorithm rule rules set items knowledge clusters means descriptive confidence clustering discovery cluster

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Data Mining Concepts" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Data Mining Concepts

Emre Eftelioglu

1Slide2

What is Knowledge Discovery in Databases?

Data mining is actually one step of a larger process known as

knowledge discovery in databases

(KDD).

The KDD process model consists of six phasesData selectionData cleansingEnrichmentData transformationData miningReporting and displaying discovered knowledge

Data Selection

Data Cleansing & Enrichment

Data transformation

Data mining

Reporting

Database

Data Warehouses

2Slide3

Data Warehouse

A subject oriented, integrated, non-volatile, time variant collection of data in support of management’s decisions.

Understand and improve the performance of an organization

Designed for query and data retrieval.

Not for transaction processing.Contains historical data derived from transaction data, but can include data from other sources. Data is consolidated, aggregated and summarizedUnderlying engine used by the business intelligence environments which generates these reports.

3Slide4

Why Data Mining is Needed?

Data is

large scale

,

high dimensional, heterogeneous, complex and distributed and it is required to find useful information using this data.Commercial PerspectiveSales can be increased.Lots of data is in hand but not interpreted.Computers are cheap and powerful.There is a competition between companiesBetter service, more sales

Easy access for customers.Customized user experience.

Etc.Scientific PerspectiveData collected and stored at enormous speeds.Traditional techniques are infeasible.

Data mining improve/ease the work of scientistsOverall

There is often information “hidden” in the data that is not readily evident.Even for a small datasets

human analysts may take weeks to discover useful information

4Slide5

What is the Goal of Data Mining?

Credit Request by a Bank Customer: By analyzing credit card usage a customer’s buying capacity can be predicted.

Prediction

:

Determine how certain attributes will behave in the

future

.

Identification

:

Identify the

existence

of an item, event, or activity.

Classification

: Partition

data into classes or categories.Optimization:Optimize

the use of limited resources.

Scientists are trying to identify the life on Mars by analyzing different soil samples.

A company can grade their employers by classifying them by their skills.

UPS avoids left turns in order to save gas on idle.

5Slide6

What is Data Mining?

Definition:

Non-trivial extraction of implicit, previously unknown and potentially useful information from data.

Another definition:

Discovering new information in terms of patterns or rules from vast amounts of data. Data Mining: "Torturing data until it confesses ... and if you torture it enough, it will confess to anything"

Jeff Jonas, IBM

Data Mining - an interdisciplinary field

DatabasesStatisticsHigh Performance Computing

Machine LearningVisualizationMathematics

Which disciplines does Data Mining use?

6Slide7

What is

Not Data Mining?

Use “Google” to check the price of an item.

Getting the statistical details about the historical price change of an item. (i.e. max price, min price, average price)

Checking the sales of an item by color.Example:Green item sales– 500 Blue item sales– 1000 etc.

So what would be a data mining question?

- How many items can we sell if we produce the same item in Red color?

- How will the sales change if we make a discount on the price?

- Is there any association between the sales of item X and item Y?

Why Data Mining is not Statistical Analysis?

Interpretation of results is difficult and daunting

Requires expert user

guidance

Ill-suited

for Nominal and Structured Data Types

Completely data driven - incorporation of domain knowledge not

possible

7Slide8

What is the Difference between DBMS, OLAP and Data Mining?

DBMS

OLAP – Data Warehouse

Data Mining

TaskExtraction of detailed data

Summaries, Trends, ReportsKnowledge Discovery of Hidden Patterns, Insights

ResultInformation

AnalysisInsight and Future Prediction

MethodDeduction (ask the question, verify the data)

Model the data, aggregate, use statisticsInduction (build the model, apply to new data, get the result)

ExampleWho purchased the Apple iPhone 6 so far?What is the average income of iPhone

6 buyers by region and month?Who will buy the new Apple Watch when it is in the market?

8Slide9

What types of Knowledge can be revealed?

Association Rule Discovery (descriptive)

Clustering (descriptive

)

Classification (descriptive)Sequential Pattern Discovery (descriptive)Patterns Within Time Series (predictive)9Slide10

Association Rule Mining

Association rules are frequently used to generate rules from

market-basket data

.

A market-basket corresponds to the sets of items a consumer purchases during one visit to a supermarket.Create dependency rules to predict occurrence of an item based on occurrences of other items. The set of items purchased by customers is known as an

item set.

An association rule is of the form X=>Y, where X ={x1, x

2, …., xn }, and Y = {y1

,y2, …., yn} are sets of items, with x

i and yi being distinct items for all i and all

j.For an association rule to be of interest, it must satisfy a minimum support and

confidence.

Beer & Diapers as an urban legend:

Father goes to grocery store to get a big pack of diapers after he is out of work.

When he buys diapers, he decides to get a six pack

10Slide11

Association Rules

- Confidence and Support

Support

:

The minimum percentage of instances in the database that contain all items listed in a given association rule.Support is the percentage of transactions that contain all of the items in the item set, LHS U RHS.The rule

X ⇒ Y holds with support s if s%

of transactions in contain X ∪ Y.Confidence:

The rule X ⇒ Y holds with confidence c if c% of the transactions that contain

X also contain YConfidence

can be computed as support(LHS U RHS) / support(LHS)

TID

ItemsSupport = Occurrence / Total Transactions

1AB

Total Transactions = 5Support {AB} = 3/5 = 60%Support {BC} = 3/5 = 60%Support {CD} = 1/5 = 20%

Support {ABC} = 1/5 = 20%2ABD

3ACD

4

ABC

5

BC

TID

Items

Given

X ⇒ Y

Confidence =

Occurrence {

X Y

} / Occurrence{X}

1

AB

Total Transactions = 5Confidence {A ⇒

B} = 3/4 = 75%Confidence {B ⇒ C} = 2/4 = 50%Confidence {C ⇒ D} = 1/3 = 33%Confidence {AB ⇒ C} = 1/3 = 33%2ABD3ACD4ABC5BC11Slide12

Association Rules

- Apriori algorithm

A general algorithm for generating association rules is a two-step process.

Generate all item sets that have a support exceeding the given threshold. Item sets with this property are called

large or frequent item sets.Generate rules for each item set as follows:For item set X and Y (subset of X), let Z = X – Y (set difference); If Support(X)/Support(Z) > minimum confidence, the rule Z=>Y is a valid rule.The Apriori algorithm was the first algorithm used to generate association rules.

The Apriori algorithm uses the general algorithm for creating association rules together with

downward closure and anti-monotonicity. For a k-item set to be frequent, each and every one of its items must also be frequent.

To generate a k-item set: Use a frequent (k-1)-item set and extend

it with a frequent 1-itemset.

Downward ClosureA subset of a large itemset must also be large

Anti-monotonicityA superset of a small itemset is also small. This implies that the itemset does not have sufficient support to be considered for rule generation.

12Slide13

Complications seen

with Association Rules

The cardinality of

item sets

in most situations is extremely large.Association rule mining is more difficult when transactions show variability in factors such as geographic location and seasons.Item classifications exist along multiple dimensions.Data quality is variable; data may be missing, erroneous, conflicting, as well as redundant.13Slide14

What types of Knowledge can be revealed?

Association Rule Discovery (descriptive)

Clustering (descriptive

)

Classification (descriptive)Sequential Pattern Discovery (descriptive)Patterns Within Time Series (predictive)14Slide15

Clustering (1/2)

Motivation

Marketing

: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs.

Land use: Identification of areas of similar land use in an earth observation database. Insurance: Identifying groups of motor insurance policy holders with a high average claim cost. City-planning: Identifying groups of houses according to their house type, value, and geographical location. Many more…15Slide16

Clustering (2/2)

Given a set of data points, find clusters such that;

Records in one cluster are highly similar to each other and dissimilar from the records in other clusters.

It is an unsupervised learning technique which does not require any prior knowledge of clusters.

16Slide17

k-Means Algorithm (1/2)

The

k-Means

algorithm is a simple yet effective clustering technique.

The algorithm clusters observations into k groups, where k is provided as an input parameter.It then assigns each observation to clusters based upon the observation’s proximity to the mean center of the cluster.The cluster’s mean center is then recomputed and the process begins again.Algorithm stops when the means centers do not change.

17

The objective is minimize the error (distance).

 Slide18

k-Means Algorithm (2/2)

1. Select initial k cluster centers.

2.

Assignment Step:

Assign each point in the dataset to the closest cluster, based upon the Euclidean distance between the point and each cluster center.3. Update Step: Once all points are clustered, re-compute the cluster centers using the arithmetic mean of the coordinates of the points which belongs to them.4. Repeat step 2-3 until the centers do not change (convergence)

Example of K-Means Execution on a Smiley Face dataset.

Initial 3 centers

User defined

Final Output

18

Update

Assignment

Assignment

Input DatasetSlide19

k-Means

Algorithm – Issues (1/2)

The number of clusters

k

is an input parameter: an inappropriate choice of k may yield poor results. Initial centroid selection is important for the final result of the algorithm since it converges to a local minimum.

Input data set

Initial Centroids (green/red)

Output

19

Initial Centroids (green/red)

Output

Output changes by different initial selections of centersSlide20

k-Means Algorithm – Issues (2/2)

Density of the clusters may bias the results since k-Means depends on the Euclidean distance between points and centroids.

20

Input DatasetSlide21

Classification vs. Clustering

Classification is a supervised learning techniqueClassification techniques learn a method for predicting the

class of a data record from the pre-labeled

(classified)

records.Clustering is an unsupervised learning technique which finds clusters of records without any prior knowledge.21