/
Explore-by-Example: An Automatic Query Steering Framework f Explore-by-Example: An Automatic Query Steering Framework f

Explore-by-Example: An Automatic Query Steering Framework f - PowerPoint Presentation

tatiana-dople
tatiana-dople . @tatiana-dople
Follow
407 views
Uploaded On 2017-05-08

Explore-by-Example: An Automatic Query Steering Framework f - PPT Presentation

By Kyriaki Dimitriadou Olga Papaemmanouil and Yanlei Diao Agenda Introduction to AIDE Data exploration IDE interactive data exploration AIDE automated interactive data exploration ID: 546051

relevant data exploration user data relevant user exploration learning phase exploitation samples aide areas objects number tree decision means

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Explore-by-Example: An Automatic Query S..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Explore-by-Example: An Automatic Query Steering Framework for Interactive Data Exploration

By

Kyriaki

Dimitriadou

,

Olga

Papaemmanouil

and

Yanlei

DiaoSlide2

Agenda

Introduction to AIDE

Data exploration

IDE: interactive data exploration

AIDE: automated interactive data exploration

Machine learning

Supervised learning: Decision tree

Unsupervised learning: K-means

Measure accuracy

AIDE framework

AIDE model:

Data

classification

Query formulation

Space exploration

Relevant object discovery

Misclassifies exploitation

Boundary

exploitation

AIDE

model summary

ConclusionsSlide3

WHAT IS AIDE?

Automated interactive data explorationSlide4

Explore data to find an apartment

BUT MOM I DON’T WANT TO MOVE!Slide5

Explore data to find an apartmentSlide6

Explore data to find an apartmentSlide7

Data Exploration

Data exploration

 is the first step in 

data

 analysis and typically involves summarizing the main characteristics of a dataset. It is commonly conducted using visual analytics tools, but can also be done in more advanced statistical software, such as R.Slide8

IDE: Interactive data explorationSlide9

AIDE: Automated interactive data exploration

An Automatic Interactive Data Exploration framework, that iteratively steers the user towards interesting areas and “predicts” a query that retrieves his objects of interest.

AIDE integrates machine learning and data management techniques to provide effective data exploration results (matching user’s interest with high accuracy) as well as high interactive performance.Slide10

What is machine learning?Slide11

What is machine learning?

One definition: “Machine learning is the semi-automated extraction of knowledge from the data”

Knowledge from data:

Starts with a question that might be answerable using data

Automated extraction:

A computer provide the insights

Semi-Automated: Requires many smart decisions by a humanSlide12

Two main categories of machine learning

Supervised learning:

Making predictions using data

Example: is a given email “spam” or “ham”?

There is an outcome we are trying to predict

Unsupervised learning:

Extracting structure from dataExample: Segment grocery store shoppers into clusters that exhibits similar behavior

There is no “right answer” Slide13

Supervised learning

High level steps of supervised learning:

First, train a

machine learning model

using

labeled data

“Labeled data” has been labeled with the outcome“Machine learning model” learns the relationship between the attributes of the data and its outcome

Then, make prediction on new data for which the label is unknown Slide14

Supervised learning

The primary goal of supervised learning is to build a model that “generalizes”: It accurately

predicts the

future

rather then the

past!Slide15

Supervised learning

The primary goal of supervised learning is to build a model that “generalizes”: It accurately

predicts the

future

rather then the

past!

X1

X2

X3

Mail1

“Hello..”

29

1

Mail2

“Dear…”

17

3

Mail3

“Check

out..”

58

1Slide16

Supervised learning

The primary goal of supervised learning is to build a model that “generalizes”: It accurately

predicts the

future

rather then the

past!

Y

Mail1

Ham

Mail2

Spam

Mail3

HamSlide17

Supervised learning

The primary goal of supervised learning is to build a model that “generalizes”: It accurately

predicts the

future

rather then the

past!Slide18

Decision tree classifier

Labeled data:

Main idea:

Form binary tree

Minimize the error in each leaf

 

 

 Slide19

Decision tree classifier

 

 

1

2

1

1

.8

 

 

 

 

 

 

 

 

 

Y

Y

Y

Y

N

N

N

NSlide20

Decision tree classifier

 

 

1

2

1

1

.8

 

 

 

 

 

 

 

 

 

Y

Y

Y

Y

N

N

N

NSlide21

Decision tree classifier

 

 

1

2

1

1

.8

 

 

 

 

 

 

 

 

 

Y

Y

Y

Y

N

N

N

NSlide22

How decision tree really works?

Initial error: 0.2

After split: 0.5*0.4 + 0.5*0 = 0.2

Is this a good split

?

 

….

 

 

label

1

 

1

 

1

 

1

 

1

 

0

 

0

 

0

 

0

 

0

….

 

 

label

1

 

1

 

1

 

1

 

1

 

0

 

0

 

0

 

0

 

0Slide23

How decision tree really works?

Selecting predicates - splitting criteria

potential

function

val

(.) to

guide our selection:Every change is an improvement. We will be able to achieve this by using a strictly concave function

.The potential is symmetric around 0.5, namely, val(q)= val(1 − q). When zero perfect classification. This implies that val(0) = val(1) = 0We have

val(0.5) = 0.5val(T) ≥ error(T)

Minimizing

val

(T) upper bounds the error! Slide24

How decision tree really works?

S

plitting criteria:

Gini Index: G(q) = 2q(1 − q

)

Before the split we have G(0.8

)=2· 0.8·

0.2 = 0.32After the split we have 0.5G(0.6) + 0.5G(1) = 0.5 · 2 · 0.4 · 0.6 = 0.24. Slide25

Comments on decision tree

method

Strength:

Easy to use, understand

Produce rules that are easy to interpret & implement

Variable selection & reduction is automatic

Do not require the assumptions of statistical models

Can work without extensive handling of missing dataWeakness:May not perform well where there is structure in the data that is not well captured by horizontal or vertical splitsSince the process deals with one variable at a time, no way to capture interactions between variablesTrees must be pruned to avoid over-fitting of the training dataSlide26

Unsupervised learning

High level steps

of unsupervised learning:

Also called

clustering,

sometimes

called classification by statisticians and sorting by psychologists and segmentation by people in

marketingOrganizing data into classes such that there isHigh intra-class similarityLow

inter-class similarity between the attributes of the data and its outcomeFinding the class labels and the number of classes directly from the data (in contrast to classification

).

More informally, finding natural groupings among objects. Slide27

What is a natural grouping among these objects?Slide28

What is a natural grouping among these objects?

Clustering is subjective

School Employees

Simpson's Family

Males

Females

Slide29

What is similarity?

The quality or state of being similar; likeness; resemblance; as, a similarity of features.

Webster's Dictionary

Similarity is hard to define, but…

We know it when we see it

The real meaning of similarity is a philosophical question. We will take a more pragmatic approach. Slide30

Defining distance measures

Definition

: Let

O

1

and

O2 be two objects from the universe of possible objects. The distance (dissimilarity) between

O1 and O2 is a real number denoted by D(O1,O2)

0.23

342.7

Peter

Piotr

3Slide31

Intuition behind desirable distance properties

D(A,B) = D(B,A) Symmetry

Otherwise you could claim “Alex looks more like

Bob, than Bob does

.”

D(A,B) = 0

IIf

A=B Positivity (Separation)Otherwise there are objects in your world that are different, but you cannot tell apart.D

(A,B)  D(A,C) +

D

(B,C)

Triangular Inequality

Otherwise you could claim “Alex is very like Bob, and Alex is very like Carl, but Bob is very unlike Carl.”Slide32

Algorithm K-means

Goal:

Decide on a value for k

Initialize the

k

cluster centers (randomly, if necessary).

Decide the class memberships of the

N objects by assigning them to the nearest cluster center.Re-estimate the k cluster centers, by assuming the memberships found above are correct.

If none of the N objects changed membership in the last iteration, exit. Otherwise goto 3.Slide33

K-means clustering: Step 1

Algorithm: k-means, Distance Metric: Euclidean Distance

0

1

2

3

4

5

0

1

2

3

4

5

k

1

k

2

k

3Slide34

K-means clustering: Step 2

Algorithm: k-means, Distance Metric: Euclidean Distance

0

1

2

3

4

5

0

1

2

3

4

5

k

1

k

2

k

3Slide35

K-means clustering: Step 3

Algorithm: k-means, Distance Metric: Euclidean Distance

0

1

2

3

4

5

0

1

2

3

4

5

k

1

k

2

k

3Slide36

K-means clustering: Step 4

Algorithm: k-means, Distance Metric: Euclidean Distance

0

1

2

3

4

5

0

1

2

3

4

5

k

1

k

2

k

3Slide37

K-means clustering: Step 5

Algorithm: k-means, Distance Metric: Euclidean Distance

k

1

k

2

k

3Slide38

Comments on the k-means method

Strength:

Relatively

efficient: O(

tkn

), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n

.Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic

algorithmsWeakness:Applicable only when mean is defined, then what about categorical data?Need to specify k, the number of clusters, in advanceUnable to handle noisy data and outliersNot suitable to discover clusters with non-convex shapesSlide39

Measure accuracy

Precision

is

the fraction of retrieved instances that are

relevant

Recall

 is the fraction of relevant instances that are retrievedSlide40

Measure accuracy: F-score

The F

 score can be interpreted as a weighted average of the 

precision and recall

F

 score reaches its best value at 1 and worst at 0

.

 Slide41

Question about machine learning

How do I choose which attributes of my data to include in the model?

How do I choose which model to use?

How do I optimize this model for best performance?

How do I ensure that I’m building a model that will generalize to unseen data?

Can I estimate how well my model is likely to perform on unseen data?Slide42

Back to AIDE…Slide43

How does AIDE works?

Framework

that automatically “steers” the user towards data areas relevant to his

interest

In AIDE, the user engages in a “conversation” with the system indicating his interests, while in the background the system automatically formulates and processes queries that collect data matching the user

interestSlide44

AIDE framework

Label data samples

Decision Tree classifier

Identify promising sampling areas

Retrieve next sample set from DBSlide45

AIDE challenges

AIDE operates on the unlabeled data space that the user aims to explore

To achieve

desirable interactive experience for the user, AIDE needs not only to provide accurate results, but also to minimize the number of samples presented to the user (which determines the amount of user effort).

Trade-off between quality of results :accuracy and efficiency: the total exploration time which includes the total sample reviewing time and wait time by the user.Slide46

Assumptions

Predictions of linear patterns: user interest are captured by range queries

Binary

, non noisy, relevance system where the user indicates whether a data object is relevant or not to him and this categorization cannot be modified in the following iterations

.

Categorical, numerical featuresSlide47

Data classification

Decision tree classifier to identify linear patterns of user interest

Decision tree advantages:

Easy to interpret

Perform well with large data

Easy mapping to queries that retrieve the relevant data objects

Can handle both numerical and categorical dataSlide48

Query formulation

Let us assume a decision tree classifier that predicts relevant and irrelevant clinical trials objects based on the attributes age and

dosageSlide49

Query formulation

SELECT *

FROM table

WHERE (

age ≤ 20 and dosage >10 and dosage ≤ 15

)

or (age > 20 and age ≤ 40 and dosage ≥ 0 and dosage ≥ 10)).Slide50

Space exploration overview

focus is on optimizing the effectiveness of the exploration while minimizing the number of samples presented to the

user

goal is to discover relevant areas and formulate user queries that select either a single relevant area (conjunctive queries) or multiple ones (disjunctive queries

).

three

exploration phases:Relevant Object Discovery

Misclassified ExploitationBoundary ExploitationSlide51

Phase one: relevant object discovery

Focus on collecting samples from yet unexplored areas and identifying single relevant object.

This

phase aims to discover relevant objects by showing to the user samples from diverse data

areas

To maximize the coverage of the exploration space it follows a well-structured approach that allows AIDE

to:

ensure that the exploration space is explored widelykeep track of the already explored sub-areasexplore different data areas in different granularitySlide52

Phase one: relevant object discovery

Attribute

B

Attribute

A

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

LEVEL 1

LEVEL 2

0

120

120

80

120

120

LEVEL 3

 

 

 

 

 

 

 

 

 

120

120

 

d – number of features

-

granularity

- grid cells

 Slide53

Phase one: relevant object discovery

Algorithm:

Retrieve a single random object within distance

along each dimension from this center

 Slide54

Phase one: relevant object discovery

Optimizations:

Hint-based object discovery

:

specific attributes ranges on which the user desires to focus

Skewed attributes: use K-means algorithm to partition the data space into k clusters. Data base objects are assigned to the cluster with the closest centroidSlide55

Phase two: misclassified exploitation

Goal is to discover relevant areas as opposed to single object.

This phase strategically increase the relevant object in the training set such that the predicted queries will select relevant areas

Designed to increase both the precision and recall of the final query.

Strive to limit the number of extraction queries and hence the time overhead of this phase Slide56

Phase two: misclassified exploitation

Generation of Misclassified

Samples

:

Assuming

decision tree classifier Ci is generated

on i-th iteration.

This phase leverages the misclassified samples to identify the next set of sampling areas in order to discover more relevant areas.addresses the lack of relevant samples by collecting more objects around false negatives.Slide57

Phase two: misclassified exploitation

“Naïve” Algorithm

:

collect samples around each false negative to obtain more relevant samples

.

very successful in identifying relevant

areashigh time

cost:execute one retrieval query per misclassified objectoften redundantly sample highly overlapping areas, spending resources (i.e., user labeling effort) without increasing much AIDE’s accuracy

If k iteration are needed to identify a relevant area, the user might have labeled

samples without improving the F-measure

 Slide58

Phase two: misclassified exploitation

Clustering-based

Exploitation Algorithm

:

create clusters using the k-means algorithm

and

have one sampling area per clustersample around each clusterIn each iteration

i, the algorithm sets k to be the overall number of relevant objects discovered in the object discovery phase.we run the clustering based exploitation only if k is less than the number of false negativesexperimental results showed that f should be set to a small number (10-25 samples)Slide59

Phase three: boundary exploitationSlide60

Phase three: boundary exploitation

Given a set of relevant areas identified by the decision tree classifier, this phase

aims to refine these areas by incrementally adjusting their

boundaries

better characterization of the user’s interests, i.e., higher accuracy of our final

results

this phase has the smallest impact on the effectiveness of our model: not discovering a relevant area can reduce our accuracy more than a partially discovered relevant area with imprecise boundaries. Hence, we constrain the number of samples used during this phase

to aims to distribute an equal amount of user effort to refine each boundary

 Slide61

Phase three: boundary exploitation

Algorithm

:

Input:

- number of samples

k - d-dimensional relevant

areas

- number of boundariescollect

random

samples within a distance ±x from the each

boundary

 Slide62

Phase three: boundary exploitation

Optimizations:

Adaptive sample size:

dynamically adapts the number of samples

collected.

d

- dimensionality of the exploration

space

- percentage

of change of the boundary j

between

the

(

i

− 1)-

th

and

i-th

iterations

e

r

- error

variable to cover cases where the boundary is not modified but also not accurately

predicted

- calculated

as the difference of the

boundary’s

normalized

values

of the specific dimension

 Slide63

Phase three: boundary exploitation

Optimizations:

Non-overlapping Sampling Areas

:

In this

case, the exploration areas

do not evolve significantly between iterations, resulting in redundant sampling and increased exploration cost (e.g., user effort) without improvements on classification accuracySlide64

Phase three: boundary exploitation

Optimizations:

Identifying Irrelevant

Attributes

:

domain sampling around the

boundaries.

While shrinking/expanding one dimension of a relevant area, collect random samples over the whole domain of the remaining dimensionsSlide65

Phase three: boundary exploitation

Optimizations:

Exploration on Sampled Datasets

:

generate

a random sampled database and extract our samples from the smaller sampled

datasetthis optimization can be used for both the misclassified and the boundary exploitation

phasesgenerate sampled data sets using a simple random sampling approach that picks each tuple with the same probabilitySlide66

AIDE model summary

Initial

Sample Acquisition

The iterative steering process starts when the user provides his feedback:

Data Classification

domain experts could restrict the attribute set on which the exploration is

performedData Extraction Query

Space ExplorationRelevant Object DiscoveryMisclassified ExploitationBoundary ExploitationSample ExtractionQuery FormulationSlide67

Conclusions

AIDE assists

users in discovering new interesting data patterns and eliminate expensive ad-hoc exploratory

queries

AIDE relies on a seamless integration of classification algorithms and data management optimization techniques that collectively strive to accurately learn the user interests based on his relevance feedback on strategically collected

samples

Our techniques minimize the number of samples presented to the user (which determines the amount of user effort) as well as the cost of sample acquisition (which amounts to the user wait time

)It provides interactive performance as it limits the user wait time per iteration of exploration to less than a few seconds.Slide68

Any Questions?Slide69

And now for real..

https://www.youtube.com/watch?v=1BwIw_t_J_4