By Kyriaki Dimitriadou Olga Papaemmanouil and Yanlei Diao Agenda Introduction to AIDE Data exploration IDE interactive data exploration AIDE automated interactive data exploration ID: 546051
Download Presentation The PPT/PDF document "Explore-by-Example: An Automatic Query S..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Explore-by-Example: An Automatic Query Steering Framework for Interactive Data Exploration
By
Kyriaki
Dimitriadou
,
Olga
Papaemmanouil
and
Yanlei
DiaoSlide2
Agenda
Introduction to AIDE
Data exploration
IDE: interactive data exploration
AIDE: automated interactive data exploration
Machine learning
Supervised learning: Decision tree
Unsupervised learning: K-means
Measure accuracy
AIDE framework
AIDE model:
Data
classification
Query formulation
Space exploration
Relevant object discovery
Misclassifies exploitation
Boundary
exploitation
AIDE
model summary
ConclusionsSlide3
WHAT IS AIDE?
Automated interactive data explorationSlide4
Explore data to find an apartment
BUT MOM I DON’T WANT TO MOVE!Slide5
Explore data to find an apartmentSlide6
Explore data to find an apartmentSlide7
Data Exploration
Data exploration
is the first step in
data
analysis and typically involves summarizing the main characteristics of a dataset. It is commonly conducted using visual analytics tools, but can also be done in more advanced statistical software, such as R.Slide8
IDE: Interactive data explorationSlide9
AIDE: Automated interactive data exploration
An Automatic Interactive Data Exploration framework, that iteratively steers the user towards interesting areas and “predicts” a query that retrieves his objects of interest.
AIDE integrates machine learning and data management techniques to provide effective data exploration results (matching user’s interest with high accuracy) as well as high interactive performance.Slide10
What is machine learning?Slide11
What is machine learning?
One definition: “Machine learning is the semi-automated extraction of knowledge from the data”
Knowledge from data:
Starts with a question that might be answerable using data
Automated extraction:
A computer provide the insights
Semi-Automated: Requires many smart decisions by a humanSlide12
Two main categories of machine learning
Supervised learning:
Making predictions using data
Example: is a given email “spam” or “ham”?
There is an outcome we are trying to predict
Unsupervised learning:
Extracting structure from dataExample: Segment grocery store shoppers into clusters that exhibits similar behavior
There is no “right answer” Slide13
Supervised learning
High level steps of supervised learning:
First, train a
machine learning model
using
labeled data
“Labeled data” has been labeled with the outcome“Machine learning model” learns the relationship between the attributes of the data and its outcome
Then, make prediction on new data for which the label is unknown Slide14
Supervised learning
The primary goal of supervised learning is to build a model that “generalizes”: It accurately
predicts the
future
rather then the
past!Slide15
Supervised learning
The primary goal of supervised learning is to build a model that “generalizes”: It accurately
predicts the
future
rather then the
past!
X1
X2
X3
Mail1
“Hello..”
29
1
Mail2
“Dear…”
17
3
Mail3
“Check
out..”
58
1Slide16
Supervised learning
The primary goal of supervised learning is to build a model that “generalizes”: It accurately
predicts the
future
rather then the
past!
Y
Mail1
Ham
Mail2
Spam
Mail3
HamSlide17
Supervised learning
The primary goal of supervised learning is to build a model that “generalizes”: It accurately
predicts the
future
rather then the
past!Slide18
Decision tree classifier
Labeled data:
Main idea:
Form binary tree
Minimize the error in each leaf
Slide19
Decision tree classifier
1
2
1
1
.8
Y
Y
Y
Y
N
N
N
NSlide20
Decision tree classifier
1
2
1
1
.8
Y
Y
Y
Y
N
N
N
NSlide21
Decision tree classifier
1
2
1
1
.8
Y
Y
Y
Y
N
N
N
NSlide22
How decision tree really works?
Initial error: 0.2
After split: 0.5*0.4 + 0.5*0 = 0.2
Is this a good split
?
….
label
1
1
1
1
1
0
0
0
0
0
….
label
1
1
1
1
1
0
0
0
0
0Slide23
How decision tree really works?
Selecting predicates - splitting criteria
potential
function
val
(.) to
guide our selection:Every change is an improvement. We will be able to achieve this by using a strictly concave function
.The potential is symmetric around 0.5, namely, val(q)= val(1 − q). When zero perfect classification. This implies that val(0) = val(1) = 0We have
val(0.5) = 0.5val(T) ≥ error(T)
Minimizing
val
(T) upper bounds the error! Slide24
How decision tree really works?
S
plitting criteria:
Gini Index: G(q) = 2q(1 − q
)
Before the split we have G(0.8
)=2· 0.8·
0.2 = 0.32After the split we have 0.5G(0.6) + 0.5G(1) = 0.5 · 2 · 0.4 · 0.6 = 0.24. Slide25
Comments on decision tree
method
Strength:
Easy to use, understand
Produce rules that are easy to interpret & implement
Variable selection & reduction is automatic
Do not require the assumptions of statistical models
Can work without extensive handling of missing dataWeakness:May not perform well where there is structure in the data that is not well captured by horizontal or vertical splitsSince the process deals with one variable at a time, no way to capture interactions between variablesTrees must be pruned to avoid over-fitting of the training dataSlide26
Unsupervised learning
High level steps
of unsupervised learning:
Also called
clustering,
sometimes
called classification by statisticians and sorting by psychologists and segmentation by people in
marketingOrganizing data into classes such that there isHigh intra-class similarityLow
inter-class similarity between the attributes of the data and its outcomeFinding the class labels and the number of classes directly from the data (in contrast to classification
).
More informally, finding natural groupings among objects. Slide27
What is a natural grouping among these objects?Slide28
What is a natural grouping among these objects?
Clustering is subjective
School Employees
Simpson's Family
Males
Females
Slide29
What is similarity?
The quality or state of being similar; likeness; resemblance; as, a similarity of features.
Webster's Dictionary
Similarity is hard to define, but…
“
We know it when we see it
”
The real meaning of similarity is a philosophical question. We will take a more pragmatic approach. Slide30
Defining distance measures
Definition
: Let
O
1
and
O2 be two objects from the universe of possible objects. The distance (dissimilarity) between
O1 and O2 is a real number denoted by D(O1,O2)
0.23
342.7
Peter
Piotr
3Slide31
Intuition behind desirable distance properties
D(A,B) = D(B,A) Symmetry
Otherwise you could claim “Alex looks more like
Bob, than Bob does
.”
D(A,B) = 0
IIf
A=B Positivity (Separation)Otherwise there are objects in your world that are different, but you cannot tell apart.D
(A,B) D(A,C) +
D
(B,C)
Triangular Inequality
Otherwise you could claim “Alex is very like Bob, and Alex is very like Carl, but Bob is very unlike Carl.”Slide32
Algorithm K-means
Goal:
Decide on a value for k
Initialize the
k
cluster centers (randomly, if necessary).
Decide the class memberships of the
N objects by assigning them to the nearest cluster center.Re-estimate the k cluster centers, by assuming the memberships found above are correct.
If none of the N objects changed membership in the last iteration, exit. Otherwise goto 3.Slide33
K-means clustering: Step 1
Algorithm: k-means, Distance Metric: Euclidean Distance
0
1
2
3
4
5
0
1
2
3
4
5
k
1
k
2
k
3Slide34
K-means clustering: Step 2
Algorithm: k-means, Distance Metric: Euclidean Distance
0
1
2
3
4
5
0
1
2
3
4
5
k
1
k
2
k
3Slide35
K-means clustering: Step 3
Algorithm: k-means, Distance Metric: Euclidean Distance
0
1
2
3
4
5
0
1
2
3
4
5
k
1
k
2
k
3Slide36
K-means clustering: Step 4
Algorithm: k-means, Distance Metric: Euclidean Distance
0
1
2
3
4
5
0
1
2
3
4
5
k
1
k
2
k
3Slide37
K-means clustering: Step 5
Algorithm: k-means, Distance Metric: Euclidean Distance
k
1
k
2
k
3Slide38
Comments on the k-means method
Strength:
Relatively
efficient: O(
tkn
), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n
.Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic
algorithmsWeakness:Applicable only when mean is defined, then what about categorical data?Need to specify k, the number of clusters, in advanceUnable to handle noisy data and outliersNot suitable to discover clusters with non-convex shapesSlide39
Measure accuracy
Precision
is
the fraction of retrieved instances that are
relevant
Recall
is the fraction of relevant instances that are retrievedSlide40
Measure accuracy: F-score
The F
score can be interpreted as a weighted average of the
precision and recall
F
score reaches its best value at 1 and worst at 0
.
Slide41
Question about machine learning
How do I choose which attributes of my data to include in the model?
How do I choose which model to use?
How do I optimize this model for best performance?
How do I ensure that I’m building a model that will generalize to unseen data?
Can I estimate how well my model is likely to perform on unseen data?Slide42
Back to AIDE…Slide43
How does AIDE works?
Framework
that automatically “steers” the user towards data areas relevant to his
interest
In AIDE, the user engages in a “conversation” with the system indicating his interests, while in the background the system automatically formulates and processes queries that collect data matching the user
interestSlide44
AIDE framework
Label data samples
Decision Tree classifier
Identify promising sampling areas
Retrieve next sample set from DBSlide45
AIDE challenges
AIDE operates on the unlabeled data space that the user aims to explore
To achieve
desirable interactive experience for the user, AIDE needs not only to provide accurate results, but also to minimize the number of samples presented to the user (which determines the amount of user effort).
Trade-off between quality of results :accuracy and efficiency: the total exploration time which includes the total sample reviewing time and wait time by the user.Slide46
Assumptions
Predictions of linear patterns: user interest are captured by range queries
Binary
, non noisy, relevance system where the user indicates whether a data object is relevant or not to him and this categorization cannot be modified in the following iterations
.
Categorical, numerical featuresSlide47
Data classification
Decision tree classifier to identify linear patterns of user interest
Decision tree advantages:
Easy to interpret
Perform well with large data
Easy mapping to queries that retrieve the relevant data objects
Can handle both numerical and categorical dataSlide48
Query formulation
Let us assume a decision tree classifier that predicts relevant and irrelevant clinical trials objects based on the attributes age and
dosageSlide49
Query formulation
SELECT *
FROM table
WHERE (
age ≤ 20 and dosage >10 and dosage ≤ 15
)
or (age > 20 and age ≤ 40 and dosage ≥ 0 and dosage ≥ 10)).Slide50
Space exploration overview
focus is on optimizing the effectiveness of the exploration while minimizing the number of samples presented to the
user
goal is to discover relevant areas and formulate user queries that select either a single relevant area (conjunctive queries) or multiple ones (disjunctive queries
).
three
exploration phases:Relevant Object Discovery
Misclassified ExploitationBoundary ExploitationSlide51
Phase one: relevant object discovery
Focus on collecting samples from yet unexplored areas and identifying single relevant object.
This
phase aims to discover relevant objects by showing to the user samples from diverse data
areas
To maximize the coverage of the exploration space it follows a well-structured approach that allows AIDE
to:
ensure that the exploration space is explored widelykeep track of the already explored sub-areasexplore different data areas in different granularitySlide52
Phase one: relevant object discovery
Attribute
B
Attribute
A
LEVEL 1
LEVEL 2
0
120
120
80
120
120
LEVEL 3
120
120
d – number of features
-
granularity
- grid cells
Slide53
Phase one: relevant object discovery
Algorithm:
Retrieve a single random object within distance
along each dimension from this center
Slide54
Phase one: relevant object discovery
Optimizations:
Hint-based object discovery
:
specific attributes ranges on which the user desires to focus
Skewed attributes: use K-means algorithm to partition the data space into k clusters. Data base objects are assigned to the cluster with the closest centroidSlide55
Phase two: misclassified exploitation
Goal is to discover relevant areas as opposed to single object.
This phase strategically increase the relevant object in the training set such that the predicted queries will select relevant areas
Designed to increase both the precision and recall of the final query.
Strive to limit the number of extraction queries and hence the time overhead of this phase Slide56
Phase two: misclassified exploitation
Generation of Misclassified
Samples
:
Assuming
decision tree classifier Ci is generated
on i-th iteration.
This phase leverages the misclassified samples to identify the next set of sampling areas in order to discover more relevant areas.addresses the lack of relevant samples by collecting more objects around false negatives.Slide57
Phase two: misclassified exploitation
“Naïve” Algorithm
:
collect samples around each false negative to obtain more relevant samples
.
very successful in identifying relevant
areashigh time
cost:execute one retrieval query per misclassified objectoften redundantly sample highly overlapping areas, spending resources (i.e., user labeling effort) without increasing much AIDE’s accuracy
If k iteration are needed to identify a relevant area, the user might have labeled
samples without improving the F-measure
Slide58
Phase two: misclassified exploitation
Clustering-based
Exploitation Algorithm
:
create clusters using the k-means algorithm
and
have one sampling area per clustersample around each clusterIn each iteration
i, the algorithm sets k to be the overall number of relevant objects discovered in the object discovery phase.we run the clustering based exploitation only if k is less than the number of false negativesexperimental results showed that f should be set to a small number (10-25 samples)Slide59
Phase three: boundary exploitationSlide60
Phase three: boundary exploitation
Given a set of relevant areas identified by the decision tree classifier, this phase
aims to refine these areas by incrementally adjusting their
boundaries
better characterization of the user’s interests, i.e., higher accuracy of our final
results
this phase has the smallest impact on the effectiveness of our model: not discovering a relevant area can reduce our accuracy more than a partially discovered relevant area with imprecise boundaries. Hence, we constrain the number of samples used during this phase
to aims to distribute an equal amount of user effort to refine each boundary
Slide61
Phase three: boundary exploitation
Algorithm
:
Input:
- number of samples
k - d-dimensional relevant
areas
- number of boundariescollect
random
samples within a distance ±x from the each
boundary
Slide62
Phase three: boundary exploitation
Optimizations:
Adaptive sample size:
dynamically adapts the number of samples
collected.
d
- dimensionality of the exploration
space
- percentage
of change of the boundary j
between
the
(
i
− 1)-
th
and
i-th
iterations
e
r
- error
variable to cover cases where the boundary is not modified but also not accurately
predicted
- calculated
as the difference of the
boundary’s
normalized
values
of the specific dimension
Slide63
Phase three: boundary exploitation
Optimizations:
Non-overlapping Sampling Areas
:
In this
case, the exploration areas
do not evolve significantly between iterations, resulting in redundant sampling and increased exploration cost (e.g., user effort) without improvements on classification accuracySlide64
Phase three: boundary exploitation
Optimizations:
Identifying Irrelevant
Attributes
:
domain sampling around the
boundaries.
While shrinking/expanding one dimension of a relevant area, collect random samples over the whole domain of the remaining dimensionsSlide65
Phase three: boundary exploitation
Optimizations:
Exploration on Sampled Datasets
:
generate
a random sampled database and extract our samples from the smaller sampled
datasetthis optimization can be used for both the misclassified and the boundary exploitation
phasesgenerate sampled data sets using a simple random sampling approach that picks each tuple with the same probabilitySlide66
AIDE model summary
Initial
Sample Acquisition
The iterative steering process starts when the user provides his feedback:
Data Classification
domain experts could restrict the attribute set on which the exploration is
performedData Extraction Query
Space ExplorationRelevant Object DiscoveryMisclassified ExploitationBoundary ExploitationSample ExtractionQuery FormulationSlide67
Conclusions
AIDE assists
users in discovering new interesting data patterns and eliminate expensive ad-hoc exploratory
queries
AIDE relies on a seamless integration of classification algorithms and data management optimization techniques that collectively strive to accurately learn the user interests based on his relevance feedback on strategically collected
samples
Our techniques minimize the number of samples presented to the user (which determines the amount of user effort) as well as the cost of sample acquisition (which amounts to the user wait time
)It provides interactive performance as it limits the user wait time per iteration of exploration to less than a few seconds.Slide68
Any Questions?Slide69
And now for real..
https://www.youtube.com/watch?v=1BwIw_t_J_4