1012013 2 Background Crowd Data sourcing 2 Crowd Mining Crowdsourcing Challenges or shameless selfadvertisement What questions to ask SIGMOD13 VLDB13 ID: 928739
Download Presentation The PPT/PDF document "Crowd Mining Tova Milo The engagement o..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Crowd Mining
Tova
Milo
Slide2The engagement of crowds of Web users for data procurement10/1/2013
2
Background
-
Crowd (Data) sourcing
2
Crowd Mining
Slide3Crowdsourcing
Challenges (or, shameless self-advertisement )
What questions to ask? [SIGMOD13, VLDB13]
How to define & determine correctness of answers? [WWW12, ICDE11]
Who to ask? how many people? How to best use the resources? [ICDE12, ICDT13, ICDE13, VLDB13]
3
Crowd Mining
Data Mining
Data Cleaning
Probabilistic Data
Optimizations and Incremental Computation
Slide4Crowd Mining: Crowdsourcing in an open world
Human knowledge forms an
open worldAssume we want to find out what is interesting
and important in some domain area
Folk medicine, people’s habits, …What questions to ask?
4
Crowd Mining
Slide5Back to classic databases...
Significant data patterns are identified using
data mining techniques.A useful type of pattern: association rulesE.g., stomach ache
chamomileQueries are dynamically constructed in the learning process
Is it possible to mine the crowd?
5
Crowd Mining
Slide6Turning to the crowd
Let us model the history of every user as a
personal database
Every case = a
transaction consisting of itemsNot recorded anywhere – a hidden DBIt is hard for people to recall many details about many transactions!But … they can often
provide summaries, in the form of personal rules “To treat a sore throat I often use garlic”
6
Crowd Mining
Treated a
sore throat
with
garlic
and
oregano leaves
…
Treated a
sore throat
and
low fever
with garlic and
ginger …
Treated a heartburn with water,
baking soda
and
lemon
…
Treated
nausea
with
ginger
, the patient experienced
sleepiness
…
…
Slide7Two types of questions
Free recollection (mostly simple, prominent patterns)
Open questionsConcrete questions (may be more complex)
Closed questions
We use the two types interleavingly.
When a patient has both
headaches and fever, how often do you use a willow tree bark infusion?
Tell me how you treat a particular illness
“I typically treat nausea with ginger infusion”
7
Crowd Mining
Slide8Contributions (at a very high level)
Formal model
for crowd mining; allowed questions and the answers interpretation; personal rules and their overall significance.A Framework of the generic components required for mining the crowdSignificance and error estimations.
[and, how will this change if we ask more questions…]Crowd-mining algorithms
[Implementation & benchmark. both synthetic & real data.]
8
Crowd Mining
Slide9The model: User support and confidence
A set of
users UEach user u
U has a
(hidden!) transaction database DuEach rule X
Y is associated with:
9
u
ser
support
user
confidence
Crowd Mining
Slide10Model for closed and open questions
Closed questions:
X ? YAnswer: (approximate) user support and confidenceOpen questions:
? ? ?
Answer: an arbitrary rule with its user support and confidence
“I typically have a headache once a week. In 90% of the times, coffee helps.
10
Crowd Mining
Slide11Significant rules
Significant rules
: Rules were the mean user support and confidence are above some specified thresholds Θs
, Θ
c.Goal: estimating rule significance (and identifying the significant rules) while asking the smallest possible number of questions to the crowd
11
Crowd Mining
Slide12Framework components
One generic
framework for crowd-miningParticular choices of implementation for the black boxesValidated by experiments
12
Crowd Mining
Slide13What is a good algorithm?
How to measure the efficiency of Crowd Mining Algorithms ???
Two distinguished cost factors:Crowd complexity: # of crowd queries used by the algorithm
Computational complexity: the complexity of computing the crowd queries and processing the answers
[Crowd comp. lower bound is a trivial computational comp. lower bound]There exists a tradeoff between the complexity measuresNaïve questions selection -> more crowd questions
13
Crowd Mining
Slide14Semantic knowledge can save work
Given a taxonomy of is-a relationships among items, e.g.
espresso is a coffee frequent
({headache, espresso})
frequent({headache, coffee})
AdvantagesAllows inference on itemset frequencies
Allows avoiding semantically equivalent itemsets {espresso}, {
espresso, coffee
}, {
espresso, beverage
}…
14
Crowd Mining
Slide15Complexity measures
Given a taxonomy
Ψ we measure complexity it terms ofthe input Ψ:its size, shape, width…
the output frequent itemsets
, represented compactly by the Maximal Frequent Itemsets (MFI)Minimal Infrequent Itemsets (MII)
15
Crowd Mining
Slide16Complexity boundaries
Notations:
|Ψ| - the taxonomy size|I(
Ψ)|
- the number of itemsets (modulo equivalences)|S(Ψ)
| - the number of possible solutions
16
Crowd Mining
Slide17Now, back to the bigger picture…
17
Crowd Mining
“I’m looking for activities to do in a child-friendly attraction in New York, and a good restaurant near by”
“You can go bike riding in Central Park and eat at
Maoz
Vegetarian.
Rent bikes at the boathouse”
“You can go visit the Bronx Zoo
and eat at Pine Restaurant.
Order antipasti at Pine.
Skip dessert and go for ice cream across the street”
The user’s question in natural language:
A
nswers:
Slide18ID
Transaction Details1<Football>
doAt <Central_Park>2
<Biking> doAt
<Central_Park>.<BaseBall> doAt
<Central_Park><Rent_Bikes>
doAt <Boathouse>
3
<Falafel>
eatAt
<
Maoz
Veg.>
4
<Antipasti> eatAt
<Pine>5
<Visit> doAt
<Bronx_Zoo>.
<Antipasti> eatAt <Pine>
Slide19Solution Ingredients
Crowd Mining
Query Language (based on SPARQL and DMQL)Describing the relevant part of the ontology, andThe type of (association) rules we are interested in
Crowd-based Query Evaluation AlgorithmsOpen and closed questions to the crowd
Sampling and answers aggregationRefinement…
19
Crowd Mining
Slide20Crowd Mining Query language
(based on SPARQL and DMQL)
20
Crowd Mining
FIND association rules
RELATED TO
x, y+, z, u?, p?, v?
WHERE
{$x
instanceOf
<Attraction>.
$x inside <NYC>.
$x
hasLabel
“Child Friendly”.
$y
subClassOf
*
<Activity>.
$z instanceOf <Restaurant>.
$z nearBy $x.
...}MATCHING
(
{}
=>
{([]
eatAt
$z.)}
WITH support THRESHOLD = 0.007
)
AND
(
{([]
doAt
$x)}
=>
{($y
doAt
$x),($u $p $v)}
WITH support THRESHOLD = 0.01
WITH confidence THRESHOLD = 0.2
RETURN MFI
)
Determine
group size
for each variable,
SPARQL-like
where clause
, $x is a variable
Path
of length 0 or more
The
left-hand
and
right-hand
parts of the rule are defined as SPARQL patterns
Mining parameters
: thresholds, output granularity, etc.
Several
rule patterns can be mined, joint by AND/OR
Slide21Can we trust the crowd ?
21
Crowd Mining
21
Slide2222
Can we trust the crowd ?
22
Crowd Mining
Slide23Summary
23
Crowd Mining
The crowd is an incredible resource…
But must be used carefully!
Many challenges:(very) interactive computation
A huge amount of dataVarying quality and trust
“Computers are useless, they can only give you answers”
-
Pablo Picasso
But, as it seems, they can also ask us questions!
Slide24Thanks
24
Crowd Mining
Antoine
Amarilli
, Yael
Amsterdamer
,
Rubi
Boim
,
Susan Davidson
,
Ohad
Greenshpan
, Benoit Gross,
Yael Grossman
, Ezra Levin,
Ilia
Lotosh, Slava
Novgordov
,
Neoklis
Polyzotis
,
Sudeepa
Roy,
Pierre
Senellart
,
Amit
Somech
, Wang-
Chiew
Tan…
EU-FP7 ERC
MoDaS
Mob Data Sourcing
Slide25(Closed) Questions to the crowd
25
Crowd Mining
How often you
do
something
in Central Park?{ ( [] doAt <
Central_Park
> )
How often
do you eat
at
Maoz
Vegeterian
?
{ ( [] eatAt
<Maoz_Vegeterian> ) }
How often do you swim in
Central Park?{ ( <Swim> doAt <Central_Park>
) }Supp=0.7
Supp=0.1
Supp=0.3
Slide26(Open) Questions to the crowd
26
Crowd Mining
What do you do
in Central Park ?
{ ( $y doAt <Central_Park>) }
What else do you do when
biking
in
Central Park
?
{ (<
Biking
>
doAt <Central_Park>) , ($y doAt
<Central_Park>) }
Complete: When I go biking in Central Park…
{ (<Biking> doAt <Central_Park
>), ( $u $p $v ) }$y = Football, supp = 0.6
$y = Football, supp = 0.6
$u = <
Rent_Bikes
>, $p =
doAt
, $v = <
BoatHouse
>, supp = 0.6
Slide27Example of extracted association rules
<
Ball Games, doAt , Central_Park
> (Supp:0.4 Conf:1) < [ ] ,
eatAt , Maoz_Vegeterian > (Supp:0.2 Conf:0.2)<
visit , doAt , Bronx_Zoo
> (Supp: 0.2 Conf: 1) < [ ] , eatAt ,
Pine_Restaurant
>
(Supp: 0.2 Conf: 0.2)
27
Crowd Mining