Susan B Davidson University of Pennsylvania Eleanor Ainy Daniel Deutch Tova Milo Tel Aviv University The engagement of crowds of Web users for data procurement and knowledge creation ID: 212538
Download Presentation The PPT/PDF document "Approximated Provenance for Complex Appl..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Approximated Provenance for Complex Applications
Susan B.
Davidson
University of Pennsylvania
Eleanor
Ainy
, Daniel
Deutch
,
Tova
Milo
Tel Aviv UniversitySlide2
The engagement of crowds of Web users for data procurement and knowledge creation.
2
Crowd Sourcing
2Slide3
Why now?
3
3
We are all connected, all the time!Slide4
Complexity?
Many of the initial applications were quite simple
Specify Human Interaction Task (HIT) using e.g. Mechanical Turk, collect responses, aggregate to form result.
Newer ideas are multi-phase and complex, e.g. mining frequent fact sets from the crowd (OASSIS)Model as workflows with global state
4Slide5
Outline
“State-of-the-art” in crowd data provenance
New challenges
A proposal for modeling crowd data provenance
5Slide6
Outline
“State-of-the-art” in crowd data provenance
New challenges
A proposal for modeling crowd data provenance
6Slide7
Crowd data provenance?
TripAdvisor
: aggregates reviews and presents average ratings
Individual reviews are part of the provenanceWikipedia: keeps extensive information about how pages are edited
ID of the user who generated the page as well as changes to page (when, who, summary) Provides several views of this information, e.g. by page or by editorMainly used for presentation and explanation
7Slide8
8Slide9
9Slide10
10Slide11
11Slide12
12Slide13
13Slide14
14Slide15
15Slide16
Outline
“State-of-the-art” in crowd data provenance
New challenges
A proposal for modeling crowd data provenance
16Slide17
Challenges for crowd data provenance
Complexity of processes and number of user inputs involved
Provenance can be very large, leading to difficulties in viewing and understanding provenance
Need forSummarization
Multidimensional viewsProvenance miningCompact representation for maintenance and cleaning
17Slide18
Summarization
Large size of provenance
need for abstraction E.g., in heavily edited Wikipedia pages: “x1, x2, x3 are formatting changes; y1, y2, y3, y4 add content; z1 , z2 represent divergent viewpoints”
“u1 , u2 , u3 represent edits by robots; v1, v2 represent edits by Wikipedia administrators”E.g., in a movie-rating application to summarize the provenance of the average rating for “MatchPoint”“Audience crowd members gave higher ratings (8-10) whereas critics gave lower ratings (3-5).”
18Slide19
Multidimensional Views
“Perspective” through which provenance can be viewed or mined
E.g. in
TripAdvisor, if there is an “outlier” review it would be useful to see other reviews by that person to “calibrate” it.“Question” perspective could show which questions are bad/unclear
19Slide20
Maintenance and Cleaning
May need update propagation to remove certain users, questions and/or answers
E.g. spammers or bad questions
Mining of provenance may lag behind the aggregate calculationE.g., detecting a spammer may only be possible when they have answered enough questions, or when enough answers have been obtained from other users.
20Slide21
Outline
“State-of-the-art” in crowd data provenance
New challenges
A proposal for modeling crowd data provenance
21Slide22
Crowd Sourcing Workflow
22
Movie reviews Aggregator PlatformSlide23
Provenance expression
23Slide24
Propagating provenance annotations through joins
24
JOIN (on B)
…
a
b c
…
p
The annotation
p
*
r
means
joint use
of data annotated by
p
and data annotated by
r
…
a
b
c
d
e
p
*
r
…
R
R
⋈
S
S
…
d
b
e
…
r
A B C
D B E
A B C D E
[Green,
Karvounarakis
,
Tannen
, Provenance
Semirings
.
PODS 2007]Slide25
Propagating provenance annotations through unions and projections
25
…
a
b
c
1
p
…
a b
c2
r
…
a
b
c
3
…
s
…
a
b
p
+
r
+
s
…
+
means
alternative use
of data, which arises in both PROJECT and UNION.
PROJECT
R
A B C
π
AB
R
A B
[Green,
Karvounarakis
,
Tannen
, Provenance
Semirings
.
PODS 2007]Slide26
Annotated Aggregate Expressions
26
1 d
1
20
p
1
2 d
1 10
p2
3 d
1 15
P
3
Q
=
R
Eid
Dept Sal
select Dept,
sum(Sal
)
from R
group by Dept
The sum
salary for d
1
could be represented
by the expression
(20 p
1
+ 10 p
2
+ 15 p
3
)
⊗
⊗
⊗
This provenance aware value “commutes” with deletion.
[
Amsterdamer
,
Deutch
,
Tannen
, Provenance for Aggregate Queries.
PODS 2011]Slide27
Provenance expression
27Slide28
Provenance expression: Benefits
Can understand how movie ratings were computed.
Can be used for
data maintenance and cleaning
E.g. if U2 is discovered to be a spammer, “map” its provenance annotation to 028Slide29
Summarizing provenance
Map annotations to a corresponding “summary”
h
: Ann
Ann’, where |Ann’| << |Ann|E.g. in our example, leth(Ui)=h(S
i)=1, h(Ai)=A, h(Ci)=CReducing the expression to
Which simplifies to29Slide30
Constructing mappings?
How do we define and find “good” mappings?
Provenance size
Semantic constraints (e.g. two annotations can only be mapped to the same annotation if they come from the same input table)Distance between original provenance expression and the mapped expression (e.g. grouping all young French people and giving them an average rating for some movie)
30Slide31
Conclusions
Provenance is needed for crowd-sourcing applications to help understand the results and reason about their quality.
Techniques from database/workflow provenance can be used, but there are special challenges and “opportunities”
31