/
Approximated Provenance for Complex Applications Approximated Provenance for Complex Applications

Approximated Provenance for Complex Applications - PowerPoint Presentation

jane-oiler
jane-oiler . @jane-oiler
Follow
393 views
Uploaded On 2015-12-02

Approximated Provenance for Complex Applications - PPT Presentation

Susan B Davidson University of Pennsylvania Eleanor Ainy Daniel Deutch Tova Milo Tel Aviv University The engagement of crowds of Web users for data procurement and knowledge creation ID: 212538

crowd provenance expression data provenance crowd data expression state movie provenancenew outline annotations aggregate reviews art

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Approximated Provenance for Complex Appl..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Approximated Provenance for Complex Applications

Susan B.

Davidson

University of Pennsylvania

Eleanor

Ainy

, Daniel

Deutch

,

Tova

Milo

Tel Aviv UniversitySlide2

The engagement of crowds of Web users for data procurement and knowledge creation.

2

Crowd Sourcing

2Slide3

Why now?

3

3

We are all connected, all the time!Slide4

Complexity?

Many of the initial applications were quite simple

Specify Human Interaction Task (HIT) using e.g. Mechanical Turk, collect responses, aggregate to form result.

Newer ideas are multi-phase and complex, e.g. mining frequent fact sets from the crowd (OASSIS)Model as workflows with global state

4Slide5

Outline

“State-of-the-art” in crowd data provenance

New challenges

A proposal for modeling crowd data provenance

5Slide6

Outline

“State-of-the-art” in crowd data provenance

New challenges

A proposal for modeling crowd data provenance

6Slide7

Crowd data provenance?

TripAdvisor

: aggregates reviews and presents average ratings

Individual reviews are part of the provenanceWikipedia: keeps extensive information about how pages are edited

ID of the user who generated the page as well as changes to page (when, who, summary) Provides several views of this information, e.g. by page or by editorMainly used for presentation and explanation

7Slide8

8Slide9

9Slide10

10Slide11

11Slide12

12Slide13

13Slide14

14Slide15

15Slide16

Outline

“State-of-the-art” in crowd data provenance

New challenges

A proposal for modeling crowd data provenance

16Slide17

Challenges for crowd data provenance

Complexity of processes and number of user inputs involved

Provenance can be very large, leading to difficulties in viewing and understanding provenance

Need forSummarization

Multidimensional viewsProvenance miningCompact representation for maintenance and cleaning

17Slide18

Summarization

Large size of provenance

need for abstraction E.g., in heavily edited Wikipedia pages: “x1, x2, x3 are formatting changes; y1, y2, y3, y4 add content; z1 , z2 represent divergent viewpoints”

“u1 , u2 , u3 represent edits by robots; v1, v2 represent edits by Wikipedia administrators”E.g., in a movie-rating application to summarize the provenance of the average rating for “MatchPoint”“Audience crowd members gave higher ratings (8-10) whereas critics gave lower ratings (3-5).”

18Slide19

Multidimensional Views

“Perspective” through which provenance can be viewed or mined

E.g. in

TripAdvisor, if there is an “outlier” review it would be useful to see other reviews by that person to “calibrate” it.“Question” perspective could show which questions are bad/unclear

19Slide20

Maintenance and Cleaning

May need update propagation to remove certain users, questions and/or answers

E.g. spammers or bad questions

Mining of provenance may lag behind the aggregate calculationE.g., detecting a spammer may only be possible when they have answered enough questions, or when enough answers have been obtained from other users.

20Slide21

Outline

“State-of-the-art” in crowd data provenance

New challenges

A proposal for modeling crowd data provenance

21Slide22

Crowd Sourcing Workflow

22

Movie reviews Aggregator PlatformSlide23

Provenance expression

23Slide24

Propagating provenance annotations through joins

24

JOIN (on B)

a

b c

p

The annotation

p

*

r

means

joint use

of data annotated by

p

and data annotated by

r

a

b

c

d

e

p

*

r

R

R

S

S

d

b

e

r

A B C

D B E

A B C D E

[Green,

Karvounarakis

,

Tannen

, Provenance

Semirings

.

PODS 2007]Slide25

Propagating provenance annotations through unions and projections

25

a

b

c

1

p

a b

c2

r

a

b

c

3

s

a

b

p

+

r

+

s

+

means

alternative use

of data, which arises in both PROJECT and UNION.

PROJECT

R

A B C

π

AB

R

A B

[Green,

Karvounarakis

,

Tannen

, Provenance

Semirings

.

PODS 2007]Slide26

Annotated Aggregate Expressions

26

1 d

1

20

p

1

2 d

1 10

p2

3 d

1 15

P

3

Q

=

R

Eid

Dept Sal

select Dept,

sum(Sal

)

from R

group by Dept

The sum

salary for d

1

could be represented

by the expression

(20 p

1

+ 10 p

2

+ 15 p

3

)

This provenance aware value “commutes” with deletion.

[

Amsterdamer

,

Deutch

,

Tannen

, Provenance for Aggregate Queries.

PODS 2011]Slide27

Provenance expression

27Slide28

Provenance expression: Benefits

Can understand how movie ratings were computed.

Can be used for

data maintenance and cleaning

E.g. if U2 is discovered to be a spammer, “map” its provenance annotation to 028Slide29

Summarizing provenance

Map annotations to a corresponding “summary”

h

: Ann 

Ann’, where |Ann’| << |Ann|E.g. in our example, leth(Ui)=h(S

i)=1, h(Ai)=A, h(Ci)=CReducing the expression to

Which simplifies to29Slide30

Constructing mappings?

How do we define and find “good” mappings?

Provenance size

Semantic constraints (e.g. two annotations can only be mapped to the same annotation if they come from the same input table)Distance between original provenance expression and the mapped expression (e.g. grouping all young French people and giving them an average rating for some movie)

30Slide31

Conclusions

Provenance is needed for crowd-sourcing applications to help understand the results and reason about their quality.

Techniques from database/workflow provenance can be used, but there are special challenges and “opportunities”

31