/
The  Semiring  Framework for Database Provenance The  Semiring  Framework for Database Provenance

The Semiring Framework for Database Provenance - PowerPoint Presentation

katrgolden
katrgolden . @katrgolden
Follow
344 views
Uploaded On 2020-08-07

The Semiring Framework for Database Provenance - PPT Presentation

5212018 1 UW Val Tannen University of Pennsylvania 5212018 2 UW Collaborators ORCHESTRA ID: 801806

cat provenance gray mouse provenance cat mouse gray data semiring red polynomials rat notes semirings zack commutative tannen 2018

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "The Semiring Framework for Database Pr..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

The Semiring Framework for Database Provenance

5/21/2018

1

UW

Val Tannen

University of Pennsylvania

Slide2

5/21/20182UW

Collaborators

ORCHESTRA

TJ Green

RelationalAI

(was @

LogicBlox

)

Grigoris

Karvounarakis

RelationalAI

(was @

LogicBlox

)

Zack Ives

University of Pennsylvania

Other core papers

Nate Foster

Cornell University

Yael

Amsterdamer

Bar-

Ilan

University

Daniel

Deutch

Tel Aviv University

Tova

Milo

Tel Aviv University

Sudeepa

Roy

Duke University

Yuval

Moskovitch

Tel Aviv University

Slide3

5/21/2018For many, tracking data provenance means maintaining an “activity log”: record accessing of data items.

Many are satisfied, depending on what is recorded in each log entry!

A new,

both declarative (logic-based) and processing-based

, perspective arose in databases in the last 15+ years with the work of

Peter

Buneman

and others, including us.

This new perspective allows us to be more ambitious about applications of provenance

analysis. (As in algorithmic analysis.)

3

UW

Data provenance

Slide4

Binary trust5/21/2018UW4

mouse

gray

mouse

red

rat

gray

* Sue and Val are noted zoologists. ** Zack is a noted

computational

zoologist

cat

mouse

cat

rat

Sue’s notes *

Val’s notes *

cat

gray

cat

red

Zack **

computation

Yes

No

Yes

Yes

Yes

Yes

No

No

NoYes

food color

Slide5

Access control5/21/2018UW5

mouse

gray

mouse

red

rat

gray

Pub < Conf < Sec <

TSec

cat

mouse

cat

rat

Sue’s notes

Val’s notes

cat

gray

cat

red

Zack

computation

TSec

TSec

Conf

Pub

Pub

Conf

TSec

Slide6

Confidence scores (non-binary trust)5/21/2018UW6

mouse

gray

mouse

red

rat

gray

cat

mouse

cat

rat

Sue’s notes

Val’s notes

cat

gray

cat

red

Zack

computation

0.6

0.1

0.8

0.9

0.9

0.72

0.09

0.72 = max(0.9

× 0.8, 0.9 × 0.6)

0.09 = 0.9 × 0.1

Slide7

A simple model for data pricing5/21/2018UW7

mouse

gray

mouse

red

rat

gray

cat

mouse

cat

rat

Sue’s notes

Val’s notes

cat

gray

cat

red

Zack

computation

$6

$1

$8

$10

$10

$16

$11

16 = min(10 +

8, 10 + 6)

11 = 10 + 1

Slide8

5/21/20188UWDo it once and use it repeatedly: provenance

Label (annotate) input items abstractly with

provenance tokens.

Provenance tracking

: propagate

expressions

(involving tokens)

(to annotate intermediate data and, finally, outputs)

Track

two distinct ways of using data items by computation primitives:

jointly

(this alone is basically like keeping a log)

alternatively

(doing both is essential; think trust)

Input-output compositional; Modular (in the primitives)

Later, we want to evaluate the provenance expressions to obtain binary trust, access control, confidence scores, data prices, etc.

Slide9

Database Provenance

V

R

View

V

=

Query

(

R

)

?

?

5/21/2018

UW

9

a

b

cd

b e

f g ea ca e

d

cd ef e A B C A C

Slide10

Provenance by propagating annotation of data items5/21/2018UW10

JOIN (on B)

a

b

c

p

The annotation

p

¢

r

means joint use of data annotated by p

and data annotated by r …a b c d

e

p

¢ r…R

R

⋈ S

S

…d b e …

r A B C

D B E A B C D E

Slide11

Another way to propagate annotations5/21/2018UW11

UNION

a

b

c

p

The annotation

p

+

r

means

alternative use

of data

…a b cp

+

r

…RR [ SS

a

b

c …r

A B C A B C A B C

Slide12

Another use of + 5/21/2018UW

12

a

b

c

1

p

a

b

c

2

r

a

b

c

3

s

a

bp + r + s

+

means alternative use of data PROJECT R

A B C ¼ABR A B

Slide13

Example: a positive querySPJU / UCQ / nonrecursive Datalog5/21/2018UW

13

a

b

c

p

d

b

e

r

f

g

e

s

a

c

(

p ¢ p + p

¢

p

) ¢ 0a e(p ¢ r ) ¢ 1

d

c

(r ¢ p ) ¢ 0d

e( r ¢ r +

r ¢ s + r ¢ r

) ¢ 1f e(

s

¢

s

+

s

¢

r

+

s

¢

s

)

¢

1

For selection we multiply

with two special annotations, 0 and 1

V

=

C=e

¼

AC

(

¼

AC

R

¼

BC

R

[

¼

AB

R

¼

BC

R )

R

A B C

A C

V

Slide14

Summary so farA space of annotations, K

K

-

relations

: every

tuple

annotated with some element from

K

.

Binary operations on K

: ¢

corresponds to joint use (

join

), and

+

corresponds to alternative use (union and projection).

We assume K contains special annotations 0 and 1. ‘‘Absent’’ tuples are annotated with

0!

1 is a ‘‘neutral’’ annotation (no restrictions).Algebra of annotations? What are the laws of (K, +, ¢, 0, 1) ?5/21/2018UW14

Slide15

Annotated relational algebraHence, for each commutative semiring K we have a K-

annotated relational algebra.

Proposition

. Above identities hold for queries on

K

-relations

iff

(

K

, +, ¢, 0, 1) is a commutative semiring15

5/21/2018

UW

DBMS query optimizers assume certain

equivalences

:

union is associative, commutative

join is associative, commutative, distributes over union

projections and selections commute with each other and with union and join (when applicable)Etc., but no R ⋈ R = R

[ R =

R (i.e., no idempotence, to allow for bag semantics)Equivalent queries should produce same annotations!

Slide16

What is a commutative semiring?An algebraic structure (K, +,

¢, 0, 1) where:

K is the domain

+ is associative, commutative, with

0

identity

¢

is associative, with

1

identity

semiring ¢ distributes over + a ¢ 0 = 0 ¢ a = 0

¢ is also

commutative

Unlike ring, no requirement for inverses to

+

16

5/21/2018

UW

Slide17

But are there useful commutative semirings?

(

B,

Ç,

Æ

,

?

,

>

)Set semantics(ℕ, +, ∙, 0, 1)Bag semantics

(BoolExp(

X),

Ç

,

Æ

, ?, >)Conditional tables (c-tables) [ImielinskiLipski 84](P(), [, Å, ;, )Probabilistic event tables [FuhrRölleke 97, Zimányi 97]175/21/2018UW

They capture several semantics:

Slide18

Using the commutative semiring axioms (I)5/21/2018UW18

a

b

c

p

d

b

e

r

f

g

e

s

a

c

(

p

¢ p + p

¢ p

)

¢ 0a ep ¢ r ¢ 1

d

c

r

¢ p ¢ 0d e

(r ¢ r + r ¢

s + r ¢ r ) ¢

1f e(s ¢ s

+

s

¢

r

+

s

¢

s

)

¢

1

R

A B C

A C

A C

A C

V

Slide19

Using the commutative semiring axioms (II)5/21/2018UW19

a

b

c

p

d

b

e

r

f

g

e

s

R

A B C

A Ca e

p ¢

r

d er ¢ r + r ¢ s

+

r

¢ r

f es ¢ s +

s ¢ r + s ¢ s

A C A C V

Slide20

Using … axioms (III): provenance polynomials5/21/2018UW20

a

b

c

p

d

b

e

r

f

g

e

s

R

A B C

A C A Ca e

pr

d

e2r2 + rsf e

rs

+ 2s2

A CMultivariate polynomials with - annotation tokens as indeterminates p, r, s - coefficients in N .Polynomials capture a remarkably general form of provenance.

V

Slide21

Provenance polynomials (I)5/21/2018UW21

a

b

c

p

d

b

e

r

f

g

e

s

R

A B C

A C A Ca e

pr

d

e2r2 + rsf e

rs

+ 2s2

A C VProvenance reading:Three ways to derive (d e): - two of them use r, twice, - the third uses r and s

, once each V = C=e¼AC( ¼ACR ⋈ ¼BCR [ ¼ABR ⋈

¼BCR )

Slide22

Provenance polynomials (II)5/21/2018UW22

(

N

[

X

], +,

¢

, 0, 1)

is the commutative

semiring

freely generated by X (universality property involving homomorphisms)Provenance polynomials are PTIME-computable (data complexity)

.

(query complexity depends on language and representation)

ORCHESTRA provenance (graph representation) about

30%

overhead

Monomials correspond to

logical derivations

(proof trees in non-rec. Datalog)

Slide23

Low-hanging fruit: deletion propagation5/21/2018UW23

a

b

c

p

d

b

e

r

f

g

e

s

R

A B C

A Ca epr

d

e2r2 + rsf ers + 2s2

Delete

(

d

b e) from R ?Set r = 0 ! a e

0d e0

f e2s2

f e2s2

A C

A C

We used this in

Orchestra

for update propagation

Q

Q

Q

Slide24

Wrong answers: diagnostic and repairs5/21/2018UW24

mouse

gray

mouse

red

rat

gray

cat

mouse

cat

rat

Sue’s notes

Val’s notes

cat

gray

cat

red

Zack

Zack(x,z):-Sue(x,y),Val(y,z

)

r

s

t

p

q

p

r+q

tp∙s

Diagnostic: provenance of wrong answer: p∙r+q∙tFour minimal ways to make p∙r+q∙t=0p=q=0; r=q=0

p

=

t

=0;

r

=

t

=0

(maybe choose based on confidence or cost)

On planet

Erythro

, mice and rats are red!

Wrong answer!

Slide25

Specialize provenance for access control5/21/2018UW25

mouse

gray

mouse

red

rat

gray

cat

mouse

cat

rat

Sue’s notes

Val’s notes

cat

gray

cat

red

Zack

Zack(x,z):-Sue(x,y),Val(y,z

)

r

s

t

p

q

pr+qt

ps

(

A

, min, max,

0,Pub) where A = Pub < Conf < Sec < TSec < 0X  A p, q Pub r,s  TSec

t

Conf

eval

:

N

[

X

]

A

eval(pr+qt

)=

Conf

eval(ps

)=

TSec

TSec

TSec

Conf

Conf

TSec

Pub

Pub

Slide26

Specialize provenance for confidence scores5/21/2018UW26

mouse

gray

mouse

red

rat

gray

cat

mouse

cat

rat

Sue’s notes

Val’s notes

cat

gray

cat

red

Zack

Zack(x,z):-Sue(x,y),Val(y,z

)

r

s

t

p

q

pr+qt

ps

V

= (

[0,1],

max, ∙, 0, 1) the Viterbi semiring0.60.1

0.8

0.72

0.09

0.9

0.9

Slide27

Some application semirings5/21/2018UW27

(

B

, Æ

,

Ç

,

>

,

?

) binary trust(N, +, ¢, 0, 1) multiplicity (number of derivations)(A, min, max, 0, Pub) access control

V

= ([0,1],

max,

, 0, 1)

Viterbi

semiring (HMM) confidence scoresT = ([0, 1], min, +, 1, 0) tropical semiring (shortest paths) data pricingF = ([0,1], max, min, 0, 1) “fuzzy logic” semiring

Slide28

A menagerie of provenance semirings5/21/2018UW28

(

Which(X),

[, [*

,

;

,

;

*

) sets of contributing

tuples “Lineage” (1) [CWW00](Why(X), [, d, ;, {;}) sets of sets of … Witness why-provenance [BKT01](PosBool(X), Æ, Ç

, >, ?

) minimal sets of sets of… Minimal witness why-provenance [BKT01] also “Lineage” (2) used in probabilistic

dbs

[SORK11]

(

Trio(

X), +, ¢, 0, 1) bags of sets of … “Lineage” (3) [BDHT08,G09](B[X],+, ¢, 0, 1) sets of bags of … Boolean coeff. polynomials [G09](Sorp(X),+, ¢, 0, 1) minimal sets of bags of … absorptive polynomials [DMRT14](N[X], +, ¢, 0, 1) bags of bags of… universal provenance polynomials [GKT07]

Slide29

Two kinds of semirings in this framework5/21/2018UW29

Provenance

semirings

, e.g.,

(

N

[

X

], +,

¢

, 0, 1) provenance polynomials [GKT07](Why(X), [, d, ;, {;}) witness why-provenance [BKT01]Application semirings, e.g.,

(A, min, max, 0, Pub) access control

[FGT08]

V

= (

[0,1],

max,

∙, 0, 1) Viterbi semiring (HMM) [GKIT07]Provenance specialization relies on- Provenance semirings are freely generated by provenance tokens Query commutation with semiring homomorphisms

Slide30

Query commutation with homomorphisms query in QL homomorphism

h

:

K1

K

2

5/21/2018UW30K

1-Rel

K

1

-Rel

query

query

h

hK2-RelK2-Rel

QL

= RA+, Datalog [GKT07] and extensions [FGT08, GP10, ADT11a, T13, DMT15, GUKFC16, T17]

Slide31

A Hierarchy of Provenance Semirings [G09, DMRT14]N[X]

B

[X

]

Trio(

X

)

Why(

X

)

Which(X)PosBool(X)

most informative

least informative

Example:

2

x

2

y

+

xy + 5y2 +

xz + = 315/21/2018UWSorp(X)surjective semiring homomorphism, identity on X

absorption

absorption (

ab+a=a)  idemp. + idemp. x2y + xy + y2 + xz3xy + 5y + xzy + x

z

xy + y2+

xzxyz

 idemp.xy + y + xz  idemp. + idemp.A

T,VNB

Slide32

5/21/2018UW32

Slide33

From RA+ to DatalogImmediate consequence operator F

of a

Datalog program.

Incorporates the

edb

predicates, maps

idb

predicates to

idb

predicates.

It’s expressible in RA+. E.g., transitive closure F(T)

= E

[ ¼

1,3

(

E

T)Generalize to F: (K-Rel)n  (K-Rel)n (n=# of idb predicates)Solve certain (systems) of least fixed point equations over K-relations. T = F(T)

5/21/2018

UW33

Slide34

Find provenances of idb tuples

5/21/2018

UW

34

Slide35

Provenance of idb tuples

- introduce unknowns

Z

for the annotations of

idb

tuples

- solve system of fixed point equations over

K

; right-hand sides are polynomials in K[

Z].

Additional structure on

K

for these to have (unique) solutions?

5/21/2018

UW

35

Slide36

ω-continuous semirings

Semirings

K

such that the immediate consequence operator of any

Datalog

program has a least

fixpoint

on

K

-relations. Naturally ordered when x ≤

y

iff there exists

z

s.t

.

x+z = yis an order relation (all semirings seen here are naturally ordered)ω-complete also x0 ≤ x1 ≤ … ≤ xn ≤ … have l.u.b.’s (

sup’s)

ω-continuous moreover + and ¢ preserve those l.u.b.’s5/21/2018UW36

Slide37

Among our examplesMany of the semirings that interest us

B

,

T,

V

,

A

,

F

are already ω-continuous. (N, +,

¢

, 0, 1)

is not,

but its

“completion”

(N1= N [ {1} , +, ¢ , 0, 1) is.For provenance, the completion of N[X] is not

N1[X].

Instead of (finite) polynomials we need (possibly infinite) formal power series. They form an ω-continuous semiring N1[[X]]. Monomials still correspond to derivations trees. (Even transitive closure has infinitely many derivation trees if E has loops.)The completion of B[X] is B[[X]]. 5/21/2018UW

37

Slide38

Solution

5/21/2018

UW

38

Slide39

Absorptive polynomialsMost informative provenance semiring for Datalog

: (

N1

[[X]], +,

¢

, 0,1

)

(Infinite power series have finite representations as systems of polynomial equations.)

Absorption a + ab =

a

Absorptive polynomials Sorp(X

):

boolean

coefficients but only minimal degree monomials

x2y + xy + y2 + xz  xy + y2 + xzAbsorptive power series same as absorptive polynomials! Why? Order monomials by degree of each variable. In this infinite poset all antichains are finite! (Dickson’s Lemma)Sorp(X) is already ω

-continuous: provides provenance polynomials for Datalog.So is

PosBool(X), but Sorp(X) provenance also supports tropical and Viterbi semiring applications 5/21/2018UW39

Slide40

Provenance for aggregation5/21/2018

a 20+10

?

b

15+10+25

?

a 20

x

a

10

y

b

15

q

b

10

r

b

25

s

DesiderataCompatibility with set/bag semanticsFundamental property (commutation with homomorphisms)

Poly-size overhead! 1+2+4+…+2n-1 => 2n results

D S-

agg

D SSUM SGROUP BY D 40UW

Slide41

Solution inspired by (semi) linear algebra5/21/2018

a

x

20

+

y

10

?

b

q

15

+ r 10 +

s 25

?D S-agga 20x

a

10

y b 15q b 10

r

b

25

s D S41UW

(R, +, 0) is not a Prov(X)-semimodule, but…(K-Rel, [, ;) is a K-semimodule with the singletons as basis.Relations are the result of [-aggregation!What if (R, +, 0) were a Prov(X)-semimodule?

Slide42

Tensor product construction5/21/2018

a

x

20+

y

10

x

+

y

b

q

⊗15+r ⊗10+s ⊗2

5

q + r + sD S-aggEmbed a commutative monoid M (for sum, max or min) into a K-semimodule K ⊗

M (new values!)

Consistency: embedding should be faithful.42

UW

Slide43

Further aspects of the framework5/21/2018UW43

Extension to tree data (Nested Relational Calculus, structural recursion on trees, unordered

XQuery

) [FGT08]

Study of CQ/UCQ on provenance-annotated relations

[G09]

Extension to aggregates (poly-size overhead)

[ADT11a]

Poly-size provenance for

Datalog (circuits; PosBool(X

), Sorp(X)…)

[DMRT14]

Extension to data-dependent finite state processes

[DMT15]

Connections to

semiring

monad

[FGT08, T13] to semimodules [ADT11a] to tensor products [ADT11a, DMT15]

Slide44

Negative information; non-monotone operations (difference)5/21/2018UW44

Boolean expressions

[IL84].

Limited.

Add a binary operation corresponding to difference

m-semirings

(common gen. of set and bag difference)

[GP10]

spm-semirings (OPTIONAL in SPARQL) [GUKFC16] Encode difference by aggregation [ADT11a]

Different

equational theories, different algebraic optimizations

[ADT11b]

Still not clear how to track

negative information

. useful: non-answers (why not?), insertion propagation.

Logical model checking (“

provenance of … truth?”) negation as duality (NNFs), logical games ongoing work with Grädel [T16, T17]

Slide45

Current targets5/21/2018UW45

ANALYTICS COMPUTATIONS

“Fine-grained provenance for linear algebra operators”

Yan, T.,

Ives

TaPP

16

DISTRIBUTED SYSTEMS/NETWORK PROVENANCE

“Time-aware provenance for distributed systems”

, Zhou, Ding, Haeberlen, Ives, Loo TaPP 11“Diagnosing missing events in distributed systems with negative provenance”, Wu, Zhao, Haeberlen,

Zhou, Loo SIGCOMM 14

STATIC ANALYSIS OF SOFTWARE

“On abstraction refinement for program analyses in

Datalog

Zhang,

Mangal, Grigore, Naik PLDI 14

Slide46

Framework references (I) *5/21/2018UW46

[GKT07]

“Provenance semirings

” Green, Karvounarakis, Tannen PODS 07.

[GKIT07]

“Update exchange with mappings and provenance”

Green,

Karvounarakis

, Ives, Tannen VLDB 07.

[FGT08]

“Annotated XML: queries and provenance” Foster, Green, Tannen PODS 08.[G09]“Containment of conjunctive queries on annotated relations” Green ICDT 09.[GP10]“On database query languages for K-relations”, Geerts, Poggi J Appl. Logic 2010.* See also companion paper in PODS 2017 proceedings.

Slide47

Framework references (II)5/21/2018UW47

[ADT11a]

“Provenance for aggregate queries”, Amsterdamer

, Deutch, Tannen PODS 11.

[ADT11b]

“On the limitations of provenance for queries with difference”,

Amsterdamer

,

Deutch

, Tannen

TaPP 11[T13]“Provenance propagation in complex queries” Tannen Buneman Festschrift 2013[DMRT14]“Circuits for Datalog provenance”, Deutch, Milo, Roy, T. ICDT 14.[DMT15]“Provenance-based analysis of data-centric processes”Deutch, Moskovitch

, Tannen VLDB J. 2015

Slide48

Framework references (III)5/21/2018UW48

[GUKFC16]

“Algebraic structures for capturing the provenance of SPARQL queries”

Geerts, Unger, Karvounarakis

,

Fundulaki

,

Christophides

JACM 2016

[T16]

“About the provenance of truth” Tannen Simons Inst. Website 16https://simons.berkeley.edu/talks/val-tannen-2016-12-09[T17]“Provenance analysis for FOL model checking” Tannen SIGLOG News 2017

Slide49

Other references5/21/2018UW49

[IL84]

“Incomplete information in relational databases” Imieliński

, Lipski JACM 1984

[FR97]

“A probabilistic relational algebra”

Fuhr

,

Röllecke

TOIS 1997[Z97]“Query evaluation in probabilistic relational databases” Zimányi DDS 1997[CWW00]“Tracing the lineage of view data in a warehousing environment” Cui, Widom, Wiener TODS 2000[BKT01]“Why and where: a characterization of data provenance” Buneman, Khanna, Tan ICDT 2001[BDHTW08]

“Databases with uncertainty and lineage” Benjelloun, Das Sarma

, Halevy, Theobald, Widom

VLDB J. 2008

[SORK11]

“Probabilistic databases”

Suciu

,

Olteanu, Ré, Koch SLDM 2011 [SuciuOlteanuRéKoch 11]

Slide50

5/21/2018UW50

Thank you!