5212018 1 UW Val Tannen University of Pennsylvania 5212018 2 UW Collaborators ORCHESTRA ID: 801806
Download The PPT/PDF document "The Semiring Framework for Database Pr..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
The Semiring Framework for Database Provenance
5/21/2018
1
UW
Val Tannen
University of Pennsylvania
5/21/20182UW
Collaborators
ORCHESTRA
TJ Green
RelationalAI
(was @
LogicBlox
)
Grigoris
Karvounarakis
RelationalAI
(was @
LogicBlox
)
Zack Ives
University of Pennsylvania
Other core papers
Nate Foster
Cornell University
Yael
Amsterdamer
Bar-
Ilan
University
Daniel
Deutch
Tel Aviv University
Tova
Milo
Tel Aviv University
Sudeepa
Roy
Duke University
Yuval
Moskovitch
Tel Aviv University
5/21/2018For many, tracking data provenance means maintaining an “activity log”: record accessing of data items.
Many are satisfied, depending on what is recorded in each log entry!
A new,
both declarative (logic-based) and processing-based
, perspective arose in databases in the last 15+ years with the work of
Peter
Buneman
and others, including us.
This new perspective allows us to be more ambitious about applications of provenance
analysis. (As in algorithmic analysis.)
3
UW
Data provenance
Slide4Binary trust5/21/2018UW4
mouse
gray
mouse
red
rat
gray
* Sue and Val are noted zoologists. ** Zack is a noted
computational
zoologist
cat
mouse
cat
rat
Sue’s notes *
Val’s notes *
cat
gray
cat
red
Zack **
computation
Yes
No
Yes
Yes
Yes
Yes
No
No
NoYes
food color
Slide5Access control5/21/2018UW5
mouse
gray
mouse
red
rat
gray
Pub < Conf < Sec <
TSec
cat
mouse
cat
rat
Sue’s notes
Val’s notes
cat
gray
cat
red
Zack
computation
TSec
TSec
Conf
Pub
Pub
Conf
TSec
Slide6Confidence scores (non-binary trust)5/21/2018UW6
mouse
gray
mouse
red
rat
gray
cat
mouse
cat
rat
Sue’s notes
Val’s notes
cat
gray
cat
red
Zack
computation
0.6
0.1
0.8
0.9
0.9
0.72
0.09
0.72 = max(0.9
× 0.8, 0.9 × 0.6)
0.09 = 0.9 × 0.1
Slide7A simple model for data pricing5/21/2018UW7
mouse
gray
mouse
red
rat
gray
cat
mouse
cat
rat
Sue’s notes
Val’s notes
cat
gray
cat
red
Zack
computation
$6
$1
$8
$10
$10
$16
$11
16 = min(10 +
8, 10 + 6)
11 = 10 + 1
Slide85/21/20188UWDo it once and use it repeatedly: provenance
Label (annotate) input items abstractly with
provenance tokens.
Provenance tracking
: propagate
expressions
(involving tokens)
(to annotate intermediate data and, finally, outputs)
Track
two distinct ways of using data items by computation primitives:
jointly
(this alone is basically like keeping a log)
alternatively
(doing both is essential; think trust)
Input-output compositional; Modular (in the primitives)
Later, we want to evaluate the provenance expressions to obtain binary trust, access control, confidence scores, data prices, etc.
Slide9Database Provenance
V
R
View
V
=
Query
(
R
)
?
?
5/21/2018
UW
9
a
b
cd
b e
f g ea ca e
d
cd ef e A B C A C
Slide10Provenance by propagating annotation of data items5/21/2018UW10
JOIN (on B)
…
a
b
c
…
p
The annotation
p
¢
r
means joint use of data annotated by p
and data annotated by r …a b c d
e
p
¢ r…R
R
⋈ S
S
…d b e …
r A B C
D B E A B C D E
Slide11Another way to propagate annotations5/21/2018UW11
UNION
…
a
b
c
…
p
The annotation
p
+
r
means
alternative use
of data
…a b cp
+
r
…RR [ SS
…
a
b
c …r
A B C A B C A B C
Slide12Another use of + 5/21/2018UW
12
…
a
b
c
1
p
…
a
b
c
2
r
…
a
b
c
3
…
s
…
a
bp + r + s
…
+
means alternative use of data PROJECT R
A B C ¼ABR A B
Slide13Example: a positive querySPJU / UCQ / nonrecursive Datalog5/21/2018UW
13
a
b
c
p
d
b
e
r
f
g
e
s
a
c
(
p ¢ p + p
¢
p
) ¢ 0a e(p ¢ r ) ¢ 1
d
c
(r ¢ p ) ¢ 0d
e( r ¢ r +
r ¢ s + r ¢ r
) ¢ 1f e(
s
¢
s
+
s
¢
r
+
s
¢
s
)
¢
1
For selection we multiply
with two special annotations, 0 and 1
V
=
C=e
¼
AC
(
¼
AC
R
⋈
¼
BC
R
[
¼
AB
R
⋈
¼
BC
R )
R
A B C
A C
V
Slide14Summary so farA space of annotations, K
K
-
relations
: every
tuple
annotated with some element from
K
.
Binary operations on K
: ¢
corresponds to joint use (
join
), and
+
corresponds to alternative use (union and projection).
We assume K contains special annotations 0 and 1. ‘‘Absent’’ tuples are annotated with
0!
1 is a ‘‘neutral’’ annotation (no restrictions).Algebra of annotations? What are the laws of (K, +, ¢, 0, 1) ?5/21/2018UW14
Slide15Annotated relational algebraHence, for each commutative semiring K we have a K-
annotated relational algebra.
Proposition
. Above identities hold for queries on
K
-relations
iff
(
K
, +, ¢, 0, 1) is a commutative semiring15
5/21/2018
UW
DBMS query optimizers assume certain
equivalences
:
union is associative, commutative
join is associative, commutative, distributes over union
projections and selections commute with each other and with union and join (when applicable)Etc., but no R ⋈ R = R
[ R =
R (i.e., no idempotence, to allow for bag semantics)Equivalent queries should produce same annotations!
Slide16What is a commutative semiring?An algebraic structure (K, +,
¢, 0, 1) where:
K is the domain
+ is associative, commutative, with
0
identity
¢
is associative, with
1
identity
semiring ¢ distributes over + a ¢ 0 = 0 ¢ a = 0
¢ is also
commutative
Unlike ring, no requirement for inverses to
+
16
5/21/2018
UW
Slide17But are there useful commutative semirings?
(
B,
Ç,
Æ
,
?
,
>
)Set semantics(ℕ, +, ∙, 0, 1)Bag semantics
(BoolExp(
X),
Ç
,
Æ
, ?, >)Conditional tables (c-tables) [ImielinskiLipski 84](P(), [, Å, ;, )Probabilistic event tables [FuhrRölleke 97, Zimányi 97]175/21/2018UW
They capture several semantics:
Slide18Using the commutative semiring axioms (I)5/21/2018UW18
a
b
c
p
d
b
e
r
f
g
e
s
a
c
(
p
¢ p + p
¢ p
)
¢ 0a ep ¢ r ¢ 1
d
c
r
¢ p ¢ 0d e
(r ¢ r + r ¢
s + r ¢ r ) ¢
1f e(s ¢ s
+
s
¢
r
+
s
¢
s
)
¢
1
R
A B C
A C
A C
A C
V
Slide19Using the commutative semiring axioms (II)5/21/2018UW19
a
b
c
p
d
b
e
r
f
g
e
s
R
A B C
A Ca e
p ¢
r
d er ¢ r + r ¢ s
+
r
¢ r
f es ¢ s +
s ¢ r + s ¢ s
A C A C V
Slide20Using … axioms (III): provenance polynomials5/21/2018UW20
a
b
c
p
d
b
e
r
f
g
e
s
R
A B C
A C A Ca e
pr
d
e2r2 + rsf e
rs
+ 2s2
A CMultivariate polynomials with - annotation tokens as indeterminates p, r, s - coefficients in N .Polynomials capture a remarkably general form of provenance.
V
Slide21Provenance polynomials (I)5/21/2018UW21
a
b
c
p
d
b
e
r
f
g
e
s
R
A B C
A C A Ca e
pr
d
e2r2 + rsf e
rs
+ 2s2
A C VProvenance reading:Three ways to derive (d e): - two of them use r, twice, - the third uses r and s
, once each V = C=e¼AC( ¼ACR ⋈ ¼BCR [ ¼ABR ⋈
¼BCR )
Slide22Provenance polynomials (II)5/21/2018UW22
(
N
[
X
], +,
¢
, 0, 1)
is the commutative
semiring
freely generated by X (universality property involving homomorphisms)Provenance polynomials are PTIME-computable (data complexity)
.
(query complexity depends on language and representation)
ORCHESTRA provenance (graph representation) about
30%
overhead
Monomials correspond to
logical derivations
(proof trees in non-rec. Datalog)
Slide23Low-hanging fruit: deletion propagation5/21/2018UW23
a
b
c
p
d
b
e
r
f
g
e
s
R
A B C
A Ca epr
d
e2r2 + rsf ers + 2s2
Delete
(
d
b e) from R ?Set r = 0 ! a e
0d e0
f e2s2
f e2s2
A C
A C
We used this in
Orchestra
for update propagation
Q
Q
Q
Slide24Wrong answers: diagnostic and repairs5/21/2018UW24
mouse
gray
mouse
red
rat
gray
cat
mouse
cat
rat
Sue’s notes
Val’s notes
cat
gray
cat
red
Zack
Zack(x,z):-Sue(x,y),Val(y,z
)
r
s
t
p
q
p
∙
r+q
∙
tp∙s
Diagnostic: provenance of wrong answer: p∙r+q∙tFour minimal ways to make p∙r+q∙t=0p=q=0; r=q=0
p
=
t
=0;
r
=
t
=0
(maybe choose based on confidence or cost)
On planet
Erythro
, mice and rats are red!
Wrong answer!
Slide25Specialize provenance for access control5/21/2018UW25
mouse
gray
mouse
red
rat
gray
cat
mouse
cat
rat
Sue’s notes
Val’s notes
cat
gray
cat
red
Zack
Zack(x,z):-Sue(x,y),Val(y,z
)
r
s
t
p
q
pr+qt
ps
(
A
, min, max,
0,Pub) where A = Pub < Conf < Sec < TSec < 0X A p, q Pub r,s TSec
t
Conf
eval
:
N
[
X
]
A
eval(pr+qt
)=
Conf
eval(ps
)=
TSec
TSec
TSec
Conf
Conf
TSec
Pub
Pub
Slide26Specialize provenance for confidence scores5/21/2018UW26
mouse
gray
mouse
red
rat
gray
cat
mouse
cat
rat
Sue’s notes
Val’s notes
cat
gray
cat
red
Zack
Zack(x,z):-Sue(x,y),Val(y,z
)
r
s
t
p
q
pr+qt
ps
V
= (
[0,1],
max, ∙, 0, 1) the Viterbi semiring0.60.1
0.8
0.72
0.09
0.9
0.9
Slide27Some application semirings5/21/2018UW27
(
B
, Æ
,
Ç
,
>
,
?
) binary trust(N, +, ¢, 0, 1) multiplicity (number of derivations)(A, min, max, 0, Pub) access control
V
= ([0,1],
max,
∙
, 0, 1)
Viterbi
semiring (HMM) confidence scoresT = ([0, 1], min, +, 1, 0) tropical semiring (shortest paths) data pricingF = ([0,1], max, min, 0, 1) “fuzzy logic” semiring
Slide28A menagerie of provenance semirings5/21/2018UW28
(
Which(X),
[, [*
,
;
,
;
*
) sets of contributing
tuples “Lineage” (1) [CWW00](Why(X), [, d, ;, {;}) sets of sets of … Witness why-provenance [BKT01](PosBool(X), Æ, Ç
, >, ?
) minimal sets of sets of… Minimal witness why-provenance [BKT01] also “Lineage” (2) used in probabilistic
dbs
[SORK11]
(
Trio(
X), +, ¢, 0, 1) bags of sets of … “Lineage” (3) [BDHT08,G09](B[X],+, ¢, 0, 1) sets of bags of … Boolean coeff. polynomials [G09](Sorp(X),+, ¢, 0, 1) minimal sets of bags of … absorptive polynomials [DMRT14](N[X], +, ¢, 0, 1) bags of bags of… universal provenance polynomials [GKT07]
Slide29Two kinds of semirings in this framework5/21/2018UW29
Provenance
semirings
, e.g.,
(
N
[
X
], +,
¢
, 0, 1) provenance polynomials [GKT07](Why(X), [, d, ;, {;}) witness why-provenance [BKT01]Application semirings, e.g.,
(A, min, max, 0, Pub) access control
[FGT08]
V
= (
[0,1],
max,
∙, 0, 1) Viterbi semiring (HMM) [GKIT07]Provenance specialization relies on- Provenance semirings are freely generated by provenance tokens Query commutation with semiring homomorphisms
Slide30Query commutation with homomorphisms query in QL homomorphism
h
:
K1
K
2
5/21/2018UW30K
1-Rel
K
1
-Rel
query
query
h
hK2-RelK2-Rel
QL
= RA+, Datalog [GKT07] and extensions [FGT08, GP10, ADT11a, T13, DMT15, GUKFC16, T17]
Slide31A Hierarchy of Provenance Semirings [G09, DMRT14]N[X]
B
[X
]
Trio(
X
)
Why(
X
)
Which(X)PosBool(X)
most informative
least informative
Example:
2
x
2
y
+
xy + 5y2 +
xz + = 315/21/2018UWSorp(X)surjective semiring homomorphism, identity on X
absorption
absorption (
ab+a=a) idemp. + idemp. x2y + xy + y2 + xz3xy + 5y + xzy + x
z
xy + y2+
xzxyz
idemp.xy + y + xz idemp. + idemp.A
T,VNB
Slide325/21/2018UW32
Slide33From RA+ to DatalogImmediate consequence operator F
of a
Datalog program.
Incorporates the
edb
predicates, maps
idb
predicates to
idb
predicates.
It’s expressible in RA+. E.g., transitive closure F(T)
= E
[ ¼
1,3
(
E
⋈
T)Generalize to F: (K-Rel)n (K-Rel)n (n=# of idb predicates)Solve certain (systems) of least fixed point equations over K-relations. T = F(T)
5/21/2018
UW33
Slide34Find provenances of idb tuples
5/21/2018
UW
34
Slide35Provenance of idb tuples
- introduce unknowns
Z
for the annotations of
idb
tuples
- solve system of fixed point equations over
K
; right-hand sides are polynomials in K[
Z].
Additional structure on
K
for these to have (unique) solutions?
5/21/2018
UW
35
Slide36ω-continuous semirings
Semirings
K
such that the immediate consequence operator of any
Datalog
program has a least
fixpoint
on
K
-relations. Naturally ordered when x ≤
y
iff there exists
z
s.t
.
x+z = yis an order relation (all semirings seen here are naturally ordered)ω-complete also x0 ≤ x1 ≤ … ≤ xn ≤ … have l.u.b.’s (
sup’s)
ω-continuous moreover + and ¢ preserve those l.u.b.’s5/21/2018UW36
Slide37Among our examplesMany of the semirings that interest us
B
,
T,
V
,
A
,
F
are already ω-continuous. (N, +,
¢
, 0, 1)
is not,
but its
“completion”
(N1= N [ {1} , +, ¢ , 0, 1) is.For provenance, the completion of N[X] is not
N1[X].
Instead of (finite) polynomials we need (possibly infinite) formal power series. They form an ω-continuous semiring N1[[X]]. Monomials still correspond to derivations trees. (Even transitive closure has infinitely many derivation trees if E has loops.)The completion of B[X] is B[[X]]. 5/21/2018UW
37
Slide38Solution
5/21/2018
UW
38
Slide39Absorptive polynomialsMost informative provenance semiring for Datalog
: (
N1
[[X]], +,
¢
, 0,1
)
(Infinite power series have finite representations as systems of polynomial equations.)
Absorption a + ab =
a
Absorptive polynomials Sorp(X
):
boolean
coefficients but only minimal degree monomials
x2y + xy + y2 + xz xy + y2 + xzAbsorptive power series same as absorptive polynomials! Why? Order monomials by degree of each variable. In this infinite poset all antichains are finite! (Dickson’s Lemma)Sorp(X) is already ω
-continuous: provides provenance polynomials for Datalog.So is
PosBool(X), but Sorp(X) provenance also supports tropical and Viterbi semiring applications 5/21/2018UW39
Slide40Provenance for aggregation5/21/2018
a 20+10
?
b
15+10+25
?
a 20
x
a
10
y
b
15
q
b
10
r
b
25
s
DesiderataCompatibility with set/bag semanticsFundamental property (commutation with homomorphisms)
Poly-size overhead! 1+2+4+…+2n-1 => 2n results
D S-
agg
D SSUM SGROUP BY D 40UW
Slide41Solution inspired by (semi) linear algebra5/21/2018
a
x
20
+
y
10
?
b
q
15
+ r 10 +
s 25
?D S-agga 20x
a
10
y b 15q b 10
r
b
25
s D S41UW
(R, +, 0) is not a Prov(X)-semimodule, but…(K-Rel, [, ;) is a K-semimodule with the singletons as basis.Relations are the result of [-aggregation!What if (R, +, 0) were a Prov(X)-semimodule?
Slide42Tensor product construction5/21/2018
a
x
⊗
20+
y
⊗
10
x
+
y
b
q
⊗15+r ⊗10+s ⊗2
5
q + r + sD S-aggEmbed a commutative monoid M (for sum, max or min) into a K-semimodule K ⊗
M (new values!)
Consistency: embedding should be faithful.42
UW
Slide43Further aspects of the framework5/21/2018UW43
Extension to tree data (Nested Relational Calculus, structural recursion on trees, unordered
XQuery
) [FGT08]
Study of CQ/UCQ on provenance-annotated relations
[G09]
Extension to aggregates (poly-size overhead)
[ADT11a]
Poly-size provenance for
Datalog (circuits; PosBool(X
), Sorp(X)…)
[DMRT14]
Extension to data-dependent finite state processes
[DMT15]
Connections to
semiring
monad
[FGT08, T13] to semimodules [ADT11a] to tensor products [ADT11a, DMT15]
Slide44Negative information; non-monotone operations (difference)5/21/2018UW44
Boolean expressions
[IL84].
Limited.
Add a binary operation corresponding to difference
m-semirings
(common gen. of set and bag difference)
[GP10]
spm-semirings (OPTIONAL in SPARQL) [GUKFC16] Encode difference by aggregation [ADT11a]
Different
equational theories, different algebraic optimizations
[ADT11b]
Still not clear how to track
negative information
. useful: non-answers (why not?), insertion propagation.
Logical model checking (“
provenance of … truth?”) negation as duality (NNFs), logical games ongoing work with Grädel [T16, T17]
Slide45Current targets5/21/2018UW45
ANALYTICS COMPUTATIONS
“Fine-grained provenance for linear algebra operators”
Yan, T.,
Ives
TaPP
16
DISTRIBUTED SYSTEMS/NETWORK PROVENANCE
“Time-aware provenance for distributed systems”
, Zhou, Ding, Haeberlen, Ives, Loo TaPP 11“Diagnosing missing events in distributed systems with negative provenance”, Wu, Zhao, Haeberlen,
Zhou, Loo SIGCOMM 14
STATIC ANALYSIS OF SOFTWARE
“On abstraction refinement for program analyses in
Datalog
”
Zhang,
Mangal, Grigore, Naik PLDI 14
Slide46Framework references (I) *5/21/2018UW46
[GKT07]
“Provenance semirings
” Green, Karvounarakis, Tannen PODS 07.
[GKIT07]
“Update exchange with mappings and provenance”
Green,
Karvounarakis
, Ives, Tannen VLDB 07.
[FGT08]
“Annotated XML: queries and provenance” Foster, Green, Tannen PODS 08.[G09]“Containment of conjunctive queries on annotated relations” Green ICDT 09.[GP10]“On database query languages for K-relations”, Geerts, Poggi J Appl. Logic 2010.* See also companion paper in PODS 2017 proceedings.
Slide47Framework references (II)5/21/2018UW47
[ADT11a]
“Provenance for aggregate queries”, Amsterdamer
, Deutch, Tannen PODS 11.
[ADT11b]
“On the limitations of provenance for queries with difference”,
Amsterdamer
,
Deutch
, Tannen
TaPP 11[T13]“Provenance propagation in complex queries” Tannen Buneman Festschrift 2013[DMRT14]“Circuits for Datalog provenance”, Deutch, Milo, Roy, T. ICDT 14.[DMT15]“Provenance-based analysis of data-centric processes”Deutch, Moskovitch
, Tannen VLDB J. 2015
Slide48Framework references (III)5/21/2018UW48
[GUKFC16]
“Algebraic structures for capturing the provenance of SPARQL queries”
Geerts, Unger, Karvounarakis
,
Fundulaki
,
Christophides
JACM 2016
[T16]
“About the provenance of truth” Tannen Simons Inst. Website 16https://simons.berkeley.edu/talks/val-tannen-2016-12-09[T17]“Provenance analysis for FOL model checking” Tannen SIGLOG News 2017
Slide49Other references5/21/2018UW49
[IL84]
“Incomplete information in relational databases” Imieliński
, Lipski JACM 1984
[FR97]
“A probabilistic relational algebra”
Fuhr
,
Röllecke
TOIS 1997[Z97]“Query evaluation in probabilistic relational databases” Zimányi DDS 1997[CWW00]“Tracing the lineage of view data in a warehousing environment” Cui, Widom, Wiener TODS 2000[BKT01]“Why and where: a characterization of data provenance” Buneman, Khanna, Tan ICDT 2001[BDHTW08]
“Databases with uncertainty and lineage” Benjelloun, Das Sarma
, Halevy, Theobald, Widom
VLDB J. 2008
[SORK11]
“Probabilistic databases”
Suciu
,
Olteanu, Ré, Koch SLDM 2011 [SuciuOlteanuRéKoch 11]
Slide505/21/2018UW50
Thank you!