/
CS 519:  Big Data Exploration and Analytics CS 519:  Big Data Exploration and Analytics

CS 519: Big Data Exploration and Analytics - PowerPoint Presentation

blindnessinfluenced
blindnessinfluenced . @blindnessinfluenced
Follow
346 views
Uploaded On 2020-10-01

CS 519: Big Data Exploration and Analytics - PPT Presentation

Review Relational Query Languages Announcements The due data for selecting papers is Tuesday Your top 5 choices 2 Relational model and query languages Relational model defines data organization ID: 812913

actor movie year plays movie actor plays year mid aid query relational gross total title datalog homomorphism queries database

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "CS 519: Big Data Exploration and Analyt..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

CS 519: Big Data Exploration and Analytics

Review: Relational Query Languages

Slide2

Announcements

The due data for selecting papers is Tuesday.

Your top 5 choices

2

Slide3

Relational model and query languages

Relational model defines data organization

Relational query languages define data retrieval/manipulation operations.

3

Slide4

4

Relational model

Title Price Category Year

MySQL $102.1 computer 2001

Cell biology $201.69 biology 1954

French cinema $53.99 art 2002

NBA History $63.65 sport 2010

tuples

Attribute names

Relation name

Book:

Slide5

Relational Model

Attributes

Atomic values

Domain: string, integer, real, Keys: no duplicate valuesEach relation must have keys A relation does not contain duplicate tuples.

5

Slide6

Database Schema vs. Database Instance

Schema: S

:

Book(Title, Price, Category, Year)Instance:

Values of each attribute A in I:

active domain

of A,

adom

(A)

6

Title Price Category YearMySQL $102.1 computer 2001Cell biology $201.69 biology 1954French cinema $53.99 art 2002NBA History $63.65 sport 2010

Slide7

SQL

A declarative language for querying data stored in relational databases.

Much easier to use than procedural languages.

Say what instead of howSELECT returned attribute(s)FROM table(s)WHERE conditions on the tuples of the table(s)

7

Slide8

SQL Example

Movie(

id

, title, year, total-gross)Actor(id, name, b-year)Plays(mid, aid

)

What movies are made in 1998?

SELECT title

FROM movie

WHERE year = 1998;

8

Slide9

SQL Example

Find actors who played in a movie whose total gross is more than $2,000,000.

SELECT * FROM Actor, Movie, Plays WHERE

Movie.id

=

Plays.mid

AND

Plays.aid = Actor.id

AND total-gross > 2000000; 9Movie(mid, title, year, total-gross)Actor(aid, name, b-year)Plays(mid, aid)

Slide10

Formal Relational Query Languages

Formal languages that express queries over relational schemas.

Relational Algebra

Datalog (recursion-free with negation) Relational calculusUsed to explore the properties of relational model.Easier to use than SQL in some application domains.

10

Slide11

Relational Algebra (RA)

Used by RDBMS to execute queries

Six operators

Selection σ Projection ΠJoin ∞Union

Difference –

Renaming

ρ

(for named perspective)

11

Slide12

Relational Algebra

Selection

σ

σtitle=‘Fargo’ (Movie)Projection Π

Π

b

-year

(Actor)

Join ∞Movie ∞

id=mid Plays

12Movie(mid, title, year, total-gross)Actor(aid, name, b-year)Plays(mid, aid)

Slide13

Datalog

First created to support recursive queries over relational databases.

Easier to use than relational algebra.

Used extensively in research and industry Data integration, networking, logic programming, learning, distributed processing, …We talk about the recursion-free datalog.

13

Slide14

Datalog

Each tuple in database is a

fact

Movie(236878, ‘Godfather I’, 1972, 40000000) Movie(879900, ‘Godfather II’, 1974, 3900000) Actor(090988,’Robert De Niro’, 1943)

Each query is a

rule

Movies that were produced in 1998 and made more than $2,000.

Q1(y):- Movie(x,y,1998,z),z > 2000.

14

Movie(mid, title, year, total-gross)Actor(aid, name, b-year)Plays(mid, aid)

Slide15

Datalog Example

Actors who played in a movie whose total gross is more than $2,000.

Q2(y):- Actor(

x,y,z),Plays(t,x

),Movie(

t,v,w,f

),

f > 2000.

Actors who played in a movie whose total gross is more than $2,000 and a movie made in 1998.

Q3(y):- Actor(x,y,z

),Plays(t,x),Movie(t,v,w,f), f > 2000, Plays(g,x), Movie(g,l,1998,h).15Movie(mid, title, year, total-gross)Actor(aid, name, b-year)Plays(mid,

aid)

Slide16

Datalog

Q2(y):- Actor(

x,y,z

),Plays(t,x),Movie(t,v,w,f

),

f > 2000.

y: head variable; x, z, t : existential variables

Extensional Database Predicates (EDB)

Movie, Actor, Plays

Intentional Database Predicate (IDB) Q2

16headatom

atom

body

Movie(

mid

, title, year, total-gross)Actor(aid, name, b-year)Plays(mid, aid)

Slide17

Datalog programs

A collection of rules: union

Actors who played in a movie with gross of more than $2,000 or a movie made after 1990.

Q4(y):- Actor(x,y,z

),Plays(

t,x

),Movie(

t,v,w,f

),

f > 2000.Q4(y):- Actor(x,y,z

),Plays(t,x),Movie(t,v,w,f), w > 1990.17Movie(mid, title, year, total-gross)Actor(aid, name, b-year)

Plays(mid, aid)

Slide18

Views

Similar to views in SQL

Actors who played in a movie with gross of more than $2000 and in a movie with

‘Robert De Niro’.

V(

x,y,z

):- Actor(

x,y,z

), Plays(

t,x), Movie(t,v,w,f

), f > 20000.Q5(y):- V(x,y,z), Plays(t,x), Plays(t,f), Actor(f, ’Robert De Niro’, g, h).UnfoldingQ5(y):- Actor(x,y,z), Plays(t,x), Movie(t,v,w,f), f > 20000, Plays(u,x

), Plays(u,f),

Actor(f,’Robert De

Niro’,g, h).

18

Definition of VMovie(

mid, title, year, total-gross)Actor(aid, name, b-year)Plays(mid, aid)

Slide19

Datalog with negation

All actors who did not play in a movie with ‘Robert De

Niro

’.U(

x,y,z

):- Actor(

x,y,z

), Plays(

t,x), Plays(

t,f),

Actor(f, ’Robert De Niro’, g).Q6(y):- Actor(x,y,z), not U(x,y,z).19Movie(mid, title, year, total-gross)Actor(aid

, name, b-year)Plays(mid, aid

)

Slide20

Safe datalog rules

Unsafe

rules:

V(x,y,z):- Actor(x,y,1998), z > 200.

W(

x,y,z

):- Actor(

x,y,z

),

not

Plays(t,x).A datalog rule is safe if every variable appear in at least one positive predicate20Movie(mid, title, year, total-gross)Actor(aid, name, b-year)Plays(mid, aid

)

Slide21

Datalog to SQL

Non-recursive datalog with negation represents the core functionalities of SQL

We can translate each non-recursive datalog program to a core SQL query and vice versa.

21

Slide22

Equivalency Theorem

RA and non-recursive datalog with negation the same set of queries.

Relational queries.

22

Slide23

Conjunctive queries (CQ)

One datalog rule.

SELECT-DISTINCT-FROM-WHERE.

Select/project/join (σ, Π, ∞) fragment of RA.

Existential/ conjunctive fragment of RC

There is not any comparison operator (<, ≠, …) in CQ.

If used the family is called CQ

<

, CQ

≠, …

23

Slide24

CQ examples

Actors who played in “LTR”.

Q7(y):- Actor(x, y, z), Plays(t, x),

Movie(t, ’LTR’, w, f).Non-CQ

: Actors who played in some movies with only one actor.

24

Movie(

mid

, title, year, total-gross)

Actor(

aid, name, b-year)Plays(mid, aid)

Slide25

Query equivalency and containment

Interesting and long standing problems in query processing.

Queries q

1 and q2 are

equivalent

if and only if for every database instance I, q

1

(I) = q

2(I)

Shown as q1 q

2Query q1 is contained in q2 if and only if for every database instance I, q1(I) q2(I)Shown as q1 q2 25

Slide26

Containment examples

Is q

1

q2?q1(x):- R(

x,y

),R(

y,z

),R(

z,w).

q2(x):- R(x,y),R(

y,z).q1(x):-R(x,y),R(y,’Joe’).q2(x):-R(x,y),R(y,z).q1(x):- R(x,y),R(y,z),R(z,x).q2(x):- R(x,y

),R(y,x).

26

Slide27

Containment examples

Is q

1

q2?q1(x):- R(

x,y

),R(

y,y

).

q2(x):- R(x,y

),R(y,z),R(

z,t).27

Slide28

Query semantics

Rules based form of a CQ

q(u):- R1(u1),…,

R

n

(u

n

).

ui

is shorthand for (x, y,…, z).Valuation v is a total function from a set of variables to domain (dom) and identity on constants in the domain.28

Slide29

Query semantics

The set of variables in

q

is shown as var(q) e.g. var

(q

1

)

= {

x,y}

q1(x):- R(x,y),R(

y,y).The image of database instance I under query q, q(u):- R1(u1),…,Rn(un) is q(I) = {v(u)| v is a valuation over var

(q), for each }

29

Slide30

Query homomorphism

A

homomorphism

is a function from var(

q

2

) to

var

(q

1)

s.t. for each atom R(x, y, …) in the query q1 there is an atom R(h(x), h(y), …) in q2 .h leaves the constants in q2 intact.

Exampleq1(x):- R(

x,y),R(y,z

),R(z,w).

q2(x):- R(x,y),R(

y,z).We treat head variables, ‘x’, as constants, i.e., the same in q1 and q2 .30

Slide31

Homomorphism Theorem

Given CQs q

1

and q2, we have q1 q

2

if and only if there exists a homomorphism .

Example:

q1(x):-R(

x,y),R(

y,z),R(z,w). q2(x):-R(x,y),R(y,z).Since is a homomorphism, we have .31

Slide32

Homomorphism examples

q1(x):-R(

x,y

),R(y,’Joe’).q2(x):-R(

x,y

),R(

y,z

).

q1(x):- R(

x,y

),R(y,z),R(z,x).q2(x):- R(x,y),R(y,x).There is no homomorphism: 32

Slide33

Homomorphism examples

Is q

1

q2?q1(x):- R(

x,y

),R(

y,y

).

q2(x):- R(x,y

),R(y,z),R(

z,t).33

Slide34

Homomorphism Theorem

Given CQs

q

and q’, we have q

q

if and only if there exists a homomorphism .

Proof:

For each w in q(I), there is a valuation v that maps free tuples in q to I such that v(u)= w. Thus, h(v) will map free tuples in q’ into I and h(v(u’)) = w, where w is in q’(I). Using canonical instances: read the book page 117.34

Slide35

Checking containment

Check if there exists a homomorphism between queries.

The problem is NP-complete, proved by reducing from 3-SAT.

Since the size of queries are relatively small, the process is sufficiently fast.

35

Slide36

Query minimization

A conjunctive query

q

is minimal if for every other conjunctive query q’ , if q’ q,

q

has at least as many atoms as

q.

Examples:

q1(x):- R(x,z),R(

z,t),R(x,w). q2(x):- R(x,z),R(z,t),R(x,’Joe’). 36

Slide37

Query minimization algorithm

1. Remove an atom from

q

. Let’s call new query q’.2. We have q q’.

3. Check to see if

q

q,

if it is then remove the atom permanently. Example:

q1(x):- R(x,z

),R(z,t),R(x,w). q2(x):- R(x,z),R(z,t). We have a homomorphism from q1 to q2. 37

Slide38

Larger families: UCQ

CQ with union

Movies that were produced in 1998 or made more than

$2,000. Q1(y):- Movie(x,y,1998,z). Q1(y):- Movie(

x,y,z,t

), t > 2000.

We can extend homomorphism theorem for UCQs.

38

Movie(

mid

, title, year, total-gross)Actor(aid, name, b-year)Plays(mid, aid)

Slide39

Homomorphism Theorem for UCQ

Given UCQs and , we have

if and only if for every

there is a , such that Thus, we can use apply homomorphism theorem to each CQ in a UCQ to check the containment.

Containment checking for UCQs is NP-complete.

39

Slide40

Larger families: relational queries

Containment checking for relational queries is undecidable.

Proved by reduction from finite satisfiability problem:

Given a query, is there any (finite) database where the query as at least one answer.

40