/
Real Time Analytics: Algorithms and Systems Real Time Analytics: Algorithms and Systems

Real Time Analytics: Algorithms and Systems - PowerPoint Presentation

jane-oiler
jane-oiler . @jane-oiler
Follow
482 views
Uploaded On 2016-09-08

Real Time Analytics: Algorithms and Systems - PPT Presentation

Type Tutorial Paper Authors Arun Kejriwal Machine Zone Inc Sanjeev Kulkarni Karthik Ramasamy Twitter I nc Presented by Siddhant Kulkarni Term Fall 2015 Motivation ID: 462549

data query graph knowledge query data knowledge graph cte keys university problem ctes 2015 optimization set entity algorithms lbs execution gpars wrapper

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Real Time Analytics: Algorithms and Syst..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Real Time Analytics: Algorithms and Systems

Type: Tutorial Paper

Authors:

Arun

Kejriwal

(Machine Zone Inc.), Sanjeev Kulkarni,

Karthik

Ramasamy

(Twitter

I

nc.)

Presented by: Siddhant Kulkarni

Term: Fall 2015Slide2

Motivation

In-Depth overview of streaming analytics

Applications

Algorithms

PlatformsSlide3

Contribution

Description of various types of data contributing to the field of Big Data

Social Media

IoT

Healthcare

Machine Data (cloud)

Connected VehiclesSlide4
Slide5
Slide6

KATARA: Reliable Data Cleaning with Knowledge Bases and Crowdsourcing

Type:

Demo Paper

Authors: Xu Chu, John

Morcos

,

Ihab

Ilyas

,

Paolo

Papotti

,

Mourad

Ouzzani

,

Nan

Tang(Qatar Computing Research Institute, Yin Ye(Google)

Presented by: Siddhant Kulkarni

Term: Fall 2015Slide7

Motivation

Issue with Data Cleaning

What are the external sources?

Problem with External SourcesSlide8

VINERy: A Visual IDE for Information Extraction

Presented by: Omar

Alqahtani

Fall 2015

Demonstration

PapaerSlide9

Authors

Yunyao

Li

IBM Research

Almaden

Elmer

Kim

Treasure Data, Inc.

Marc

A.

Touchette

IBM Silicon Valley Lab

Ramiya

Venkatachalam

IBM Silicon Valley

Lab

Hao

Wang

IBM Silicon Valley

LabSlide10

Motivation

Extractor

development remains a major bottleneck in satisfying

the increasing

demands of real-world

applications

based on

IE.

Lowering

the barrier to entry for extractor development becomes

a critical

requirement.Slide11

Related Works

Previous work

has focused

on reducing the manual effort involved in

extractor development.

WizIE

is a promising wizard-like environment but needs non-trivial rule language.

Special-purpose

systems.Slide12

Contribution

VINER

Y

, a

V

isual

IN

tegrated

D

evelopment

E

nvironment for Information

ext

R

action

, consists of:

The

foundation of

VINER

Y

is VAQL

, a visual programming language for information

extraction.

VINER

Y

embeds VAQL

in an

web-based

visual IDE for constructing

extractors,

w

hich are translated

into AQL and executed

VINER

Y

includes a rich set of easily customizable pre-built extractors to help jump-start extractor development.

VINER

Y

provides

features to

support the entire life cycle of extractor

development.Slide13

WADaR: Joint Wrapper and Data Repair

Author: Stefano Ortona,

Giorgio

Orsi,

Marcello

Buoncristiano,

and Tim

Furche

Department of Computer Science, Oxford University, United Kingdom

Dipartimento

di Matematica, Informatica ed Economia, Universit`a della Basilicata, Italy

marcello.buoncristiano@yahoo.it

Paper Type: Demo

Presented by:

Ranjan_KY

Fall 2015Slide14

Motivation

Web scrapping (or wrapping) is a popular means for acquiring data from the web.

Today generation made scalable wrapper-generation possible and enabled

data acquisition process

involving thousands of sources.

No scalable tools exists that support these task.

. Slide15

Problem

Modern wrapper-generation systems leverage a number of features ranging from HTML and visual structures to knowledge bases and micro-data.

Nevertheless

, automatically-generated wrappers often suffer from errors resulting in under/over segmented data, together with missing or

spurious content.

Under and over segmentation of attributes are commonly caused by irregular HTML markups or by multiple attributes occurring within the same DOM node

.

Incorrect column types are instead associated with the lack of domain knowledge, supervision, or micro-data during wrapper generation.

The

degraded quality of the generated relations argues for means to repair both the data and the corresponding wrapper so that future wrapper executions can produce cleaner dataSlide16

Demonstration

WADaR takes as input a (possibly incorrect) wrapper and a target relation schema, and iteratively repairs both the generated relations and the wrapper by observing the output of the wrapper execution

.

A key observation is that errors in the extracted relations are likely to be systematic as wrappers are often generated from templated websites. Slide17

WADaR’s repair process

(

i

) Annotating the extracted relations with standard entity

recognizers,

(

ii) Computing Markov chains describing the most likely segmentation of attribute values in the records,

and

(iii) Inducing regular expressions which re-segment the input relation according to the given target schema and that can possibly be encoded back into the wrapper.Slide18

Related work

In

this paper, related work was not evaluated in detail

[

1] M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti.

Extraction and integration of partially overlapping web sources.

PVLDB, 6(10):805–816, 2013

.

[

2] L. Chen, S. Ortona, G. Orsi, and M. Benedikt.

Aggregating semantic annotators

. PVLDB, 6(13):1486–1497, 2013.

[

3] X. Chu, Y. He, K. Chakrabarti, and K. Ganjam

. Tegra: Table extraction by global record alignment

. In SIGMOD, pages 1713–1728. ACM, 2015. Slide19

Association Rules with Graph Patterns

Wenfei

Fan 1,2

Xin

Wang 3

Yinghui

Wu 4

Jingbo

Xu

1,2

1Univ

. of Edinburgh

2Beihang

Univ.

3Southwest

Presented by:

Z

ohreh

R

aghebi

Fall 2015Slide20

Motivation

We propose graph-pattern association rules (

GPARs

) for social media

marketing

Extending

association rules for

itemsets

,

GPARs

help us discover regularities between entities

in social graphs

We

study the problem of discovering

top k

diversified

GPARs

We also study

the problem of identifying potential customers with

GPARsSlide21

Introduction

A graph-pattern association rule (

GPAR

) R(x,

y) is

defined as Q(x, y) ⇒ q(x, y),

where

Q(x, y) is a

graph pattern

in which x and y are two designated nodes,

q(x

, y) is an edge labeled q from x to y, on which the

same search

conditions as in Q are

imposed

We

refer to Q and

q as

the antecedent and consequent of

R

We

model R(x, y) as a graph pattern PR, by extending

Q with a (dotted) edge q(x, y). We treat q(x, y) as pattern Pq , and q(x, G) as the set of matches of x in G by PqSlide22

DIVERSIFIED RULE DISCOVERY

We are interested in

GPARs

for a particular event q(x,

y)

However

, this often generates an excessive number of

rules, which

often pertain to the same or similar people

This

motivates us to study a diversified mining

problem, to

discover

GPARs

that are both interesting and

diverse

Problem

. Based on the objective function, the

diversified

GPAR

mining problem (DMP) is stated as follows.◦ Input: A graph G, a predicate q(x, y), a support bound σ and positive integers k and d. Output: A set Lk of k nontrivial GPARs pertaining to q(x, y) such that (a) F(Lk) is maximized; and (b) for each

GPAR

R ∈

Lk

,

supp

(R, G) ≥

σSlide23

DIVERSIFIED RULE DISCOVERY

DMP

is a bi-criteria optimization problem to discover

GPARs

for

a particular event

q(x

, y) with high

support and

a balanced confidence and diversity.

In practice, users

can freely specify q(x, y) of

interests

proper parameters (e.g., support, confidence, diversity) can be estimated from query logs or recommended by domain expertsSlide24

IDENTIFYING CUSTOMERS

Consider

a set Σ of

GPARs

pertaining to the same q(x, y

), i.e

., their consequents are the same event q(x, y).

We define the

set of entities identified by Σ in a (social) graph G

with confidence η

Problem

. We study the entity identification problem (

EIP

):

◦ Input: A set Σ of

GPARs

pertaining to the same q(x, y

), a

confidence bound η > 0, and a graph G

.

Output: Σ(x, G, η

). It is to find potential customers x of y in G identified by atleast one GPAR in Σ, with confidence of at least η.Slide25

Keys for Graphs

Wenfei

Fan 1

,

2

Zhe

Fan 3 Chao

Tian

1

,

2

Xin

Luna Dong 4

1 University o f Edinburgh 2

Beihang

University 3 Hong Kong Baptist University 4 Google Inc.

{

wenfei@inf

.,

chao.tian

@

}

ed.ac.uk, zfan@comp.hkbu.edu.hk, lunadong@google.comSlide26

Motivation

Keys

for graphs aim to uniquely identify entities

represented by

vertices in a graph

.

We

propose

a class of keys that

are recursively

defined in terms of graph patterns, and are interpreted with

subgraph

isomorphism.

Extending conventional keys

for relations and XML, these keys find

applications in:

object

identification, knowledge fusion and

social network reconciliation

.

As

an application, we study the entity matching problem that, given a graph G and a set Σ of keys to find all pairs of entities (vertices) in G that are identified by keys in Σwe provide two parallel scalable algorithms for entity matching: MapReduce and a vertex-centric asynchronous modelSlide27

More details

Entity resolution

is

to identify records that refer

to the

same real-world entity.

Keys

for graphs yield a

deterministic

method

to provide an invariant connection between

vertices and

the real-world entities

The

quality of matches identified by keys highly depends on keys discovered

and used

, although keys help us reduce false positives.

We

defer the topic of key discovery to another

paper

focus primarily

on

the efficiency of applying such constraintsSlide28

Entity resolution

Finally

, we remark that entity resolution is just

one

of

the

applications for keys for

graphs besides

:

e.g

.,

digital

citations

and knowledge base expansion

entity matching is

different to record

matching

that

identify tuples in relations

that does not enforce topological constraints in the matching processSlide29

Graph pattern matching

Consider

a graph

G

and

an entity

e

in

G

.

We

say that

G matches Q

(

x

)

at e

if

there exist

a set

S

of triples in

G and a valuation ν of Q(x) in S such that ν(x) = e, ν is a bijection between Q(x) and S. We refer to S as a

match

of

Q

(

x

) in

G

at

e

under

ν

.

Intuitively

,

ν

is an isomorphism from

Q

(

x

) to

S

when

Q

(

x

) and

S

are depicted as graphs.

That

is, we adopt

subgraph

isomorphism

for the semantics of graph pattern

matchingSlide30

Examples

Example 4:

Consider

Q

(

x

)

and

G

,

a

set

S

1

of triples in

G

2:

{

(com 1

,

name of, “AT&T”),

(com 4

, name of, “AT&T”), (com 1, parent of, com 4), (com 3,parent of, com 4)}. Then S1 is a match of Q 4(x) in G 2 at com 4, which maps variable x to com 4, name∗ to “AT&T”, wildcard company to com 1, and company to com 3. Keys for Graphs:

Keys

. A

key for entities of type τ

is a graph pattern

Q

(

x

),

where

x

is a designated entity variable of type

τ

we provide two parallel scalable algorithms for entity matching:

MapReduce

and a vertex-centric asynchronous modelSlide31

Optimization of Common Table Expressions in MPP

Database Systems

Amr El-

Helw

,

Venkatesh

Raghavan

, Mohamed A.

Soliman

, George

Caragea

,

Zhongxian

Gu

,

Michalis

Petropoulos

Pivotal Inc.

Palo Alto, CA, USA

Datometry

Inc.

San Francisco, CA, USA

Amazon Web Services

Palo Alto, CA, USA

Presented by: Zohreh Raghebi

Fall 2015Slide32

Motivation

Big Data analytics

are becoming increasingly common in

many business domains:

including

financial corporations,

government agencies

, and insurance

providers

Big Data analytics often include complex queries with similar or

identical

expressions

Massively

Parallel Processing

(

MPP

) databases address these challenges by distributing

storage and query processing across multiple nodes and processes

Common

Table Expressions

(

CTEs

) are commonly used in complex analytical queries that often have many repeated computationsA CTE can be seen as a temporary table that exists just for one query. The purpose of CTEs is to avoid re-execution of expressions referenced more than once within a query. CTEs may be defined explicitly, or generated implicitly by the query optimizerSlide33

Background

CTEs

follow a producer/consumer

model

where

the data is produced by the

CTE

definition

consumed in all the

locations where

that

CTE

is referenced.

One

possible approach to

execute

CTEs

is to expand (inline) all

CTE

consumers, Rewriting the query internally to replace each reference to the CTEThis approach simplifies query execution logic, but may incur performance overhead due to executing the same expression multiple times.Slide34

Background

the

CTE

expression is

separately optimized

and executed only once,

the

results are kept in

memory, or

written to disk if the data does not fit in

memory

The

data is then read whenever the

CTE

is referenced.

This

approach avoids the cost of repeated execution of the

same expression

,

although

it may incur an overhead of disk I/O.

The

impact of this approach on query optimization time is rather limited, since the optimizer chooses one plan to be shared by all CTE consumers. However, important optimization opportunities could be missed due to fixing one execution plan for all consumersSlide35

Challenges : Deadlock Hazard

MPP

systems leverage parallel query

execution

where

different parts of the query plan execute simultaneously as separate processes,

possibly

running on different machines.

In some cases

, a process has to wait until another process produces the data

it needs.

For

complicated queries involving

multiple

CTEs

, the optimizer needs to guarantee that no two or more processes could be waiting on each other during query execution.

CTE

constructs

need to be cleanly abstracted within the query optimization framework to guarantee deadlock-free planSlide36

Enumerating Inlining

Alternatives

and Contextualized Optimization

The approaches of always

inlining

CTEs

, or never

inlining

CTEs

, can

be easily proven to be sub-optimal

The

query optimizer needs to efficiently enumerate and cost plan alternatives that

combine the benefits of these

approaches

CTEs

should not be optimized in isolation without taking

into account

the context in which they occur.

Isolated optimization can easily miss several optimization opportunitiesSlide37

1

.

This

approach avoids repeated

computation

However

, this approach does not take advantage of the

index on

i

color

2.

The

opposite

approach: all

occurrences of the

CTE

are replaced by the expansion

of the

CTE

This

allows the optimizer to utilize the

index on i color However, it suffers from the repeated computation3. Figure

1(c) depicts a possible plan in which one occurrence of

the

CTE

is expanded,

allowing

the use of the

index

while

the

other two

occurrences are not

inlined

, to avoid

recomputing

the common expression.Slide38

Contributions

A novel framework for the optimization of

CTEs

in

MPP

database

systems.

Our

framework extends and builds

upon our

optimizer infrastructure to allow optimization of

CTEs

within

the context where they are used in a query

A

new technique in which a

CTE

does not get

re-optimized for

every reference in the query, but only when there are optimization opportunities, e.g. pushing down filters or

sort operations. This ensures that the optimization time does not grow exponentially with the number of CTE consumersSlide39

Contribution

A cost-based approach for deciding whether or not to

expand

CTEs

in a given query.

The

cost model takes into

account disk

I/O as well the cost of repeated

CTE

execution

A

query execution model that guarantees that the

CTE

producer is always executed before the

CTE

consumer(s).

In

MPP

settings, this is crucial for deadlock-free executionSlide40

Fuzzy Joins in MapReduce

: An Experimental Study

Ben

Kimmett

,

Venkatesh

Sr

inivasan

, Alex

Thomo

University of

Victoria

, Canada

{

blk,srinivas,thomo

}

@

uvic.ca

Presented by: Zohreh Raghebi

Fall 2015Slide41

Motivation

We

report experimental

results for the

MapReduce

algorithms

proposed

by

Afrati

, Das

Sarma

,

Menestrina

,

Parameswaran

and Ullman in

ICDE’12

(Fuzzy join using

mapreduce

)

to

compute fuzzy joins of binary strings using Hamming DistanceTheir algorithms come with complete theoretical analysishowever, no experimental evaluation is providedSlide42

Methods

there

are several algorithms

proposed

for

performing

“fuzzy

join”

:

(

an

operation

that finds pairs of similar

items) in

MapReduce

concentrates

on

binary strings

and Hamming

distance

The algorithms proposed are:

Naive, which compares every string in the set with every otherBall-Hashing, send strings to a ‘ball’ of all ‘nearby strings’ within a certain similaritySlide43

Methods

Anchor Points

, a randomized

algorithm that

selects a set of strings and compares any pair of

strings that

have a close enough distance to a member of the

set

Splitting

, an algorithm that splits the strings into

pieces and

compares only strings with matching

piecesSlide44

Conclusion

It is argued in

that

there is a tradeoff between communication cost and processing

cost

that there is a

skyline of

the proposed algorithms; i.e. none dominates another.

One

of our objectives is to see whether we can observe this

skyline in practical terms.

We

observe via experiments

that some

algorithms are

almost always

preferable to others.

Splitting

is a clear

winner

Ball-Hashing

suffers for all distance thresholds except the very small ones Slide45

A Natural Language Interface for Querying General and

Individual Knowledge

Presented by:

S

h

a

h

a

b

H

e

l

m

i

Fall 2015Slide46

Paper Info

Authors:

Yael Amsterdamer

,

Tel Aviv University

Anna Kukliansky,

Tel Aviv University

Tova Milo,

Tel Aviv

University

Publication:

VLDB 2015

Type:

Research PaperSlide47

Motivation

Many

real-life scenarios (queries)

require the joint analysis of

general knowledge

, which includes facts about the world, with

individual knowledge

, which relates to the opinions or

habits of

individuals

.

“What are the

most

interesting

places near Forest Hotel,

Buffalo

,

we should

visit in the fall

?“Locations, opening hours.Interesting locations: depends on the people’s opinions or habits.Existing platforms require

users to

specify their information needs in a formal,

declarative language

, which may be too complicated for

naive users.

Hence, a question in the natural language should be translated into a well-formed query.Slide48

Related Work

The NL-to-query

translation

problem

has been

previously studied for queries over

general data (knowledge), including

SQL/ XQuery/SPARQL

queries.

Crowdsourcing

: asking user to refine the translated query.

NL tools

for parsing and detecting

the semantics

of NL sentences.Slide49

Challenges

The mix of

general

and

individual

knowledge needs

lead to unique

challenges:

Distinguishing

the

individual

and

general

part of the question (query).

The crowd info regarding the

induvial

part of the NL question may not be in the knowledge base.

Most of the current techniques which are based on aliening questions to the knowledge based, does not apply.

Integrating

the generated queries for

individual

and general part of the question to a well-formed query.Slide50

Contributions

The modular

design of a translation framework

, to solve the challenges mentioned in the previous slide.

The development of new

modules.Slide51

Knowledge Representation

K

nowledge

representation must be

expressive enough

T

o

account for both

general

knowledge, to

be

queried

from an ontology,

F

or

individual

knowledge to be

collected

from the

crowd

.RDF: publicly available knowledge bases such as DBPedia and Linked-GeoData.{Buffalo, NY inside USA}.{Buffalo, NY has Label "interesting"}.{I visit Buffalo, NY}.Slide52

Query Language

The query language to which NL questions are

translated, should

naturally match the knowledge representation

.

Q

ASSIS-QL

query language, which

extends SPARQL

, the RDF query language, with crowd mining

capabilitiesSlide53

NL Processing Tools

Distinguishes

the

individual

and

general

part of the question (query

) according to the grammatical roles.Slide54

IX Detector

Dependency

Parser:

This

tool parses a given text

into a

standard structure called a

dependency

graph

. This structure

is a directed graph (typically, a tree) with

labels on

the edges. It exposes

different

types of

semantic

dependencies

between

the terms of a

sentence (

grammatical role

of the words).Slide55

Query Generators

It is left to perform the translation from the NL

representation to

the

query language representation.Slide56

Missing Parameters

Limit

ThresholdSlide57

Experimental Results

In this experiment, we have arbitrarily chosen the

first 500 questions

from the Yahoo! Answers

repositories.Slide58

Aggregate Estimations Over Location Based Services

Presented

by:

S

h

a

h

a

b

H

e

l

m

i

Fall

2015Slide59

Paper Info

Authors:

Weimo

Liuy

,

The George Washington University

Md

Farhadur

Rahmanz

,

University of Texas at Arlington

Saravanan

Thirumuruganathanz

, University of Texas at Arlington

Nan

Zhangy

, The George Washington University

Gautam Das, University of Texas at ArlingtonPublication:VLDB 2015Type:Research PaperSlide60

Introduction

Location-returned services (LR-LBS):

this services return the

location

of the

k

returned tuples.

Google Maps.

Location-not-returned services (LNR-LBS):

this services does not return the location of the

k

tuples and returns some other attributes such as ID, ranking and etc.

WeChat

Sina

Weibo

A

K

-nearest-neighbors (

k

NN

) query:

return the

k nearest tuples to the query point according to a ranking function (Euclidian distance in this paper). Slide61

Introduction (2)

LBS with a

K

NN interface:

hidden

databases

with limited

access, usually through a

public

web query interface

or

API

.

These interfaces impose some

constraints

:

Query limitation:

10,000 per user per

day in Google Maps

M

aximum

coverage

limit, for example 5 miles away from the query pointAggregate Estimations: For many applications, it is important to collect aggregate statistics in such hidden databases such as sum

,

count

, or distributions of the

tuples satisfying

certain selection

conditions.

A

hotel

recommendation application

would like to know the average

review scores

for Marriott vs Hilton hotels in Google Maps;

A

cafe

chain startup

would like to know the number of Starbucks restaurants in

a certain

geographical region;

A

demographics researcher may

wish know

the gender ratio of users of social networks in China etc.Slide62

Motivation / Goals

A

ggregate

information can be obtained

by:

Entering into

data sharing agreements with the location-based

service providers

, but this approach can often be

extremely expensive

,

and sometimes

impossible if the data owners are

unwilling to share

their data

.

Getting the whole data using limited interfaces would

take so long

.

Goals:

A

pproximate estimates of such aggregates by only querying the database via its restrictive public interface. Minimizing the query cost (i.e., ask as few queries as possible)

Making

the aggregate estimations as

accurate

as possible.Slide63

Related Work

Analytics and Inference over LBS

:

Estimating COUNT and SUM aggregates.

Error reduction, such as bias correction

Aggregate Estimations over Hidden Web Repositories

:

U

nbiased estimators for

COUNT and SUM aggregates for static

databases.

E

fficient techniques to

obtain random samples from hidden web databases that can

then be

utilized to perform aggregate

estimation.

E

stimating the size

of search

engines.Slide64

Contributions

For LR-LBS

interfaces

: the developed algorithm (LR-LBS-AGG), for

estimating

COUNT

and

SUM

aggregates, represents

a significant

improvement over

prior work along multiple dimensions:

a

novel way

of precisely

calculating Voronoi cells leads to completely

unbiased estimations

;

top-k

returned tuples are leveraged

rather than

only top-1; several innovative techniques developed for reducing error and increasing efficiency.For LNR-LBS interfaces: the developed algorithm (LNR-LBS-AGG) which was a novel problem with no prior work. The algorithm is not bias-free, but the bias can be controlled to any desired precision.Slide65

Background: Voronoi Diagrams

Top-1 Voronoi

Top-2 Voronoi

In a Voronoi diagram, for

each

point, there

is a corresponding

region

consisting of all points

closer

to that

point than

to any other.Slide66

LR-LBS-AGG | LNR-LBS-AGG Algorithms

Precisely compute Voronoi cells

=

, Count(*) =

Extensions:

Computing

Voroni

cells faster

Error reduction

 Slide67

Experimental Results

Datasets:

Offline Real-World

Dataset (

OpenStreetMap

, USA Portion): to verify the correctness of the algorithm.

Online LBS

Demonstrations: to evaluate efficiency of the algorithm.

Google Maps

WeChat

Sina

Weibo