Type Tutorial Paper Authors Arun Kejriwal Machine Zone Inc Sanjeev Kulkarni Karthik Ramasamy Twitter I nc Presented by Siddhant Kulkarni Term Fall 2015 Motivation ID: 462549
Download Presentation The PPT/PDF document "Real Time Analytics: Algorithms and Syst..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Real Time Analytics: Algorithms and Systems
Type: Tutorial Paper
Authors:
Arun
Kejriwal
(Machine Zone Inc.), Sanjeev Kulkarni,
Karthik
Ramasamy
(Twitter
I
nc.)
Presented by: Siddhant Kulkarni
Term: Fall 2015Slide2
Motivation
In-Depth overview of streaming analytics
Applications
Algorithms
PlatformsSlide3
Contribution
Description of various types of data contributing to the field of Big Data
Social Media
IoT
Healthcare
Machine Data (cloud)
Connected VehiclesSlide4Slide5Slide6
KATARA: Reliable Data Cleaning with Knowledge Bases and Crowdsourcing
Type:
Demo Paper
Authors: Xu Chu, John
Morcos
,
Ihab
Ilyas
,
Paolo
Papotti
,
Mourad
Ouzzani
,
Nan
Tang(Qatar Computing Research Institute, Yin Ye(Google)
Presented by: Siddhant Kulkarni
Term: Fall 2015Slide7
Motivation
Issue with Data Cleaning
What are the external sources?
Problem with External SourcesSlide8
VINERy: A Visual IDE for Information Extraction
Presented by: Omar
Alqahtani
Fall 2015
Demonstration
PapaerSlide9
Authors
Yunyao
Li
IBM Research
Almaden
Elmer
Kim
Treasure Data, Inc.
Marc
A.
Touchette
IBM Silicon Valley Lab
Ramiya
Venkatachalam
IBM Silicon Valley
Lab
Hao
Wang
IBM Silicon Valley
LabSlide10
Motivation
Extractor
development remains a major bottleneck in satisfying
the increasing
demands of real-world
applications
based on
IE.
Lowering
the barrier to entry for extractor development becomes
a critical
requirement.Slide11
Related Works
Previous work
has focused
on reducing the manual effort involved in
extractor development.
WizIE
is a promising wizard-like environment but needs non-trivial rule language.
Special-purpose
systems.Slide12
Contribution
VINER
Y
, a
V
isual
IN
tegrated
D
evelopment
E
nvironment for Information
ext
R
action
, consists of:
The
foundation of
VINER
Y
is VAQL
, a visual programming language for information
extraction.
VINER
Y
embeds VAQL
in an
web-based
visual IDE for constructing
extractors,
w
hich are translated
into AQL and executed
VINER
Y
includes a rich set of easily customizable pre-built extractors to help jump-start extractor development.
VINER
Y
provides
features to
support the entire life cycle of extractor
development.Slide13
WADaR: Joint Wrapper and Data Repair
Author: Stefano Ortona,
Giorgio
Orsi,
Marcello
Buoncristiano,
and Tim
Furche
Department of Computer Science, Oxford University, United Kingdom
Dipartimento
di Matematica, Informatica ed Economia, Universit`a della Basilicata, Italy
marcello.buoncristiano@yahoo.it
Paper Type: Demo
Presented by:
Ranjan_KY
Fall 2015Slide14
Motivation
Web scrapping (or wrapping) is a popular means for acquiring data from the web.
Today generation made scalable wrapper-generation possible and enabled
data acquisition process
involving thousands of sources.
No scalable tools exists that support these task.
. Slide15
Problem
Modern wrapper-generation systems leverage a number of features ranging from HTML and visual structures to knowledge bases and micro-data.
Nevertheless
, automatically-generated wrappers often suffer from errors resulting in under/over segmented data, together with missing or
spurious content.
Under and over segmentation of attributes are commonly caused by irregular HTML markups or by multiple attributes occurring within the same DOM node
.
Incorrect column types are instead associated with the lack of domain knowledge, supervision, or micro-data during wrapper generation.
The
degraded quality of the generated relations argues for means to repair both the data and the corresponding wrapper so that future wrapper executions can produce cleaner dataSlide16
Demonstration
WADaR takes as input a (possibly incorrect) wrapper and a target relation schema, and iteratively repairs both the generated relations and the wrapper by observing the output of the wrapper execution
.
A key observation is that errors in the extracted relations are likely to be systematic as wrappers are often generated from templated websites. Slide17
WADaR’s repair process
(
i
) Annotating the extracted relations with standard entity
recognizers,
(
ii) Computing Markov chains describing the most likely segmentation of attribute values in the records,
and
(iii) Inducing regular expressions which re-segment the input relation according to the given target schema and that can possibly be encoded back into the wrapper.Slide18
Related work
In
this paper, related work was not evaluated in detail
[
1] M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti.
Extraction and integration of partially overlapping web sources.
PVLDB, 6(10):805–816, 2013
.
[
2] L. Chen, S. Ortona, G. Orsi, and M. Benedikt.
Aggregating semantic annotators
. PVLDB, 6(13):1486–1497, 2013.
[
3] X. Chu, Y. He, K. Chakrabarti, and K. Ganjam
. Tegra: Table extraction by global record alignment
. In SIGMOD, pages 1713–1728. ACM, 2015. Slide19
Association Rules with Graph Patterns
Wenfei
Fan 1,2
Xin
Wang 3
Yinghui
Wu 4
Jingbo
Xu
1,2
1Univ
. of Edinburgh
2Beihang
Univ.
3Southwest
Presented by:
Z
ohreh
R
aghebi
Fall 2015Slide20
Motivation
We propose graph-pattern association rules (
GPARs
) for social media
marketing
Extending
association rules for
itemsets
,
GPARs
help us discover regularities between entities
in social graphs
We
study the problem of discovering
top k
diversified
GPARs
We also study
the problem of identifying potential customers with
GPARsSlide21
Introduction
A graph-pattern association rule (
GPAR
) R(x,
y) is
defined as Q(x, y) ⇒ q(x, y),
where
Q(x, y) is a
graph pattern
in which x and y are two designated nodes,
q(x
, y) is an edge labeled q from x to y, on which the
same search
conditions as in Q are
imposed
We
refer to Q and
q as
the antecedent and consequent of
R
We
model R(x, y) as a graph pattern PR, by extending
Q with a (dotted) edge q(x, y). We treat q(x, y) as pattern Pq , and q(x, G) as the set of matches of x in G by PqSlide22
DIVERSIFIED RULE DISCOVERY
We are interested in
GPARs
for a particular event q(x,
y)
However
, this often generates an excessive number of
rules, which
often pertain to the same or similar people
This
motivates us to study a diversified mining
problem, to
discover
GPARs
that are both interesting and
diverse
Problem
. Based on the objective function, the
diversified
GPAR
mining problem (DMP) is stated as follows.◦ Input: A graph G, a predicate q(x, y), a support bound σ and positive integers k and d. Output: A set Lk of k nontrivial GPARs pertaining to q(x, y) such that (a) F(Lk) is maximized; and (b) for each
GPAR
R ∈
Lk
,
supp
(R, G) ≥
σSlide23
DIVERSIFIED RULE DISCOVERY
DMP
is a bi-criteria optimization problem to discover
GPARs
for
a particular event
q(x
, y) with high
support and
a balanced confidence and diversity.
In practice, users
can freely specify q(x, y) of
interests
proper parameters (e.g., support, confidence, diversity) can be estimated from query logs or recommended by domain expertsSlide24
IDENTIFYING CUSTOMERS
Consider
a set Σ of
GPARs
pertaining to the same q(x, y
), i.e
., their consequents are the same event q(x, y).
We define the
set of entities identified by Σ in a (social) graph G
with confidence η
Problem
. We study the entity identification problem (
EIP
):
◦ Input: A set Σ of
GPARs
pertaining to the same q(x, y
), a
confidence bound η > 0, and a graph G
.
◦
Output: Σ(x, G, η
). It is to find potential customers x of y in G identified by atleast one GPAR in Σ, with confidence of at least η.Slide25
Keys for Graphs
Wenfei
Fan 1
,
2
Zhe
Fan 3 Chao
Tian
1
,
2
Xin
Luna Dong 4
1 University o f Edinburgh 2
Beihang
University 3 Hong Kong Baptist University 4 Google Inc.
{
wenfei@inf
.,
chao.tian
@
}
ed.ac.uk, zfan@comp.hkbu.edu.hk, lunadong@google.comSlide26
Motivation
Keys
for graphs aim to uniquely identify entities
represented by
vertices in a graph
.
We
propose
a class of keys that
are recursively
defined in terms of graph patterns, and are interpreted with
subgraph
isomorphism.
Extending conventional keys
for relations and XML, these keys find
applications in:
object
identification, knowledge fusion and
social network reconciliation
.
As
an application, we study the entity matching problem that, given a graph G and a set Σ of keys to find all pairs of entities (vertices) in G that are identified by keys in Σwe provide two parallel scalable algorithms for entity matching: MapReduce and a vertex-centric asynchronous modelSlide27
More details
Entity resolution
is
to identify records that refer
to the
same real-world entity.
Keys
for graphs yield a
deterministic
method
to provide an invariant connection between
vertices and
the real-world entities
The
quality of matches identified by keys highly depends on keys discovered
and used
, although keys help us reduce false positives.
We
defer the topic of key discovery to another
paper
focus primarily
on
the efficiency of applying such constraintsSlide28
Entity resolution
Finally
, we remark that entity resolution is just
one
of
the
applications for keys for
graphs besides
:
e.g
.,
digital
citations
and knowledge base expansion
entity matching is
different to record
matching
that
identify tuples in relations
that does not enforce topological constraints in the matching processSlide29
Graph pattern matching
Consider
a graph
G
and
an entity
e
in
G
.
We
say that
G matches Q
(
x
)
at e
if
there exist
a set
S
of triples in
G and a valuation ν of Q(x) in S such that ν(x) = e, ν is a bijection between Q(x) and S. We refer to S as a
match
of
Q
(
x
) in
G
at
e
under
ν
.
Intuitively
,
ν
is an isomorphism from
Q
(
x
) to
S
when
Q
(
x
) and
S
are depicted as graphs.
That
is, we adopt
subgraph
isomorphism
for the semantics of graph pattern
matchingSlide30
Examples
Example 4:
Consider
Q
(
x
)
and
G
,
a
set
S
1
of triples in
G
2:
{
(com 1
,
name of, “AT&T”),
(com 4
, name of, “AT&T”), (com 1, parent of, com 4), (com 3,parent of, com 4)}. Then S1 is a match of Q 4(x) in G 2 at com 4, which maps variable x to com 4, name∗ to “AT&T”, wildcard company to com 1, and company to com 3. Keys for Graphs:
Keys
. A
key for entities of type τ
is a graph pattern
Q
(
x
),
where
x
is a designated entity variable of type
τ
we provide two parallel scalable algorithms for entity matching:
MapReduce
and a vertex-centric asynchronous modelSlide31
Optimization of Common Table Expressions in MPP
Database Systems
Amr El-
Helw
∗
,
Venkatesh
Raghavan
∗
, Mohamed A.
Soliman
∗
, George
Caragea
∗
,
Zhongxian
Gu
†
,
Michalis
Petropoulos
‡
∗
Pivotal Inc.
Palo Alto, CA, USA
†
Datometry
Inc.
San Francisco, CA, USA
‡
Amazon Web Services
Palo Alto, CA, USA
Presented by: Zohreh Raghebi
Fall 2015Slide32
Motivation
Big Data analytics
are becoming increasingly common in
many business domains:
including
financial corporations,
government agencies
, and insurance
providers
Big Data analytics often include complex queries with similar or
identical
expressions
Massively
Parallel Processing
(
MPP
) databases address these challenges by distributing
storage and query processing across multiple nodes and processes
Common
Table Expressions
(
CTEs
) are commonly used in complex analytical queries that often have many repeated computationsA CTE can be seen as a temporary table that exists just for one query. The purpose of CTEs is to avoid re-execution of expressions referenced more than once within a query. CTEs may be defined explicitly, or generated implicitly by the query optimizerSlide33
Background
CTEs
follow a producer/consumer
model
where
the data is produced by the
CTE
definition
consumed in all the
locations where
that
CTE
is referenced.
One
possible approach to
execute
CTEs
is to expand (inline) all
CTE
consumers, Rewriting the query internally to replace each reference to the CTEThis approach simplifies query execution logic, but may incur performance overhead due to executing the same expression multiple times.Slide34
Background
the
CTE
expression is
separately optimized
and executed only once,
the
results are kept in
memory, or
written to disk if the data does not fit in
memory
The
data is then read whenever the
CTE
is referenced.
This
approach avoids the cost of repeated execution of the
same expression
,
although
it may incur an overhead of disk I/O.
The
impact of this approach on query optimization time is rather limited, since the optimizer chooses one plan to be shared by all CTE consumers. However, important optimization opportunities could be missed due to fixing one execution plan for all consumersSlide35
Challenges : Deadlock Hazard
MPP
systems leverage parallel query
execution
where
different parts of the query plan execute simultaneously as separate processes,
possibly
running on different machines.
In some cases
, a process has to wait until another process produces the data
it needs.
For
complicated queries involving
multiple
CTEs
, the optimizer needs to guarantee that no two or more processes could be waiting on each other during query execution.
CTE
constructs
need to be cleanly abstracted within the query optimization framework to guarantee deadlock-free planSlide36
Enumerating Inlining
Alternatives
and Contextualized Optimization
The approaches of always
inlining
CTEs
, or never
inlining
CTEs
, can
be easily proven to be sub-optimal
The
query optimizer needs to efficiently enumerate and cost plan alternatives that
combine the benefits of these
approaches
CTEs
should not be optimized in isolation without taking
into account
the context in which they occur.
Isolated optimization can easily miss several optimization opportunitiesSlide37
1
.
This
approach avoids repeated
computation
However
, this approach does not take advantage of the
index on
i
color
2.
The
opposite
approach: all
occurrences of the
CTE
are replaced by the expansion
of the
CTE
This
allows the optimizer to utilize the
index on i color However, it suffers from the repeated computation3. Figure
1(c) depicts a possible plan in which one occurrence of
the
CTE
is expanded,
allowing
the use of the
index
while
the
other two
occurrences are not
inlined
, to avoid
recomputing
the common expression.Slide38
Contributions
A novel framework for the optimization of
CTEs
in
MPP
database
systems.
Our
framework extends and builds
upon our
optimizer infrastructure to allow optimization of
CTEs
within
the context where they are used in a query
A
new technique in which a
CTE
does not get
re-optimized for
every reference in the query, but only when there are optimization opportunities, e.g. pushing down filters or
sort operations. This ensures that the optimization time does not grow exponentially with the number of CTE consumersSlide39
Contribution
A cost-based approach for deciding whether or not to
expand
CTEs
in a given query.
The
cost model takes into
account disk
I/O as well the cost of repeated
CTE
execution
A
query execution model that guarantees that the
CTE
producer is always executed before the
CTE
consumer(s).
In
MPP
settings, this is crucial for deadlock-free executionSlide40
Fuzzy Joins in MapReduce
: An Experimental Study
Ben
Kimmett
,
Venkatesh
Sr
inivasan
, Alex
Thomo
University of
Victoria
, Canada
{
blk,srinivas,thomo
}
@
uvic.ca
Presented by: Zohreh Raghebi
Fall 2015Slide41
Motivation
We
report experimental
results for the
MapReduce
algorithms
proposed
by
Afrati
, Das
Sarma
,
Menestrina
,
Parameswaran
and Ullman in
ICDE’12
(Fuzzy join using
mapreduce
)
to
compute fuzzy joins of binary strings using Hamming DistanceTheir algorithms come with complete theoretical analysishowever, no experimental evaluation is providedSlide42
Methods
there
are several algorithms
proposed
for
performing
“fuzzy
join”
:
(
an
operation
that finds pairs of similar
items) in
MapReduce
concentrates
on
binary strings
and Hamming
distance
The algorithms proposed are:
Naive, which compares every string in the set with every otherBall-Hashing, send strings to a ‘ball’ of all ‘nearby strings’ within a certain similaritySlide43
Methods
Anchor Points
, a randomized
algorithm that
selects a set of strings and compares any pair of
strings that
have a close enough distance to a member of the
set
Splitting
, an algorithm that splits the strings into
pieces and
compares only strings with matching
piecesSlide44
Conclusion
It is argued in
that
there is a tradeoff between communication cost and processing
cost
that there is a
skyline of
the proposed algorithms; i.e. none dominates another.
One
of our objectives is to see whether we can observe this
skyline in practical terms.
We
observe via experiments
that some
algorithms are
almost always
preferable to others.
Splitting
is a clear
winner
Ball-Hashing
suffers for all distance thresholds except the very small ones Slide45
A Natural Language Interface for Querying General and
Individual Knowledge
Presented by:
S
h
a
h
a
b
H
e
l
m
i
Fall 2015Slide46
Paper Info
Authors:
Yael Amsterdamer
,
Tel Aviv University
Anna Kukliansky,
Tel Aviv University
Tova Milo,
Tel Aviv
University
Publication:
VLDB 2015
Type:
Research PaperSlide47
Motivation
Many
real-life scenarios (queries)
require the joint analysis of
general knowledge
, which includes facts about the world, with
individual knowledge
, which relates to the opinions or
habits of
individuals
.
“What are the
most
interesting
places near Forest Hotel,
Buffalo
,
we should
visit in the fall
?“Locations, opening hours.Interesting locations: depends on the people’s opinions or habits.Existing platforms require
users to
specify their information needs in a formal,
declarative language
, which may be too complicated for
naive users.
Hence, a question in the natural language should be translated into a well-formed query.Slide48
Related Work
The NL-to-query
translation
problem
has been
previously studied for queries over
general data (knowledge), including
SQL/ XQuery/SPARQL
queries.
Crowdsourcing
: asking user to refine the translated query.
NL tools
for parsing and detecting
the semantics
of NL sentences.Slide49
Challenges
The mix of
general
and
individual
knowledge needs
lead to unique
challenges:
Distinguishing
the
individual
and
general
part of the question (query).
The crowd info regarding the
induvial
part of the NL question may not be in the knowledge base.
Most of the current techniques which are based on aliening questions to the knowledge based, does not apply.
Integrating
the generated queries for
individual
and general part of the question to a well-formed query.Slide50
Contributions
The modular
design of a translation framework
, to solve the challenges mentioned in the previous slide.
The development of new
modules.Slide51
Knowledge Representation
K
nowledge
representation must be
expressive enough
T
o
account for both
general
knowledge, to
be
queried
from an ontology,
F
or
individual
knowledge to be
collected
from the
crowd
.RDF: publicly available knowledge bases such as DBPedia and Linked-GeoData.{Buffalo, NY inside USA}.{Buffalo, NY has Label "interesting"}.{I visit Buffalo, NY}.Slide52
Query Language
The query language to which NL questions are
translated, should
naturally match the knowledge representation
.
Q
ASSIS-QL
query language, which
extends SPARQL
, the RDF query language, with crowd mining
capabilitiesSlide53
NL Processing Tools
Distinguishes
the
individual
and
general
part of the question (query
) according to the grammatical roles.Slide54
IX Detector
Dependency
Parser:
This
tool parses a given text
into a
standard structure called a
dependency
graph
. This structure
is a directed graph (typically, a tree) with
labels on
the edges. It exposes
different
types of
semantic
dependencies
between
the terms of a
sentence (
grammatical role
of the words).Slide55
Query Generators
It is left to perform the translation from the NL
representation to
the
query language representation.Slide56
Missing Parameters
Limit
ThresholdSlide57
Experimental Results
In this experiment, we have arbitrarily chosen the
first 500 questions
from the Yahoo! Answers
repositories.Slide58
Aggregate Estimations Over Location Based Services
Presented
by:
S
h
a
h
a
b
H
e
l
m
i
Fall
2015Slide59
Paper Info
Authors:
Weimo
Liuy
,
The George Washington University
Md
Farhadur
Rahmanz
,
University of Texas at Arlington
Saravanan
Thirumuruganathanz
, University of Texas at Arlington
Nan
Zhangy
, The George Washington University
Gautam Das, University of Texas at ArlingtonPublication:VLDB 2015Type:Research PaperSlide60
Introduction
Location-returned services (LR-LBS):
this services return the
location
of the
k
returned tuples.
Google Maps.
Location-not-returned services (LNR-LBS):
this services does not return the location of the
k
tuples and returns some other attributes such as ID, ranking and etc.
WeChat
Sina
Weibo
A
K
-nearest-neighbors (
k
NN
) query:
return the
k nearest tuples to the query point according to a ranking function (Euclidian distance in this paper). Slide61
Introduction (2)
LBS with a
K
NN interface:
hidden
databases
with limited
access, usually through a
public
web query interface
or
API
.
These interfaces impose some
constraints
:
Query limitation:
10,000 per user per
day in Google Maps
M
aximum
coverage
limit, for example 5 miles away from the query pointAggregate Estimations: For many applications, it is important to collect aggregate statistics in such hidden databases such as sum
,
count
, or distributions of the
tuples satisfying
certain selection
conditions.
A
hotel
recommendation application
would like to know the average
review scores
for Marriott vs Hilton hotels in Google Maps;
A
cafe
chain startup
would like to know the number of Starbucks restaurants in
a certain
geographical region;
A
demographics researcher may
wish know
the gender ratio of users of social networks in China etc.Slide62
Motivation / Goals
A
ggregate
information can be obtained
by:
Entering into
data sharing agreements with the location-based
service providers
, but this approach can often be
extremely expensive
,
and sometimes
impossible if the data owners are
unwilling to share
their data
.
Getting the whole data using limited interfaces would
take so long
.
Goals:
A
pproximate estimates of such aggregates by only querying the database via its restrictive public interface. Minimizing the query cost (i.e., ask as few queries as possible)
Making
the aggregate estimations as
accurate
as possible.Slide63
Related Work
Analytics and Inference over LBS
:
Estimating COUNT and SUM aggregates.
Error reduction, such as bias correction
Aggregate Estimations over Hidden Web Repositories
:
U
nbiased estimators for
COUNT and SUM aggregates for static
databases.
E
fficient techniques to
obtain random samples from hidden web databases that can
then be
utilized to perform aggregate
estimation.
E
stimating the size
of search
engines.Slide64
Contributions
For LR-LBS
interfaces
: the developed algorithm (LR-LBS-AGG), for
estimating
COUNT
and
SUM
aggregates, represents
a significant
improvement over
prior work along multiple dimensions:
a
novel way
of precisely
calculating Voronoi cells leads to completely
unbiased estimations
;
top-k
returned tuples are leveraged
rather than
only top-1; several innovative techniques developed for reducing error and increasing efficiency.For LNR-LBS interfaces: the developed algorithm (LNR-LBS-AGG) which was a novel problem with no prior work. The algorithm is not bias-free, but the bias can be controlled to any desired precision.Slide65
Background: Voronoi Diagrams
Top-1 Voronoi
Top-2 Voronoi
In a Voronoi diagram, for
each
point, there
is a corresponding
region
consisting of all points
closer
to that
point than
to any other.Slide66
LR-LBS-AGG | LNR-LBS-AGG Algorithms
Precisely compute Voronoi cells
=
, Count(*) =
Extensions:
Computing
Voroni
cells faster
Error reduction
Slide67
Experimental Results
Datasets:
Offline Real-World
Dataset (
OpenStreetMap
, USA Portion): to verify the correctness of the algorithm.
Online LBS
Demonstrations: to evaluate efficiency of the algorithm.
Google Maps
WeChat
Sina
Weibo