Designing a
43K - views

Designing a

Similar presentations


Download Presentation

Designing a




Download Presentation - The PPT/PDF document "Designing a" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentation on theme: "Designing a"— Presentation transcript:

Slide1

Designing a Scalable Data Cleaning Infrastructure

Daniel HaasIn Collaboration With: Sanjay Krishnan, Jiannan Wang, Juan Sanchez, Wenbo Tao, Eugene Wu, Ken Goldberg, Mike Franklin

1

Slide2

Outline

What we think matters for data cleaning Our system designReleases/opportunities for collaboration

2

Slide3

Outline

What we think matters for data cleaning Our system designReleases/opportunities for collaboration

3

Slide4

An Example Cleaning Lifecycle

Goal: extract addresses from a dataset of webpages

4

???

Slide5

Goal: extract addresses from a dataset of webpagesFirst: try simple rules on a sampleWorks great!

5

webpages

Count(*)

Sample

Rule: Extract address

1.

An Example Cleaning Lifecycle

Slide6

Goal: extract addresses from a dataset of webpagesNext: apply rules to whole dataLots of errors, feel sad

6

webpages

Rule: Extract address

2

.

An Example Cleaning Lifecycle

Slide7

Goal: extract addresses from a dataset of webpagesSo, try the crowd! Great resultsLots of engineeringVery slow

7

webpages

Crowd:

Extract address

3

.

An Example Cleaning Lifecycle

Slide8

Goal: extract addresses from a dataset of webpagesFinally, settle on a hybrid approach. Rules for simple casesCrowds for hard casesML to make crowds scale

8

webpages

Crowd + Active Learning:

Extract address

4.

Rule: Extract address

An Example Cleaning Lifecycle

Slide9

How to make the lifecycle easier?

General, composable operatorsSupport for iteration on workflowsOptimization for workflow searchIntegrated tools for crowdsourcing

9

Slide10

Outline

What we think matters for data cleaning Our system designReleases/opportunities for collaboration

10

Slide11

“Our System”

11

Slide12

General, composable operators

12

Logical Operators

Sampling

Similarity Join

Filtering

Extraction

Physical Operators

Rule-based

Learning-based

Crowd-based

Slide13

Support for iteration

Observation:Cleaning workflows require many changes to work wellSolution:“Hot-swapping” which:Can modify in-flight logical operatorsUses caching and lineage to avoid re-computing intermediate results

13

Slide14

Optimization for workflow search

Observation:Data scientists tweak workflows using heuristics and intuitionSolution:An eval operator which:Gathers ground truthEstimates the cost / quality of a workflowRecommends changes to improve quality / decrease cost

14

Slide15

Integrated crowdsourcing

ObservationMany cleaning operations require human guidance but need to scaleSolution:AMPCrowd, a standalone web service with:Support for MTurk or an internal crowdBuilt-in quality control (voting, EM)Extensibility to new task interfaces, new crowd platforms

15

Slide16

Summary:

Operators: logical, physical, composableIteration: hot-swapping mid-flightOptimization: the eval operatorCrowdsourcing: the AMPCrowd platform

16

Slide17

Outline

What we think matters for data cleaning Our system designReleases/opportunities for collaboration

17

Slide18

Initial System Release

Built on the BDAS stack (Scala)Apache licensedRelease within the next month!

18

Slide19

AMPCrowd Release

amplab.github.io/ampcrowdPython/Django/PostgresqlApache Licensed

19

Slide20

20

Optimizer

Data Cleaning Plan Executor

Planning UI

User

Crowd

Hot Swapper

DSL Compiler

Rec. Engine

SAQP

Queries &

Results

Swap

Cmds

Swap Recs

Cleaning Tasks

Crowd Manager

Cleaning UI

Lineage and Storage

Slide21

Questions for you

For discussion now:How do you handle dirty data?Would our system be useful?… and many moreTake our survey! Goals:Inform our system designPublish our findings

21

Slide22

Questions for us?

Thanks!{dhaas, sanjay, jnwang}@cs.berkeley.eduewu@cs.columbia.edusampleclean.org

22

Slide23

SAQP: Tradeoff Between Accuracy and Cleaning

Query

Error

Sample Size

No Cleaning

SampleClean

BlinkDB

23

SIGMOD

2014.

SampleClean

: Fast and Accurate Query Processing on Dirty

Data

Slide24

Materialized View

Updates

Broad View of Data Cleaning

Submitted VLDB 2015. Stale View Cleaning: Getting Fresh Answers From Materialized Views

Base Data

Sample

View

Query

Approx.

Result

24

Outlier

Index

Slide25

Data cleaning for Machine Learning?

Dirty

Data

Clean Data

Θ*

Correction

25

Slide26

Tackling crowd latency

Our approach: treat crowd workers like nodes in a distributed system!Detect slow/low-quality workersMitigate straggling workersTune active learning hyper-parameters for performance

26

Slide27