/
Fine-grained Partitioning for Aggressive Data Skipping Fine-grained Partitioning for Aggressive Data Skipping

Fine-grained Partitioning for Aggressive Data Skipping - PowerPoint Presentation

celsa-spraggs
celsa-spraggs . @celsa-spraggs
Follow
461 views
Uploaded On 2016-07-14

Fine-grained Partitioning for Aggressive Data Skipping - PPT Presentation

Liwen Sun Michael J Franklin Sanjay Krishnan Reynold S Xin UC Berkeley and Databricks Inc VLDB 2014 March 17 2015 Heymo Kou Introduction Overview Workload Analysis The Partitioning Problem ID: 403565

partitioning data query blocking data partitioning blocking query skipping based access fine vector experiment range maxskip tuples grained batch

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Fine-grained Partitioning for Aggressive..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Fine-grained Partitioning for Aggressive Data Skipping

Liwen

Sun, Michael J. Franklin, Sanjay Krishnan, Reynold S.

Xin†

UC

Berkeley and †

Databricks

Inc

.

VLDB 2014

March 17, 2015

Heymo

KouSlide2

IntroductionOverview

Workload Analysis

The Partitioning ProblemFeature-based Data SkippingDiscussionExperimental EvaluationRelated Work & Conclusion

Contents

2

/ 18Slide3

Several ways tom improve data scan throughput

Memory caching

ParallelizationData compressionReduce the data access (Data skipping)Increasing interest in reducing data access

Introduction

3

/ 18Slide4

Recall Google’s PowerDrill

4

/ 18Slide5

Traditionally, ranged partitioning

PowerDrill

Composite range partitioning

Logic difference

Skew Inevitable

5

/ 18Slide6

Feature Selection

Analyze frequent query features

Optimal PartitioningFormulate Balanced MaxSkip partitioning problem

Scalability

Contributions

6

/ 18Slide7

Filter Commonality

Only small set of filters are commonly used

Filter StabilityFuture queries have occurred beforeOverview

Workload Assumptions

7

/ 18Slide8

Workload Analyzer

Extract features

FeaturizationEvaluate filterstuple  (vector, tuple)

ReductionGroup by (vector, tuple)Partitioner

Split data

Shuffle

Augment partitioned data

Catalog Update

Union vectors for each block

Overview

Blocking Workflow

8

/ 18Slide9

Goal : extract freatures

from the query traces

Given

Predicate Augmentation

Reduce Redundancy

Workload Analysis

9

/ 18Slide10

Set of m features

Collection of m-dimensional bit vectors

Partitioning over VUnion vector of all vectors in

Pi

Cost Function(sum of tuples that can be skipped)

Partitioning Problem

Problem

Definition

10

/ 18Slide11

Cost Function over a partitioning

Problem 1 (Balanced

MaxSkip Partitioning)NP-hard using hypergraph bisection

Partitioning Problem

Balanced

MaxSkip

Partitioning

11

/ 18Slide12

Partitioning Problem

Example of Blocking

12

/ 18Slide13

Query Execution

Feature-Based Data Skipping

13

/ 18Slide14

Data UpdateInfrequent ad-hoc updates, batch-inserted, batch-deleted

Still fine-grained blocking partitions separately

Parameter SelectionTwo key parameters in blocking processnumFeat : number of features

minSize : minimum number of tuples per block

Default Parameter

numFeat

: < 50

MinSize

: 64 – 128MB (which fits in HDFS block)

Discussion

14

/ 18Slide15

EnvironmentAmazon Spark EC2 cluster

25 m2.4xlarge instances

8 x 2.66 GHz CPU cores64.8 GB RAM2 x 840 GB disk storage

HDFSDatasetsTPC-H benchmark data

TPC-H Skewed

Conviva

Anonymized user access log of video streams

Experiment [1/3]

15

/ 18Slide16

FullScan : disable data skipping

Range1 : Shark’s data skipping

Range2 : Composite range partitioning (PowerDrill)Experiment [2/3]

16

/ 18Slide17

Effect of numFeat

Breakdown of blocking time

Experiment [3/3]

17

/ 18Slide18

Fine-grained data blocking techniques

Partition data tuples into blocks

Data skipping reduce 5-7x less data access2-5x improvement in query response timeCompared to range-based blocking techniques

Conclusion

18

/ 18