Fine-grained Partitioning for Aggressive Data Skipping - PowerPoint Presentation

Download presentation
Fine-grained Partitioning for Aggressive Data Skipping
Fine-grained Partitioning for Aggressive Data Skipping

Embed / Share - Fine-grained Partitioning for Aggressive Data Skipping


Presentation on theme: "Fine-grained Partitioning for Aggressive Data Skipping"— Presentation transcript


Slide1

Fine-grained Partitioning for Aggressive Data Skipping

Liwen

Sun, Michael J. Franklin, Sanjay Krishnan, Reynold S.

Xin†

UC

Berkeley and †

Databricks

Inc

.

VLDB 2014

March 17, 2015

Heymo

KouSlide2

IntroductionOverview

Workload Analysis

The Partitioning ProblemFeature-based Data SkippingDiscussionExperimental EvaluationRelated Work & Conclusion

Contents

2

/ 18Slide3

Several ways tom improve data scan throughput

Memory caching

ParallelizationData compressionReduce the data access (Data skipping)Increasing interest in reducing data access

Introduction

3

/ 18Slide4

Recall Google’s PowerDrill

4

/ 18Slide5

Traditionally, ranged partitioning

PowerDrill

Composite range partitioning

Logic difference

Skew Inevitable

5

/ 18Slide6

Feature Selection

Analyze frequent query features

Optimal PartitioningFormulate Balanced MaxSkip partitioning problem

Scalability

Contributions

6

/ 18Slide7

Filter Commonality

Only small set of filters are commonly used

Filter StabilityFuture queries have occurred beforeOverview

Workload Assumptions

7

/ 18Slide8

Workload Analyzer

Extract features

FeaturizationEvaluate filterstuple  (vector, tuple)

ReductionGroup by (vector, tuple)Partitioner

Split data

Shuffle

Augment partitioned data

Catalog Update

Union vectors for each block

Overview

Blocking Workflow

8

/ 18Slide9

Goal : extract freatures

from the query traces

Given

Predicate Augmentation

Reduce Redundancy

Workload Analysis

9

/ 18Slide10

Set of m features

Collection of m-dimensional bit vectors

Partitioning over VUnion vector of all vectors in

Pi

Cost Function(sum of tuples that can be skipped)

Partitioning Problem

Problem

Definition

10

/ 18Slide11

Cost Function over a partitioning

Problem 1 (Balanced

MaxSkip Partitioning)NP-hard using hypergraph bisection

Partitioning Problem

Balanced

MaxSkip

Partitioning

11

/ 18Slide12

Partitioning Problem

Example of Blocking

12

/ 18Slide13

Query Execution

Feature-Based Data Skipping

13

/ 18Slide14

Data UpdateInfrequent ad-hoc updates, batch-inserted, batch-deleted

Still fine-grained blocking partitions separately

Parameter SelectionTwo key parameters in blocking processnumFeat : number of features

minSize : minimum number of tuples per block

Default Parameter

numFeat

: < 50

MinSize

: 64 – 128MB (which fits in HDFS block)

Discussion

14

/ 18Slide15

EnvironmentAmazon Spark EC2 cluster

25 m2.4xlarge instances

8 x 2.66 GHz CPU cores64.8 GB RAM2 x 840 GB disk storage

HDFSDatasetsTPC-H benchmark data

TPC-H Skewed

Conviva

Anonymized user access log of video streams

Experiment [1/3]

15

/ 18Slide16

FullScan : disable data skipping

Range1 : Shark’s data skipping

Range2 : Composite range partitioning (PowerDrill)Experiment [2/3]

16

/ 18Slide17

Effect of numFeat

Breakdown of blocking time

Experiment [3/3]

17

/ 18Slide18

Fine-grained data blocking techniques

Partition data tuples into blocks

Data skipping reduce 5-7x less data access2-5x improvement in query response timeCompared to range-based blocking techniques

Conclusion

18

/ 18

By: celsa-spraggs
Views: 108
Type: Public

Fine-grained Partitioning for Aggressive Data Skipping - Description


Liwen Sun Michael J Franklin Sanjay Krishnan Reynold S Xin UC Berkeley and Databricks Inc VLDB 2014 March 17 2015 Heymo Kou Introduction Overview Workload Analysis The Partitioning Problem ID: 403565 Download Presentation

Related Documents