Liwen Sun Michael J Franklin Sanjay Krishnan Reynold S Xin UC Berkeley and Databricks Inc VLDB 2014 March 17 2015 Heymo Kou Introduction Overview Workload Analysis The Partitioning Problem ID: 403565
Download Presentation The PPT/PDF document "Fine-grained Partitioning for Aggressive..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Fine-grained Partitioning for Aggressive Data Skipping
Liwen
Sun, Michael J. Franklin, Sanjay Krishnan, Reynold S.
Xin†
UC
Berkeley and †
Databricks
Inc
.
VLDB 2014
March 17, 2015
Heymo
KouSlide2
IntroductionOverview
Workload Analysis
The Partitioning ProblemFeature-based Data SkippingDiscussionExperimental EvaluationRelated Work & Conclusion
Contents
2
/ 18Slide3
Several ways tom improve data scan throughput
Memory caching
ParallelizationData compressionReduce the data access (Data skipping)Increasing interest in reducing data access
Introduction
3
/ 18Slide4
Recall Google’s PowerDrill
4
/ 18Slide5
Traditionally, ranged partitioning
PowerDrill
Composite range partitioning
Logic difference
Skew Inevitable
5
/ 18Slide6
Feature Selection
Analyze frequent query features
Optimal PartitioningFormulate Balanced MaxSkip partitioning problem
Scalability
Contributions
6
/ 18Slide7
Filter Commonality
Only small set of filters are commonly used
Filter StabilityFuture queries have occurred beforeOverview
Workload Assumptions
7
/ 18Slide8
Workload Analyzer
Extract features
FeaturizationEvaluate filterstuple (vector, tuple)
ReductionGroup by (vector, tuple)Partitioner
Split data
Shuffle
Augment partitioned data
Catalog Update
Union vectors for each block
Overview
Blocking Workflow
8
/ 18Slide9
Goal : extract freatures
from the query traces
Given
Predicate Augmentation
Reduce Redundancy
Workload Analysis
9
/ 18Slide10
Set of m features
Collection of m-dimensional bit vectors
Partitioning over VUnion vector of all vectors in
Pi
Cost Function(sum of tuples that can be skipped)
Partitioning Problem
Problem
Definition
10
/ 18Slide11
Cost Function over a partitioning
Problem 1 (Balanced
MaxSkip Partitioning)NP-hard using hypergraph bisection
Partitioning Problem
Balanced
MaxSkip
Partitioning
11
/ 18Slide12
Partitioning Problem
Example of Blocking
12
/ 18Slide13
Query Execution
Feature-Based Data Skipping
13
/ 18Slide14
Data UpdateInfrequent ad-hoc updates, batch-inserted, batch-deleted
Still fine-grained blocking partitions separately
Parameter SelectionTwo key parameters in blocking processnumFeat : number of features
minSize : minimum number of tuples per block
Default Parameter
numFeat
: < 50
MinSize
: 64 – 128MB (which fits in HDFS block)
Discussion
14
/ 18Slide15
EnvironmentAmazon Spark EC2 cluster
25 m2.4xlarge instances
8 x 2.66 GHz CPU cores64.8 GB RAM2 x 840 GB disk storage
HDFSDatasetsTPC-H benchmark data
TPC-H Skewed
Conviva
Anonymized user access log of video streams
Experiment [1/3]
15
/ 18Slide16
FullScan : disable data skipping
Range1 : Shark’s data skipping
Range2 : Composite range partitioning (PowerDrill)Experiment [2/3]
16
/ 18Slide17
Effect of numFeat
Breakdown of blocking time
Experiment [3/3]
17
/ 18Slide18
Fine-grained data blocking techniques
Partition data tuples into blocks
Data skipping reduce 5-7x less data access2-5x improvement in query response timeCompared to range-based blocking techniques
Conclusion
18
/ 18