Annie Yang and Martin Burtscher Department of Computer Science Highlights MPC compression algorithm Brandnew lossless compression algorithm for single and doubleprecision floatingpoint data ID: 401686
Download Presentation The PPT/PDF document "Synthesizing Effective Data Compression ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Synthesizing Effective Data Compression Algorithms for GPUs
Annie Yang and Martin Burtscher*Department of Computer ScienceSlide2
Highlights
MPC compression algorithmBrand-new lossless
compression algorithm for single- and double-precision floating-point dataSystematically derived to work well on GPUs
MPC features
Compression ratio is similar to best
CPU algorithmsThroughput is much higherRequires little internal state (no tables or dictionaries)
Synthesizing Effective Data Compression Algorithms for GPUs
2Slide3
Introduction
High-Performance Computing SystemsDepend increasingly on acceleratorsProcess large amounts of
floating-point (FP) dataMoving this data is often the performance bottleneck
Data compression
Can increase transfer throughput
Can reduce storage requirementBut only if effective, fast (real-time), and lossless
Synthesizing Effective Data Compression Algorithms for GPUs
3Slide4
Problem Statement
Existing FP compression algorithms for GPUsFast but compress poorly
Existing FP compression algorithms for CPUsCompress much better but are slow
Parallel codes run serial algorithms on multiple chunks
Too much state per thread for a GPU implementation
Best serial algos may not be scalably parallelizableDo effective FP compression algos
for GPUs exist?And if so, how can we create such an algorithm?
Synthesizing Effective Data Compression Algorithms for GPUs
4Slide5
Our Approach
Need a brand-new massively-parallel algorithmStudy existing FP compression algorithms
Break them down into constituent partsOnly keep GPU-friendly parts
Generalize them as much as possible
Resulted in
algorithmic componentsCUDA implementation: each component takes sequence of values as input and outputs transformed sequenceComponents operate on integer representation of data
Synthesizing Effective Data Compression Algorithms for GPUs
5
Charles Trevelyan for http://plus.maths.org/Slide6
Our Approach (cont.)
Automatically synthesize compression algorithms by
chaining componentsUse exhaustive search
to find best
four-component chains
Synthesize decompressorEmploy inverse componentsPerform opposite transformation on data
Synthesizing Effective Data Compression Algorithms for GPUs
6Slide7
Mutator Components
Mutators computationally transform each valueDo not use information about any other value
NUL outputs the input block (identity)
INV
flips
all the bits│, called cut, is a singleton pseudo component that converts a block of words into a block of bytesMerely a type cast, i.e., no computation or data copyingByte granularity
can be better for compression
Synthesizing Effective Data Compression Algorithms for GPUs
7Slide8
Shuffler Components
Shufflers reorder whole values or bits of valuesDo not perform any computation
Each thread block operates on a chunk of valuesBIT emits most
significant
bits
of all values, followed by the second most significant bits, etc.DIMn groups values by dimension
nTested n = 2, 3, 4, 5, 8, 16,
and 32
For example,
DIM2
has the following effect:
sequence A, B, C, D, E, F becomes A, C, E, B, D, F
Synthesizing Effective Data Compression Algorithms for GPUs
8Slide9
Predictor Components
Predictors guess values based on previous values and compute residuals (true minus guessed value)Residuals tend to cluster around zero, making them easier to compress than the original sequence
Each thread block operates on a chunk of valuesLNV
n
s
subtracts nth prior value from current valueTested n
= 1, 2, 3, 5, 6, and 8LNV
n
x
XORs current with
n
th
prior
value
Tested
n
= 1, 2, 3, 5, 6,
and 8
Synthesizing Effective Data Compression Algorithms for GPUs
9Slide10
Reducer Components
Reducers eliminate redundancies in value sequenceAll other components cannot change length of sequence, i.e., only reducers can compress sequence
Each thread block operates on a chunk of valuesZE emits bitmap of 0s followed by
non-zero values
Effective if input sequence contains many zeros
RLE performs run-length encoding, i.e., replaces repeating values by count and a single value
Effective if input contains many repeating values
Synthesizing Effective Data Compression Algorithms for GPUs
10Slide11
Algorithm Synthesis
Determine best four-stage algorithms with a cutExhaustive search of
all possible 138,240 combinations13
double-precision
data sets (19 – 277 MB)
Observational data, simulation results, MPI messagesSingle-precision data derived from double-precision data
Create general GPU-friendly compression algorithmAnalyze
best algorithm for each data set and precision
Find commonalities and
generalize
into one algorithm
Synthesizing Effective Data Compression Algorithms for GPUs
11Slide12
Best of 138,240 Algorithms
Synthesizing Effective Data Compression Algorithms for GPUs
12Slide13
Analysis of Reducers
Double prec results only
Single prec results similar
ZE
or RLE required at endNot counting cut; (encoder)ZE dominates
Many 0s but not in a rowFirst three stages
Contain almost
no
reducers
Transformations are key to making reducer effective
Chaining whole compression algorithms may be futile
Synthesizing Effective Data Compression Algorithms for GPUs
13Slide14
Analysis of Mutators
NUL and INV never used
No need to invert bitsFewer stages perform worseCut
often at end (not used)
Word granularity suffices
Easier/faster to implementDIM8 right after cut
DIM4 with single precisionUsed to
separate
byte
positions
of each word
Synthesis yielded unforeseen use of DIM component
Synthesizing Effective Data Compression Algorithms for GPUs
14Slide15
Analysis of Shufflers
Shufflers are importantAlmost always included
BIT used very frequentlyFP bit positions correlate more strongly than values
DIM
has two uses
Separate bytes (see before)Right after cutSeparate values of multi-dim data sets (intended use)
Early stagesSynthesizing Effective Data Compression Algorithms for GPUs
15Slide16
Analysis of Predictors
Predictors very important(Data model)
Used in every caseOften 2 predictors used
LNV
n
s dominates LNVnx
Arithmetic (sub) difference superior to bit-wise (xor) difference in residual
Dimension
n
Separates
values of multi-dim data
sets (in 1
st
stage)
Synthesizing Effective Data Compression Algorithms for GPUs
16Slide17
Analysis of Overall Best Algorithm
Same algo for SP and DP
Few components mismatchBut LNV
6
s dim is off
Most frequent patternLNV*s BIT LNV1s ZE
Star denotes dimensionality
Why
6
in starred position?
Not used in individual
algos
6 is least common multiple of 1, 2, and 3
Did not test
n
> 8
Synthesizing Effective Data Compression Algorithms for GPUs
17Slide18
MPC: Generalization of Overall Best
MPC
algorithm
M
assively
Parallel CompressionUses generalized pattern
“LNVd
s
BIT LNV1s
ZE” where
d
is data set dimensionality
Matches
best algorithm on several DP and SP data sets
Performs even better when true dimensionality is used
Synthesizing Effective Data Compression Algorithms for GPUs
18Slide19
Evaluation
Methodology
System
Dual 10-core Xeon
E5-2680 v2
CPUK40 GPU with 15 SMs (2880 cores)13 DP and 13 SP real-world data setsSame as beforeCompression algorithms
CPU: bzip2, gzip, lzop, and
pFPC
GPU:
GFC
and
MPC
(our algorithm)
Synthesizing Effective Data Compression Algorithms for GPUs
19Slide20
Compression Ratio (Double Precision)
MPC delivers
record compression on 5 data setsIn spite of “GPU-friendly components” constraint
MPC outperformed by bzip2 and
pFPC
on averageDue to msg_sppm and num_plasmaMPC superior to GFC (only other GPU compressor)
Synthesizing Effective Data Compression Algorithms for GPUs
20Slide21
Compression Ratio (Single Precision)
MPC delivers
record compression 8 data setsIn spite of “GPU-friendly components”
constraint
MPC is outperformed
by bzip2 on averageDue to num_plasmaMPC is “superior” to
GFC and pFPCThey do
not
support single-precision data, MPC does
Synthesizing Effective Data Compression Algorithms for GPUs
21Slide22
Throughput (Gigabytes per Second)
MPC outperforms all CPU compressors
Including pFPC running on two 10-core CPUs by 7.5x
MPC slower than GFC but mostly faster than
PCIe
MPC uses slow O(n log n) prefix scan implementation
Synthesizing Effective Data Compression Algorithms for GPUs
22Slide23
Summary
Goal of researchCreate an effective algorithm for FP data compression that is suitable
for massively-parallel GPUsApproach
Extracted 24
GPU-friendly
components and evaluated 138,240 combinations to find best 4-stage algorithmsGeneralized findings to derive MPC
algorithmResult
Brand new
compression algorithm for SP and DP data
Compresses about as well as CPU
algos
but much faster
Synthesizing Effective Data Compression Algorithms for GPUs
23Slide24
Future Work and Acknowledgments
Future workFaster implementation, more components, longer chains, and
other inputs, data types, and constraintsAcknowledgmentsNational Science Foundation
NVIDIA Corporation
Texas Advanced Computing Center
Contact informationburtscher@txstate.eduSynthesizing Effective Data Compression Algorithms for GPUs
24
NvidiaSlide25
Number of Stages
3 stages reach about 95% of compression ratioSynthesizing Effective Data Compression Algorithms for GPUs
25Slide26
Single- vs Double-Precision Algorithms
Synthesizing Effective Data Compression Algorithms for GPUs
26Slide27
MPC Operation
What does “LNVd
s BIT LNV1s ZE”
do?
LNV
ds predicts each value using a similar value to obtain a residual
sequence with many small valuesSimilar value = most
recent prior value from same
dim
BIT
groups residuals by bit position
All LSBs, then all second LSBs, etc.
LNV1s
turns
identical consecutive
words into zeros
ZE
eliminates
these
zero words
GPU friendly
All four components are massively parallel
Can be implemented with prefix scans or simpler
Synthesizing Effective Data Compression Algorithms for GPUs
27