/
Synthesizing Effective Data Compression Algorithms for GPUs Synthesizing Effective Data Compression Algorithms for GPUs

Synthesizing Effective Data Compression Algorithms for GPUs - PowerPoint Presentation

pamella-moone
pamella-moone . @pamella-moone
Follow
435 views
Uploaded On 2016-07-12

Synthesizing Effective Data Compression Algorithms for GPUs - PPT Presentation

Annie Yang and Martin Burtscher Department of Computer Science Highlights MPC compression algorithm Brandnew lossless compression algorithm for single and doubleprecision floatingpoint data ID: 401686

compression data effective algorithms data compression algorithms effective gpus synthesizing values algorithm mpc components gpu precision double single block

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Synthesizing Effective Data Compression ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Synthesizing Effective Data Compression Algorithms for GPUs

Annie Yang and Martin Burtscher*Department of Computer ScienceSlide2

Highlights

MPC compression algorithmBrand-new lossless

compression algorithm for single- and double-precision floating-point dataSystematically derived to work well on GPUs

MPC features

Compression ratio is similar to best

CPU algorithmsThroughput is much higherRequires little internal state (no tables or dictionaries)

Synthesizing Effective Data Compression Algorithms for GPUs

2Slide3

Introduction

High-Performance Computing SystemsDepend increasingly on acceleratorsProcess large amounts of

floating-point (FP) dataMoving this data is often the performance bottleneck

Data compression

Can increase transfer throughput

Can reduce storage requirementBut only if effective, fast (real-time), and lossless

Synthesizing Effective Data Compression Algorithms for GPUs

3Slide4

Problem Statement

Existing FP compression algorithms for GPUsFast but compress poorly

Existing FP compression algorithms for CPUsCompress much better but are slow

Parallel codes run serial algorithms on multiple chunks

Too much state per thread for a GPU implementation

Best serial algos may not be scalably parallelizableDo effective FP compression algos

for GPUs exist?And if so, how can we create such an algorithm?

Synthesizing Effective Data Compression Algorithms for GPUs

4Slide5

Our Approach

Need a brand-new massively-parallel algorithmStudy existing FP compression algorithms

Break them down into constituent partsOnly keep GPU-friendly parts

Generalize them as much as possible

Resulted in

algorithmic componentsCUDA implementation: each component takes sequence of values as input and outputs transformed sequenceComponents operate on integer representation of data

Synthesizing Effective Data Compression Algorithms for GPUs

5

Charles Trevelyan for http://plus.maths.org/Slide6

Our Approach (cont.)

Automatically synthesize compression algorithms by

chaining componentsUse exhaustive search

to find best

four-component chains

Synthesize decompressorEmploy inverse componentsPerform opposite transformation on data

Synthesizing Effective Data Compression Algorithms for GPUs

6Slide7

Mutator Components

Mutators computationally transform each valueDo not use information about any other value

NUL outputs the input block (identity)

INV

flips

all the bits│, called cut, is a singleton pseudo component that converts a block of words into a block of bytesMerely a type cast, i.e., no computation or data copyingByte granularity

can be better for compression

Synthesizing Effective Data Compression Algorithms for GPUs

7Slide8

Shuffler Components

Shufflers reorder whole values or bits of valuesDo not perform any computation

Each thread block operates on a chunk of valuesBIT emits most

significant

bits

of all values, followed by the second most significant bits, etc.DIMn groups values by dimension

nTested n = 2, 3, 4, 5, 8, 16,

and 32

For example,

DIM2

has the following effect:

sequence A, B, C, D, E, F becomes A, C, E, B, D, F

Synthesizing Effective Data Compression Algorithms for GPUs

8Slide9

Predictor Components

Predictors guess values based on previous values and compute residuals (true minus guessed value)Residuals tend to cluster around zero, making them easier to compress than the original sequence

Each thread block operates on a chunk of valuesLNV

n

s

subtracts nth prior value from current valueTested n

= 1, 2, 3, 5, 6, and 8LNV

n

x

XORs current with

n

th

prior

value

Tested

n

= 1, 2, 3, 5, 6,

and 8

Synthesizing Effective Data Compression Algorithms for GPUs

9Slide10

Reducer Components

Reducers eliminate redundancies in value sequenceAll other components cannot change length of sequence, i.e., only reducers can compress sequence

Each thread block operates on a chunk of valuesZE emits bitmap of 0s followed by

non-zero values

Effective if input sequence contains many zeros

RLE performs run-length encoding, i.e., replaces repeating values by count and a single value

Effective if input contains many repeating values

Synthesizing Effective Data Compression Algorithms for GPUs

10Slide11

Algorithm Synthesis

Determine best four-stage algorithms with a cutExhaustive search of

all possible 138,240 combinations13

double-precision

data sets (19 – 277 MB)

Observational data, simulation results, MPI messagesSingle-precision data derived from double-precision data

Create general GPU-friendly compression algorithmAnalyze

best algorithm for each data set and precision

Find commonalities and

generalize

into one algorithm

Synthesizing Effective Data Compression Algorithms for GPUs

11Slide12

Best of 138,240 Algorithms

Synthesizing Effective Data Compression Algorithms for GPUs

12Slide13

Analysis of Reducers

Double prec results only

Single prec results similar

ZE

or RLE required at endNot counting cut; (encoder)ZE dominates

Many 0s but not in a rowFirst three stages

Contain almost

no

reducers

Transformations are key to making reducer effective

Chaining whole compression algorithms may be futile

Synthesizing Effective Data Compression Algorithms for GPUs

13Slide14

Analysis of Mutators

NUL and INV never used

No need to invert bitsFewer stages perform worseCut

often at end (not used)

Word granularity suffices

Easier/faster to implementDIM8 right after cut

DIM4 with single precisionUsed to

separate

byte

positions

of each word

Synthesis yielded unforeseen use of DIM component

Synthesizing Effective Data Compression Algorithms for GPUs

14Slide15

Analysis of Shufflers

Shufflers are importantAlmost always included

BIT used very frequentlyFP bit positions correlate more strongly than values

DIM

has two uses

Separate bytes (see before)Right after cutSeparate values of multi-dim data sets (intended use)

Early stagesSynthesizing Effective Data Compression Algorithms for GPUs

15Slide16

Analysis of Predictors

Predictors very important(Data model)

Used in every caseOften 2 predictors used

LNV

n

s dominates LNVnx

Arithmetic (sub) difference superior to bit-wise (xor) difference in residual

Dimension

n

Separates

values of multi-dim data

sets (in 1

st

stage)

Synthesizing Effective Data Compression Algorithms for GPUs

16Slide17

Analysis of Overall Best Algorithm

Same algo for SP and DP

Few components mismatchBut LNV

6

s dim is off

Most frequent patternLNV*s BIT LNV1s ZE

Star denotes dimensionality

Why

6

in starred position?

Not used in individual

algos

6 is least common multiple of 1, 2, and 3

Did not test

n

> 8

Synthesizing Effective Data Compression Algorithms for GPUs

17Slide18

MPC: Generalization of Overall Best

MPC

algorithm

M

assively

Parallel CompressionUses generalized pattern

“LNVd

s

BIT LNV1s

ZE” where

d

is data set dimensionality

Matches

best algorithm on several DP and SP data sets

Performs even better when true dimensionality is used

Synthesizing Effective Data Compression Algorithms for GPUs

18Slide19

Evaluation

Methodology

System

Dual 10-core Xeon

E5-2680 v2

CPUK40 GPU with 15 SMs (2880 cores)13 DP and 13 SP real-world data setsSame as beforeCompression algorithms

CPU: bzip2, gzip, lzop, and

pFPC

GPU:

GFC

and

MPC

(our algorithm)

Synthesizing Effective Data Compression Algorithms for GPUs

19Slide20

Compression Ratio (Double Precision)

MPC delivers

record compression on 5 data setsIn spite of “GPU-friendly components” constraint

MPC outperformed by bzip2 and

pFPC

on averageDue to msg_sppm and num_plasmaMPC superior to GFC (only other GPU compressor)

Synthesizing Effective Data Compression Algorithms for GPUs

20Slide21

Compression Ratio (Single Precision)

MPC delivers

record compression 8 data setsIn spite of “GPU-friendly components”

constraint

MPC is outperformed

by bzip2 on averageDue to num_plasmaMPC is “superior” to

GFC and pFPCThey do

not

support single-precision data, MPC does

Synthesizing Effective Data Compression Algorithms for GPUs

21Slide22

Throughput (Gigabytes per Second)

MPC outperforms all CPU compressors

Including pFPC running on two 10-core CPUs by 7.5x

MPC slower than GFC but mostly faster than

PCIe

MPC uses slow O(n log n) prefix scan implementation

Synthesizing Effective Data Compression Algorithms for GPUs

22Slide23

Summary

Goal of researchCreate an effective algorithm for FP data compression that is suitable

for massively-parallel GPUsApproach

Extracted 24

GPU-friendly

components and evaluated 138,240 combinations to find best 4-stage algorithmsGeneralized findings to derive MPC

algorithmResult

Brand new

compression algorithm for SP and DP data

Compresses about as well as CPU

algos

but much faster

Synthesizing Effective Data Compression Algorithms for GPUs

23Slide24

Future Work and Acknowledgments

Future workFaster implementation, more components, longer chains, and

other inputs, data types, and constraintsAcknowledgmentsNational Science Foundation

NVIDIA Corporation

Texas Advanced Computing Center

Contact informationburtscher@txstate.eduSynthesizing Effective Data Compression Algorithms for GPUs

24

NvidiaSlide25

Number of Stages

3 stages reach about 95% of compression ratioSynthesizing Effective Data Compression Algorithms for GPUs

25Slide26

Single- vs Double-Precision Algorithms

Synthesizing Effective Data Compression Algorithms for GPUs

26Slide27

MPC Operation

What does “LNVd

s BIT LNV1s ZE”

do?

LNV

ds predicts each value using a similar value to obtain a residual

sequence with many small valuesSimilar value = most

recent prior value from same

dim

BIT

groups residuals by bit position

All LSBs, then all second LSBs, etc.

LNV1s

turns

identical consecutive

words into zeros

ZE

eliminates

these

zero words

GPU friendly

All four components are massively parallel

Can be implemented with prefix scans or simpler

Synthesizing Effective Data Compression Algorithms for GPUs

27