/
Good data structure experiments are Good data structure experiments are

Good data structure experiments are - PowerPoint Presentation

debby-jeon
debby-jeon . @debby-jeon
Follow
368 views
Uploaded On 2018-03-15

Good data structure experiments are - PPT Presentation

rare Trevor Brown Technion slides at httptbrownpro Why do we perform experiments To answer questions about data structures Is one data structure faster than another Why We are asking about algorithmic differences ID: 652288

volatile data lock free data volatile free lock node memory thread bst padding structure structures nm14 microsecond starttime experiments rngs perf left

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Good data structure experiments are" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Good data structureexperiments are r.a.r.e.

Trevor Brown

Technion

slides at http://tbrown.proSlide2

Why do we perform experiments?To answer questions about

data structures

Is one data structure faster than another? Why?

We are asking about algorithmic differences,

NOT engineering differencesSlide3

The problem

Typical data structure experiment

2x Intel E7-4830, 48 threads, 128GB RAM

Ubuntu 16.04LTS, G++ 6.3.0 with flags –mcx16 –O3

Binary search tree benchmark

Five 3-second

trials, tree prefilled to half-full

24 threads do 50% insert, 50% delete, 100k keysSlide4

Which is the “true” performance?

operations per microsecondSlide5

Which is the “right” comparison?

[NM14]

Lock-free External

BST

[BCCO10] Optimistic

AVL

tree

BCCO10

168% faster

than NM14

NM14

84% faster

than BCCO10Slide6

Good data structure Experiments are

Reproducible

Apples-to-apples (fair)

Realistic

ExplainableSlide7

Reproducibility:Crucial configuration parameters

Operating system:

memory allocator

, huge pages,

thread pinning

Processor:

prefetching

mode, hyper threading, turbo boost

Data

structure:

memory

reclamation, object poolingSlide8

Apples-to-apples Comparisons

All data structures should use the

same

:

Configuration parameters

ADT differences

set vs dictionary

insert-replace vs insert-if-absent

Engineering practices

inlining

, int vs longSlide9

Realistic experiments

Appropriate benchmarks

Realistic system configuration

Fast scalable allocator

Realistic data structure implementation

Memory reclamation

(and free() calls)

Eliminate implementation errorsSlide10

NM14

33% faster

BCCO10

106% faster

NM14 1-

6% faster

Comparison in [NM14]

Thread pinning reveals

sensitivity to NUMASlide11

Explaining results

Investigate with systems tools

Use performance counters (PAPI, Linux

perftools

)

L1/L2/L3 cache misses, stalls, cycles, instructions

Construct experiments to confirm explanationsSlide12

How to perform R.A.R.E. experiments

Forgotten or unrealistic parameters

Unfair or unrealistic comparisons

Bugs, unfair or unrealistic engineeringSlide13

Common Implementation errors

Test

harness

overhead

Misuse of C/C++ volatile

Memory leaks

False sharing

Bad padding/alignment

Data structure memory layout anomaliesSlide14

[#1] Test harness Overhead:impact on different Data structures

2.2x

3.1x

Original test harness

After reducing overhead

concurrent threads

operations per microsecond

8-thread

Intel I7-4770Slide15

Overhead of timing measurements

Data structure operations are no-ops

64-thread AMD

system

# operations per

get_time

() callSlide16

[#2] C++ volatile keyword

Informs compiler an address may be changed

by another thread

Prevents some optimizations that are illegal

in a concurrent setting

Value-based validation

v1 = *

addr

;

[…]

v2 = *

addr

if (v1 != v2) return FAIL;Eliminated validation!v1 = *addr

;[…]v2 = v1if (v1 != v2) return FAIL;

Optimize

Impossible!Slide17

Misuse of C/C++ Volatile

What is “left?”

node_t

* left;

volatile

node_t

* left;

node_t

volatile * left;

node_t

* volatile left;

volatile

node_t

* volatile left;Slide18

examples in the wild:Missing volatiles

The original implementation of the [NM14] BST

uses the following node type:

AO_double_t

is defined by the Atomic Ops (AO) library

NOT volatile by default

Need “volatile

AO_double_t

children”

struct

node_t

{

int key

; AO_double_t children;

};Slide19

struct

node_t

{

skey_t

key

;

sval_t

value;

volatile

node_t *

left; volatile node_t

* right;

char padding[32];

};

We want a

volatile pointer

to a node:

node_t

* volatile left

;

“left” is a pointer to a

volatile

node

examples in the wild:

Misplaced volatiles

The

ASCYLIB implementation of the [NM14] BST

uses the following node

type:Slide20

Profiling leaks in ./myprogram

PDF graph output

[#3] Checking

for memory leaks:

Using

jemalloc

env

MALLOC_CONF=prof_leak:true,lg_prof_sample:0,prof_final:true LD_PRELOAD=libjemalloc.so ./

myprogram

jeprof

--

show_bytes

--pdf ./

myprogram jeprof.1592.0.f.heap > output.pdf

<

jemalloc

>: Leak approximation summary: 8458392 bytes [...]

<

jemalloc

>: Run

jeprof

on "jeprof.1592.0.f.heap" [...]Slide21

PDF output: tracking down ~8MB of leaked memory

Leak was caused by a serious algorithmic bug!Slide22

Checking for memory leaks:Using valgrind

==

28550== 233,072 (3,696 direct, 229,376 indirect) bytes in 154 blocks are definitely lost in loss record 13 of 16

==28550== by 0x429518: Prepare<...> (snapcollector.h:307)

==28550== by 0x429518:

traversal_end

(rq_snap.h:313)

...

$

valgrind

--fair-

sched

=yes --leak-check=full ./

myprogramSlide23

[#4] False sharing

w

1

w

2

64 byte (8 word) cache line

Thread 1’s cache

Thread 2’s cache

w

3

w

4

w

5

w

6

w

7

w

8

Thread 1 reads w

2

w

1

w

2

w

3

w

4

w

5

w

6

w

7

w

8

S

Thread 2 reads w

7

w

1

w

2

w

3

w

4

w

5

w

6

w

7

w

8

S

Thread 2

writes

w

7

XSlide24

False sharing in the test harness

Typically revealed by sanity checks

For example:

read only workload with empty data structuresSlide25

Searches in empty data structures

o

perations per microsecond

Lock-free

skiplist

Lock-free list

Lazy list

RCU-based BST

Lock-free BST

Lock-free

(

a,b

)-tree

48 threads on:

2x24 thread Intel E7-4830

Same search codeSlide26

Locating the false sharing

Using Linux Performance Tools:

perf

Record performance counter

MEM_LOAD_UOPS_RETIRED_HIT_LFB

Command

perf record

–e

cpu

/event=0xd1,umask=0x40/pp ./

myprogram

Use high precision event data (more accurate line numbers)

≅ memory contentionSlide27

Exploring the perf dataSlide28

Exploring the

perf

dataSlide29

Exploring the

perf

dataSlide30

What are these variables?

while

(!done)

{

++

cnt

;

if

(

cnt

is a multiple of 50)

{

if (

get_time() - startTime

>= run_time) {

done = true;

memory_fence

();

break;

}

}

...

[perform a random operation]

}Slide31

The offending data layout

volatile long

rngs

[NUM_THREADS * PADDING];

volatile long

startTime

;

volatile bool done;

empty padding

data

Thread t’s random # generator =

rngs

[t * PADDING]

8 bytes

2 cache lines - 8 bytesSlide32

Expected data layout

...

rngs

[...]

startTime

done

Actual data layout

...

rngs

[...]

startTime

done

No false sharingSlide33

Brittle Solution

...

rngs

[...]

startTime

done

padding[…]

volatile long

rngs

[NUM_THREADS * PADDING];

volatile char padding[128];

volatile long

startTime

;

volatile bool done;

Data could still be reordered!Slide34

Better Solution

struct

{

volatile char pad0[128]

volatile long

rngs

[NUM_THREADS * PADDING];

volatile long

startTime

;

volatile bool done;

volatile char

pad1[128]} g;

...

g.rngs

[...]

g.startTime

g.done

g.pad0[…]

g.pad1[…]Slide35

Searches in empty data structures

o

perations per microsecond

Lock-free

skiplist

Lock-free list

Lazy list

RCU-based BST

Lock-free BST

Lock-free

(

a,b

)-treeSlide36

Searches in empty data structures

o

perations per microsecond

Lock-free

skiplist

Lock-free list

Lazy list

RCU-based BST

Lock-free BST

Lock-free

(

a,b

)-treeSlide37

Why is the skiplist slow?

Use PAPI measurements to investigate

Lock-free

BST

Lock-free

skiplist

L1

miss / op

0.11

0.14

L2

miss / op

0.11

0.14

L3

miss / op

0.04

0.05

Cycles / op

347

656

Instr. / op

307

700Slide38

Digging deeper with perf

perf

record –e

cpu-cycles:pp

./

myprogram

;

perf reportSlide39

Confirm with an experimentFlatten

skiplist

(MAX_LEVEL=1)

Works because of empty data structure workloadSlide40

Searches in empty data structures

o

perations per microsecond

Lock-free

skiplist

Lock-free list

Lazy list

RCU-based BST

Lock-free BST

Lock-free

(

a,b

)-treeSlide41

[#5] Problematic padding

Original

2.05 L3 miss/op

Padding

removed

0.01 L3 miss/op

operations per microsecondSlide42

[#6] Data structure

memory layout anomalies

48 threads

Prefill with 1M insertions

Then do 100% searches

More L2 misses

and L3 misses

But the external BST

contains more nodes!?

Fixed

Memory layout:

NNNNNNNN

Memory layout:

NDNDNDND

operations per microsecondSlide43

My top 10 Sanity checks

Empty data structures

Read-only workloads

Inspect

object

sizes and first k object allocations per thread

Trial

length:

3s

vs 60s

10

2 vs 105 vs 10

7 keys

Key checksums

Valgrind

Extremely high contention

Memory

reclamation:

efficient

vs eager

Variance measurementsSlide44

Conclusion

Join me in performing R.A.R.E experiments

find problems with sanity checks

find solutions with systems tools

explain everything

(with evidence!)

Question: are Java experiments useful?

Ongoing work: new test harness with tools to make R.A.R.E. experiments easier