Good data structure experiments are

Good data structure experiments are Good data structure experiments are - Start

2018-03-15 23K 23 0 0

Good data structure experiments are - Description

r.a.r.e. .. Trevor Brown. Technion. slides at http://tbrown.pro. Why do we perform experiments?. To answer questions about . data structures. Is one data structure faster than another? Why?. We are asking about algorithmic differences,. ID: 652288 Download Presentation

Download Presentation

Good data structure experiments are




Download Presentation - The PPT/PDF document "Good data structure experiments are" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentations text content in Good data structure experiments are

Slide1

Good data structureexperiments are r.a.r.e.

Trevor Brown

Technion

slides at http://tbrown.pro

Slide2

Why do we perform experiments?To answer questions about

data structures

Is one data structure faster than another? Why?

We are asking about algorithmic differences,

NOT engineering differences

Slide3

The problem

Typical data structure experiment

2x Intel E7-4830, 48 threads, 128GB RAM

Ubuntu 16.04LTS, G++ 6.3.0 with flags –mcx16 –O3

Binary search tree benchmark

Five 3-second

trials, tree prefilled to half-full

24 threads do 50% insert, 50% delete, 100k keys

Slide4

Which is the “true” performance?

operations per microsecond

Slide5

Which is the “right” comparison?

[NM14]

Lock-free External

BST

[BCCO10] Optimistic

AVL

tree

BCCO10

168% faster

than NM14

NM14

84% faster

than BCCO10

Slide6

Good data structure Experiments are

Reproducible

Apples-to-apples (fair)

Realistic

Explainable

Slide7

Reproducibility:Crucial configuration parameters

Operating system:

memory allocator

, huge pages,

thread pinning

Processor:

prefetching

mode, hyper threading, turbo boost

Data

structure:

memory

reclamation, object pooling

Slide8

Apples-to-apples Comparisons

All data structures should use the

same

:

Configuration parameters

ADT differences

set vs dictionary

insert-replace vs insert-if-absent

Engineering practices

inlining

, int vs long

Slide9

Realistic experiments

Appropriate benchmarks

Realistic system configuration

Fast scalable allocator

Realistic data structure implementation

Memory reclamation

(and free() calls)

Eliminate implementation errors

Slide10

NM14

33% faster

BCCO10

106% faster

NM14 1-

6% faster

Comparison in [NM14]

Thread pinning reveals

sensitivity to NUMA

Slide11

Explaining results

Investigate with systems tools

Use performance counters (PAPI, Linux

perftools

)

L1/L2/L3 cache misses, stalls, cycles, instructions

Construct experiments to confirm explanations

Slide12

How to perform R.A.R.E. experiments

Forgotten or unrealistic parameters

Unfair or unrealistic comparisons

Bugs, unfair or unrealistic engineering

Slide13

Common Implementation errors

Test

harness

overhead

Misuse of C/C++ volatile

Memory leaks

False sharing

Bad padding/alignment

Data structure memory layout anomalies

Slide14

[#1] Test harness Overhead:impact on different Data structures

2.2x

3.1x

Original test harness

After reducing overhead

concurrent threads

operations per microsecond

8-thread

Intel I7-4770

Slide15

Overhead of timing measurements

Data structure operations are no-ops

64-thread AMD

system

# operations per

get_time

() call

Slide16

[#2] C++ volatile keyword

Informs compiler an address may be changed

by another thread

Prevents some optimizations that are illegal

in a concurrent setting

Value-based validation

v1 = *

addr

;

[…]

v2 = *

addr

if (v1 != v2) return FAIL;Eliminated validation!v1 = *addr

;[…]v2 = v1if (v1 != v2) return FAIL;

Optimize

Impossible!

Slide17

Misuse of C/C++ Volatile

What is “left?”

node_t

* left;

volatile

node_t

* left;

node_t

volatile * left;

node_t

* volatile left;

volatile

node_t

* volatile left;

Slide18

examples in the wild:Missing volatiles

The original implementation of the [NM14] BST

uses the following node type:

AO_double_t

is defined by the Atomic Ops (AO) library

NOT volatile by default

Need “volatile

AO_double_t

children”

struct

node_t

{

int key

; AO_double_t children;

};

Slide19

struct

node_t

{

skey_t

key

;

sval_t

value;

volatile

node_t *

left; volatile node_t

* right;

char padding[32];

};

We want a

volatile pointer

to a node:

node_t

* volatile left

;

“left” is a pointer to a

volatile

node

examples in the wild:

Misplaced volatiles

The

ASCYLIB implementation of the [NM14] BST

uses the following node

type:

Slide20

Profiling leaks in ./myprogram

PDF graph output

[#3] Checking

for memory leaks:

Using

jemalloc

env

MALLOC_CONF=prof_leak:true,lg_prof_sample:0,prof_final:true LD_PRELOAD=libjemalloc.so ./

myprogram

jeprof

--

show_bytes

--pdf ./

myprogram jeprof.1592.0.f.heap > output.pdf

<

jemalloc

>: Leak approximation summary: 8458392 bytes [...]

<

jemalloc

>: Run

jeprof

on "jeprof.1592.0.f.heap" [...]

Slide21

PDF output: tracking down ~8MB of leaked memory

Leak was caused by a serious algorithmic bug!

Slide22

Checking for memory leaks:Using valgrind

==

28550== 233,072 (3,696 direct, 229,376 indirect) bytes in 154 blocks are definitely lost in loss record 13 of 16

==28550== by 0x429518: Prepare<...> (snapcollector.h:307)

==28550== by 0x429518:

traversal_end

(rq_snap.h:313)

...

$

valgrind

--fair-

sched

=yes --leak-check=full ./

myprogram

Slide23

[#4] False sharing

w

1

w

2

64 byte (8 word) cache line

Thread 1’s cache

Thread 2’s cache

w

3

w

4

w

5

w

6

w

7

w

8

Thread 1 reads w

2

w

1

w

2

w

3

w

4

w

5

w

6

w

7

w

8

S

Thread 2 reads w

7

w

1

w

2

w

3

w

4

w

5

w

6

w

7

w

8

S

Thread 2

writes

w

7

X

Slide24

False sharing in the test harness

Typically revealed by sanity checks

For example:

read only workload with empty data structures

Slide25

Searches in empty data structures

o

perations per microsecond

Lock-free

skiplist

Lock-free list

Lazy list

RCU-based BST

Lock-free BST

Lock-free

(

a,b

)-tree

48 threads on:

2x24 thread Intel E7-4830

Same search code

Slide26

Locating the false sharing

Using Linux Performance Tools:

perf

Record performance counter

MEM_LOAD_UOPS_RETIRED_HIT_LFB

Command

perf record

–e

cpu

/event=0xd1,umask=0x40/pp ./

myprogram

Use high precision event data (more accurate line numbers)

≅ memory contention

Slide27

Exploring the perf data

Slide28

Exploring the

perf

data

Slide29

Exploring the

perf

data

Slide30

What are these variables?

while

(!done)

{

++

cnt

;

if

(

cnt

is a multiple of 50)

{

if (

get_time() - startTime

>= run_time) {

done = true;

memory_fence

();

break;

}

}

...

[perform a random operation]

}

Slide31

The offending data layout

volatile long

rngs

[NUM_THREADS * PADDING];

volatile long

startTime

;

volatile bool done;

empty padding

data

Thread t’s random # generator =

rngs

[t * PADDING]

8 bytes

2 cache lines - 8 bytes

Slide32

Expected data layout

...

rngs

[...]

startTime

done

Actual data layout

...

rngs

[...]

startTime

done

No false sharing

Slide33

Brittle Solution

...

rngs

[...]

startTime

done

padding[…]

volatile long

rngs

[NUM_THREADS * PADDING];

volatile char padding[128];

volatile long

startTime

;

volatile bool done;

Data could still be reordered!

Slide34

Better Solution

struct

{

volatile char pad0[128]

volatile long

rngs

[NUM_THREADS * PADDING];

volatile long

startTime

;

volatile bool done;

volatile char

pad1[128]} g;

...

g.rngs

[...]

g.startTime

g.done

g.pad0[…]

g.pad1[…]

Slide35

Searches in empty data structures

o

perations per microsecond

Lock-free

skiplist

Lock-free list

Lazy list

RCU-based BST

Lock-free BST

Lock-free

(

a,b

)-tree

Slide36

Searches in empty data structures

o

perations per microsecond

Lock-free

skiplist

Lock-free list

Lazy list

RCU-based BST

Lock-free BST

Lock-free

(

a,b

)-tree

Slide37

Why is the skiplist slow?

Use PAPI measurements to investigate

Lock-free

BST

Lock-free

skiplist

L1

miss / op

0.11

0.14

L2

miss / op

0.11

0.14

L3

miss / op

0.04

0.05

Cycles / op

347

656

Instr. / op

307

700

Slide38

Digging deeper with perf

perf

record –e

cpu-cycles:pp

./

myprogram

;

perf report

Slide39

Confirm with an experimentFlatten

skiplist

(MAX_LEVEL=1)

Works because of empty data structure workload

Slide40

Searches in empty data structures

o

perations per microsecond

Lock-free

skiplist

Lock-free list

Lazy list

RCU-based BST

Lock-free BST

Lock-free

(

a,b

)-tree

Slide41

[#5] Problematic padding

Original

2.05 L3 miss/op

Padding

removed

0.01 L3 miss/op

operations per microsecond

Slide42

[#6] Data structure

memory layout anomalies

48 threads

Prefill with 1M insertions

Then do 100% searches

More L2 misses

and L3 misses

But the external BST

contains more nodes!?

Fixed

Memory layout:

NNNNNNNN

Memory layout:

NDNDNDND

operations per microsecond

Slide43

My top 10 Sanity checks

Empty data structures

Read-only workloads

Inspect

object

sizes and first k object allocations per thread

Trial

length:

3s

vs 60s

10

2 vs 105 vs 10

7 keys

Key checksums

Valgrind

Extremely high contention

Memory

reclamation:

efficient

vs eager

Variance measurements

Slide44

Conclusion

Join me in performing R.A.R.E experiments

find problems with sanity checks

find solutions with systems tools

explain everything

(with evidence!)

Question: are Java experiments useful?

Ongoing work: new test harness with tools to make R.A.R.E. experiments easier

Slide45

Slide46


About DocSlides
DocSlides allows users to easily upload and share presentations, PDF documents, and images.Share your documents with the world , watch,share and upload any time you want. How can you benefit from using DocSlides? DocSlides consists documents from individuals and organizations on topics ranging from technology and business to travel, health, and education. Find and search for what interests you, and learn from people and more. You can also download DocSlides to read or reference later.