/
Good data structure experiments are Good data structure experiments are

Good data structure experiments are - PowerPoint Presentation

pamella-moone
pamella-moone . @pamella-moone
Follow
372 views
Uploaded On 2018-03-13

Good data structure experiments are - PPT Presentation

rare Trevor Brown Technion slides at httptbrownpro Why do we perform experiments To answer questions about data structures Is one data structure faster than another Why We are asking about algorithmic differences ID: 649523

free data volatile lock data free lock volatile memory bst thread structures structure perf microsecond empty list layout tree

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Good data structure experiments are" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Good data structureexperiments are r.a.r.e.

Trevor Brown

Technion

slides at http://tbrown.proSlide2

Why do we perform experiments?To answer questions about

data structures

Is one data structure faster than another? Why?

We are asking about algorithmic differences,

NOT engineering differencesSlide3

The problem

Typical data structure experiment

2x Intel E7-4830, 48 threads, 128GB RAM

Ubuntu 16.04LTS, G++ 6.3.0 with flags –mcx16 –O3

Binary search tree benchmark

Five 3-second

trials, tree prefilled to half-full

24 threads do 50% insert, 50% delete, 100k keysSlide4

Which is the “true” performance?

operations per microsecondSlide5

Which is the “right” comparison?

[NM14]

Lock-free External

BST

[BCCO10] Optimistic

AVL

tree

BCCO10

168% faster

than NM14

NM14

84% faster

than BCCO10Slide6

Good data structure Experiments are

Reproducible

Apples-to-apples (fair)

Realistic

ExplainableSlide7

Reproducibility:Crucial configuration parameters

Operating system:

memory allocator

, huge pages,

thread pinning

Processor:

prefetching

mode, hyper threading, turbo boost

Data

structure:

memory

reclamation, object poolingSlide8

Apples-to-apples Comparisons

All data structures should use the

same

:

Configuration parameters

Abstract data type

set vs dictionary

insert-replace vs insert-if-absent

Engineering practices

inlining

, int vs longSlide9

Realistic experimentsRealistic system configuration

Fast scalable allocator

Realistic data structure implementation

Memory reclamation

(and free() calls)

Eliminate implementation errorsSlide10

NM14 1-

6% faster

Thread pinning reveals

sensitivity to NUMASlide11

How to perform R.A.R.E. experiments

Forgotten or unrealistic parameters

Unfair or unrealistic comparisons

Bugs, unfair or unrealistic engineeringSlide12

Common Implementation errors

Test

harness

overhead

Misuse of C/C++ volatile

Memory leaks

False sharing

Bad padding/alignment

Data structure memory layout anomaliesSlide13

[#1] Test harness Overhead:impact on different Data structures

2.2x

3.1x

Original test harness

After reducing overhead

concurrent threads

operations per microsecond

8-thread

Intel I7-4770Slide14

[#2] C++ volatile keyword

Informs compiler an address may be changed

by another thread

Prevents some optimizations that are illegal

in a concurrent setting

Value-based validation

v1 = *

addr

;

[…]

v2 = *

addr

if (v1 != v2) return FAIL;Eliminated validation!v1 = *addr

;[…]v2 = v1if (v1 != v2) return FAIL;

Optimize

Impossible!Slide15

Misuse of C/C++ Volatile

What is “left?”

node_t

* left;

volatile

node_t

* left;

node_t

volatile * left;

node_t

* volatile left;

volatile

node_t

* volatile left;Slide16

Profiling leaks in ./myprogram

PDF graph output

[#3] Checking

for memory leaks:

Using

jemalloc

env

MALLOC_CONF=prof_leak:true,lg_prof_sample:0,prof_final:true LD_PRELOAD=libjemalloc.so ./

myprogram

jeprof

--

show_bytes

--pdf ./

myprogram jeprof.1592.0.f.heap > output.pdf

<

jemalloc

>: Leak approximation summary: 8458392 bytes [...]

<

jemalloc

>: Run

jeprof

on "jeprof.1592.0.f.heap" [...]Slide17

PDF output: tracking down ~8MB of leaked memory

Leak was caused by a serious algorithmic bug!Slide18

Checking for memory leaks:Using valgrind

==

28550== 233,072 (3,696 direct, 229,376 indirect) bytes in 154 blocks are definitely lost in loss record 13 of 16

==28550== by 0x429518: Prepare<...> (snapcollector.h:307)

==28550== by 0x429518:

traversal_end

(rq_snap.h:313)

...

$

valgrind

--fair-

sched

=yes --leak-check=full ./

myprogramSlide19

[#4] False sharing

w

1

w

2

64 byte (8 word) cache line

Thread 1’s cache

Thread 2’s cache

w

3

w

4

w

5

w

6

w

7

w

8

Thread 1 reads w

2

w

1

w

2

w

3

w

4

w

5

w

6

w

7

w

8

S

Thread 2 reads w

7

w

1

w

2

w

3

w

4

w

5

w

6

w

7

w

8

S

Thread 2

writes

w

7

XSlide20

False sharing in the test harness

Typically revealed by sanity checks

For example:

read only workload with empty data structuresSlide21

Searches in empty data structures

o

perations per microsecond

Lock-free

skiplist

Lock-free list

Lazy list

RCU-based BST

Lock-free BST

Lock-free

(

a,b

)-tree

48 threads on:

2x24 thread Intel E7-4830

Same search codeSlide22

Locating the false sharing

Using Linux Performance Tools:

perf

Record performance counter

MEM_LOAD_UOPS_RETIRED_HIT_LFB

Commands

perf record

–e

cpu

/event=0xd1,umask=0x40/pp ./

myprogram

perf report

≅ memory contentionSlide23

Exploring the perf dataSlide24

Exploring the

perf

dataSlide25

Exploring the

perf

dataSlide26

What are these variables?

while

(!done)

{

++

cnt

;

if

(

cnt

is a multiple of 50)

{

if (

get_time() - startTime

>= run_time) {

done = true;

memory_fence

();

break;

}

}

...

[perform a random operation]

}Slide27

The offending data layout

volatile long

rngs

[NUM_THREADS * PADDING];

volatile long

startTime

;

volatile bool done;

empty padding

data

Thread t’s random # generator =

rngs

[t * PADDING]

8 bytes

2 cache lines - 8 bytesSlide28

Expected data layout

...

rngs

[...]

startTime

done

Actual data layout

...

rngs

[...]

startTime

done

No false sharingSlide29

Solution

struct

{

volatile char pad0[128]

volatile long

rngs

[NUM_THREADS * PADDING];

volatile long

startTime

;

volatile bool done;

volatile char

pad1[128]} g;

...

g.rngs

[...]

g.startTime

g.done

g.pad0[…]

g.pad1[…]Slide30

Searches in empty data structures

o

perations per microsecond

Lock-free

skiplist

Lock-free list

Lazy list

RCU-based BST

Lock-free BST

Lock-free

(

a,b

)-treeSlide31

Searches in empty data structures

o

perations per microsecond

Lock-free

skiplist

Lock-free list

Lazy list

RCU-based BST

Lock-free BST

Lock-free

(

a,b

)-treeSlide32

Why is the skiplist slow?

Use PAPI measurements to investigate

Lock-free

BST

Lock-free

skiplist

L1

miss / op

0.11

0.14

L2

miss / op

0.11

0.14

L3

miss / op

0.04

0.05

Cycles / op

347

656

Instr. / op

307

700Slide33

Digging deeper with perf

perf

record –e

cpu-cycles:pp

./

myprogram

;

perf reportSlide34

Confirm with an experimentFlatten

skiplist

(MAX_LEVEL=1)

Works because of empty data structure workloadSlide35

Searches in empty data structures

o

perations per microsecond

Lock-free

skiplist

Lock-free list

Lazy list

RCU-based BST

Lock-free BST

Lock-free

(

a,b

)-treeSlide36

[#5] Problematic padding

operations per microsecondSlide37

[#6] Data structure

memory layout anomalies

48 threads

Prefill with 1M insertions

Then do 100% searches

More L2 misses

and L3 misses

But the external BST

contains more nodes!?

Fixed

Memory layout:

NNNNNNNN

Memory layout:

NDNDNDND

operations per microsecondSlide38

My top 10 Sanity checks

Variance measurements

Empty

data structures

Read-only workloads

Inspect object addresses

Very large data structures

Valgrind

Key

checksums

Extremely

high

contention

Eager reclamation

Artificial delays

Correctness

PerformanceSlide39

Conclusion

Join me in performing R.A.R.E experiments

expose problems

with sanity checks

find solutions

with systems tools

explain

everything

Question: are Java experiments useful?

Ongoing work:

new

C/C++ test harness with tools to make R.A.R.E. data structure experiments easiertutorial

seriesData structure microbenchmarks and application benchmarks

Scripts to run sanity checksSimple memory reclamationMore control over memory layout

Easy PAPI integrationResults stored in SQL database

Graph generation scripts