Good data structure experiments are

Good data structure experiments are Good data structure experiments are - Start

2018-03-13 29K 29 0 0

Good data structure experiments are - Description

r.a.r.e. .. Trevor Brown. Technion. slides at http://tbrown.pro. Why do we perform experiments?. To answer questions about . data structures. Is one data structure faster than another? Why?. We are asking about algorithmic differences,. ID: 649523 Download Presentation

Download Presentation

Good data structure experiments are




Download Presentation - The PPT/PDF document "Good data structure experiments are" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentations text content in Good data structure experiments are

Slide1

Good data structureexperiments are r.a.r.e.

Trevor Brown

Technion

slides at http://tbrown.pro

Slide2

Why do we perform experiments?To answer questions about

data structures

Is one data structure faster than another? Why?

We are asking about algorithmic differences,

NOT engineering differences

Slide3

The problem

Typical data structure experiment

2x Intel E7-4830, 48 threads, 128GB RAM

Ubuntu 16.04LTS, G++ 6.3.0 with flags –mcx16 –O3

Binary search tree benchmark

Five 3-second

trials, tree prefilled to half-full

24 threads do 50% insert, 50% delete, 100k keys

Slide4

Which is the “true” performance?

operations per microsecond

Slide5

Which is the “right” comparison?

[NM14]

Lock-free External

BST

[BCCO10] Optimistic

AVL

tree

BCCO10

168% faster

than NM14

NM14

84% faster

than BCCO10

Slide6

Good data structure Experiments are

Reproducible

Apples-to-apples (fair)

Realistic

Explainable

Slide7

Reproducibility:Crucial configuration parameters

Operating system:

memory allocator

, huge pages,

thread pinning

Processor:

prefetching

mode, hyper threading, turbo boost

Data

structure:

memory

reclamation, object pooling

Slide8

Apples-to-apples Comparisons

All data structures should use the

same

:

Configuration parameters

Abstract data type

set vs dictionary

insert-replace vs insert-if-absent

Engineering practices

inlining

, int vs long

Slide9

Realistic experimentsRealistic system configuration

Fast scalable allocator

Realistic data structure implementation

Memory reclamation

(and free() calls)

Eliminate implementation errors

Slide10

NM14 1-

6% faster

Thread pinning reveals

sensitivity to NUMA

Slide11

How to perform R.A.R.E. experiments

Forgotten or unrealistic parameters

Unfair or unrealistic comparisons

Bugs, unfair or unrealistic engineering

Slide12

Common Implementation errors

Test

harness

overhead

Misuse of C/C++ volatile

Memory leaks

False sharing

Bad padding/alignment

Data structure memory layout anomalies

Slide13

[#1] Test harness Overhead:impact on different Data structures

2.2x

3.1x

Original test harness

After reducing overhead

concurrent threads

operations per microsecond

8-thread

Intel I7-4770

Slide14

[#2] C++ volatile keyword

Informs compiler an address may be changed

by another thread

Prevents some optimizations that are illegal

in a concurrent setting

Value-based validation

v1 = *

addr

;

[…]

v2 = *

addr

if (v1 != v2) return FAIL;Eliminated validation!v1 = *addr

;[…]v2 = v1if (v1 != v2) return FAIL;

Optimize

Impossible!

Slide15

Misuse of C/C++ Volatile

What is “left?”

node_t

* left;

volatile

node_t

* left;

node_t

volatile * left;

node_t

* volatile left;

volatile

node_t

* volatile left;

Slide16

Profiling leaks in ./myprogram

PDF graph output

[#3] Checking

for memory leaks:

Using

jemalloc

env

MALLOC_CONF=prof_leak:true,lg_prof_sample:0,prof_final:true LD_PRELOAD=libjemalloc.so ./

myprogram

jeprof

--

show_bytes

--pdf ./

myprogram jeprof.1592.0.f.heap > output.pdf

<

jemalloc

>: Leak approximation summary: 8458392 bytes [...]

<

jemalloc

>: Run

jeprof

on "jeprof.1592.0.f.heap" [...]

Slide17

PDF output: tracking down ~8MB of leaked memory

Leak was caused by a serious algorithmic bug!

Slide18

Checking for memory leaks:Using valgrind

==

28550== 233,072 (3,696 direct, 229,376 indirect) bytes in 154 blocks are definitely lost in loss record 13 of 16

==28550== by 0x429518: Prepare<...> (snapcollector.h:307)

==28550== by 0x429518:

traversal_end

(rq_snap.h:313)

...

$

valgrind

--fair-

sched

=yes --leak-check=full ./

myprogram

Slide19

[#4] False sharing

w

1

w

2

64 byte (8 word) cache line

Thread 1’s cache

Thread 2’s cache

w

3

w

4

w

5

w

6

w

7

w

8

Thread 1 reads w

2

w

1

w

2

w

3

w

4

w

5

w

6

w

7

w

8

S

Thread 2 reads w

7

w

1

w

2

w

3

w

4

w

5

w

6

w

7

w

8

S

Thread 2

writes

w

7

X

Slide20

False sharing in the test harness

Typically revealed by sanity checks

For example:

read only workload with empty data structures

Slide21

Searches in empty data structures

o

perations per microsecond

Lock-free

skiplist

Lock-free list

Lazy list

RCU-based BST

Lock-free BST

Lock-free

(

a,b

)-tree

48 threads on:

2x24 thread Intel E7-4830

Same search code

Slide22

Locating the false sharing

Using Linux Performance Tools:

perf

Record performance counter

MEM_LOAD_UOPS_RETIRED_HIT_LFB

Commands

perf record

–e

cpu

/event=0xd1,umask=0x40/pp ./

myprogram

perf report

≅ memory contention

Slide23

Exploring the perf data

Slide24

Exploring the

perf

data

Slide25

Exploring the

perf

data

Slide26

What are these variables?

while

(!done)

{

++

cnt

;

if

(

cnt

is a multiple of 50)

{

if (

get_time() - startTime

>= run_time) {

done = true;

memory_fence

();

break;

}

}

...

[perform a random operation]

}

Slide27

The offending data layout

volatile long

rngs

[NUM_THREADS * PADDING];

volatile long

startTime

;

volatile bool done;

empty padding

data

Thread t’s random # generator =

rngs

[t * PADDING]

8 bytes

2 cache lines - 8 bytes

Slide28

Expected data layout

...

rngs

[...]

startTime

done

Actual data layout

...

rngs

[...]

startTime

done

No false sharing

Slide29

Solution

struct

{

volatile char pad0[128]

volatile long

rngs

[NUM_THREADS * PADDING];

volatile long

startTime

;

volatile bool done;

volatile char

pad1[128]} g;

...

g.rngs

[...]

g.startTime

g.done

g.pad0[…]

g.pad1[…]

Slide30

Searches in empty data structures

o

perations per microsecond

Lock-free

skiplist

Lock-free list

Lazy list

RCU-based BST

Lock-free BST

Lock-free

(

a,b

)-tree

Slide31

Searches in empty data structures

o

perations per microsecond

Lock-free

skiplist

Lock-free list

Lazy list

RCU-based BST

Lock-free BST

Lock-free

(

a,b

)-tree

Slide32

Why is the skiplist slow?

Use PAPI measurements to investigate

Lock-free

BST

Lock-free

skiplist

L1

miss / op

0.11

0.14

L2

miss / op

0.11

0.14

L3

miss / op

0.04

0.05

Cycles / op

347

656

Instr. / op

307

700

Slide33

Digging deeper with perf

perf

record –e

cpu-cycles:pp

./

myprogram

;

perf report

Slide34

Confirm with an experimentFlatten

skiplist

(MAX_LEVEL=1)

Works because of empty data structure workload

Slide35

Searches in empty data structures

o

perations per microsecond

Lock-free

skiplist

Lock-free list

Lazy list

RCU-based BST

Lock-free BST

Lock-free

(

a,b

)-tree

Slide36

[#5] Problematic padding

operations per microsecond

Slide37

[#6] Data structure

memory layout anomalies

48 threads

Prefill with 1M insertions

Then do 100% searches

More L2 misses

and L3 misses

But the external BST

contains more nodes!?

Fixed

Memory layout:

NNNNNNNN

Memory layout:

NDNDNDND

operations per microsecond

Slide38

My top 10 Sanity checks

Variance measurements

Empty

data structures

Read-only workloads

Inspect object addresses

Very large data structures

Valgrind

Key

checksums

Extremely

high

contention

Eager reclamation

Artificial delays

Correctness

Performance

Slide39

Conclusion

Join me in performing R.A.R.E experiments

expose problems

with sanity checks

find solutions

with systems tools

explain

everything

Question: are Java experiments useful?

Ongoing work:

new

C/C++ test harness with tools to make R.A.R.E. data structure experiments easiertutorial series

Data structure microbenchmarks and application benchmarks

Scripts to run sanity checksSimple memory reclamationMore control over memory layout

Easy PAPI integrationResults stored in SQL databaseGraph generation scripts

Slide40


About DocSlides
DocSlides allows users to easily upload and share presentations, PDF documents, and images.Share your documents with the world , watch,share and upload any time you want. How can you benefit from using DocSlides? DocSlides consists documents from individuals and organizations on topics ranging from technology and business to travel, health, and education. Find and search for what interests you, and learn from people and more. You can also download DocSlides to read or reference later.