rare Trevor Brown Technion slides at httptbrownpro Why do we perform experiments To answer questions about data structures Is one data structure faster than another Why We are asking about algorithmic differences ID: 649523
Download Presentation The PPT/PDF document "Good data structure experiments are" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Good data structureexperiments are r.a.r.e.
Trevor Brown
Technion
slides at http://tbrown.proSlide2
Why do we perform experiments?To answer questions about
data structures
Is one data structure faster than another? Why?
We are asking about algorithmic differences,
NOT engineering differencesSlide3
The problem
Typical data structure experiment
2x Intel E7-4830, 48 threads, 128GB RAM
Ubuntu 16.04LTS, G++ 6.3.0 with flags –mcx16 –O3
Binary search tree benchmark
Five 3-second
trials, tree prefilled to half-full
24 threads do 50% insert, 50% delete, 100k keysSlide4
Which is the “true” performance?
operations per microsecondSlide5
Which is the “right” comparison?
[NM14]
Lock-free External
BST
[BCCO10] Optimistic
AVL
tree
BCCO10
168% faster
than NM14
NM14
84% faster
than BCCO10Slide6
Good data structure Experiments are
Reproducible
Apples-to-apples (fair)
Realistic
ExplainableSlide7
Reproducibility:Crucial configuration parameters
Operating system:
memory allocator
, huge pages,
thread pinning
Processor:
prefetching
mode, hyper threading, turbo boost
Data
structure:
memory
reclamation, object poolingSlide8
Apples-to-apples Comparisons
All data structures should use the
same
:
Configuration parameters
Abstract data type
set vs dictionary
insert-replace vs insert-if-absent
Engineering practices
inlining
, int vs longSlide9
Realistic experimentsRealistic system configuration
Fast scalable allocator
Realistic data structure implementation
Memory reclamation
(and free() calls)
Eliminate implementation errorsSlide10
NM14 1-
6% faster
Thread pinning reveals
sensitivity to NUMASlide11
How to perform R.A.R.E. experiments
Forgotten or unrealistic parameters
Unfair or unrealistic comparisons
Bugs, unfair or unrealistic engineeringSlide12
Common Implementation errors
Test
harness
overhead
Misuse of C/C++ volatile
Memory leaks
False sharing
Bad padding/alignment
Data structure memory layout anomaliesSlide13
[#1] Test harness Overhead:impact on different Data structures
2.2x
3.1x
Original test harness
After reducing overhead
concurrent threads
operations per microsecond
8-thread
Intel I7-4770Slide14
[#2] C++ volatile keyword
Informs compiler an address may be changed
by another thread
Prevents some optimizations that are illegal
in a concurrent setting
Value-based validation
v1 = *
addr
;
[…]
v2 = *
addr
if (v1 != v2) return FAIL;Eliminated validation!v1 = *addr
;[…]v2 = v1if (v1 != v2) return FAIL;
Optimize
Impossible!Slide15
Misuse of C/C++ Volatile
What is “left?”
node_t
* left;
volatile
node_t
* left;
node_t
volatile * left;
node_t
* volatile left;
volatile
node_t
* volatile left;Slide16
Profiling leaks in ./myprogram
PDF graph output
[#3] Checking
for memory leaks:
Using
jemalloc
env
MALLOC_CONF=prof_leak:true,lg_prof_sample:0,prof_final:true LD_PRELOAD=libjemalloc.so ./
myprogram
jeprof
--
show_bytes
--pdf ./
myprogram jeprof.1592.0.f.heap > output.pdf
<
jemalloc
>: Leak approximation summary: 8458392 bytes [...]
<
jemalloc
>: Run
jeprof
on "jeprof.1592.0.f.heap" [...]Slide17
PDF output: tracking down ~8MB of leaked memory
Leak was caused by a serious algorithmic bug!Slide18
Checking for memory leaks:Using valgrind
==
28550== 233,072 (3,696 direct, 229,376 indirect) bytes in 154 blocks are definitely lost in loss record 13 of 16
==28550== by 0x429518: Prepare<...> (snapcollector.h:307)
==28550== by 0x429518:
traversal_end
(rq_snap.h:313)
...
$
valgrind
--fair-
sched
=yes --leak-check=full ./
myprogramSlide19
[#4] False sharing
w
1
w
2
64 byte (8 word) cache line
Thread 1’s cache
Thread 2’s cache
w
3
w
4
w
5
w
6
w
7
w
8
Thread 1 reads w
2
w
1
w
2
w
3
w
4
w
5
w
6
w
7
w
8
S
Thread 2 reads w
7
w
1
w
2
w
3
w
4
w
5
w
6
w
7
w
8
S
Thread 2
writes
w
7
XSlide20
False sharing in the test harness
Typically revealed by sanity checks
For example:
read only workload with empty data structuresSlide21
Searches in empty data structures
o
perations per microsecond
Lock-free
skiplist
Lock-free list
Lazy list
RCU-based BST
Lock-free BST
Lock-free
(
a,b
)-tree
48 threads on:
2x24 thread Intel E7-4830
Same search codeSlide22
Locating the false sharing
Using Linux Performance Tools:
perf
Record performance counter
MEM_LOAD_UOPS_RETIRED_HIT_LFB
Commands
perf record
–e
cpu
/event=0xd1,umask=0x40/pp ./
myprogram
perf report
≅ memory contentionSlide23
Exploring the perf dataSlide24
Exploring the
perf
dataSlide25
Exploring the
perf
dataSlide26
What are these variables?
while
(!done)
{
++
cnt
;
if
(
cnt
is a multiple of 50)
{
if (
get_time() - startTime
>= run_time) {
done = true;
memory_fence
();
break;
}
}
...
[perform a random operation]
}Slide27
The offending data layout
volatile long
rngs
[NUM_THREADS * PADDING];
volatile long
startTime
;
volatile bool done;
empty padding
data
Thread t’s random # generator =
rngs
[t * PADDING]
8 bytes
2 cache lines - 8 bytesSlide28
Expected data layout
...
rngs
[...]
startTime
done
Actual data layout
...
rngs
[...]
startTime
done
No false sharingSlide29
Solution
struct
{
volatile char pad0[128]
volatile long
rngs
[NUM_THREADS * PADDING];
volatile long
startTime
;
volatile bool done;
volatile char
pad1[128]} g;
...
g.rngs
[...]
g.startTime
g.done
g.pad0[…]
g.pad1[…]Slide30
Searches in empty data structures
o
perations per microsecond
Lock-free
skiplist
Lock-free list
Lazy list
RCU-based BST
Lock-free BST
Lock-free
(
a,b
)-treeSlide31
Searches in empty data structures
o
perations per microsecond
Lock-free
skiplist
Lock-free list
Lazy list
RCU-based BST
Lock-free BST
Lock-free
(
a,b
)-treeSlide32
Why is the skiplist slow?
Use PAPI measurements to investigate
Lock-free
BST
Lock-free
skiplist
L1
miss / op
0.11
0.14
L2
miss / op
0.11
0.14
L3
miss / op
0.04
0.05
Cycles / op
347
656
Instr. / op
307
700Slide33
Digging deeper with perf
perf
record –e
cpu-cycles:pp
./
myprogram
;
perf reportSlide34
Confirm with an experimentFlatten
skiplist
(MAX_LEVEL=1)
Works because of empty data structure workloadSlide35
Searches in empty data structures
o
perations per microsecond
Lock-free
skiplist
Lock-free list
Lazy list
RCU-based BST
Lock-free BST
Lock-free
(
a,b
)-treeSlide36
[#5] Problematic padding
operations per microsecondSlide37
[#6] Data structure
memory layout anomalies
48 threads
Prefill with 1M insertions
Then do 100% searches
More L2 misses
and L3 misses
But the external BST
contains more nodes!?
Fixed
Memory layout:
NNNNNNNN
Memory layout:
NDNDNDND
operations per microsecondSlide38
My top 10 Sanity checks
Variance measurements
Empty
data structures
Read-only workloads
Inspect object addresses
Very large data structures
Valgrind
Key
checksums
Extremely
high
contention
Eager reclamation
Artificial delays
Correctness
PerformanceSlide39
Conclusion
Join me in performing R.A.R.E experiments
expose problems
with sanity checks
find solutions
with systems tools
explain
everything
Question: are Java experiments useful?
Ongoing work:
new
C/C++ test harness with tools to make R.A.R.E. data structure experiments easiertutorial
seriesData structure microbenchmarks and application benchmarks
Scripts to run sanity checksSimple memory reclamationMore control over memory layout
Easy PAPI integrationResults stored in SQL database
Graph generation scripts