rare Trevor Brown Technion slides at httptbrownpro Why do we perform experiments To answer questions about data structures Is one data structure faster than another Why We are asking about algorithmic differences ID: 652288
Download Presentation The PPT/PDF document "Good data structure experiments are" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Good data structureexperiments are r.a.r.e.
Trevor Brown
Technion
slides at http://tbrown.proSlide2
Why do we perform experiments?To answer questions about
data structures
Is one data structure faster than another? Why?
We are asking about algorithmic differences,
NOT engineering differencesSlide3
The problem
Typical data structure experiment
2x Intel E7-4830, 48 threads, 128GB RAM
Ubuntu 16.04LTS, G++ 6.3.0 with flags –mcx16 –O3
Binary search tree benchmark
Five 3-second
trials, tree prefilled to half-full
24 threads do 50% insert, 50% delete, 100k keysSlide4
Which is the “true” performance?
operations per microsecondSlide5
Which is the “right” comparison?
[NM14]
Lock-free External
BST
[BCCO10] Optimistic
AVL
tree
BCCO10
168% faster
than NM14
NM14
84% faster
than BCCO10Slide6
Good data structure Experiments are
Reproducible
Apples-to-apples (fair)
Realistic
ExplainableSlide7
Reproducibility:Crucial configuration parameters
Operating system:
memory allocator
, huge pages,
thread pinning
Processor:
prefetching
mode, hyper threading, turbo boost
Data
structure:
memory
reclamation, object poolingSlide8
Apples-to-apples Comparisons
All data structures should use the
same
:
Configuration parameters
ADT differences
set vs dictionary
insert-replace vs insert-if-absent
Engineering practices
inlining
, int vs longSlide9
Realistic experiments
Appropriate benchmarks
Realistic system configuration
Fast scalable allocator
Realistic data structure implementation
Memory reclamation
(and free() calls)
Eliminate implementation errorsSlide10
NM14
33% faster
BCCO10
106% faster
NM14 1-
6% faster
Comparison in [NM14]
Thread pinning reveals
sensitivity to NUMASlide11
Explaining results
Investigate with systems tools
Use performance counters (PAPI, Linux
perftools
)
L1/L2/L3 cache misses, stalls, cycles, instructions
Construct experiments to confirm explanationsSlide12
How to perform R.A.R.E. experiments
Forgotten or unrealistic parameters
Unfair or unrealistic comparisons
Bugs, unfair or unrealistic engineeringSlide13
Common Implementation errors
Test
harness
overhead
Misuse of C/C++ volatile
Memory leaks
False sharing
Bad padding/alignment
Data structure memory layout anomaliesSlide14
[#1] Test harness Overhead:impact on different Data structures
2.2x
3.1x
Original test harness
After reducing overhead
concurrent threads
operations per microsecond
8-thread
Intel I7-4770Slide15
Overhead of timing measurements
Data structure operations are no-ops
64-thread AMD
system
# operations per
get_time
() callSlide16
[#2] C++ volatile keyword
Informs compiler an address may be changed
by another thread
Prevents some optimizations that are illegal
in a concurrent setting
Value-based validation
v1 = *
addr
;
[…]
v2 = *
addr
if (v1 != v2) return FAIL;Eliminated validation!v1 = *addr
;[…]v2 = v1if (v1 != v2) return FAIL;
Optimize
Impossible!Slide17
Misuse of C/C++ Volatile
What is “left?”
node_t
* left;
volatile
node_t
* left;
node_t
volatile * left;
node_t
* volatile left;
volatile
node_t
* volatile left;Slide18
examples in the wild:Missing volatiles
The original implementation of the [NM14] BST
uses the following node type:
AO_double_t
is defined by the Atomic Ops (AO) library
NOT volatile by default
Need “volatile
AO_double_t
children”
struct
node_t
{
int key
; AO_double_t children;
};Slide19
struct
node_t
{
skey_t
key
;
sval_t
value;
volatile
node_t *
left; volatile node_t
* right;
char padding[32];
};
We want a
volatile pointer
to a node:
node_t
* volatile left
;
“left” is a pointer to a
volatile
node
examples in the wild:
Misplaced volatiles
The
ASCYLIB implementation of the [NM14] BST
uses the following node
type:Slide20
Profiling leaks in ./myprogram
PDF graph output
[#3] Checking
for memory leaks:
Using
jemalloc
env
MALLOC_CONF=prof_leak:true,lg_prof_sample:0,prof_final:true LD_PRELOAD=libjemalloc.so ./
myprogram
jeprof
--
show_bytes
--pdf ./
myprogram jeprof.1592.0.f.heap > output.pdf
<
jemalloc
>: Leak approximation summary: 8458392 bytes [...]
<
jemalloc
>: Run
jeprof
on "jeprof.1592.0.f.heap" [...]Slide21
PDF output: tracking down ~8MB of leaked memory
Leak was caused by a serious algorithmic bug!Slide22
Checking for memory leaks:Using valgrind
==
28550== 233,072 (3,696 direct, 229,376 indirect) bytes in 154 blocks are definitely lost in loss record 13 of 16
==28550== by 0x429518: Prepare<...> (snapcollector.h:307)
==28550== by 0x429518:
traversal_end
(rq_snap.h:313)
...
$
valgrind
--fair-
sched
=yes --leak-check=full ./
myprogramSlide23
[#4] False sharing
w
1
w
2
64 byte (8 word) cache line
Thread 1’s cache
Thread 2’s cache
w
3
w
4
w
5
w
6
w
7
w
8
Thread 1 reads w
2
w
1
w
2
w
3
w
4
w
5
w
6
w
7
w
8
S
Thread 2 reads w
7
w
1
w
2
w
3
w
4
w
5
w
6
w
7
w
8
S
Thread 2
writes
w
7
XSlide24
False sharing in the test harness
Typically revealed by sanity checks
For example:
read only workload with empty data structuresSlide25
Searches in empty data structures
o
perations per microsecond
Lock-free
skiplist
Lock-free list
Lazy list
RCU-based BST
Lock-free BST
Lock-free
(
a,b
)-tree
48 threads on:
2x24 thread Intel E7-4830
Same search codeSlide26
Locating the false sharing
Using Linux Performance Tools:
perf
Record performance counter
MEM_LOAD_UOPS_RETIRED_HIT_LFB
Command
perf record
–e
cpu
/event=0xd1,umask=0x40/pp ./
myprogram
Use high precision event data (more accurate line numbers)
≅ memory contentionSlide27
Exploring the perf dataSlide28
Exploring the
perf
dataSlide29
Exploring the
perf
dataSlide30
What are these variables?
while
(!done)
{
++
cnt
;
if
(
cnt
is a multiple of 50)
{
if (
get_time() - startTime
>= run_time) {
done = true;
memory_fence
();
break;
}
}
...
[perform a random operation]
}Slide31
The offending data layout
volatile long
rngs
[NUM_THREADS * PADDING];
volatile long
startTime
;
volatile bool done;
empty padding
data
Thread t’s random # generator =
rngs
[t * PADDING]
8 bytes
2 cache lines - 8 bytesSlide32
Expected data layout
...
rngs
[...]
startTime
done
Actual data layout
...
rngs
[...]
startTime
done
No false sharingSlide33
Brittle Solution
...
rngs
[...]
startTime
done
padding[…]
volatile long
rngs
[NUM_THREADS * PADDING];
volatile char padding[128];
volatile long
startTime
;
volatile bool done;
Data could still be reordered!Slide34
Better Solution
struct
{
volatile char pad0[128]
volatile long
rngs
[NUM_THREADS * PADDING];
volatile long
startTime
;
volatile bool done;
volatile char
pad1[128]} g;
...
g.rngs
[...]
g.startTime
g.done
g.pad0[…]
g.pad1[…]Slide35
Searches in empty data structures
o
perations per microsecond
Lock-free
skiplist
Lock-free list
Lazy list
RCU-based BST
Lock-free BST
Lock-free
(
a,b
)-treeSlide36
Searches in empty data structures
o
perations per microsecond
Lock-free
skiplist
Lock-free list
Lazy list
RCU-based BST
Lock-free BST
Lock-free
(
a,b
)-treeSlide37
Why is the skiplist slow?
Use PAPI measurements to investigate
Lock-free
BST
Lock-free
skiplist
L1
miss / op
0.11
0.14
L2
miss / op
0.11
0.14
L3
miss / op
0.04
0.05
Cycles / op
347
656
Instr. / op
307
700Slide38
Digging deeper with perf
perf
record –e
cpu-cycles:pp
./
myprogram
;
perf reportSlide39
Confirm with an experimentFlatten
skiplist
(MAX_LEVEL=1)
Works because of empty data structure workloadSlide40
Searches in empty data structures
o
perations per microsecond
Lock-free
skiplist
Lock-free list
Lazy list
RCU-based BST
Lock-free BST
Lock-free
(
a,b
)-treeSlide41
[#5] Problematic padding
Original
2.05 L3 miss/op
Padding
removed
0.01 L3 miss/op
operations per microsecondSlide42
[#6] Data structure
memory layout anomalies
48 threads
Prefill with 1M insertions
Then do 100% searches
More L2 misses
and L3 misses
But the external BST
contains more nodes!?
Fixed
Memory layout:
NNNNNNNN
Memory layout:
NDNDNDND
operations per microsecondSlide43
My top 10 Sanity checks
Empty data structures
Read-only workloads
Inspect
object
sizes and first k object allocations per thread
Trial
length:
3s
vs 60s
10
2 vs 105 vs 10
7 keys
Key checksums
Valgrind
Extremely high contention
Memory
reclamation:
efficient
vs eager
Variance measurementsSlide44
Conclusion
Join me in performing R.A.R.E experiments
find problems with sanity checks
find solutions with systems tools
explain everything
(with evidence!)
Question: are Java experiments useful?
Ongoing work: new test harness with tools to make R.A.R.E. experiments easier