John Demme Simha Sethumadhavan Columbia University jddsimha cscolumbiaedu 2002 CASTL Computer Architecture and Security Technologies Lab 2 Platforms Source TIOBE Index ID: 797736
Download The PPT/PDF document "Rapid Identification of Architectural Bo..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Rapid Identification of Architectural Bottlenecks via Precise Event Counting
John
Demme
,
Simha
Sethumadhavan
Columbia University
{
jdd,simha
}@
cs.columbia.edu
Slide22002CASTL: Computer Architecture and Security Technologies Lab
2
Platforms
Source: TIOBE Index
http://
www.tiobe.com
/
index.php
/
tiobe_index
Slide32011
CASTL: Computer Architecture and Security Technologies Lab
3
Source: TIOBE Index
http://
www.tiobe.com
/
index.php
/tiobe_index
Platforms
Multicore
Moore’s Law
Slide4How can we possibly keep up?
CASTL: Computer Architecture and Security Technologies Lab
4
Slide5Architectural Lifecycle
CASTL: Computer Architecture and Security Technologies Lab
5
Slide6Performance Data CollectionAnalytical ModelsFast, but questionable accuracy
Simulation
Often the gold standard
Very detailed information
Very slow
Production Hardware (performance counters)Very fastNot very detailed
CASTL: Computer Architecture and Security Technologies Lab6
Slide7Performance Data CollectionAnalytical ModelsFast, but questionable accuracy
Simulation
Often the gold standard
Very detailed information
Very slow
Production Hardware (Performance Counters)Very fastNot very detailedRelatively detailed
CASTL: Computer Architecture and Security Technologies Lab
7
Slide8Accuracy, Precision & PerturbationA comparison of performance monitoring techniques
a
nd the uncertainty
p
rincipal
CASTL: Computer Architecture and Security Technologies Lab8
Slide9Accuracy, Precision & PerturbationIn normal execution, program interacts with microarchitecture as expected
CASTL: Computer Architecture and Security Technologies Lab
9
Normal Program Execution
Corresponding Machine State (Cache, Branch Predictor,
etc
)
Time
Slide10Precise Instrumentation
When instrumentation is inserted, the machine state is disrupted and measurements are inaccurate
CASTL: Computer Architecture and Security Technologies Lab
10
Monitored Program Execution
“Correct” Machine State (Cache, Branch Predictor,
etc
)
Measured Machine State (Cache, Branch Predictor,
etc
)
Start of
mutex_lock
Start of
mutex_unlock
Start of
b
arrier_wait
Time
Slide11Performance Counter SW Landscape
Precise
Reads counters whenever
program or instrumentation requests a read
Heavyweight
Examples
PAPI
perf_event
Overhead
Proportional
to # of reads
PAPI: 1048ns
Perf_event
: 262ns
CASTL: Computer Architecture and Security Technologies Lab
11
Slide12Sampling vs. InstrumentationCASTL: Computer Architecture and Security Technologies Lab
12
Sampled Program Execution
n cycles
n cycles
Traditional Instrumented Program Execution
Start of
mutex_lock
Start of
mutex_unlock
Start of
b
arrier_wait
Traditional instrumentation like polling
Sampling uses interrupts
Time
Slide13Performance Counter SW Landscape
Sampling
Precise
Interrupts every
n
cycles and extrapolates
Reads counters whenever
program or instrumentation requests a read
Heavyweight
Examples
vTune
OProfile
PAPI
perf_event
Overhead
Inversely
proportional to n
Up to 20%
Usually much less
Proportional
to # of reads
PAPI: 1048ns
Perf_event
: 262ns
CASTL: Computer Architecture and Security Technologies Lab
13
Slide14The Problem with SamplingCASTL: Computer Architecture and Security Technologies Lab
14
Sample Interrupt
Is this a critical section?
Slide15Corrected with PrecisionCASTL: Computer Architecture and Security Technologies Lab
15
Read counter
Read counter
Slide16But, Precision Adds Overhead
CASTL: Computer Architecture and Security Technologies Lab
16
Monitored Program Execution
“Correct” Machine State (Cache, Branch Predictor,
etc
)
Measured Machine State (Cache, Branch Predictor,
etc
)
Time
Slide17Instrumentation Adds PerturbationIf instrumentation sections are short, perturbation is reduced and measurements become more accurate
CASTL: Computer Architecture and Security Technologies Lab
17
Monitored Program Execution
“Correct” Machine State (Cache, Branch Predictor,
etc
)
Measured Machine State (Cache, Branch Predictor,
etc
)
Time
Slide18Performance Counter SW Landscape
Sampling
Precise
Interrupts every
n
cycles and extrapolates
Reads counters whenever
program or instrumentation requests a read
Heavyweight
Lightweight
Examples
vTune
OProfile
PAPI
perf_event
Overhead
Inversely
proportional to n
Up to 20%
Usually much less
Proportional
to # of reads
PAPI: 1048ns
Perf_event
: 262ns
CASTL: Computer Architecture and Security Technologies Lab
18
Slide19Performance Counter SW Landscape
Sampling
Precise
Interrupts every
n
cycles and extrapolates
Reads counters whenever
program or instrumentation requests a read
Heavyweight
Lightweight
Examples
vTune
OProfile
PAPI
perf_event
LiMiT
Overhead
Inversely
proportional to n
Up to 20%
Usually much less
Proportional
to # of reads
PAPI: 1048ns
Perf_event
: 262nsProportional to # of reads11ns
CASTL: Computer Architecture and Security Technologies Lab
19
Slide20Related WorkNo recent papers for better precise countingOriginal PAPI paper:
Browne et
a
l.
2000Some software, none offering
LiMiT’s featuresCharacterizing performance counters Weaver & Dongarra 2010Sampling
Counter multiplexing techniquesMytkowicz et al. 2007Azimi et al. 2005Trace Alignment
Mytkowicz et al. 2006CASTL: Computer Architecture and Security Technologies Lab
20
Slide21Reducing counterread overheads
Implementing lightweight, precise monitoring
CASTL: Computer Architecture and Security Technologies Lab
21
Slide22Why Precision is Slow
Avoid system calls to avoid overhead
Perfmon2 &
Perf_event
LiMiT
Program requests counter read
22
CASTL: Computer Architecture and Security Technologies Lab
Kernel reads counter and returns result
Program uses value
System Call
System Ret
Program
reads counter
Program uses value
Why is this
so hard?
Slide23A Self-Monitoring ProcessCASTL: Computer Architecture and Security Technologies Lab
23
Slide24Run, process, runCASTL: Computer Architecture and Security Technologies Lab
24
3
24
39
5
L1 Misses
Branches
Cycles
Slide25OverflowCASTL: Computer Architecture and Security Technologies Lab
25
L1 Misses
Branches
Cycles
24
39
7
95
1
0
0
Psst
!
Slide26Overflow
CASTL: Computer Architecture and Security Technologies Lab
26
L1 Misses
Branches
Cycles
24
7
0
0
L1 Misses
Branches
Cycles
0
0
0
Overflow Space
1
100
Slide27Modified ReadCASTL: Computer Architecture and Security Technologies Lab
27
L1 Misses
Branches
Cycles
24
7
20
L1 Misses
Branches
Cycles
0
0
Overflow Space
100
20
100
+
120
Slide28Overflow During ReadCASTL: Computer Architecture and Security Technologies Lab
28
L1 Misses
Branches
Cycles
24
7
9
9
L1 Misses
Branches
Cycles
0
0
Overflow Space
0
99
Slide29Overflow!
CASTL: Computer Architecture and Security Technologies Lab
29
L1
Misses
Branches
Cycles
24
7
00
L1 Misses
Branches
Cycles
0
0
Overflow Space
0
1
100
99
Slide30Atomicity Violation!CASTL: Computer Architecture and Security Technologies Lab
30
L1 Misses
Branches
Cycles
24
7
0
L1 Misses
Branches
Cycles
0
0
Overflow Space
100
99
100
+
19
9
Slide31OS Detection & Correction
CASTL: Computer Architecture and Security Technologies Lab
31
L1 Misses
Branches
Cycles
24
7
00
L1 Misses
Branches
Cycles
0
0
Overflow Space
0
1
100
99
Slide32OS Detection & Correction
CASTL: Computer Architecture and Security Technologies Lab
32
L1 Misses
Branches
Cycles
24
7
00
L1 Misses
Branches
Cycles
0
0
Overflow Space
100
99
Looks like he was reading that…
0
Slide33Atomicity Violation CorrectedCASTL: Computer Architecture and Security Technologies Lab
33
L1 Misses
Branches
Cycles
24
7
0
L1 Misses
Branches
Cycles
0
0
Overflow Space
100
0
100
+
100
So what does all this effort buy us?
Slide34Time to collect 3*107 readings
Time
PAPI
Perf_event
LiMiT
Speedup
User
1.26s0.53s
0.034s3.7x / 1.56x
System30.10s
7.30s
0
∞
Wall
31.44s
7.87s
0.34s
92x / 23.1x
CASTL: Computer Architecture and Security Technologies Lab
34
Average
LiMiT
Readout
Number
of instructions
5
Number of cycles37.14
Time
11.3
ns
Slide35LiMiT Enables Detailed StudyShort counter reads decrease perturbation
Little perturbation allows detailed study of
Short synchronization regions
Short function calls
Three Case Studies
Synchronization in production web applicationsNot presented here, see paperSynchronization changes in MySQL over timeUser/Kernel code behavior in runtime libraries
CASTL: Computer Architecture and Security Technologies Lab
35
Slide36Case Study:LONGITUDINAL STUDY OF LOCKING BEHAVIOR IN MYSQL
Has MySQL gotten better since the advent of multi-cores?
CASTL: Computer Architecture and Security Technologies Lab
36
Slide37Evolution of Locking in MySQLQuestions to answer
Has MySQL gotten better at locking?
What techniques have been used?
Methodology
Intercept
pthread locking callsCount overheads and critical sectionsCASTL: Computer Architecture and Security Technologies Lab
37
Slide38MySQL Synchronization TimesCASTL: Computer Architecture and Security Technologies Lab
38
Slide39MySQL Critical SectionsCASTL: Computer Architecture and Security Technologies Lab
39
Slide40Number of Locks in MySQLCASTL: Computer Architecture and Security Technologies Lab
40
Slide41Observations & ImplicationsCoarser granularity, better performance
Total critical section time has decreased
A
verage CS times have increased
Number of locks has decreased
Performance counters useful for software engineering studiesCASTL: Computer Architecture and Security Technologies Lab
41
Slide42Case Study:KERNEL/USERSPACE OVERHEADS IN RUNTIME LIBRARY
Does code in the kernel and runtime library behave?
CASTL: Computer Architecture and Security Technologies Lab
42
Slide43Full System Analysis w/o SimulationQuestions to answerHow much time do system applications spend in in runtime libraries?
How well do they perform in them? Why?
Methodology
Intercept common
libc
, libm and libpthread callsCount user-/kernel- space events during the callsBreak down by purpose (I/O, Memory,
Pthread)ApplicationsMySQL, ApacheIntel Nehalem Microarchitecture
CASTL: Computer Architecture and Security Technologies Lab
43
Slide44Execution Cycles in Library CallsCASTL: Computer Architecture and Security Technologies Lab
44
Slide45MySQL Clocks per InstructionCASTL: Computer Architecture and Security Technologies Lab
45
Slide46L3 Cache MPKICASTL: Computer Architecture and Security Technologies Lab
46
Slide47I-Cache Stall CyclesCASTL: Computer Architecture and Security Technologies Lab
47
22.4%
12.0%
Slide48Observations & ImplicationsApache is fundamentally I/O bound
Optimization of the I/O subsystem necessary
Kernel code suffers from I-Cache stalls
Speculation: bad interrupt instruction prefetching
LiMiT
yields detailed performance dataNot as accurate or detailed as simulationBut gathered in hours rather than weeksCASTL: Computer Architecture and Security Technologies Lab
48
Slide49ConclusionsResearch Methodology Implications,Closing thoughts
CASTL: Computer Architecture and Security Technologies Lab
49
Slide50ConclusionsImplications from case studiesMySQL’s multicore
e
xperience helped scalability
Performance
counting for non-architecture
Libraries and kernels perform very differentlyI/O subsystems can be slowResearch MethodologyLiMiT can provide detailed results quicklySimulators are more detailed but slowOpportunity to build
microbenchmarksIdentify bottlenecks with countersVerify representativeness with countersThen simulate
CASTL: Computer Architecture and Security Technologies Lab
50
Slide51Questions?
CASTL: Computer Architecture and Security Technologies Lab
51
Slide52Backup slidesMan down! Need backup!
CASTL: Computer Architecture and Security Technologies Lab
52
Slide53Performance Evaluation Methods
Accuracy
Precision
Speed
Cost
Simulators
↑
↑
↓
↑
/
↓
Analytical Models
?
?
↑
↓
Prototype Hardware
↑
↑
↑
↑
Production
Hardware
↑
/
↓
↑
/
↓
↑
↓
Accuracy and Precision
are traded off
Production hardware provides performance counters
However, existing interfaces make accuracy/precision tradeoff
difficult
53
CASTL: Computer Architecture and Security Technologies Lab
Slide54Sampling vs. LiMiTCASTL: Computer Architecture and Security Technologies Lab
54
Sampled Program Execution
n cycles
n cycles
LiMiT
Instrumented Program Execution
Start of
mutex_lock
Start of
mutex_unlock
Start of
b
arrier_wait
Slide55Another process runsCASTL: Computer Architecture and Security Technologies Lab
55
Miles
Pushups
Situps
5
24
39
7
9
Slide56Fix: Virtualization
CASTL: Computer Architecture and Security Technologies Lab
56
Miles
Pushups
Situps
24
39
30
30 Miles!
I did pretty well today.
No you didn’t.
7
Slide57Miles
Pushups
Situps
24
39
7
Avoiding Communication
CASTL: Computer Architecture and Security Technologies Lab
57
Miles
Pushups
Situps
0
0
30
Slide58LiMiT Operation
CASTL: Computer Architecture and Security Technologies Lab
58
Slide59RDTSC
CASTL: Computer Architecture and Security Technologies Lab
59
Slide60MySQL Instrumentation OverheadCASTL: Computer Architecture and Security Technologies Lab
60
Slide61Case Study A:LOCKING IN WEB WORKLOADS
How does web-related software use locks?
CASTL: Computer Architecture and Security Technologies Lab
61
Slide62Locking on the WebQuestions to answerIs locking a significant concern?
How can architects help?
Are traditional benchmarks similar?
Methodology
Intercept
pthread mutex calls, time w/ LiMiTApplicationsFirefoxApache
MySQLPARSECCASTL: Computer Architecture and Security Technologies Lab
62
Slide63Execution Time by RegionCASTL: Computer Architecture and Security Technologies Lab
63
Slide64Locking Statistics
Firefox
Apache
PARSEC
MySQL
Avg.
Lock Held Time (cycles)
789149
1181076Dynamic Locks per 10k Cycles
3.241.12
0.545
3.18
Static Locks
57
1
17
13853
CASTL: Computer Architecture and Security Technologies Lab
64
Slide65Observations & ImplicationsApplications like Firefox and MySQL use locks differently from Apache and PARSEC
Many notions of synchronization based on scientific computing probably don’t apply
Locking overheads up to 8 - 13%
More efficient mechanisms may be helpful
But, 13% is upper bound on speedup
MySQL has some very long critical sectionsPrime targets for micro-arch optimizationIf they run faster, MySQL scales betterCASTL: Computer Architecture and Security Technologies Lab
65
Slide66Hardware Enhancements64-bit Reads and WritesOverflows are primary source of complexity
64-bit counters w/ full read/write eliminates it
Destructive Reads
D
ifference = 2 reads, store, load & subtract
Destructive read difference = 2 readsCombined ReadsX86 counter read requires 2 instructionsCombining should reduce overheadAMD’s Lightweight Profiling ProposalReally good, depending on microarchitecture
CASTL: Computer Architecture and Security Technologies Lab
66