Department of Electrical and Computer Engineering North Carolina State University Rami Sheikh James Tuck and Eric Rotenberg ControlFlow Decoupling Rami Sheikh 2012 MICRO45 ID: 257854
Download Presentation The PPT/PDF document "Center for Efficient, Scalable, and Reli..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Center for Efficient, Scalable, and Reliable Computing
Department
of Electrical and Computer EngineeringNorth Carolina State University
Rami Sheikh, James Tuck, and Eric Rotenberg
Control-Flow Decoupling
Rami Sheikh © 2012 MICRO-45
1Slide2
Single-thread Performance is Important
96
128168
192
15-cycle
17-cycle14-17 cycles
14-17 cycles (?)OoO scheduling window doubledPipeline depthremains high
*Source: Intel IDF presentations
14% - 33%
generation
to
generation
gains
*Source:
AnandTech
website
*Source:
AnandTech
website
2
Rami Sheikh © 2012 MICRO-45Slide3
Energy is Important
3Rami Sheikh © 2012 MICRO-45Slide4
ASTAR (Rivers)
Memory Latency Tolerance
4
Rami Sheikh © 2012 MICRO-45
Baseline uses ISL-TAGE predictor
63%
65%
67%
68%
69%
67%
65%
Energy ReductionSlide5
Better Branch Handling is ImportantImproves performanceReduces energy consumptionWrong pathPreparing for recovery
RecoveryNecessary catalyst for memory latency tolerance 5
Rami Sheikh © 2012 MICRO-45Slide6
Interesting Observation
branch-slice
control-
dependent
region
branch
6
Rami Sheikh © 2012 MICRO-45Slide7
Control-Flow Decoupling
branch-slice
control-
dependent
region
branch
branch-slice
control-
dependent
region
branch
7
Rami Sheikh © 2012 MICRO-45Slide8
Control-Flow Decoupling
branch-slice
control-
dependent
region
branch
control-
dependent
region
branch-slice
branch
branch-slice
Push_BQ
control-
dependent
region
Branch
_on_BQ
1 0 1 1 0 0 0 1 1 0 1 0
Original Loop
CFD Loops
BQ
BQ drives fetch
8
Rami Sheikh © 2012 MICRO-45
Generate a vector of predicatesSlide9
Problem #2
No mechanism to comm. predicates to Fetch Unit
Control-Flow Decoupling
Original Loop
IF
…….…
EX
IF
…….…
EX
IF
…….…
EX
IF
…….…
EX
IF
…….…
EX
IF
…….…
EX
Problem #1
No fetch separation: need branch prediction
9
Rami Sheikh © 2012 MICRO-45Slide10
Control-Flow Decoupling
CFD Loops
IF
…….…
EX
IF
…….…EX
IF
…….…
EX
IF
…….…
EX
IF
…….…
EX
IF
…….…
EX
10
Rami Sheikh © 2012 MICRO-45
CFD provides:
Fetch Separation
Mechanism to comm. predicates to Fetch Unit
BQ
1
0
1Slide11
Control-Flow Decoupling: Example
SOPLEX
31% contrib.
1 0 1 1 0 0 0 1 1 0 1 0
11
Rami Sheikh © 2012 MICRO-45Slide12
Agenda MethodologyControl-flow classificationControl-flow decoupling (CFD)
EvaluationConclusion12Rami Sheikh © 2012 MICRO-45Slide13
Methodology
Where do state-of-the-art branch predictors fall short?
4 benchmark suites
(80
apps)
PIN with x86 binaries27 out of 80 have misprediction rate >= 2%6 of which had problems with cross-compilerRemaining 21 apps contribute 78% of total MPKI (in the 80 apps)13Rami Sheikh © 2012 MICRO-45Slide14
Control-Flow Classification
Classify
targeted mispredictions
into four classes
Hammock
If-conversion
Separable branchesControl-Flow Decoupling (CFD)Inseparable branches (very serial)Other solutions requiredNot analyzed
14
Rami Sheikh © 2012 MICRO-45Slide15
Classify
targeted mispredictions
into four classes
Hammock
SeparableInseparableNot
analyzed15Rami Sheikh © 2012 MICRO-45
Control-Flow ClassificationSlide16
Agenda MethodologyControl-flow classification
Control-flow decoupling (CFD)EvaluationConclusion
16Rami Sheikh © 2012 MICRO-45Slide17
Targets
separable branches with large, complex
CD regions
ISA supportSoftware sideHardware side
17Rami Sheikh © 2012 MICRO-45
Control-Flow DecouplingSlide18
BQ specification:
Size (N)
Content
Length
18
Rami Sheikh © 2012 MICRO-45
ISA Support
N elements
BQ
1-bit flag
Length register
Two purposes:
Needed to save/restore
BQ state
Flexible implementation
(e.g., circular vs. shifting buffer)
<predicate>Slide19
Rule #2: N consecutive pushes must be followed by exactly N consecutive pops (same order)
New instructions:
Push_BQ
(push)
Branch_on_BQ (pop)19
Rami Sheikh © 2012 MICRO-45
ISA SupportPush-Pop Ordering Rules push-1
push-2
….…
push-N
pop-1
pop-2
pop-N
….…
push-1
….
Rule #1
: a
push must precede its corresponding
pop
Rule #3
: N
cannot exceed the BQ size
time
Align predicateswith their corresponding branches
PreventdeadlockSlide20
Working with finite-size BQ
for (
large_trip_count
/N)
{
for (1 ... N) {body of first loop} for (1 ... N) {body of second loop} }
for (
large_trip_count
) {body of first loop}
for (
large_trip_count
) {body of second loop}
BQ
BQ
CFD
L
oops
Strip-mined CFD Loops
20
Rami Sheikh © 2012 MICRO-45
Software SideSlide21
Hardware
Side
BQ implementation
21
Rami Sheikh © 2012 MICRO-45Slide22
Execution scenarios
BQ hit
BQ miss
22
Rami Sheikh © 2012 MICRO-45
IF
…….…EXIF
slice
branch
BQ miss
IF
…….
EX
IF
slice
branch
BQ hit
Common Case
Uncommon Case
Speculate or
Stall
Hardware
SideSlide23
Instruction Window
BQ length
23
Rami Sheikh © 2012 MICRO-45
BQ size is
N
push-1push-2
….…
push-N
pop-1
pop-2
pop-N
….…
push-1
….
time
0
BQ Length
N
Hardware
SideSlide24
Instruction Window
BQ length
24
Rami Sheikh © 2012 MICRO-45
push-1
push-2
….…
push-N
pop-1
pop-2
pop-N
….…
push-1
….
time
N
Stall push-1:
BQ is full
Hardware
Side
BQ size is
N
BQ LengthSlide25
Instruction Window
BQ length
25
Rami Sheikh © 2012 MICRO-45
push-1
push-2
….…push-N
pop-1
pop-2
pop-N
….…
push-1
….
time
N -1
Unstall
push-1
Hardware
Side
BQ size is
N
BQ LengthSlide26
Instruction Window
BQ
length
26
Rami Sheikh © 2012 MICRO-45
push-1
push-2….…
push-N
pop-1
pop-2
pop-N
….…
push-1
….
time
N
Hardware
Side
BQ size is
N
BQ LengthSlide27
Checkpoint:RMT, … etc
BQ
recovery
27
Rami Sheikh © 2012 MICRO-45
Committed State:
AMT, … etc
Hardware
SideSlide28
Checkpoint:RMT, … etc
BQ head ptrBQ tail ptr
BQ
recovery
28
Rami Sheikh © 2012 MICRO-45
Committed State:AMT, … etcArch. BQ head ptrArch. BQ tail ptr
Hardware
SideSlide29
control-
dependent
region
control-
dependent
region
branch
branch-slice
control-
dependent
region
branch
29
Rami Sheikh © 2012 MICRO-45
Other Interesting Aspects of CFD
Supports partially separable branches
branch-slice
branchSlide30
control-
dependent
region
branch
branch-slice
control-
dependent
region
branch
control-
dependent
region
30
Rami Sheikh © 2012 MICRO-45
Other Interesting Aspects of CFD
Supports partially separable branches
branch-slice
branch
branch-slice
Push_BQ
if-converted hammock
control-
dependent
region
Branch_on_BQSlide31
Works with nested branches:
Combine predicates (if safe)
Multi-level decoupling
CFD overheads can be reduced through value communication
(see CFD+ in the paper)31
Rami Sheikh © 2012 MICRO-45
Other Interesting Aspects of CFDSlide32
Agenda MethodologyControl-flow classification
Control-flow decoupling (CFD)Evaluation
Conclusion32Rami Sheikh © 2012 MICRO-45Slide33
Evaluation EnvironmentSimulatorIn-house detailed execution-driven, execute-at-execute, cycle-level Alpha simulatorCFD microarchitecture is faithfully modeled
McPAT and CACTI are used to measure energy consumptionBenchmarksCompiled with gcc and -O3 level optimizationModified benchmarks are validated by compiling and running to completion on x86 host (emulate BQ with software queue
)When simulating modified binaries, we simulate as many retired instructions as needed in order to perform the same amount of work as the unmodified binaries.33Rami Sheikh © 2012 MICRO-45Slide34
Evaluation EnvironmentBaseline
Branch Prediction
BP
: 64KB ISL-TAGE predictor
- 16 tables: 1 bimodal, 15 partially-tagged. In addition to, IUM, SC, LP.
- History lengths:
{0, 3, 8, 12, 17, 33, 35, 67, 97, 138, 195, 330, 517, 1193, 1741, 1930}BTB: 4K entries, 4-way set-associativeRAS: 64 entries
Memory Hierarchy
Block size
: 64B
Victim caches
: each cache has a 16-entry FA victim cache
L1
: split, 64KB each, 4-way set-associative, 1-cycle access latency
L2
: unified, private for each core, 512KB, 8-way set-associative, 20-cycle access latency
-
L2 stream
prefetcher
: 4 streams, each of depth 16
L3
: unified, shared among cores, 8MB, 16-way set-associative, 40-cycle access latency
Memory
: 200-cycle access latency
Fetch/Issue/Retire Width
4 instr./cycle
ROB/IQ/LDQ/STQ
168/54/64/36
(modeled after Sandy Bridge)
Fetch-to-Execute Latency
10-cycle
Physical RF
236
Checkpoints
8,
OoO
reclamation, confidence estimator (8K entries, 4-bit resetting counter,
gshare
index)
CFD
BQ
: 96B (128 6-bit entries)
VQ
renamer
: 128B (128 8-bit entries)
34
Rami Sheikh © 2012 MICRO-45Slide35
35
Rami Sheikh © 2012 MICRO-45
ResultsSlide36
36
Rami Sheikh © 2012 MICRO-45
ResultsSlide37
Fetch-to-execute depth
Bobcat/Power6
GeoMean
=1.16
Cortex A15
GeoMean
=1.18
Pentium 4
GeoMean
=1.22
37
Rami Sheikh © 2012 MICRO-45
Results – Sensitivity StudySlide38
38
Rami Sheikh © 2012 MICRO-45
Results – Manual vs. AutomatedSlide39
Conclusion
State-of-the-art
branch predictors have limitations
A third of mispredictions come from separable branches
CFD is a software/hardware collaboration for exploiting separability with low complexity and high efficacyCFD is comparable to if-conversion in terms of number of static branches and MPKI contribution
39Rami Sheikh © 2012 MICRO-45Slide40
Thanks!Questions?