Rethinking of the Memory Hierarchy Sarita Adve Vikram Adve Rob Bocchino Nicholas Carter Byn Choi ChingTsun Chou Stephen Heumann Nima Honarmand Rakesh Komuravelli Maria Kotsifakou ID: 792466
Download The PPT/PDF document "DeNovo : A Software-Driven" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
DeNovo: A Software-Driven Rethinking of the Memory Hierarchy
Sarita
Adve,
Vikram Adve,
Rob Bocchino
, Nicholas Carter, Byn
Choi,
Ching-Tsun
Chou,
Stephen Heumann,
Nima
Honarmand, Rakesh Komuravelli, Maria Kotsifakou,
Tatiana
Schpeisman
, Matthew
Sinclair, Robert
Smolinski
,
Prakalp
Srivastava
, Hyojin
Sung, Adam
Welc
University of Illinois at Urbana-Champaign, Intel
denovo@cs.illinois.edu
Slide2ParallelismSpecialization, heterogeneity, …BUT large impact on Software
Hardware
Hardware-Software Interface
Silver Bullets for the Energy Crisis?
Slide3Multicore parallelism today: shared-memoryComplex, power- and performance-inefficient hardwareComplex directory coherence, unnecessary traffic, ... Difficult programming model
Data races, non-determinism,
composability?,
testing?
Mismatched interface between HW and SW, a.k.a memory modelCan’t specify “what value can read return”
Data races defy acceptable semantics
Multicore Parallelism: Current Practice
Fundamentally broken for hardware & software
Slide4Specialization/Heterogeneity: Current Practice6 different ISAs
7 different parallelism models
I
ncompatible
memory systems
A modern smartphone
CPU, GPU, DSP, Vector Units, Multimedia, Audio-Video accelerators
Even more broken
Slide5How to (co-)designSoftware?Hardware?HW / SW Interface?Energy Crisis Demands Rethinking HW, SW
Deterministic Parallel Java (DPJ)
DeNovo
Virtual Instruction Set Computing (VISC)
Today: Focus on homogeneous parallelism
Slide6Multicore parallelism today: shared-memoryComplex, power- and performance-inefficient hardwareComplex directory coherence, unnecessary traffic, ... Difficult programming model
Data races, non-determinism,
composability?,
testing?
Mismatched interface between HW and SW, a.k.a memory modelCan’t specify “what value can read return”
Data races defy acceptable semantics
Multicore Parallelism: Current Practice
Fundamentally broken for hardware & software
Slide7Multicore parallelism today: shared-memoryComplex, power- and performance-inefficient hardwareComplex directory coherence, unnecessary traffic, ...
Difficult programming model
Data races, non-determinism,
composability
?, testing?Mismatched interface between HW and SW,
a.k.a
memory model
Can’t specify “what value can read return”
Data races defy acceptable semantics
Multicore Parallelism: Current Practice
Fundamentally broken for hardware & software
Banish shared memory?
Slide8Multicore parallelism today: shared-memoryComplex, power- and performance-inefficient hardwareComplex directory coherence, unnecessary traffic, ...
Difficult programming model
Data races, non-determinism,
composability
?, testing?Mismatched interface between HW and SW,
a.k.a
memory model
Can’t specify “what value can read return”
Data races defy acceptable semantics
Multicore Parallelism: Current Practice
Fundamentally broken for hardware & software
Banish
wild
shared memory!
Need
disciplined
shared memory!
Slide9Shared-Memory = Global address space +Implicit, anywhere communication, synchronizationWhat is Shared-Memory?
Slide10Shared-Memory = Global address space +Implicit, anywhere communication, synchronizationWhat is Shared-Memory?
Slide11Wild Shared-Memory = Global address space +Implicit, anywhere communication, synchronization
What is Shared-Memory?
Slide12Wild Shared-Memory = Global address space +Implicit, anywhere communication, synchronization
What is Shared-Memory?
Slide13Disciplined Shared-Memory = Global address space +Implicit, anywhere communication, synchronizationExplicit, structured side-effects
What is Shared-Memory?
How to build disciplined shared-memory software?
If software is more disciplined, can hardware be more efficient?
Slide14Simple programming model AND efficient hardwareOur Approach
Disciplined Shared Memory
Deterministic Parallel Java (
DPJ): Strong safety properties
No data races, determinism-by-default, safe non-determinism
Simple semantics, safety, and
composability
DeNovo
: Complexity-, performance-, power-efficiency
Simplify
coherence and
consistency
Optimize communication and
data storage
explicit effects +
structured
parallel control
Slide15Key MilestonesSoftwareDPJ: Determinism OOPSLA’09
Disciplined
non-determinism
POPL’11Unstructured synchronizationLegacy, OS
Hardware
DeNovo
Coherence, Consistency,
Communication
PACT’11 best paper
DeNovoND
ASPLOS’13 &
IEEE Micro top picks’14
DeNovoSynch
(in review)
Ongoing
A language-oblivious virtual ISA
+ Storage
Slide16ComplexitySubtle races and numerous transient states in the protocolHard to verify and extend for optimizationsStorage overheadDirectory overhead for sharer lists
Performance and power inefficiencies
Invalidation,
ack messagesIndirection through
directoryFalse sharing (cache-line based coherence)Bandwidth waste (cache-line based communication)Cache pollution (cache-line based allocation)Current Hardware Limitations
Slide17ComplexityNo transient statesSimple to extend for optimizations
Storage overhead
Directory overhead for sharer lists
Performance and power inefficiencies
Invalidation, ack messagesIndirection through directoryFalse sharing (cache-line based coherence)Bandwidth waste (cache-line based communication)
Cache pollution (cache-line based allocation)
Results for Deterministic Codes
Base
DeNovo
20X faster to verify vs. MESI
Slide18ComplexityNo transient statesSimple to extend for optimizations
Storage overhead
No
storage overhead for directory information
Performance and power inefficienciesInvalidation, ack messagesIndirection through directory
False sharing (cache-line based coherence)
Bandwidth waste (cache-line based communication)
Cache pollution (cache-line based allocation)
Results for Deterministic
Codes
18
Base
DeNovo
20X faster to verify vs. MESI
Slide19ComplexityNo transient statesSimple to extend for optimizations
Storage overhead
No
storage overhead for directory information
Performance and power inefficienciesNo invalidation, ack
messages
No
indirection through directory
No
false sharing: region based coherence
Region,
not cache-line
, communication
Region,
not cache-line
, allocation (ongoing)
Results for Deterministic
Codes
Up
to
77% lower memory stall time
Up to 71% lower traffic
Base
DeNovo
20X faster to verify vs. MESI
Slide20DPJ OverviewStructured parallel controlFork-join parallelismRegion: name for set of memory locationsAssign region to each field, array cellEffect: read or write on a regionSummarize effects of method bodiesCompiler: simple type checkRegion types consistentE
ffect summaries correct
Parallel tasks don’t interfere (race-free)
heap
ST
ST
ST
ST
LD
Type-checked programs guaranteed determinism (sequential semantics)
Slide21Memory Consistency ModelGuaranteed determinism Read returns value of last
write in sequential order
Same task in this parallel phaseOr before this parallel phase
LD 0xa
ST 0xa
Parallel
Phase
ST 0xa
Coherence
Mechanism
Slide22Cache CoherenceCoherence EnforcementInvalidate stale copies in cachesTrack one up-to-date copyExplicit effectsCompiler knows all writeable regions in this parallel phaseCache can
self-invalidate
before next parallel phaseInvalidates data in writeable regions not accessed by itself
RegistrationDirectory keeps track of
one up-to-date copyWriter updates before next parallel phase
Slide23Basic DeNovo Coherence [PACT’11]Assume (for now): Private L1, shared L2; single word lineData-race freedom at word granularity
No transient
states
No invalidation traffic, no false sharing
No directory storage overheadL2 data arrays double as directoryKeep valid
data or
registered
core id
Touched bit: set if word read in the phase
registry
Invalid
Valid
Registered
Read
Write
Write
Read, Write
Read
Slide24Example RunR
X
0
V
Y
0
R
X
1
V
Y
1
R
X
2
V
Y
2
V
X
3
V
Y
3
V
X
4
V
Y
4
V
X
5
V
Y
5
class
S_type
{
X in DeNovo-region ;
Y in
DeNovo
-region ;
}
S _type
S[size
];
...
Phase1
writes
{
//
DeNovo
effect
foreach
i
in 0, size {
S[i].X
= …;
}
self_invalidate
( );
}
L1 of Core 1
R
X
0
V
Y
0
R
X
1
V
Y
1
R
X
2
V
Y
2
I
X
3
V
Y
3
I
X
4
V
Y
4
I
X
5
V
Y
5
L1 of Core 2
I
X
0
V
Y
0
I
X
1
V
Y
1
I
X
2
V
Y
2
R
X
3
V
Y
3
R
X
4
V
Y
4
R
X
5
V
Y
5
Shared L2
R
C1
V
Y
0
R
C1
V
Y
1
R
C1
V
Y
2
R
C2
V
Y
3
R
C2
V
Y
4
R
C2
V
Y
5
R = R
egistered
V = V
alid
I = I
nvalid
V
X
0
V
Y
0
V
X
1
V
Y
1
V
X
2
V
Y
2
V
X
3
V
Y3V X4VY4V X5VY5
V X0VY0V X1VY1V X2VY2V X3VY3VX4VY4VX5VY5
V X0VY0V X1VY1V X2VY2V X3VY3VX4VY4VX5VY5
V X0VY0V X1VY1V X2VY2R X3VY3RX4VY4RX5VY5
Registration
Registration
Ack
Ack
Slide25Decoupling Coherence and Tag GranularityBasic protocol has tag per wordDeNovo Line-based protocolAllocation/Transfer granularity > Coherence granularityAllocate, transfer cache line at a timeCoherence granularity still at wordNo word-level false-sharing
“Line Merging”
Cache
V
V
R
Tag
V
V
V
Slide26Current Hardware LimitationsComplexitySubtle races and numerous transient sates in the protocolHard to extend for optimizations
Storage overhead
Directory overhead for sharer lists (makes up for new bits at ~20 cores)
Performance and power inefficiencies
Invalidation, ack messagesIndirection through directoryFalse sharing (cache-line based coherence)
Traffic (cache-line based communication)
Cache pollution (cache-line based allocation)
✔
✔
✔
✔
Slide27Flexible, Direct CommunicationInsights1. Traditional directory must be updated at every transfer DeNovo can copy valid data around freely2. Traditional systems send cache line at a time
DeNovo uses regions to transfer only relevant data
Effect of AoS-to-SoA transformation w/o programmer/compiler
Slide28Flexible, Direct CommunicationL1 of Core 1
…
…
R
X
0
V
Y
0
V
Z
0
R
X
1
V
Y
1
V
Z
1
R
X
2
V
Y
2
V
Z
2
I
X
3
V
Y
3
V
Z
3
I
X
4
V
Y
4
V
Z
4
I
X
5
V
Y
5
V
Z
5
L1 of Core 2
…
…
I
X
0
V
Y
0
V
Z
0
I
X
1
V
Y
1
V
Z
1
I
X
2
V
Y
2
V
Z
2
R
X
3
V
Y
3
V
Z
3
R
X
4
V
Y
4
V
Z
4
R
X
5
V
Y
5
V
Z
5
Shared L2
…
…
R
C1
V
Y
0
V
Z
0
R
C1
V
Y
1
V
Z
1
R
C1
V
Y
2
V
Z
2
R
C2
V
Y
3
V
Z
3
R
C2
V
Y
4
V
Z
4
R
C2
V
Y
5
V
Z
5
R
egistered
V
alid
I
nvalid
X
3
LD X
3
Y
3
Z
3
L1 of Core 1
…
…
R
X
0
V
Y
0
V
Z
0
R
X
1
V
Y
1
V
Z
1
R
X
2
V
Y
2
V
Z
2
I
X
3
V
Y
3
V
Z
3
I
X
4
V
Y
4
V
Z
4
I
X
5
V
Y
5
V
Z
5
L1 of Core 2
…
…
I
X
0
V
Y
0
V
Z
0
I
X
1
V
Y
1
V
Z
1
I
X
2
V
Y
2
V
Z
2
R
X
3
V
Y
3
V
Z
3
R
X
4
V
Y
4
V
Z
4
R
X
5
V
Y
5
V
Z
5
Shared L2
…
…
R
C1
V
Y
0
V
Z
0
R
C1
V
Y
1
V
Z
1
R
C1
V
Y
2
V
Z
2
R
C2
V
Y
3
V
Z
3
R
C2
V
Y
4
V
Z
4
R
C2
V
Y
5
V
Z
5
R
egistered
V
alid
I
nvalid
X
3
X
4
X
5
R
X
0
V
Y
0
V
Z
0
R
X
1
V
Y
1
V
Z
1
R
X
2
V
Y
2
V
Z
2
V
X
3VY
3VZ3V X4VY4VZ4V X5VY5VZ5LD X3Flexible, Direct CommunicationFlexible, Direct Communication
Slide30Current Hardware LimitationsComplexitySubtle races and numerous transient sates in the protocolHard to extend for optimizations
Storage overhead
Directory overhead for sharer lists (makes up for new bits at ~20 cores
)
Performance and power inefficienciesInvalidation, ack messagesIndirection through directoryFalse sharing (cache-line based coherence)
Traffic (cache-line based communication
)
Cache pollution (cache-line based allocation)
✔
✔
✔
✔
✔
✔
✔
Stash=
cache+scratchpad
,
a
nother talk
Slide31EvaluationVerification: DeNovo vs. MESI word with Murphi model checkerCorrectnessSix bugs in MESI protocol: Difficult to find and fixThree bugs in
DeNovo
protocol: Simple to fix
Complexity15x fewer reachable states for DeNovo
20x difference in the runtimePerformance: Simics + GEMS + Garnet 64 cores, simple in-order core modelWorkloadsFFT, LU, Barnes-Hut, and radix from SPLASH-2bodytrack and fluidanimate from PARSEC 2.1kd-Tree (two versions) [HPG 09]
Slide32DeNovo
is comparable to or better than MESI
DeNovo
+ opts shows 32% lower memory stalls vs. MESI (max 77%)
Memory Stall Time for MESI vs.
DeNovo
FFT
LU
Barnes-Hut
kd
-false
kd
-padded
bodytrack
fluidanimate
radix
M=MESI D=
DeNovo
Dopt
=
DeNovo+Opt
Slide33DeNovo
has 36% less traffic than MESI (max 71%)
Network Traffic for MESI vs.
DeNovo
FFT
LU
Barnes-Hut
kd
-false
kd
-padded
bodytrack
fluidanimate
radix
M=MESI D=
DeNovo
Dopt
=
DeNovo+Opt
Slide34Key MilestonesSoftwareDPJ: Determinism OOPSLA’09
Disciplined
non-determinism
POPL’11Unstructured synchronizationLegacy, OS
Hardware
DeNovo
Coherence, Consistency,
Communication
PACT’11 best paper
DeNovoND
ASPLOS’13 &
IEEE Micro top picks’14
DeNovoSynch
(in review)
Ongoing
A language-oblivious virtual ISA
+ Storage
Slide35DPJ Support for Disciplined Non-DeterminismNon-determinism comes from conflicting concurrent accessesIsolate interfering accesses as atomicEnclosed in atomic sectionsAtomic
regions and effects Disciplined non-determinism
Race freedom, strong isolationDeterminism-by-default semantics
DeNovoND converts atomic statements into locks
.
.
.
.
.
.
ST
LD
Slide36.
.
.
.
.
Memory Consistency Model
Non-deterministic read returns value of last write from
B
efore
this parallel
phase
Or same task in this phase
Or in preceding critical section of same lock
LD 0xa
ST 0xa
ST 0xa
Critical
Section
Parallel
Phase
self-invalidations as before
single core
Slide37Coherence for Non-Deterministic DataWhen to invalidate? Between start of critical section and readWhat to invalidate?Entire cache? Regions with “atomic” effects?
Track atomic writes in a signature, transfer with lock
Registration
Writer updates before next critical section
Coherence Enforcement
Invalidate stale copies in private cache
Track up-to-date copy
Slide38Tracking Data Write SignaturesSmall Bloom filter per core tracks writes signatureOnly track atomic effectsOnly 256 bits sufficeOperations on Bloom filter On write: insert addressOn read: query filter for address for self-invalidation
Slide39Distributed Queue-based LocksLock primitive that works on DeNovoNDNo sharers-list, no write invalidation No spinning for lock
Modeled after QOSB Lock [Goodman et al. ‘89]
Lock requests form a distributed queueBut much simpler Details in ASPLOS’13
Slide40lock transfer
V
X
R
Y
V
Z
V
W
V
X
R
Y
I
Z
V
W
V
X
R
Y
V
Z
V
W
R
X
V
Y
V
Z
V
W
R
X
V
Y
R
Z
V
W
R
C1
R
C2
V
Z
V
W
R
C1
R
C2
R
C1
V
W
lock transfer
Example Run
ST
LD
ST
.
.
self-invalidate( )
L1 of Core 1
L1 of Core 2
Shared L2
Z W
Registration
Ack
Read miss
X in
DeNovo
-region
Y in
DeNovo
-region
Z in
atomic
DeNovo
-region
W in
atomic
DeNovo
-region
LD
V
X
R
Y
V
Z
R
W
R
X
V
Y
R
Z
I
W
Registration
R
C1
R
C2
R
C1
R
C
2
Ack
Read miss
R
X
V
Y
R
Z
V
W
self-invalidate( )
reset filter
R
X
V
Y
R
Z
I
W
V
X
R
Y
I
Z
R
W
Z W
Slide41Optimizations to Reduce Self-InvalidationL
oads in Registered state
Touched-atomic bit
Set on first atomic load
Subsequent loads don’t self-invalidate
.
.
X in
DeNovo
-region
Y in
DeNovo
-region
Z in
atomic
DeNovo
-region
W in
atomic
DeNovo
-region
ST
LD
self-invalidate( )
LD
LD
Slide42OverheadsHardware Bloom filter256 bits per core Storage overheadOne additional state, but no storage overhead (2 bits) Touched-atomic bit per word in L1Communication overheadBloom filter piggybacked on lock transfer messageWriteback messages for locks
Lock
writebacks carry more info
Slide43Evaluation of MESI vs. DeNovoND (16 cores)DeNovoND execution time comparable or better than MESIDeNovoND has 33% less traffic than MESI (67% max)No invalidation trafficReduced load misses due to lack of false sharing
Slide44Key MilestonesSoftwareDPJ: Determinism OOPSLA’09
Disciplined
non-determinism
POPL’11Unstructured synchronizationLegacy, OS
Hardware
DeNovo
Coherence, Consistency,
Communication
PACT’11 best paper
DeNovoND
ASPLOS’13 &
IEEE Micro top picks’14
DeNovoSynch
(in review)
Ongoing
A language-oblivious virtual ISA
+ Storage
Slide45Unstructured SynchronizationMany programs (libraries) use unstructured synchronizationE.g., non-blocking, wait-free constructsArbitrary synchronization racesSeveral reads and writesData ordered by such synchronization may still be disciplinedUse static or signature driven self-invalidationsBut what about synchronization accesses?
Slide46Memory model: Sequential consistencyWhat to invalidate, when to invalidate?Every read? Every read to non-registered stateRegister read (to enable future hits)Concurrent readers?Back off (delay) read registrationUnstructured Synchronization
Slide47Unstructured Synch: Execution Time (64 cores)
DeNovoSync
reduces execution time by 28% over MESI (max 49%)
Slide48Unstructured Synch: Network Traffic (64 cores)
DeNovo
reduces traffic by 44% vs. MESI (max 61%) for 11 of 12 cases
Centralized barrier
Many concurrent readers hurt DeNovo (and MESI)Should use tree barrier even with MESI
Slide49Key MilestonesSoftwareDPJ: Determinism OOPSLA’09
Disciplined
non-determinism
POPL’11Unstructured synchronizationLegacy, OS
Hardware
DeNovo
Coherence, Consistency,
Communication
PACT’11 best paper
DeNovoND
ASPLOS’13 &
IEEE Micro top picks’14
DeNovoSynch
(in review)
Ongoing
A language-oblivious virtual ISA
+ Storage
Slide50Simple programming model AND efficient hardwareConclusions and Future Work (1 of 3)
Disciplined Shared Memory
Deterministic Parallel Java (
DPJ): Strong safety properties
No data races, determinism-by-default, safe non-determinism
Simple semantics, safety, and
composability
DeNovo
: Complexity-, performance-, power-efficiency
Simplify
coherence and
consistency
Optimize communication and storage
explicit effects +
structured
parallel control
Slide51Conclusions and Future Work (2 of 2)DeNovo rethinks hardware for disciplined modelsFor deterministic codesComplexityNo transient states: 20X faster to verify than MESIExtensible: optimizations without new statesStorage overhead
No
directory overheadPerformance and power inefficienciesNo invalidations,
acks, false sharing, indirectionFlexible, not cache-line, communication
Up to 77% lower memory stall time, up to 71% lower trafficAdded safe non-determinism and unstructured synchs
Slide52Broaden software supportedOS, legacy, …Region-driven memory hierarchyAlso apply to heterogeneous memoryGlobal address spaceRegion-driven coherence, communication, layoutStash = best of cache and scratchpadHardware/Software InterfaceLanguage-neutral virtual ISAParallelism and specialization may solve energy crisis, butRequire rethinking software, hardware, interface
Conclusions and Future Work (3 of 3)