Multicore Memory Hierarchy for Disciplined Parallelism Byn Choi Nima Honarmand Rakesh Komuravelli Robert Smolinski Hyojin Sung Sarita V Adve Vikram S ID: 919754
Download Presentation The PPT/PDF document "DeNovo : Rethinking the" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
DeNovo: Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism
Byn
Choi
,
Nima
Honarmand
,
Rakesh Komuravelli
,
Robert
Smolinski
,
Hyojin Sung
, Sarita V.
Adve
,
Vikram
S.
Adve
, Nicholas P.
Carter
,
Ching-Tsun
Chou
University of Illinois, Urbana-Champaign,
Intel
MotivationGoal: Power-, complexity-, performance-scalable hardware
Today: shared-memory
Directory-based coherence
Complex and
unscalable
Difficult programming model
Data races, non-determinism, no safety/
composability
/modularity
Mismatched interface between HW and SW,
a.k.a
memory model
Can’t specify “what value can read return”
Data races defy acceptable semantics
Fundamentally broken for hardware and software
Slide3Solution?Banish shared memory?
As you scale the number of cores on a cache coherent system (CC), “cost” in “time and memory” grows to a point beyond which the additional cores are not useful in a single parallel program. This is the coherency wall….
A case for message-passing for many-core computing,
Timothy Mattson et al
Slide4SolutionThe problems are not inherent to shared memory paradigm
Global address space
Implicit,
unstructured
communication and synchronization
Shared
Memory
Slide5SolutionBanish wild
shared memory!
Global address space
Implicit,
unstructured communication
and synchronization
Wild Shared
Memory
Slide6Implicit,
unstructured
communication and
synchronization
Solution
Build
disciplined
shared memory!
Global address space
Explicit, structured
side-effects
Disciplined Shared
Memory
Slide7Disciplined Shared Memory
Disciplined Shared Memory
No data races, determinism-by-default, safe non-determinism
Simple semantics, safety and
composability
Simple coherence and
consistency
Software
-aware
address/
comm
/coherence granularity
Power
-, complexity-, performance-scalable
HW
explicit effects +
structured
parallel control
Slide8Research Strategy: Software ScopeMany systems provide disciplined shared-memory features
Current driver is DPJ (Deterministic Parallel Java)
Determinism-by-default, safe non-determinism
End goal is language-oblivious interface
Current focus on
deterministic codes
Common and best case
Extend later to safe non-determinism, legacy codes
Research Strategy: Hardware ScopeToday’s focus:
cache coherence
Limitations of current directory-based protocols
Complexity
Subtle races and numerous transient sates in the protocol
Hard to extend for optimizations
Storage overhead
Directory overhead for sharer lists
Performance and power inefficiencies
Invalidation and
ack
messages
False sharing
Indirection through the directory
Suboptimal comm. granularity of cache line …
Ongoing: Rethink communication architecture and data layout
Slide10ContributionsSimplicityCompared protocol complexity with MESI
25x less reachable states for model checking
Extensibility
Direct cache-to-cache transfer
Flexible communication granularity
Storage overhead
No storage overhead for directory information
Storage overheads beat MESI after tens of cores and scale beyond
Performance/Power
Up to 73% reduction in memory stall time
Up to 70% reduction in network traffic
Slide11ContributionsSimplicityCompared protocol complexity with MESI
25x less reachable states with model checking
Extensibility
Direct cache-to-cache transfer
Flexible communication granularity
Storage overhead
No storage overhead for directory information
Storage overheads beat MESI after tens of cores and scale beyond
Performance/Power
Up to 73% reduction in memory stall time
Up to 70% reduction in network traffic
Slide12ContributionsSimplicityCompared protocol complexity with MESI
25x less reachable states with model checking
Extensibility
Direct cache-to-cache transfer
Flexible communication granularity
Storage overhead
No storage overhead for directory information
Storage overheads beat MESI after tens of cores and scale beyond
Performance/Power
Up to 73% reduction in memory stall time
Up to 70% reduction in network traffic
Slide13ContributionsSimplicityCompared protocol complexity with MESI
25x less reachable states with model checking
Extensibility
Direct cache-to-cache transfer
Flexible communication granularity
Storage overhead
No storage overhead for directory information
Storage overheads beat MESI after tens of cores and scale beyond
Performance/Power
Up to 73% reduction in memory stall time
Up to 70% reduction in network traffic
Slide14ContributionsSimplicityCompared protocol complexity with MESI
25x less reachable states with model checking
Extensibility
Direct cache-to-cache transfer
Flexible communication granularity
Storage overhead
No storage overhead for directory information
Storage overheads beat MESI after tens of cores and scale beyond
Performance/Power
Up to 73% reduction in memory stall time
Up to 70% reduction in network traffic
Slide15OutlineMotivationBackground: DPJDeNovo Protocol
DeNovo
Extensions
Evaluation
Protocol Verification
Performance
Conclusion and Future Work
Slide16Background: DPJExtension for modern OO languagesStructured parallel control: nested fork-join style
foreach
,
cobegin
Type and effect system ensures non-interference
Region
names: partition heapEffects for methods: which regions read/written in method
Type checking ensures race-freedom for parallel tasks
Deterministic execution
Recent support for safe non-determinism
Slide17Memory Consistency Model
Guaranteed determinism
Read returns value of
last
write in sequential order
Same task in this parallel phase
Or before this parallel phase
LD 0xa
ST 0xa
Parallel
Phase
ST 0xa
Slide18Memory Consistency Model
Guaranteed determinism
Read returns value of
last
write in sequential order
Same task in this parallel phase
Or before this parallel phase
LD 0xa
Coherence
Mechanism
ST 0xa
Parallel
Phase
Slide19Cache CoherenceCoherence Enforcement
Invalidate stale copies in caches
Track up-to-date copy
Slide20Cache CoherenceCoherence Enforcement
Invalidate stale copies in caches
Track up-to-date copy
Explicit effects
Compiler knows
all regions written in this parallel phase
Cache can self-invalidate before next parallel phase
Invalidates data in
writeable regions
not accessed by itself
Slide21Cache CoherenceCoherence Enforcement
Invalidate stale copies in caches
Track up-to-date copy
Explicit effects
Compiler knows
all regions written in this parallel phase
Cache can self-invalidate before next parallel phase
Invalidates data in
writeable regions
not accessed by itself
Registration
Directory keeps track of
one
up-to-date copy
Writer updates before next parallel phase
Slide22Basic DeNovo Coherence
Assume (for now): Private L1, shared L2; single word line
Data-race freedom at word granularity
L2 data arrays double as
directory
Keep
valid
data or
registered
core id, no space overhead
L1/L2 states
“Touched” bit – set only if read in the phase
registry
Invalid
Valid
Registered
Read
Write
Write
Slide23Example Run
R
X
0
V
Y
0
R
X
1
V
Y
1
R
X
2
V
Y
2
V
X
3
V
Y
3
V
X
4
V
Y
4
V
X
5
V
Y
5
class
S_type
{
X in
DeNovo
-region ;
Y in
DeNovo
-region ;
}
S _type
S[size
];
...
Phase1
writes
{
//
DeNovo
effect
foreach
i
in 0, size {
S[i].X
= …;
}
self_invalidate
( );
}
L1 of Core 1
R
X
0
V
Y
0
R
X
1
V
Y
1
R
X
2
V
Y
2
I
X
3
V
Y
3
I
X
4
V
Y
4
I
X
5
V
Y
5
L1 of Core 2
I
X
0
V
Y
0
I
X
1
V
Y
1
I
X
2
V
Y
2
R
X
3
V
Y
3
R
X
4
V
Y
4
R
X
5
V
Y
5
Shared L2
R C1VY0R C1VY1R C1VY2R C2VY3RC2VY4RC2VY5
RegisteredValidInvalid
V X0VY0V X1VY1V X2VY2V X3VY3V X4VY4V X5VY5
V X0VY0V X1VY1V X2VY2V X3VY3VX4VY4VX5VY5
V X0VY0V X1VY1V X2VY2V X3VY3VX4VY4VX5VY5
V X0VY0V X1VY1V X2VY2R X3VY3RX4VY4RX5VY5
Registration
Registration
Ack
Ack
Slide24Addressing LimitationsLimitations of current directory-based protocols
Complexity
Subtle races and numerous transient sates in the protocol
Hard to extend for optimizations
Storage overhead
Directory overhead for sharer lists
Performance and power overhead
Invalidation and
ack
messages
False-sharing
Indirection through the directory
✔
✔
✔
Slide25Practical DeNovo Coherence
Basic protocol impractical
High tag storage overhead (a tag per word)
Address/Transfer granularity > Coherence granularity
DeNovo
Line-based protocol
Traditional software-oblivious spatial localityCoherence granularity still at word
no
word-level
false-sharing
“Line Merging”
Cache
V
V
R
Tag
V
V
V
Slide26Storage OverheadDeNovo line (64 bytes/line)L1: 2 state bits, 1 touched bit, 5 region bits (128 bits/line)
L2: 1 valid/line, 1 dirty/line, 1 state bit/word (18 bits/line)
MESI (full-map, in-cache directory)
L1: 5 state bits (5 bits/line)
L2: 5 state bits, [# of cores for directory] bits (5+P bits/line)
Size of L2 == 8
x [aggregate size of L1s]
Slide27Storage Overhead
29 Cores
Slide28Alternative Directory DesignsDuplicate Tag DirectoryL1 tags duplicated in directoryRequires highly-associative lookups
256-way associative lookup for 64-core setup with 4-way L1s
Sparse Directory
Significant performance overhead with higher core-counts
Tagless
Directory (MICRO ’09)
Imprecise sharer-list encoding using bloom-filtersEven more complexity in the protocol (false-positives)Trade performance for less storage overhead/power
Slide29Addressing LimitationsLimitations of current directory-based protocols
Complexity
Subtle races and numerous transient sates in the protocol
Hard to extend for optimizations
Storage overhead
Directory overhead for sharer lists
Performance and power overhead
Invalidation and
ack
messages
False-sharing
Indirection through the directory
✔
✔
✔
✔
Slide30ExtensionsTraditional directory-based protocols
Sharer-lists always contain all the true sharers
DeNovo
protocol
Registry points to latest copy at end of phase
Valid data can be copied around freely
Slide31Extensions (1 of 2)Basic with Direct cache-to-cache transferGet data directly from producerThrough prediction and/or software-assistance
Convert 3-hop misses to 2-hop misses
L1 of Core 1
…
…
R
X
0
V
Y
0
V
Z
0
R
X
1
V
Y
1
V
Z
1
R
X
2
V
Y
2
V
Z
2
I
X
3
V
Y
3
V
Z
3
I
X
4
V
Y
4
V
Z
4
I
X
5
V
Y
5
V
Z
5
X
3
L1 of Core 2
…
…
I
X
0
V
Y
0
V
Z
0
I
X
1
V
Y
1
V
Z
1
I
X
2
V
Y
2
V
Z
2
R
X
3
V
Y
3
V
Z
3
R
X
4
V
Y
4
V
Z
4
R
X
5
V
Y
5
V
Z
5
Shared L2
…
…
R
C1
V
Y
0
V
Z
0
R
C1
V
Y
1
V
Z
1
R
C1
V
Y
2
V
Z
2
R
C2
V
Y
3VZ3RC2VY4VZ4RC2VY5VZ5
RegisteredValidInvalid
LD X3LD X3
Slide32Extensions (2 of 2)Basic with Flexible communicationSoftware-directed data transferTransfer “relevant” data together
Achieve effects of
AoS-to-SoA
transformation
without programmer/compiler intervention
L1 of Core 1
…
…
R
X
0
V
Y
0
V
Z
0
R
X
1
V
Y
1
V
Z
1
R
X
2
V
Y
2
V
Z
2
I
X
3
V
Y
3
V
Z
3
I
X
4
V
Y
4
V
Z
4
I
X
5
V
Y
5
V
Z
5
L1 of Core 2
…
…
I
X
0
V
Y
0
V
Z
0
I
X
1
V
Y
1
V
Z
1
I
X
2
V
Y
2
V
Z
2
R
X
3
V
Y
3
V
Z
3
R
X
4
V
Y
4
V
Z
4
R
X
5
V
Y
5
V
Z
5
Shared L2
…
R
C1
V
Y
0
V
Z
0
R
C1
V
Y
1
V
Z
1
R
C1
V
Y
2
V
Z
2
R
C2
V
Y
3
VZ3RC2VY4VZ4RC2VY5VZ5RegisteredValidInvalid
Slide33Extensions (2 of 2)Basic with Flexible communicationSoftware-directed data transferTransfer “relevant” data together
Achieve effects of
AoS-to-SoA
transformation
without programmer/compiler intervention
L1 of Core 1
…
…
R
X
0
V
Y
0
V
Z
0
R
X
1
V
Y
1
V
Z
1
R
X
2
V
Y
2
V
Z
2
I
X
3
V
Y
3
V
Z
3
I
X
4
V
Y
4
V
Z
4
I
X
5
V
Y
5
V
Z
5
L1 of Core 2
…
…
I
X
0
V
Y
0
V
Z
0
I
X
1
V
Y
1
V
Z
1
I
X
2
V
Y
2
V
Z
2
R
X
3
V
Y
3
V
Z
3
R
X
4
V
Y
4
V
Z
4
R
X
5
V
Y
5
V
Z
5
Shared L2
…
…
R
C1
V
Y
0
V
Z
0
R
C1
V
Y
1
V
Z
1
R
C1
V
Y
2
V
Z
2
R
C2
V
Y
3VZ3RC2VY4VZ4RC2VY5VZ5RegisteredValidInvalidX
3
X4 X5 LD X3LD X4LD X5Y3 Y4 Y5 Z3 Z4 Z5
Slide34Extensions (2 of 2)Basic with Flexible communicationSoftware-directed data transferTransfer “relevant” data together
Achieve effects of
AoS-to-SoA
transformation
without programmer/compiler intervention
L1 of Core 1
…
…
R
X
0
V
Y
0
V
Z
0
R
X
1
V
Y
1
V
Z
1
R
X
2
V
Y
2
V
Z
2
I
X
3
V
Y
3
V
Z
3
I
X
4
V
Y
4
V
Z
4
I
X
5
V
Y
5
V
Z
5
L1 of Core 2
…
…
I
X
0
V
Y
0
V
Z
0
I
X
1
V
Y
1
V
Z
1
I
X
2
V
Y
2
V
Z
2
R
X
3
V
Y
3
V
Z
3
R
X
4
V
Y
4
V
Z
4
R
X
5
V
Y
5
V
Z
5
Shared L2
…
…
R
C1
V
Y
0
V
Z
0
R
C1
V
Y
1
V
Z
1
R
C1
V
Y
2
V
Z
2
R
C2
V
Y
3VZ3RC2VY4VZ4RC2VY5VZ5RegisteredValidInvalidX
3
X4 X5 LD X3R X0VY0VZ0R X1VY1VZ1R X2VY2VZ2V
X3VY3VZ3V X4VY4VZ4V X5VY5VZ5
Slide35OutlineMotivationBackground: DPJDeNovo Protocol
DeNovo
Extensions
Evaluation
Protocol Verification
Performance
Conclusion and Future Work
Slide36EvaluationSimplicityFormal verification of the cache coherence protocolComparing reachable states
Performance/Power
Simulation experiments
Extensibility
DeNovo
extensions
Slide37Protocol Verification MethodologyMurphi model checking toolVerified
DeNovo
word and MESI word protocols
State-of-the art GEMS implementation
Abstract model
Single address / region, two data values
Two cores with private L1 and unified L2, unordered n/wData race free guarantee for DeNovo
Cross phase interactions
Slide38Protocol Verification: ResultsCorrectnessThree bugs in DeNovo protocol
Mistakes in translation from the high level specification
Simple to fix
Six bugs in MESI protocol
Two deadlock scenarios
Unhandled races due to L1
writebacksSeveral days to fixComplexity
25x fewer reachable states for
DeNovo
30x difference in the runtime
Slide39Performance Evaluation MethodologySimulation EnvironmentWisconsin GEMS + Simics
+ Princeton Garnet n/w
System Parameters
64 cores
Private L1(128KB) and Unified L2(32MB)
Simple core model
5-stage, one-issue, in-order coreResults for only memory stall time
Slide40BenchmarksFFT and LU from SPLASH-2kdTree – two versions
C
onstruction of k-D trees for ray tracing
Developed within UPCRC, presented at HPG 2010
Two versions
kdFalse
: false sharing in an auxiliary structurekdPad: padding to eliminated false sharing
Slide41Simulated ProtocolsWord-based: 4B cache lineMESI and DeNovo
Line-based: 64B cache line
MESI and
DeNovo
DeNovo
extensions (proof of concept, word based)
DeNovo directDeNovo
flex
Slide42Word Protocols
FFT
LU
kdFalse
kdPad
Dword
comparable to
Mword
563
547
Simplicity doesn’t compromise performance
Slide43DeNovo
Direct
Ddirect
reduces remote L1 hit time
563
547
505
FFT
LU
kdFalse
kdPad
Slide44Line Protocols
563
547
505
Dline
not susceptible to false-sharing
Up to 66% reduction in total time
FFT
LU
kdFalse
kdPad
Slide45Word vs. Line
563
547
505
FFT
LU
kdFalse
kdPad
Application dependent benefit
kdFalse
–
Mline
worse by 12%
Slide46DeNovo Flex
563
547
505
Dflex
outperforms all systems
Up to 73% reduction over
M
line
FFT
LU
kdFalse
kdPad
Slide47Network Traffic
Dline
and
Dflex
less n/w traffic than
Mline
Up to 70% reduction
FFT
LU
kdFalse
kdPad
Slide48ConclusionDisciplined programming models key for software
DeNovo
rethinks hardware for disciplined models
Simplicity
25x fewer reachable
states with model
checking
30x runtime difference
Extensibility
Direct cache-to-cache transfer
Flexible communication granularityStorage overheadNo storage overhead for directory information
Storage overheads
beat MESI after tens of cores
Performance/Power
73% less time spent in memory
requests
70% reduction in n/w traffic
Slide49Future WorkRethinking cache data layoutExtend toDisciplined non-deterministic codes
Synchronization
Legacy codes
Extend to off-chip memory
Automate generation of hardware regions
More
extensive evaluations
Slide50Thank You!
Slide51Backup Slides
Slide52Flexible Coherence GranularityByte-level sharing is uncommonNone of apps we studied so far
Handle it correctly but not necessarily efficiently
Compiler aligns byte-granularity regions at word boundaries
If fails, H/W “clone” the line into 4 cache frames
With at least 4-way
associativity
, all reside in the same set
Tag
State
Byte 0
Byte 1
Byte 2
Byte 3
Tag
State
Byte 3
Normal
Cloned
Tag
State
Byte 2
Tag
State
Byte 1
Tag
State
Byte 0
Tag
State
Byte 0
Byte 1
Byte 2
Byte 3
Tag
State
Byte 0
Byte 1
Byte 2
Byte 3
Tag
State
Byte 0
Byte 1
Byte 2
Byte 3
Set
Set