Aasheesh Kolli Jeff Rosen Stephan Diestelhorst Ali Saidi Steven Pelley Sihang Liu Peter M Chen Thomas F Wenisch MICRO 2016 Taipei Promise of Persistent Memory PM 2 Nonvolatility ID: 712865
Download Presentation The PPT/PDF document "Delegated Persist Ordering" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Delegated Persist Ordering
Aasheesh KolliJeff RosenStephan DiestelhorstAli Saidi Steven PelleySihang LiuPeter M. ChenThomas F. Wenisch
MICRO 2016 - TaipeiSlide2
Promise of Persistent Memory (PM)
2Non-volatilityPerformance*
Byte-addressable,
load-store interface
to storage
DensitySlide3
Recoverable data structures in PM
3Correct recovery relies on ordering writes to PM
prepareUndoLog
(P)
mutateData
(M)
commitTx
(C)
Recoverable
Need primitives to express desired write order to PMSlide4
Memory persistency models
Programmers can express write order to PMHardware enforces write orderSimilar to memory consistency models[Condit ‘09] [Pelley ‘14] [Intel ‘14] [Joshi ‘15] … 4Slide5
Contributions
Implications of Synchronous Ordering [Intel ‘14]Semantics and performance7.2x slowdown over volatile executionDelegated Persist OrderingImplementation strategy for persistency models3.5x speedup over Synchronous Ordering5Slide6
Outline
BackgroundSynchronous Ordering (SO)SO drawbacksDelegated Persist Ordering (DPO)Evaluation6Slide7
Persistent memory system
Core-1Core-2Core-3
Core-4
L1 $
L1 $
L1 $
L1 $
LLC
DRAM
Persistent Memory (PM)
Recovery
7Slide8
Terminology
Persist Act of making a store durable in PMPersist memory order (PMO)Memory events ordered by persistency modelGoverns the order in which stores persist 8Slide9
Persistency model
Core-1Core-2L1 $
L1 $
LLC
PM
/*
Init
: a = 0, b=0
a,b
are persistent */
Core-1
Core-2
St a = x
while (a==0){}
St b = y
Constraint: St a
< St b
a
b
Consistency
Persistency
Writeback
caching
Mem
Ctrl reordering
9
Persistency models must minimize
additional performance overheads Slide10
Epoch persistency
[Condit ‘09, Pelley ‘14, Joshi ‘15]FENCEs break thread execution into epochsPersists across epochs are orderedNo ordering within epoch10
prepareUndoLog
(P)
mutateData
(M)
FENCE
PMO
P
1
…
…
P
2
P
n
M
1
M
2
M
n
Epoch-1
Epoch-2
However, many ways to skin a cat
…Slide11
Synchronous Ordering (SO)
Based on x86 ISA changes for PMCLWB <Addr>Writes back cache block to Memory Controller (MC)PCOMMITPersists all accepted stores at MC11Cores
Caches
Mem
Ctrl
PM
1. CLWB
2. PCOMMITSlide12
Ordering via stalling
Objective: persist A and B in order12St A = xCLWB <A>fencePCOMMITfenceSt B = y
Core
Caches
b
a
a
b
4
3 – completion ACK
Stalled
a
2
1
Ordering via stalling
frequent processor stallsSlide13
Further drawbacks
Places the PCOMMIT on critical pathNegates scheduling optimizations at the MCHas to respect all accepted stores at the MCExplicit CLWB annotations from userChallenge for multi-party software devAffects cache coherence permissions137.2x slowdown for PCM main memoriesSlide14
Ideal ordering
Objective: persist A and B in order14
St A = xfence
St B = y
Core
Caches
b
a
a
1
b
b
a
>
2
3
Buffered persistency
[Condit ‘09] [Pelley ‘14] [Joshi ‘15]
Ordering without stallingSlide15
Delegated Persist Ordering (DPO)
GoalsReduce processor stallsMore scheduling freedom for MCKey ideasBuffer persists & ordering info alongside cache hierarchyDecouple persistence from core executionChallengeAccurately track persist order across cores15Slide16
Proposed architecture
Persist buffer at L1-D $Houses persist for each store to PMRetired fencesDependency trackingVia coherence trafficPersists drain to MCDecoupled from WBs16
Core-1
Core-2
Core-3
Core-4
L1 $
L1 $
L1 $
L1 $
LLC
PM
MCSlide17
DPO in action
17/* Init: a = b = 0;a,b are persistent */Core-1 Core-2St a = x while (a==0){} fence St b = y
Constraint: St a <p St bSlide18
a | 1
St b = yDPO in action
18
Core-1
LLC
PM
MC
St a = x
a
a
b
1
-
Ld
a
a |
a |
a | 1
1
fence
fence
2
-
3
a
Core-2
ID
Deps
Blk
ID
Deps
Blk
b
/*
Init
: a = b = 0;
a,b
are persistent */
Core-1
Core-2
St a = x
while (a==0){}
fence
St b = y
Constraint: St a <
p
St bSlide19
DPO highlights
Decouples persistence from core executionBetter scheduling at the MCFewer stalls at the core19Slide20
Experimental setup
gem5 simulator – details in paperWorkloads: write intensive micro-benchmarksConcurrent queue: insert/delete entriesArray swaps: random swaps of array elementsTATP: update location transaction from TATPRB-Tree: insert/delete entriesTPCC: new order transaction from TPCCPM configsPCMDRAMPWQ assuming a persistent write queue at the MC
20Slide21
Performance
2135
16
* Execution times normalized to volatile execution
Better
Execution times down with lower memory latencies
Normalized execution timesSlide22
Performance
223516* Execution times normalized to volatile execution
Better
Normalized execution times
DPO consistently out performs SO for all mem
configs
3.5x
2.5x
1.3xSlide23
Recent developments
23Cores
Caches
Mem
Ctrl
PM
Persistent
domain
4 weeks ago
Require persistent write queue
(at MC)Slide24
Performance
243516* Execution times normalized to volatile execution
Better
Normalized execution times
1.3xSlide25
Conclusions
Persistency model semantics affect performance SO couples ordering with completion of persistsOften unnecessaryDPO decouples persistence from core executionDPO significantly outperforms SO25Slide26
Questions?Slide27
Thank you!Slide28
BackupSlide29
DPO in action
29Core-1 Core-2St a = x while (a==x){} fence St b = y
ID
| Deps
| Addr & Data
PBE:Slide30
St b = y
DPO in action30Core-1
Core-2
LLC
PM
MC
St a = x
a
a
b
ID: 1
-
ID
|
Deps
|
Addr
& Data
PBE:
Ld
a
a |
a |
a | 1
a | 1
1
fence
fence
ID: 2
b
-
ID: 3
a
Core-1
Core-2
St a = x
while (a==x){}
fence
St b = ySlide31
Root cause – barrier semantics
SOOrdering + CompletionHigh latencyCompletion not always necessary [Chidambaram ‘13]Buffered persistency [Pelley ’14]Ordering onlyLow latencyPipeline persists
31Slide32
SO example
Objective: persist A and B in order32St A = xCLWB AfencePCOMMITfenceSt B = yBarrier
(1) Ensures `A’ is accepted to MC before PCOMMIT
(2) Ensures `B’ cannot reach cache before `A’ persists
(1) + (2) =>
A
and B persist in orderSlide33
The dark side of PCOMMIT
33St A = xCLWB AfencePCOMMITfenceSt B = yBarrier
(1) Ensures `A’ is accepted to MC before PCOMMIT
(2) Ensures
`B’ cannot reach cache
before `A’ persists
Exposes PM latency to the core on every PCOMMIT
Frequent processor stallsSlide34
Recent developments
Intel deprecated PCOMMIT 4 weeks agoRequires a persistent write queue at MC34
Cores
Caches
Mem
Ctrl
PM
Persistent
domainSlide35
SO example revisited
Objective: persist A and B in order35St A = xCLWB ASfenceSt B = yEnsures `A’ is accepted to MC before `B’ reaches cache
MC access latency still exposed to core (vs PM latency)
1.5x slowdown vs 7.3x Slide36
RCBSP
Relaxed Consistency Buffered Strict PersistencyRelaxed consistencyWe use ARM v7aStrict persistencyPersist order = Store orderBuffering36 /* Init: a = 0, b=0 a,b are persistent */Core-1
Core-2St a = x while (a==0){} dmb
St b = y
St a <
m
St b
St a
<
p
St bSlide37
Delegated Persist Ordering
Aasheesh Kolli Jeff Rosen Stephan Diestelhorst Ali Saidi Steven Pelley Sihang Liu Peter M. Chen Thomas F. WenischMICRO 2016 - TaipeiSlide38
Delegated Persist Ordering
Aasheesh Kolli Jeff Rosen Stephan Diestelhorst Ali Saidi Steven Pelley Sihang Liu Peter M. Chen Thomas F. WenischMICRO 2016 - TaipeiSlide39
Performance
393516
* Execution times normalized to volatile execution
?
Execution times down with lower memory latencies
Array swaps, most write intensive, suffers the most, except
…
Concurrent queue, thread contention exposes CLWBs
RB tree, least write intensive, dominated by volatile execution
RCBSP, consistently out performs SO for all
mem
configs
3.5x
2.5x
1.3xSlide40
Memory persistency models
40Allow programmers to express write order to PM Hardware responsible for enforcing write orderSimilar to memory consistency modelsSynchronous Ordering (SO)
– based on [Intel ‘15]Slide41
SO barrier v. RCBSP barrier
SO (based on [Intel ‘15])Ordering + CompletionStronger guaranteeEasier to reasonPotentially fewer barriersConservative stallingCompletion not always req.Synchronous barrierRCBSP [Pelley ‘14]Ordering only (mostly)Weaker guaranteeHarder to reasonPotentially more barriers
Stall when no buffer spaceOrdering mostly sufficientDelegation barrier
41Slide42
Barriers in action
Synchronous: Epoch-1 persists before Epoch-2 beginsDelegation: Epoch-1 ordered to persist before Epoch-242Core
$
E1
E2
Core
$
E1
E2
Time
Finish
Time
Finish
>