Delegated Persist Ordering - PowerPoint Presentation

354 views
Uploaded On 2018-11-03

Delegated Persist Ordering - PPT Presentation

Aasheesh Kolli Jeff Rosen Stephan Diestelhorst Ali Saidi Steven Pelley Sihang Liu Peter M Chen Thomas F Wenisch MICRO 2016 Taipei Promise of Persistent Memory PM 2 Nonvolatility ID: 712865

ordering core execution persist core ordering persist execution memory persistent persistency write fence times order pcommit normalized dpo pelley

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/712865" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "Delegated Persist Ordering" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Slide1

Delegated Persist Ordering

Aasheesh KolliJeff RosenStephan DiestelhorstAli Saidi Steven PelleySihang LiuPeter M. ChenThomas F. Wenisch

MICRO 2016 - TaipeiSlide2

Promise of Persistent Memory (PM)

2Non-volatilityPerformance*

Byte-addressable,

load-store interface

to storage

DensitySlide3

Recoverable data structures in PM

3Correct recovery relies on ordering writes to PM

prepareUndoLog

(P)

mutateData

(M)

commitTx

(C)

Recoverable

Need primitives to express desired write order to PMSlide4

Memory persistency models

Programmers can express write order to PMHardware enforces write orderSimilar to memory consistency models[Condit ‘09] [Pelley ‘14] [Intel ‘14] [Joshi ‘15] … 4Slide5

Contributions

Implications of Synchronous Ordering [Intel ‘14]Semantics and performance7.2x slowdown over volatile executionDelegated Persist OrderingImplementation strategy for persistency models3.5x speedup over Synchronous Ordering5Slide6

Outline

BackgroundSynchronous Ordering (SO)SO drawbacksDelegated Persist Ordering (DPO)Evaluation6Slide7

Persistent memory system

Core-1Core-2Core-3

Core-4

L1 $

LLC

DRAM

Persistent Memory (PM)

Recovery

7Slide8

Terminology

Persist Act of making a store durable in PMPersist memory order (PMO)Memory events ordered by persistency modelGoverns the order in which stores persist 8Slide9

Persistency model

Core-1Core-2L1 $

L1 $

LLC

Init

: a = 0, b=0

a,b

are persistent */

Core-1

Core-2

St a = x

while (a==0){}

St b = y

Constraint: St a

< St b

Consistency

Persistency

Writeback

caching

Mem

Ctrl reordering

Persistency models must minimize

additional performance overheads Slide10

Epoch persistency

[Condit ‘09, Pelley ‘14, Joshi ‘15]FENCEs break thread execution into epochsPersists across epochs are orderedNo ordering within epoch10

prepareUndoLog

(P)

mutateData

(M)

FENCE

PMO

…

Epoch-1

Epoch-2

However, many ways to skin a cat

…Slide11

Synchronous Ordering (SO)

Based on x86 ISA changes for PMCLWB <Addr>Writes back cache block to Memory Controller (MC)PCOMMITPersists all accepted stores at MC11Cores

Caches

Mem

Ctrl

1. CLWB

2. PCOMMITSlide12

Ordering via stalling

Objective: persist A and B in order12St A = xCLWB <A>fencePCOMMITfenceSt B = y

Core

Caches

3 – completion ACK

Stalled

Ordering via stalling



frequent processor stallsSlide13

Further drawbacks

Places the PCOMMIT on critical pathNegates scheduling optimizations at the MCHas to respect all accepted stores at the MCExplicit CLWB annotations from userChallenge for multi-party software devAffects cache coherence permissions137.2x slowdown for PCM main memoriesSlide14

Ideal ordering

Objective: persist A and B in order14

St A = xfence

St B = y

Core

Caches

Buffered persistency

[Condit ‘09] [Pelley ‘14] [Joshi ‘15]

Ordering without stallingSlide15

Delegated Persist Ordering (DPO)

GoalsReduce processor stallsMore scheduling freedom for MCKey ideasBuffer persists & ordering info alongside cache hierarchyDecouple persistence from core executionChallengeAccurately track persist order across cores15Slide16

Proposed architecture

Persist buffer at L1-D $Houses persist for each store to PMRetired fencesDependency trackingVia coherence trafficPersists drain to MCDecoupled from WBs16

Core-1

Core-2

Core-3

Core-4

L1 $

LLC

MCSlide17

DPO in action

17/* Init: a = b = 0;a,b are persistent */Core-1 Core-2St a = x while (a==0){} fence St b = y

Constraint: St a <p St bSlide18

a | 1

St b = yDPO in action

Core-1

LLC

St a = x

a |

a | 1

fence

Core-2

Deps

Blk

Deps

Blk

Init

: a = b = 0;

a,b

are persistent */

Core-1

Core-2

St a = x

while (a==0){}

fence

St b = y

Constraint: St a <

St bSlide19

DPO highlights

Decouples persistence from core executionBetter scheduling at the MCFewer stalls at the core19Slide20

Experimental setup

gem5 simulator – details in paperWorkloads: write intensive micro-benchmarksConcurrent queue: insert/delete entriesArray swaps: random swaps of array elementsTATP: update location transaction from TATPRB-Tree: insert/delete entriesTPCC: new order transaction from TPCCPM configsPCMDRAMPWQ  assuming a persistent write queue at the MC

20Slide21

Performance

2135

* Execution times normalized to volatile execution

Better

Execution times down with lower memory latencies

Normalized execution timesSlide22

Performance

223516* Execution times normalized to volatile execution

Better

Normalized execution times

DPO consistently out performs SO for all mem

configs

3.5x

2.5x

1.3xSlide23

Recent developments

23Cores

Caches

Mem

Ctrl

Persistent

domain

4 weeks ago



Require persistent write queue

(at MC)Slide24

Performance

243516* Execution times normalized to volatile execution

Better

Normalized execution times

1.3xSlide25

Conclusions

Persistency model semantics affect performance SO couples ordering with completion of persistsOften unnecessaryDPO decouples persistence from core executionDPO significantly outperforms SO25Slide26

Questions?Slide27

Thank you!Slide28

BackupSlide29

DPO in action

29Core-1 Core-2St a = x while (a==x){} fence St b = y

| Deps

| Addr & Data

PBE:Slide30

St b = y

DPO in action30Core-1

Core-2

LLC

St a = x

ID: 1

Deps

Addr

& Data

PBE:

a |

a | 1

fence

ID: 2

ID: 3

Core-1

Core-2

St a = x

while (a==x){}

fence

St b = ySlide31

Root cause – barrier semantics

SOOrdering + CompletionHigh latencyCompletion not always necessary [Chidambaram ‘13]Buffered persistency [Pelley ’14]Ordering onlyLow latencyPipeline persists

31Slide32

SO example

Objective: persist A and B in order32St A = xCLWB AfencePCOMMITfenceSt B = yBarrier

(1) Ensures `A’ is accepted to MC before PCOMMIT

(2) Ensures `B’ cannot reach cache before `A’ persists

(1) + (2) =>

and B persist in orderSlide33

The dark side of PCOMMIT

33St A = xCLWB AfencePCOMMITfenceSt B = yBarrier

(1) Ensures `A’ is accepted to MC before PCOMMIT

(2) Ensures

`B’ cannot reach cache

before `A’ persists

Exposes PM latency to the core on every PCOMMIT

Frequent processor stallsSlide34

Recent developments

Intel deprecated PCOMMIT 4 weeks agoRequires a persistent write queue at MC34

Cores

Caches

Mem

Ctrl

Persistent

domainSlide35

SO example revisited

Objective: persist A and B in order35St A = xCLWB ASfenceSt B = yEnsures `A’ is accepted to MC before `B’ reaches cache

MC access latency still exposed to core (vs PM latency)

1.5x slowdown vs 7.3x Slide36

RCBSP

Relaxed Consistency Buffered Strict PersistencyRelaxed consistencyWe use ARM v7aStrict persistencyPersist order = Store orderBuffering36 /* Init: a = 0, b=0 a,b are persistent */Core-1

Core-2St a = x while (a==0){} dmb

St b = y

St a <

St b



St a

St bSlide37

Delegated Persist Ordering

Aasheesh Kolli Jeff Rosen Stephan Diestelhorst Ali Saidi Steven Pelley Sihang Liu Peter M. Chen Thomas F. WenischMICRO 2016 - TaipeiSlide38

Delegated Persist Ordering

Aasheesh Kolli Jeff Rosen Stephan Diestelhorst Ali Saidi Steven Pelley Sihang Liu Peter M. Chen Thomas F. WenischMICRO 2016 - TaipeiSlide39

Performance

393516

* Execution times normalized to volatile execution

Execution times down with lower memory latencies

Array swaps, most write intensive, suffers the most, except

…

Concurrent queue, thread contention exposes CLWBs

RB tree, least write intensive, dominated by volatile execution

RCBSP, consistently out performs SO for all

mem

configs

3.5x

2.5x

1.3xSlide40

Memory persistency models

40Allow programmers to express write order to PM Hardware responsible for enforcing write orderSimilar to memory consistency modelsSynchronous Ordering (SO)

– based on [Intel ‘15]Slide41

SO barrier v. RCBSP barrier

SO (based on [Intel ‘15])Ordering + CompletionStronger guaranteeEasier to reasonPotentially fewer barriersConservative stallingCompletion not always req.Synchronous barrierRCBSP [Pelley ‘14]Ordering only (mostly)Weaker guaranteeHarder to reasonPotentially more barriers

Stall when no buffer spaceOrdering mostly sufficientDelegation barrier

41Slide42

Barriers in action

Synchronous: Epoch-1 persists before Epoch-2 beginsDelegation: Epoch-1 ordered to persist before Epoch-242Core

Core

Time

Finish

Time

Finish