/
Reducing Memory Interference in Reducing Memory Interference in

Reducing Memory Interference in - PowerPoint Presentation

test
test . @test
Follow
356 views
Uploaded On 2018-11-08

Reducing Memory Interference in - PPT Presentation

Multicore Systems Lavanya Subramanian Department of ECE 11042011 Main Memory is a Bottleneck 2 Core Channel Memory Main memory latency is long Reduces performance stalling core In a ID: 722780

channel memory row bank memory channel bank row core buffer req application intensity applications high main partitioning scheduling dram channels data organization

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Reducing Memory Interference in" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Reducing Memory Interference in Multicore Systems

Lavanya SubramanianDepartment of ECE11/04/2011Slide2

Main Memory is a Bottleneck2

Core

Channel

Memory

Main memory latency is long.

Reduces performance stalling core

In a

multicore

system, applications running on multiple cores share the main memory

Core

CoreSlide3

Problem of Inter-Application InterferenceApplications’ requests interfere at the main memory.

This inter-application interference degrades system performance.Problem further exacerbated due toFast growing core countsLimited off-chip pin bandwidth3

Channel

Memory

Core

Core

Req

Req

Req

ReqSlide4

Talk SummaryGoalAddress the problem of inter-application interference at main memory with the goal of improving performance.

Outline of this talkBackground/MotivationPrevious ApproachesOur Approach4Slide5

Background:Main Memory Organization

5Slide6

DRAM Main Memory Organization6

Channel

Main Memory

Bank

CoreSlide7

DRAM Organization: Bank Organization

7

A bank is a 2D array of DRAM cells

Row

(4 KB)

Row Buffer

Row Decoder

Row

Addr

Column (8 bytes)

Column

Addr

Column

MuxSlide8

DRAM Organization: Accessing data

8

Row

Column

Row

Buffer

Required Data

A

B

C

D

E

FSlide9

DRAM Organization: The Row-Buffer

9

Destructive

Read into

Row Buffer

Sent onto channel

Row

Column

Row

Buffer

Required Data

A

B

C

D

E

FSlide10

DRAM Organization: Row Hit10

Sent on channel

Row

Column

Row

Buffer

Required Data

A

B

C

D

E

FSlide11

DRAM Organization: Row Miss

11Column

2. Destructive

Read into

Row Buffer

1. Write

back data in row buffer to array

Sent onto channel

Row

Row

Buffer

Required Data

Row Miss latency = 2 x Row hit latency

A

B

C

D

E

FSlide12

The Memory Controller12

Memory Controller

Medium between the core and the main memory

Buffers memory requests from

core in request buffer

Re-orders and schedules requests to main memory banks

Main Memory

Channel

Bank 0

Bank 1

Core

Request BufferSlide13

FR-FCFS (Rixner et al. ISCA’00)

Exploits row hits to minimize overall DRAM access latency13

Bank

Req

Req

Req

time

Service Timeline

Bank

Req

Req

Req

time

Service Timeline

FCFS

FR-FCFSSlide14

Memory Scheduling in Multicore Systems

Low memory-intensity application 2 starves behind application 1Minimizing overall DRAM access latency != System Performance14

Core 1

App1

Core 2 App2

Bank

Req

Req

Req

FR-FCFS

Req

Req

time

Service Timeline

Application 2’s

single request starves behind

three

of Application 1’s requestsSlide15

Need for Application AwarenessMemory Scheduler needs to be aware of application characteristics.Thread Cluster Memory (TCM

) Scheduling is the current best application-aware memory scheduling policy.TCM (Kim et al. MICRO’10)always prioritizes low memory-intensity applicationsshuffles between high memory intensity applicationsStrengthProvides good system performance

Shortcoming

High hardware complexity due to ranking and prioritization logic

15Slide16

Modern Systems have Multiple ChannelsAllocation of data to channels – a new degree of freedom

16

Channel

Core

Core

Memory Controller

Memory Controller

Channel

Memory

MemorySlide17

Interleaving rows across channelsEnables parallelism in access to rows on different channels

17

Channel

Core

Core

Memory Controller

Memory Controller

Channel

Memory

MemorySlide18

Interleaving cache lines across channelsEnables

finer grained parallelism at the cache line granularity18

Channel

Core

Core

Memory Controller

Memory Controller

Channel

Memory

MemorySlide19

Key Insight 1High memory-intensity applications interfere with low memory-intensity applications in shared memory channels

19

Solution: Map data of low and high memory-intensity applications to different channels

Core 0

App A

Core 1

App B

Channel 0

Bank 1

Channel 1

Bank 0

Bank 1

Bank 0

Conventional Page Mapping

Time Units

1

2

3

4

5

Channel Partitioning

Core 0

App A

Core 1

App B

Channel 0

Bank 1

Bank 0

Bank 1

Bank 0

Time Units

1

2

3

4

5

Channel 1Slide20

Key Insight 220

Channel 0

Bank 1

Channel 1

Bank 0

Bank 1

Bank 0

A

B

C

D

B

E

Request Buffer State

Conventional Page Mapping

Channel 0

Bank 1

Channel 1

Bank 0

Bank 1

Bank 0

B

E

C

D

B

A

Request Buffer State

Channel Partitioning

Channel 1

Channel 0

Bank 1

Bank 0

Bank 1

Bank 0

B

B

Service Order

1

2

3

4

5

6

C

D

A

E

Channel 0

Bank 1

Bank 0

Bank 1

Bank 0

B

B

Service Order

Channel 1

1

2

3

4

5

6

C

D

E

ASlide21

Memory Channel Partitioning (MCP)Profile Applications

Classify applications into groupsPartition available channels between groupsAssign a preferred channel to each applicationAllocate application pages to preferred channel21

Hardware

System SoftwareSlide22

Profile/Classify ApplicationsProfiling

Collect Last Level Misses per Kilo Instruction (MPKI) and Row-buffer hit rate (RBH) of applications onlineClassification22

MPKI >

MPKI

t

Low Intensity

High

Intensity

RBH >

RBH

t

Low

Row-Buffer Locality

No

Yes

No

Yes

High

Row-Buffer LocalitySlide23

Partition Between Low and High

Intensity Groups23

Low

Intensity

High

Intensity

Channel 1

Channel 2

Channel 3

Channel 4

Assign channels proportional to number of applications in groupSlide24

Partition b/w Low and High RBH groups24

High IntensityLow Row- Buffer Locality

Channel 3

Channel 4

Assign channels proportional to bandwidth demand of group

High

Intensity

High

Row- Buffer LocalitySlide25

Preferred Channel Assignment/AllocationPreferred Channel Assignment

Load balance group’s bandwidth demand across group’s allocated channels Each application now has a preferred channel allocationPage allocation to preferred channel on first touchOperating system assigns a page to a preferred channel if free page availableElse use modified replacement policy to preferentially choose a replacement candidate from preferred channel

25Slide26

Integrating Partitioning and Scheduling

26Inter-application Interference Mitigation

Memory Scheduling

Memory Partitioning

Integrated Memory Partitioning and SchedulingSlide27

Integrated Memory Partitioning and Scheduling (IMPS)Applications with very low memory intensities (< 1 MPKI)

do not need dedicated bandwidthIn fact, dedicating bandwidth results in wastageThese applications need short access latenciesinterfere minimally with other applicationsSolution:Always prioritize them in the scheduler.Handle other applications via memory channel partitioning

27Slide28

MethodologyCore Model

4 GHz out-of-order processor128 entry instruction window512 KB cache/coreMemory ModelDDR21 GB capacity4 channels, 4 banks/channelRow interleavedRow hit: 200 cycles Row Miss: 400 cycles

28Slide29

Comparison to Previous Scheduling Policies

29

MCP performs 1% better than TCM (best previous scheduler) at no extra hardware complexity

IMPS

performs 5% better than TCM (best previous scheduler) at minimal extra hardware complexity

Perform consistently well across all intensity categoriesSlide30

Comparison to AFT/DPM (Awasthi et al. PACT’11)

MCP/IMPS outperform AFT and DPM by 7/12.4% (across 40 workloads).Application-aware page allocation mitigates inter-application interference better.30Slide31

Future WorkFurther exploration of integrated memory partitioning and scheduling for system performanceIntegrated partitioning and scheduling for fairness

Workload aware memory scheduling31