Multicore Systems Lavanya Subramanian Department of ECE 11042011 Main Memory is a Bottleneck 2 Core Channel Memory Main memory latency is long Reduces performance stalling core In a ID: 722780
Download Presentation The PPT/PDF document "Reducing Memory Interference in" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Reducing Memory Interference in Multicore Systems
Lavanya SubramanianDepartment of ECE11/04/2011Slide2
Main Memory is a Bottleneck2
Core
Channel
Memory
Main memory latency is long.
Reduces performance stalling core
In a
multicore
system, applications running on multiple cores share the main memory
Core
CoreSlide3
Problem of Inter-Application InterferenceApplications’ requests interfere at the main memory.
This inter-application interference degrades system performance.Problem further exacerbated due toFast growing core countsLimited off-chip pin bandwidth3
Channel
Memory
Core
Core
Req
Req
Req
ReqSlide4
Talk SummaryGoalAddress the problem of inter-application interference at main memory with the goal of improving performance.
Outline of this talkBackground/MotivationPrevious ApproachesOur Approach4Slide5
Background:Main Memory Organization
5Slide6
DRAM Main Memory Organization6
Channel
Main Memory
Bank
CoreSlide7
DRAM Organization: Bank Organization
7
A bank is a 2D array of DRAM cells
Row
(4 KB)
Row Buffer
Row Decoder
Row
Addr
Column (8 bytes)
Column
Addr
Column
MuxSlide8
DRAM Organization: Accessing data
8
Row
Column
Row
Buffer
Required Data
A
B
C
D
E
FSlide9
DRAM Organization: The Row-Buffer
9
Destructive
Read into
Row Buffer
Sent onto channel
Row
Column
Row
Buffer
Required Data
A
B
C
D
E
FSlide10
DRAM Organization: Row Hit10
Sent on channel
Row
Column
Row
Buffer
Required Data
A
B
C
D
E
FSlide11
DRAM Organization: Row Miss
11Column
2. Destructive
Read into
Row Buffer
1. Write
back data in row buffer to array
Sent onto channel
Row
Row
Buffer
Required Data
Row Miss latency = 2 x Row hit latency
A
B
C
D
E
FSlide12
The Memory Controller12
Memory Controller
Medium between the core and the main memory
Buffers memory requests from
core in request buffer
Re-orders and schedules requests to main memory banks
Main Memory
Channel
Bank 0
Bank 1
Core
Request BufferSlide13
FR-FCFS (Rixner et al. ISCA’00)
Exploits row hits to minimize overall DRAM access latency13
Bank
Req
Req
Req
time
Service Timeline
Bank
Req
Req
Req
time
Service Timeline
FCFS
FR-FCFSSlide14
Memory Scheduling in Multicore Systems
Low memory-intensity application 2 starves behind application 1Minimizing overall DRAM access latency != System Performance14
Core 1
App1
Core 2 App2
Bank
Req
Req
Req
FR-FCFS
Req
Req
time
Service Timeline
Application 2’s
single request starves behind
three
of Application 1’s requestsSlide15
Need for Application AwarenessMemory Scheduler needs to be aware of application characteristics.Thread Cluster Memory (TCM
) Scheduling is the current best application-aware memory scheduling policy.TCM (Kim et al. MICRO’10)always prioritizes low memory-intensity applicationsshuffles between high memory intensity applicationsStrengthProvides good system performance
Shortcoming
High hardware complexity due to ranking and prioritization logic
15Slide16
Modern Systems have Multiple ChannelsAllocation of data to channels – a new degree of freedom
16
Channel
Core
Core
Memory Controller
Memory Controller
Channel
Memory
MemorySlide17
Interleaving rows across channelsEnables parallelism in access to rows on different channels
17
Channel
Core
Core
Memory Controller
Memory Controller
Channel
Memory
MemorySlide18
Interleaving cache lines across channelsEnables
finer grained parallelism at the cache line granularity18
Channel
Core
Core
Memory Controller
Memory Controller
Channel
Memory
MemorySlide19
Key Insight 1High memory-intensity applications interfere with low memory-intensity applications in shared memory channels
19
Solution: Map data of low and high memory-intensity applications to different channels
Core 0
App A
Core 1
App B
Channel 0
Bank 1
Channel 1
Bank 0
Bank 1
Bank 0
Conventional Page Mapping
Time Units
1
2
3
4
5
Channel Partitioning
Core 0
App A
Core 1
App B
Channel 0
Bank 1
Bank 0
Bank 1
Bank 0
Time Units
1
2
3
4
5
Channel 1Slide20
Key Insight 220
Channel 0
Bank 1
Channel 1
Bank 0
Bank 1
Bank 0
A
B
C
D
B
E
Request Buffer State
Conventional Page Mapping
Channel 0
Bank 1
Channel 1
Bank 0
Bank 1
Bank 0
B
E
C
D
B
A
Request Buffer State
Channel Partitioning
Channel 1
Channel 0
Bank 1
Bank 0
Bank 1
Bank 0
B
B
Service Order
1
2
3
4
5
6
C
D
A
E
Channel 0
Bank 1
Bank 0
Bank 1
Bank 0
B
B
Service Order
Channel 1
1
2
3
4
5
6
C
D
E
ASlide21
Memory Channel Partitioning (MCP)Profile Applications
Classify applications into groupsPartition available channels between groupsAssign a preferred channel to each applicationAllocate application pages to preferred channel21
Hardware
System SoftwareSlide22
Profile/Classify ApplicationsProfiling
Collect Last Level Misses per Kilo Instruction (MPKI) and Row-buffer hit rate (RBH) of applications onlineClassification22
MPKI >
MPKI
t
Low Intensity
High
Intensity
RBH >
RBH
t
Low
Row-Buffer Locality
No
Yes
No
Yes
High
Row-Buffer LocalitySlide23
Partition Between Low and High
Intensity Groups23
Low
Intensity
High
Intensity
Channel 1
Channel 2
Channel 3
Channel 4
Assign channels proportional to number of applications in groupSlide24
Partition b/w Low and High RBH groups24
High IntensityLow Row- Buffer Locality
Channel 3
Channel 4
Assign channels proportional to bandwidth demand of group
High
Intensity
High
Row- Buffer LocalitySlide25
Preferred Channel Assignment/AllocationPreferred Channel Assignment
Load balance group’s bandwidth demand across group’s allocated channels Each application now has a preferred channel allocationPage allocation to preferred channel on first touchOperating system assigns a page to a preferred channel if free page availableElse use modified replacement policy to preferentially choose a replacement candidate from preferred channel
25Slide26
Integrating Partitioning and Scheduling
26Inter-application Interference Mitigation
Memory Scheduling
Memory Partitioning
Integrated Memory Partitioning and SchedulingSlide27
Integrated Memory Partitioning and Scheduling (IMPS)Applications with very low memory intensities (< 1 MPKI)
do not need dedicated bandwidthIn fact, dedicating bandwidth results in wastageThese applications need short access latenciesinterfere minimally with other applicationsSolution:Always prioritize them in the scheduler.Handle other applications via memory channel partitioning
27Slide28
MethodologyCore Model
4 GHz out-of-order processor128 entry instruction window512 KB cache/coreMemory ModelDDR21 GB capacity4 channels, 4 banks/channelRow interleavedRow hit: 200 cycles Row Miss: 400 cycles
28Slide29
Comparison to Previous Scheduling Policies
29
MCP performs 1% better than TCM (best previous scheduler) at no extra hardware complexity
IMPS
performs 5% better than TCM (best previous scheduler) at minimal extra hardware complexity
Perform consistently well across all intensity categoriesSlide30
Comparison to AFT/DPM (Awasthi et al. PACT’11)
MCP/IMPS outperform AFT and DPM by 7/12.4% (across 40 workloads).Application-aware page allocation mitigates inter-application interference better.30Slide31
Future WorkFurther exploration of integrated memory partitioning and scheduling for system performanceIntegrated partitioning and scheduling for fairness
Workload aware memory scheduling31