Vasileios Karakostas Jayneel Gandhi Furkan Ayar Adrián Cristal Mark D Hill Kathryn S Mckinley Mario Nemirovsky Michael M Swift Osman S Ünsal ID: 338015
Download Presentation The PPT/PDF document "Redundant Memory Mappings for Fast Acces..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Redundant Memory Mappings for Fast Access to Large Memories
Vasileios
Karakostas
, Jayneel Gandhi, Furkan Ayar, Adrián Cristal, Mark D. Hill, Kathryn S. Mckinley, Mario Nemirovsky, Michael M. Swift, Osman S. ÜnsalSlide2
Executive Summary
Problem: Virtual memory overheads are high (up to 41%)Proposal: Redundant Memory MappingsPropose compact representation called range translationRange Translation – arbitrarily large contiguous mappingEffectively cache, manage and facilitate range translationsRetain flexibility of 4KB pagingResult:Reduces overheads of virtual memory to less than 1%2Slide3
Outline
Motivation Virtual Memory Refresher + Key Technology TrendsPrevious ApproachesGoals + Key ObservationDesign: Redundant Memory MappingsResultsConclusion3Slide4
Virtual Memory Refresher
4TLB(Translation Lookaside Buffer)Process 1Process 2Virtual Address SpacePhysical Memory
Page Table
Challenge:
How to reduce costly page walks?Slide5
Two Technology Trends
5*Inflation-adjusted 2011 USD, from: jcmit.comTLB reach is limitedYearProcessorL1 DTLB entries1999Pent. III722001Pent. 4642008Nehalem962012IvyBridge1002015Broadwell100Slide6
0. Page-based Translation
6Virtual MemoryVPN0 PFN0TLBPhysical MemorySlide7
1. Multipage Mapping
7Virtual MemoryClustered TLB[ASPLOS’94, MICRO’12 and HPCA’14]Physical MemorySub-blocked TLB/CoLT
VPN(0-3) PFN(0-3)
BitmapMapSlide8
2. Large Pages
8Virtual MemoryPhysical Memory
[Transparent Huge Pages and
libhugetlbfs
]
VPN0
PFN0
Large Page TLBSlide9
3. Direct Segments
9Virtual MemoryDirect Segment(BASE,LIMIT) OFFSETBASELIMITOFFSET[ISCA’13 and MICRO’14]Physical MemorySlide10
Can we get best of many worlds?
Multipage MappingLarge PagesDirect SegmentsOur ProposalFlexible alignmentArbitrary reachMultiple entriesTransparent to applicationsApplicable to all workloads10Slide11
Key Observation
11Virtual MemoryPhysical MemorySlide12
Key Observation
12Virtual Memory
Large contiguous regions of virtual memory
Limited in number: only a few handfulPhysical Memory
Code
Heap
Stack
Shared Lib.Slide13
Compact Representation: Range Translation
13Virtual Memory
Physical Memory
BASE1
LIMIT1
OFFSET1
Range Translation 1
Range Translation:
is
a mapping between
contiguous
virtual pages mapped to contiguous physical pages
with uniform protection Slide14
Redundant Memory Mappings
14Virtual Memory
Physical Memory
Range Translation 1
Range Translation 2
Range Translation
3
Range Translation 4
Range Translation
5
Map most of process’s virtual address space redundantly with modest number of range translations in addition to page mappingsSlide15
Outline
MotivationDesign: Redundant Memory Mappings A. Caching Range TranslationsB. Managing Range TranslationsC. Facilitating Range TranslationsResultsConclusion15Slide16
A. Caching Range Translations
16V47 …………. V12P47 …………. P12L1 DTLBL2 DTLBRange TLBPage Table WalkerEnhanced Page Table WalkerSlide17
A. Caching Range Translations
17HitV47 …………. V12P47 …………. P12L1 DTLBRange TLBEnhanced Page Table WalkerL2 DTLBSlide18
A. Caching Range Translations
18MissV47 …………. V12P47 …………. P12L1 DTLBRange TLBEnhanced Page Table WalkerL2 DTLBHitRefillSlide19
A. Caching Range Translations
19MissV47 …………. V12P47 …………. P12L1 DTLBRange TLBEnhanced Page Table WalkerL2 DTLBHitRefillSlide20
A. Caching Range Translations
20MissV47 …………. V12P47 …………. P12L1 DTLBRange TLBL2 DTLBHitRefillEntry 1BASE 1LIMIT 1≤>Entry NBASE NLIMIT N≤
>
OFFSET 1 Protection 1OFFSET N Protection NL1 TLB Entry Generator Logic: (Virtual Address + OFFSET) ProtectionSlide21
A. Caching Range Translations
21MissV47 …………. V12P47 …………. P12L1 DTLBRange TLBEnhanced Page Table WalkerL2 DTLBMissMissSlide22
B. Managing Range Translations
Stores all the range translations in a OS managed structurePer-process like page-table22Range TableCR-RTRTCRTDRTFRTGRTARTB
RT
ESlide23
B. Managing Range Translations
23A) Page TableB) Range TableC) Both A) and B)D) Either?On a L2+Range TLB miss, what structure to walk? Is a virtual page part of range? – Not known at a missSlide24
B. Managing Range Translations
Redundancy to the rescueOne bit in page table entry denotes that page is part of a range24Page Table Walk1Insert into L1 TLB2Application resumes memory access3Range Table Walk (Background)Insert into Range TLBPart of a rangeCR-RTRTCRTDRTFRTG
RT
ARTB
RT
E
CR-3Slide25
C. Facilitating Range Translations
25Virtual MemoryPhysical Memory
Does not facilitate physical page contiguity for range creation
Demand PagingSlide26
C. Facilitating Range Translations
26Virtual MemoryPhysical Memory
Allocate physical pages when virtual memory is allocated
Increases range sizes
Reduces number of ranges
Eager PagingSlide27
Outline
MotivationDesign: Redundant Memory MappingsResults MethodologyPerformance ResultsVirtual ContiguityConclusion27Slide28
Methodology
Measure cost on page walks on real hardwareIntel 12-core Sandy-bridge with 96GB memory64-entry L1 TLB + 512-entry L2 TLB 4-way associative for 4KB pages32-entry L1 TLB 4-way associative for 2MB pagesPrototype Eager Paging and Emulator in Linux v3.15.5BadgerTrap for online analysis of TLB misses and emulate Range TLBLinear model to predict performanceWorkloadsBig-memory workloads, SPEC 2006, BioBench, PARSEC28Slide29
Comparisons
4KB: Baseline using 4KB pagingTHP: Transparent Huge Pages using 2MB paging [Transparent Huge Pages]CTLB: Clustered TLB with cluster of 8 4KB entries [HPCA’14]DS: Direct Segments [ISCA’13 and MICRO’14]RMM: Our proposal: Redundant Memory Mappings [ISCA’15]29Slide30
Performance Results
30Measured using performance countersModeled based on emulator5/14 workloadsRest in paperAssumptions:CTLB: 512 entry fully-associativeRMM: 32 entry fully-associativeBoth in parallel with L2Slide31
Performance Results
31Overheads of using 4KB pages are very highSlide32
Performance Results
32Clustered TLB works well, but limited by 8x reachSlide33
Performance Results
332MB page helps with 512x reach: Overheads not very lowSlide34
Performance Results
34Direct Segment perfect for some but not all workloadsSlide35
Performance Results
35RMM achieves low overheads robustly across all workloads Slide36
Why low overheads? Virtual Contiguity
BenchmarkPagingIdeal RMM ranges4KB + 2MBTHP# of ranges#of ranges to cover more than 99% of memorycactusADM 1365 + 33311249canneal 10016 + 359774graph500 8983 + 35725863mcf 1737 + 839551tigr 28299 + 235163361000s of TLB entries requiredOnly 10s-100s of ranges per applicationOnly few ranges for 99% coverageSlide37
Summary
Problem: Virtual memory overheads are highProposal: Redundant Memory MappingsPropose compact representation called range translationRange Translation – arbitrarily large contiguous mappingEffectively cache, manage and facilitate range translationsRetain flexibility of 4KB pagingResult:Reduces overheads of virtual memory to less than 1%37Slide38
Questions ?
38