Brian Kocoloski Jack Lange University of Pittsburgh Lightweight Experience in a Consolidated Environment HPC applications need lightweight resource management Tightlysynchronized massively parallel ID: 547893
Download Presentation The PPT/PDF document "HPMMAP: Lightweight Memory Management fo..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
HPMMAP: Lightweight Memory Management for Commodity Operating Systems
Brian Kocoloski Jack LangeUniversity of PittsburghSlide2
Lightweight Experience in a Consolidated Environment
HPC applications need lightweight resource managementTightly-synchronized, massively parallelInconsistency a huge problemSlide3
Lightweight Experience in a Consolidated Environment
HPC applications need lightweight resource managementTightly-synchronized, massively parallelInconsistency a huge problemProblem: Modern HPC environments require commodity OS/R features
Cloud computing / Consolidation with general purposeIn-Situ visualizationSlide4
Lightweight Experience in a Consolidated Environment
HPC applications need lightweight resource managementTightly-synchronized, massively parallelInconsistency a huge problemProblem: Modern HPC environments require commodity OS/R features
Cloud computing / Consolidation with general purposeIn-Situ visualizationThis talk: How can we provide lightweight memory management in a fullweight environment?Slide5
Lightweight vs Commodity Resource Management
Commodity management has
fundamentally different focus than
lightweight management
Dynamic, fine-grained
resource allocation
Resource utilization, fairness, security
Degrade applications fairly in response to heavy
loadsSlide6
Lightweight vs Commodity Resource Management
Commodity management has
fundamentally different focus than
lightweight management
Dynamic, fine-grained
resource allocation
Resource utilization, fairness, security
Degrade applications fairly in response to heavy
loads
Example: Memory Management
Demand paging
Serialized, coarse-grained address space
operationsSlide7
Lightweight vs Commodity Resource Management
Commodity management has
fundamentally different focus than
lightweight management
Dynamic, fine-grained
resource allocation
Resource utilization, fairness, security
Degrade applications fairly in response to heavy
loads
Example: Memory Management
Demand paging
Serialized, coarse-grained address space
operations
Serious HPC Implications
Resource efficiency vs. resource isolation
System overhead
Cannot fully support HPC features (e.g., large pages)Slide8
HPMMAP: High Performance Memory Mapping and Allocation Platform
Independent and isolated memory management layersLinux kernel module: NO kernel modificationsSystem call interception: NO application modifications
Lightweight memory management: NO page faultsUp to 50% performance improvement for HPC apps
Commodity Application
HPC Application
Modified System Call Interface
System Call Interface
Linux Kernel
Linux Memory Manager
HPMMAP
Linux Memory
HPMMAP Memory
NODESlide9
Talk Roadmap
Detailed Analysis of Linux Memory ManagementFocus on demand paging architectureIssues with prominent large page solutionsDesign and Implementation of HPMMAPNo kernel or application modification
Single-Node Evaluation Illustrating HPMMAP Performance BenefitsMulti-Node Evaluation Illustrating ScalabilitySlide10
Linux Memory Management
Default Linux: On-demand PagingPrimary goal: optimize memory utilizationReduce overhead of common behavior (fork/exec)Optimized Linux: Large PagesTransparent Huge Pages
HugeTLBfsBoth integrated with demand paging architectureSlide11
Linux Memory Management
Default Linux: On-demand PagingPrimary goal: optimize memory utilizationReduce overhead of common behavior (fork/exec)Optimized Linux: Large Pages
Transparent Huge PagesHugeTLBfsBoth integrated with demand paging architectureOur work: determine implications of these features for HPCSlide12
Transparent Huge Pages
Transparent Huge Pages (THP)(1) Page fault handler uses large pages when possible(2) khugepaged address space mergingSlide13
Transparent Huge Pages
Transparent Huge Pages (THP)(1) Page fault handler uses large pages when possible(2) khugepaged address space merging
khugepagedBackground kernel threadPeriodically allocates and “merges” large page into address space of any process requesting THP supportRequires global page table
lock
Driven by OS heuristics – no knowledge of application workloadSlide14
Transparent Huge Pages
Ran miniMD benchmark from Mantevo twice:
As only application
Co-located parallel kernel build
“Merge” – small page faults stalled by THP merge operationSlide15
Transparent Huge Pages
Ran miniMD benchmark from Mantevo twice:
As only application
Co-located parallel kernel build
“Merge” – small page faults stalled by THP merge operation
Large page overhead increased by nearly
100%
with
added loadSlide16
Transparent Huge Pages
Ran miniMD benchmark from Mantevo twice:
As only application
Co-located parallel kernel build
“Merge” – small page faults stalled by THP merge operation
Large page overhead increased by nearly
100%
with
added load
Total number of merges increased by
50%
with
added loadSlide17
Transparent Huge Pages
Ran miniMD benchmark from Mantevo twice:
As only application
Co-located parallel kernel build
“Merge” – small page faults stalled by THP merge operation
Large page overhead increased by nearly
100%
with
added load
Total number of merges increased by
50%
with
added load
Merge overhead increased by over
300%
with added loadSlide18
Transparent Huge Pages
Ran miniMD benchmark from Mantevo twice:
As only application
Co-located parallel kernel build
“Merge” – small page faults stalled by THP merge operation
Large page overhead increased by nearly
100%
with
added load
Total number of merges increased by
50%
with
added load
Merge overhead increased by over
300%
with added load
Merge standard deviation increased by nearly
800%
with added loadSlide19
Transparent Huge Pages
No CompetitionWith Parallel Kernel Build
Large page faults green
,
small faults delayed by merges blue
Generally periodic, but not synchronized
5M
Page Fault Cycles
Page Fault Cycles
0
0
349
368
App Runtime (s)
App Runtime (s)
5MSlide20
Transparent Huge Pages
No CompetitionWith Parallel Kernel Build
Large page faults green
,
small faults delayed by merges blue
Generally periodic, but not
synchronized
Variability increases dramatically under load
Page Fault Cycles
Page Fault Cycles
0
0
349
368
App Runtime (s)
App Runtime (s)
5M
5MSlide21
HugeTLBfs
HugeTLBfsRAM-based filesystem supporting large page allocationRequires pre-allocated memory pools reserved by system administratorAccess generally managed through libhugetlbfsSlide22
HugeTLBfs
HugeTLBfsRAM-based filesystem supporting large page allocationRequires pre-allocated memory pools reserved by system administratorAccess generally managed through libhugetlbfsLimitations
Cannot back process stacksConfiguration challengesHighly susceptible to overhead from system loadSlide23
HugeTLBfs
Ran miniMD benchmark from Mantevo twice:
As only application
Co-located parallel kernel buildSlide24
HugeTLBfs
Ran miniMD benchmark from Mantevo twice:
As only application
Co-located parallel kernel build
Large page fault performance generally unaffected by added load
Demonstrates effectiveness of pre-reserved memory poolsSlide25
HugeTLBfs
Ran miniMD benchmark from Mantevo twice:
As only application
Co-located parallel kernel build
Large page fault performance generally unaffected by added load
Demonstrates effectiveness of pre-reserved memory pools
Small page fault overhead increases by
nearly 475,000
cycles on averageSlide26
HugeTLBfs
Ran miniMD benchmark from Mantevo twice:
As only application
Co-located parallel kernel build
Large page fault performance generally unaffected by added load
Demonstrates effectiveness of pre-reserved memory pools
Small page fault overhead increases by
nearly 475,000
cycles on average
Performance considerably more variable
Standard deviation roughly
30x higher than the average!Slide27
HugeTLBfs
HPCCG
CoMDminiFE
Page Fault Cycles
App Runtime (s)
No Competition
With Parallel Kernel Build
10M
0
3
M
0
3
M
0
10M
0
3
M
0
3
M
0
51
60
248
281
54
59Slide28
HugeTLBfs
HPCCG
CoMDminiFE
Page Fault Cycles
App Runtime (s)
No Competition
With Parallel Kernel Build
10M
0
3
M
0
3
M
0
10M
0
3
M
0
3
M
0
51
60
248
281
54
59
Overhead
of small page faults increases
substantially
Ample memory available via reserved memory pools, but inaccessible for small faults
Illustrates configuration challenges Slide29
Linux Memory Management: HPC Implications
Conclusions of Linux Memory Management Analysis:Memory isolation insufficient for HPC when system is under significant loadLarge page solutions not fully HPC-compatible
Demand Paging is not an HPC featurePoses problems when adopting HPC features like large pagesBoth Linux large page solutions are impacted in different waysSolution: HPMMAPSlide30
HPMMAP: High Performance Memory Mapping and Allocation Platform
Independent and isolated memory management layersLightweight Memory ManagementLarge pages the default memory mapping
unit0 page faults during application execution
Commodity Application
HPC Application
Modified System Call Interface
System Call Interface
Linux Kernel
Linux Memory Manager
HPMMAP
Linux Memory
HPMMAP Memory
NODESlide31
Kitten Lightweight Kernel
Lightweight Kernel from Sandia National LabsMostly Linux-compatible user environmentOpen source, freely availableKitten Memory ManagementMoves memory management as close to application as possible
Virtual address regions (heap, stack, etc.) statically-sized and mapped at process creationLarge pages default unit of memory mappingNo page fault handling
https://software.sandia.gov/trac/kittenSlide32
HPMMAP Overview
Lightweight versions of memory management system calls (brk, mmap, etc.)“On-request” memory management
0 page faults during application execution
Memory
offlining
Management of large (128 MB+) contiguous regions
Utilizes vast unused address space on 64-bit systems
Linux has no knowledge of
HPMMAP’d
regionsSlide33
Evaluation Methodology
Consolidated WorkloadsEvaluate HPC performance with co-located commodity workloads (parallel kernel builds)Evaluate THP, HugeTLBfs, and HPMMAP configurationsBenchmarks selected from the Mantevo and Sequoia benchmark suites
Goal: Limit hardware contentionApply CPU and memory pinning for each workload where possibleSlide34
Single Node Evaluation
BenchmarksMantevo (HPCCG, CoMD, miniMD, miniFE)Run in weak-scaling modeAMD Opteron NodeT
wo 6-core NUMA sockets8GB RAM per socketWorkloads:Commodity profile A – 1 co-located kernel buildCommodity profile B – 2 co-located kernel buildsUp to 4 cores over-committedSlide35
Single Node Evaluation - Commodity profile A
Average 8-core improvement across applications of
15% over THP
,
9% over
HugeTLBfs
HPCCG
CoMD
miniMD
miniFESlide36
Single Node Evaluation - Commodity profile A
Average 8-core improvement across applications of
15% over THP
,
9% over
HugeTLBfs
THP becomes increasingly variable with scale
HPCCG
CoMD
miniMD
miniFESlide37
Single Node Evaluation - Commodity profile B
HPCCGCoMD
miniMDminiFE
Average 8-core improvement across applications of
16%
over THP
,
36%
over
HugeTLBfsSlide38
Single Node Evaluation - Commodity profile B
HPCCGCoMD
miniMDminiFE
Average 8-core improvement across applications of
16%
over THP
,
36%
over
HugeTLBfs
HugeTLBfs degrades significantly in all cases at 8 cores – memory pressure due to weak-scaling configurationSlide39
Multi-Node Scaling Evaluation
BenchmarksMantevo (HPCCG, miniFE) and Sequoia (LAMMPS)Run in weak-scaling modeEight-Node Sandia Test Cluster
Two 4-core NUMA sockets (Intel Xeon Cores)12GB RAM per socketGigabit EthernetWorkloadsCommodity profile C – 2
co-located kernel
builds per node
Up to 4 cores over-committedSlide40
Multi-Node Evaluation - Commodity profile C
HPCCGminiFE
LAMMPS32 rank improvement: HPCCG - 11%, miniFE – 6%, LAMMPS – 4%
HPMMAP shows very few outliers
miniFE
: impact
of single node variability on scalability (3% improvement on single node
LAMMPS
also beginning to show
divergenceSlide41
Future Work
Memory Management not the only barrier to HPC deployment in consolidated environmentsOther system software overheadsOS noise
Idea: Fully independent system software stacksLightweight virtualization (Palacios VMM)Lightweight “co-kernel”We’ve built a system that can launch Kitten on a subset of offlined CPU
cores, memory blocks, and PCI devicesSlide42
Conclusion
Commodity memory management strategies cannot isolate HPC workloads in consolidated environmentsPage fault performance illustrates effects of contentionLarge page solutions not fully HPC compatible HPMMAPIndependent and isolated lightweight memory manager
Requires no kernel modification or application modificationHPC applications using HPMMAP achieve up to 50% better performanceSlide43
Thank You
Brian Kocoloskibriankoco@cs.pitt.eduhttp://people.cs.pitt.edu/~briankocoKittenhttps://software.sandia.gov/trac/kitten