HPMMAP: Lightweight Memory Management for Commodity Operati - PowerPoint Presentation

399 views
Uploaded On 2017-05-13

HPMMAP: Lightweight Memory Management for Commodity Operati - PPT Presentation

Brian Kocoloski Jack Lange University of Pittsburgh Lightweight Experience in a Consolidated Environment HPC applications need lightweight resource management Tightlysynchronized massively parallel ID: 547893

page memory kernel management memory page management kernel large commodity pages hpc lightweight linux parallel huge fault node hpmmap

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/547893" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "HPMMAP: Lightweight Memory Management fo..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Slide1

HPMMAP: Lightweight Memory Management for Commodity Operating Systems

Brian Kocoloski Jack LangeUniversity of PittsburghSlide2

Lightweight Experience in a Consolidated Environment

HPC applications need lightweight resource managementTightly-synchronized, massively parallelInconsistency a huge problemSlide3

Lightweight Experience in a Consolidated Environment

HPC applications need lightweight resource managementTightly-synchronized, massively parallelInconsistency a huge problemProblem: Modern HPC environments require commodity OS/R features

Cloud computing / Consolidation with general purposeIn-Situ visualizationSlide4

Lightweight Experience in a Consolidated Environment

HPC applications need lightweight resource managementTightly-synchronized, massively parallelInconsistency a huge problemProblem: Modern HPC environments require commodity OS/R features

Cloud computing / Consolidation with general purposeIn-Situ visualizationThis talk: How can we provide lightweight memory management in a fullweight environment?Slide5

Lightweight vs Commodity Resource Management

Commodity management has

fundamentally different focus than

lightweight management

Dynamic, fine-grained

resource allocation

Resource utilization, fairness, security

Degrade applications fairly in response to heavy

loadsSlide6

Lightweight vs Commodity Resource Management

Commodity management has

fundamentally different focus than

lightweight management

Dynamic, fine-grained

resource allocation

Resource utilization, fairness, security

Degrade applications fairly in response to heavy

loads

Example: Memory Management

Demand paging

Serialized, coarse-grained address space

operationsSlide7

Lightweight vs Commodity Resource Management

Commodity management has

fundamentally different focus than

lightweight management

Dynamic, fine-grained

resource allocation

Resource utilization, fairness, security

Degrade applications fairly in response to heavy

loads

Example: Memory Management

Demand paging

Serialized, coarse-grained address space

operations

Serious HPC Implications

Resource efficiency vs. resource isolation

System overhead

Cannot fully support HPC features (e.g., large pages)Slide8

HPMMAP: High Performance Memory Mapping and Allocation Platform

Independent and isolated memory management layersLinux kernel module: NO kernel modificationsSystem call interception: NO application modifications

Lightweight memory management: NO page faultsUp to 50% performance improvement for HPC apps

Commodity Application

HPC Application

Modified System Call Interface

System Call Interface

Linux Kernel

Linux Memory Manager

HPMMAP

Linux Memory

HPMMAP Memory

NODESlide9

Talk Roadmap

Detailed Analysis of Linux Memory ManagementFocus on demand paging architectureIssues with prominent large page solutionsDesign and Implementation of HPMMAPNo kernel or application modification

Single-Node Evaluation Illustrating HPMMAP Performance BenefitsMulti-Node Evaluation Illustrating ScalabilitySlide10

Linux Memory Management

Default Linux: On-demand PagingPrimary goal: optimize memory utilizationReduce overhead of common behavior (fork/exec)Optimized Linux: Large PagesTransparent Huge Pages

HugeTLBfsBoth integrated with demand paging architectureSlide11

Linux Memory Management

Default Linux: On-demand PagingPrimary goal: optimize memory utilizationReduce overhead of common behavior (fork/exec)Optimized Linux: Large Pages

Transparent Huge PagesHugeTLBfsBoth integrated with demand paging architectureOur work: determine implications of these features for HPCSlide12

Transparent Huge Pages

Transparent Huge Pages (THP)(1) Page fault handler uses large pages when possible(2) khugepaged address space mergingSlide13

Transparent Huge Pages

Transparent Huge Pages (THP)(1) Page fault handler uses large pages when possible(2) khugepaged address space merging

khugepagedBackground kernel threadPeriodically allocates and “merges” large page into address space of any process requesting THP supportRequires global page table

lock

Driven by OS heuristics – no knowledge of application workloadSlide14

Transparent Huge Pages

Ran miniMD benchmark from Mantevo twice:

As only application

Co-located parallel kernel build

“Merge” – small page faults stalled by THP merge operationSlide15

Transparent Huge Pages

Ran miniMD benchmark from Mantevo twice:

As only application

Co-located parallel kernel build

“Merge” – small page faults stalled by THP merge operation

Large page overhead increased by nearly

100%

with

added loadSlide16

Transparent Huge Pages

Ran miniMD benchmark from Mantevo twice:

As only application

Co-located parallel kernel build

“Merge” – small page faults stalled by THP merge operation

Large page overhead increased by nearly

100%

with

added load

Total number of merges increased by

50%

with

added loadSlide17

Transparent Huge Pages

Ran miniMD benchmark from Mantevo twice:

As only application

Co-located parallel kernel build

“Merge” – small page faults stalled by THP merge operation

Large page overhead increased by nearly

100%

with

added load

Total number of merges increased by

50%

with

added load

Merge overhead increased by over

300%

with added loadSlide18

Transparent Huge Pages

Ran miniMD benchmark from Mantevo twice:

As only application

Co-located parallel kernel build

“Merge” – small page faults stalled by THP merge operation

Large page overhead increased by nearly

100%

with

added load

Total number of merges increased by

50%

with

added load

Merge overhead increased by over

300%

with added load

Merge standard deviation increased by nearly

800%

with added loadSlide19

Transparent Huge Pages

No CompetitionWith Parallel Kernel Build

Large page faults green

small faults delayed by merges blue

Generally periodic, but not synchronized

Page Fault Cycles

349

368

App Runtime (s)

5MSlide20

Transparent Huge Pages

No CompetitionWith Parallel Kernel Build

Large page faults green

small faults delayed by merges blue

Generally periodic, but not

synchronized

Variability increases dramatically under load

Page Fault Cycles

349

368

App Runtime (s)

5MSlide21

HugeTLBfs

HugeTLBfsRAM-based filesystem supporting large page allocationRequires pre-allocated memory pools reserved by system administratorAccess generally managed through libhugetlbfsSlide22

HugeTLBfs

HugeTLBfsRAM-based filesystem supporting large page allocationRequires pre-allocated memory pools reserved by system administratorAccess generally managed through libhugetlbfsLimitations

Cannot back process stacksConfiguration challengesHighly susceptible to overhead from system loadSlide23

HugeTLBfs

Ran miniMD benchmark from Mantevo twice:

As only application

Co-located parallel kernel buildSlide24

HugeTLBfs

Ran miniMD benchmark from Mantevo twice:

As only application

Co-located parallel kernel build

Large page fault performance generally unaffected by added load

Demonstrates effectiveness of pre-reserved memory poolsSlide25

HugeTLBfs

Ran miniMD benchmark from Mantevo twice:

As only application

Co-located parallel kernel build

Large page fault performance generally unaffected by added load

Demonstrates effectiveness of pre-reserved memory pools

Small page fault overhead increases by

nearly 475,000

cycles on averageSlide26

HugeTLBfs

Ran miniMD benchmark from Mantevo twice:

As only application

Co-located parallel kernel build

Large page fault performance generally unaffected by added load

Demonstrates effectiveness of pre-reserved memory pools

Small page fault overhead increases by

nearly 475,000

cycles on average

Performance considerably more variable

Standard deviation roughly

30x higher than the average!Slide27

HugeTLBfs

HPCCG

CoMDminiFE

Page Fault Cycles

App Runtime (s)

No Competition

With Parallel Kernel Build

10M

248

281

59Slide28

HugeTLBfs

HPCCG

CoMDminiFE

Page Fault Cycles

App Runtime (s)

No Competition

With Parallel Kernel Build

10M

248

281

Overhead

of small page faults increases

substantially

Ample memory available via reserved memory pools, but inaccessible for small faults

Illustrates configuration challenges Slide29

Linux Memory Management: HPC Implications

Conclusions of Linux Memory Management Analysis:Memory isolation insufficient for HPC when system is under significant loadLarge page solutions not fully HPC-compatible

Demand Paging is not an HPC featurePoses problems when adopting HPC features like large pagesBoth Linux large page solutions are impacted in different waysSolution: HPMMAPSlide30

HPMMAP: High Performance Memory Mapping and Allocation Platform

Independent and isolated memory management layersLightweight Memory ManagementLarge pages the default memory mapping

unit0 page faults during application execution

Commodity Application

HPC Application

Modified System Call Interface

System Call Interface

Linux Kernel

Linux Memory Manager

HPMMAP

Linux Memory

HPMMAP Memory

NODESlide31

Kitten Lightweight Kernel

Lightweight Kernel from Sandia National LabsMostly Linux-compatible user environmentOpen source, freely availableKitten Memory ManagementMoves memory management as close to application as possible

Virtual address regions (heap, stack, etc.) statically-sized and mapped at process creationLarge pages default unit of memory mappingNo page fault handling

https://software.sandia.gov/trac/kittenSlide32

HPMMAP Overview

Lightweight versions of memory management system calls (brk, mmap, etc.)“On-request” memory management

0 page faults during application execution

Memory

offlining

Management of large (128 MB+) contiguous regions

Utilizes vast unused address space on 64-bit systems

Linux has no knowledge of

HPMMAP’d

regionsSlide33

Evaluation Methodology

Consolidated WorkloadsEvaluate HPC performance with co-located commodity workloads (parallel kernel builds)Evaluate THP, HugeTLBfs, and HPMMAP configurationsBenchmarks selected from the Mantevo and Sequoia benchmark suites

Goal: Limit hardware contentionApply CPU and memory pinning for each workload where possibleSlide34

Single Node Evaluation

BenchmarksMantevo (HPCCG, CoMD, miniMD, miniFE)Run in weak-scaling modeAMD Opteron NodeT

wo 6-core NUMA sockets8GB RAM per socketWorkloads:Commodity profile A – 1 co-located kernel buildCommodity profile B – 2 co-located kernel buildsUp to 4 cores over-committedSlide35

Single Node Evaluation - Commodity profile A

Average 8-core improvement across applications of

15% over THP

9% over

HugeTLBfs

HPCCG

CoMD

miniMD

miniFESlide36

Single Node Evaluation - Commodity profile A

Average 8-core improvement across applications of

15% over THP

9% over

HugeTLBfs

THP becomes increasingly variable with scale

HPCCG

CoMD

miniMD

miniFESlide37

Single Node Evaluation - Commodity profile B

HPCCGCoMD

miniMDminiFE

Average 8-core improvement across applications of

16%

over THP

36%

over

HugeTLBfsSlide38

Single Node Evaluation - Commodity profile B

HPCCGCoMD

miniMDminiFE

Average 8-core improvement across applications of

16%

over THP

36%

over

HugeTLBfs

HugeTLBfs degrades significantly in all cases at 8 cores – memory pressure due to weak-scaling configurationSlide39

Multi-Node Scaling Evaluation

BenchmarksMantevo (HPCCG, miniFE) and Sequoia (LAMMPS)Run in weak-scaling modeEight-Node Sandia Test Cluster

Two 4-core NUMA sockets (Intel Xeon Cores)12GB RAM per socketGigabit EthernetWorkloadsCommodity profile C – 2

co-located kernel

builds per node

Up to 4 cores over-committedSlide40

Multi-Node Evaluation - Commodity profile C

HPCCGminiFE

LAMMPS32 rank improvement: HPCCG - 11%, miniFE – 6%, LAMMPS – 4%

HPMMAP shows very few outliers

miniFE

: impact

of single node variability on scalability (3% improvement on single node

LAMMPS

also beginning to show

divergenceSlide41

Future Work

Memory Management not the only barrier to HPC deployment in consolidated environmentsOther system software overheadsOS noise

Idea: Fully independent system software stacksLightweight virtualization (Palacios VMM)Lightweight “co-kernel”We’ve built a system that can launch Kitten on a subset of offlined CPU

cores, memory blocks, and PCI devicesSlide42

Conclusion

Commodity memory management strategies cannot isolate HPC workloads in consolidated environmentsPage fault performance illustrates effects of contentionLarge page solutions not fully HPC compatible HPMMAPIndependent and isolated lightweight memory manager

Requires no kernel modification or application modificationHPC applications using HPMMAP achieve up to 50% better performanceSlide43

Thank You

Brian Kocoloskibriankoco@cs.pitt.eduhttp://people.cs.pitt.edu/~briankocoKittenhttps://software.sandia.gov/trac/kitten