/
The Effect of Multi-core on HPC Applications in Virtualized Systems The Effect of Multi-core on HPC Applications in Virtualized Systems

The Effect of Multi-core on HPC Applications in Virtualized Systems - PowerPoint Presentation

widengillette
widengillette . @widengillette
Follow
342 views
Uploaded On 2020-07-04

The Effect of Multi-core on HPC Applications in Virtualized Systems - PPT Presentation

Jaeung Han¹ Jeongseob Ahn¹ Changdae Kim ¹ Youngjin Kwon¹ Young ri Choi² and Jaehyuk Huh¹ ¹ KAIST Korea Advanced Institute of Science and Technology ² KISTI Korea Institute of Science and Technology Information ID: 795633

virtualization core hpc memory core virtualization memory hpc multi machine mpi socket shared system hardware virtual numa monitor vms

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "The Effect of Multi-core on HPC Applicat..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

The Effect of Multi-core on HPC Applications in Virtualized Systems

Jaeung Han¹, Jeongseob Ahn¹,

Changdae

Kim

¹, Youngjin Kwon¹,

Young-

ri

Choi², and Jaehyuk Huh¹

¹ KAIST

(Korea Advanced Institute of Science and Technology)

² KISTI

(Korea Institute of Science and Technology Information)

Slide2

Outline

Virtualization for HPC

Virtualization on Multi-core

Virtualization for HPC on Multi-coreMethodologyPARSEC – shared memory modelNPB – MPI modelConclusion

2

Slide3

Outline

Virtualization for HPC

Virtualization on Multi-core

Virtualization for HPC on Multi-coreMethodologyPARSEC – shared memory modelNPB – MPI modelConclusion

3

Slide4

Benefits of Virtualization

4

Hardware

Virtual Machine Monitor

VM

VM

VM

Improve system utilization by consolidation

Slide5

Benefits of Virtualization

5

Hardware

Virtual Machine Monitor

VM

Windows

VM

Linux

VM

Solaris

Improve system utilization by consolidation

Support for multiple types of

OSes

on a system

Slide6

Benefits of Virtualization

6

Hardware

Virtual Machine Monitor

VM

Windows

VM

Linux

VM

Solaris

Improve system utilization by consolidation

Support for multiple types of

OSes

on a system

Fault isolation

Slide7

Benefits of Virtualization

7

Hardware

Virtual Machine Monitor

VM

Windows

VM

Linux

VM

Solaris

Hardware

Virtual Machine Monitor

Improve system utilization by consolidation

Support for multiple types of

OSes

on a system

Fault isolation

Flexible resource management

Slide8

Benefits of Virtualization

8

Improve system utilization by consolidation

Support for multiple types of OSes on a systemFault isolationFlexible resource management

Hardware

Virtual Machine Monitor

VM

Windows

VM

Linux

VM

Solaris

Hardware

Virtual Machine Monitor

Slide9

Benefits of Virtualization

9

Improve system utilization by consolidation

Support for multiple types of OSes on a systemFault isolationFlexible resource management

Cloud computing

VM

Windows

VM

Linux

VM

Solaris

Cloud

Hardware

Virtual Machine Monitor

Slide10

Virtualization for HPC

Benefits of virtualization

Improve system utilization by consolidation

Support for multiple types of OSes on a systemFault isolationFlexible resource managementCloud computing

HPC is performance-sensitive

Virtualization can help HPC

workloads

10

 resource-sensitive

Slide11

Outline

Virtualization for HPC

Virtualization on Multi-core

Virtualization for HPC on Multi-coreMethodologyPARSEC – shared memory modelNPB – MPI modelConclusion

11

Slide12

Virtualization on Multi-core

12

core

More VMs on a physical machine

More complex memory hierarchy (NUCA, NUMA)

VM

VM

core

VM

VM

core

VM

VM

core

VM

VM

core

VM

VM

core

VM

VM

Shared cache

Shared cache

Memory

Memory

core

VM

VM

core

VM

VM

Slide13

ChallengesVM management cost

Semantic gaps

vCPU

scheduling, NUMA13

Virtual Machine Monitor

VM

VM

VM

VM

VM

VM

VM

VM

Scheduling, Memory, Communication,

I/O multiplexing…

M

e

m

M

e

m

core

core

core

core

core

core

core

core

Virtual Machine Monitor

core

core

core

core

OS

Memory

$

$

Slide14

Outline

Virtualization for HPC

Virtualization on Multi-core

Virtualization for HPC on Multi-coreMethodologyPARSEC – shared memory modelNPB – MPI modelConclusion

14

Slide15

Virtualization for HPC on Multi-coreVirtualization may help HPC

Virtualization on multi-core may have some overheads

For servers, improving system utilization is a key factor

For HPC, performance is a key factor.15

How much overheads are there?

Where do they come from?

Slide16

Outline

Virtualization for HPC

Virtualization on Multi-core

Virtualization for HPC on Multi-coreMethodologyPARSEC – shared memory modelNPB – MPI modelConclusion

16

Slide17

MachinesSingle Socket System

12-cores AMD processor

Uniform memory access latency

Two 6MB L3 caches shared by 6 coresDual Socket System 2x 4-core Intel processorNon-uniform memory

access

latency

Two 8MB L3 caches shared by 4 cores

17

P

L2

P

L2

L3

P

L2

P

L2

P

L2

P

L2

P

L2

P

L2

L3

P

L2

P

L2

P

L2

P

L2

Single socket: 12-core CPU

Memory

P

L2

P

L2

P

L2

P

L2

L3

P

L2

P

L2

P

L2

P

L2

L3

Dual socket: 2x 4-core CPUs

Slide18

Workloads

PARSEC

Shared memory model

Input: nativeOn one machine

Single and Dual socket

Fix: One VM

Vary: 1, 4, 8

vCPUs

NAS Parallel Benchmark

MPI model

Input: class C

On two machines (dual socket)

1Gb Ethernet switchFix: 16

vCPUsVary: 2 ~ 16 VMs

18

M

e

m

M

e

m

core

core

core

core

core

core

core

core

Virtual Machine Monitor

core

core

core

core

OS

Memory

$

$

Virtual Machine Monitor

VM

VM

VM

VM

VM

VM

VM

VM

Hardware

Virtual Machine Monitor

VM

VM

VM

VM

VM

VM

VM

VM

Hardware

Semantic gaps

VM management cost

Slide19

Outline

Virtualization for HPC

Virtualization on Multi-core

Virtualization for HPC on Multi-coreMethodologyPARSEC – shared memory modelNPB – MPI modelConclusion

19

Slide20

PARSEC – Single SocketSingle socket

No NUMA effect

Very low virtualization overheads

20

2~4 %

Execution times normalized to native runs

Slide21

PARSEC – Single SocketSingle socket +

pin

vCPU to each pCPUReduce semantic gaps by prevent vCPU migration

vCPU

migration has negligible effect

21

Execution times normalized to native runs

Similar to unpinned

Slide22

PARSEC – Dual SocketDual socket,

unpinned

vCPUsNUMA effect  semantic gapSignificant increase of overheads

22

16~37 %

Execution times normalized to native runs

Slide23

PARSEC – Dual SocketDual socket,

pinned

vCPUsMay reduce NUMA effect alsoReduced overheads with 1 and 4 vCPUs23

Execution times normalized to native runs

Slide24

XEN and NUMA machine

Memory allocation policy

Allocate up to 4GB chunk on one socket

Scheduling policyPinning to allocated socketNothing morePinning 1 ~ 4 vCPUs on the socket

mem

. allocated is possible

Impossible with 8

vCPUs

24

M

e

m

core

core

core

core

core

core

core

core

$

$

M

e

m

VM

0

VM

1

VM

2

VM

3

Slide25

Mitigating NUMA Effects

Range pinning

Pin

vCPUs of a VM on a socketWork only if # of vCPUs < # of cores on a socketRange-pinned (best):

memory of VM in the

same

socket

Range-pinned (worst):

memory of VM in the

other

socket

NUMA-first scheduler

If there is an idle core in the socket memory allocated, pick itIf not, anyway, pick a core in the machine

All vCPUs are not active all the time (sync. or I/O)

25

Slide26

Range PinningFor 4

vCPUs

case

Range-pinned(best) ≈ Pinned 26

Execution times normalized to native runs

Slide27

NUMA-first SchedulerFor 8

vCPUs

case

Significant improvement by NUMA-first scheduler27

Execution times normalized to native runs

Slide28

Outline

Virtualization for HPC

Virtualization on Multi-core

Virtualization for HPC on Multi-coreMethodologyPARSEC – shared memory modelNPB – MPI modelConclusion

28

Slide29

VM Granularity for MPI modelFine-grained VMs

Few processes in a VM

Small VM:

vCPUs, memoryFault isolation among processes in different VMsMany VMs on a machineMPI communications mostly through the VMM

Coarse-grained VMs

Many processes in a VM

Large VM:

vCPUs

, memory

Single failure

point

for processes in a VM

Few VMs on a machine

MPI communications mostly within a VM29

VMM

Hardware

VMM

Hardware

VMM

Hardware

VMM

Hardware

Slide30

NPB - VM GranularityWork to do are same for all granularity

2 VMs: each VM has 8

vCPUs

, 8 MPI processes16 VMs: each VM has 1 vCPU, 1 MPI processes

30

Execution times normalized to native runs

11~54 %

Slide31

NPB - VM Granularity

Fine-grained VMs

 significant overheads (avg. 54%)

MPI communications mostly through VMMWorst in CG with high communication ratioSmall memory per VM

VM management costs of VMM

Coarse-grained VMs

 much less overheads (avg. 11%)

Still dual socket, but less overheads than shared memory model  the bottle neck is moved to communication

MPI communication largely within VM

Large memory per VM

31

Slide32

Outline

Virtualization for HPC

Virtualization on Multi-core

Virtualization for HPC on Multi-coreMethodologyPARSEC – shared memory modelNPB – MPI modelConclusion

32

Slide33

ConclusionQuestions on virtualization for HPC on multi-core system

How much overheads are there?

Where do they come from?

For shared memory modelWithout NUMA  little overheadsWith NUMA  large overheads from semantic gaps

For MPI model

Less NUMA effect  communication is important

Fine-grained VMs have large overheads

Communication mostly through VMM

Small memory / VM management cost

Future Works

NUMA-aware VMM scheduler

Optimize communication among VMs in a machine

33

Slide34

34

Thank you!

Slide35

35

Backup slides

Slide36

PARSEC CPU UsageEnvironments: native

linux

, turn on only 8 cores (use 8 threads mode)

Get CPU usage every seconds, then average them

For all workloads, less than 800% (fully parallel)

 NUMA-first can work

36