/
Jia He & Penny Zheng Jia He & Penny Zheng

Jia He & Penny Zheng - PowerPoint Presentation

adhesivedisney
adhesivedisney . @adhesivedisney
Follow
342 views
Uploaded On 2020-10-22

Jia He & Penny Zheng - PPT Presentation

4302019 Kata Containers on Arm Lets talk about our progress Agenda Kata Containers Design Hardware Requirements NVDIMMDAX on AArch64 Memory Hotadd on AArch64 Kata Containers Design ID: 814643

kata memory arm64 runtime memory kata runtime arm64 kernel cpu add hotplug containers aarch64 device system info hot source

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Jia He & Penny Zheng" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Jia He & Penny Zheng

4/30/2019

Kata Containers on Arm:

Let’s talk about our progress !

Slide2

Agenda

Kata Containers Design

Hardware Requirements NVDIMM/DAX on AArch64Memory Hot-add on AArch64

Slide3

Kata Containers Design

Slide4

Hardware Requirements

AAr

ch64Bare mentalkata-runtime kata-checktest if system can run Kata Containers

Host kernel version: v5.0 (

recommended

)

kvm: arm64: Dynamic & >40 bit IPA support

IPA Limit:

ioctl

request KVM_ARM_GET_MAX_VM_PHYS_SHIFTKVM_CREATE_VM: “type” filedMerged in v4.20, but the nearest stable version is v5.0

$ kata-runtime kata-check

INFO[0000] Unable to know if the system is running inside a VM source=

virtcontainers

INFO[0000] kernel property found arch=arm64 description="Kernel-based Virtual Machine" name=

kvm

pid

=132472 source=runtime type=module

INFO[0000] kernel property found arch=arm64 description="Host kernel accelerator for

virtio

" name=

vhost

pid

=132472 source=runtime type=module

INFO[0000] kernel property found arch=arm64 description="Host kernel accelerator for

virtio

network" name=

vhost_net

pid

=132472 source=runtime type=module

INFO[0000] System is capable of running Kata Containers arch=arm64 name=kata-runtime

pid

=132472 source=runtime

INFO[0000] device available arch=arm64 check-type=full device=/dev/

kvm

name=kata-runtime

pid

=132472 source=runtime

INFO[0000] feature available arch=arm64 check-type=full feature=create-

vm

name=kata-runtime

pid

=132472 source=runtime

INFO[0000] System can currently create Kata Containers arch=arm64 name=kata-runtime

pid

=132472 source=runtime

Slide5

NVDIMM On

AArch64

Virtio-blk as guest rootfs (original)-device

virtio-blk,drive

=image-091275cdf53f819a,scsi=

off,config-wce

=

off,romfile

= -drive id=image-091275cdf53f819a,file=/

usr

/share/kata-containers/kata-containers-2018-11-08-02:07:30.763626711+0800-osbuilder-0123f8f-agent-0f411fd,aio=

threads,format

=raw,if=noneFatal error: couldn’t launch more than two containers simultaneously docker: Error response from daemon: OCI runtime create failed: qemu-system-aarch64: -device virtio-blk,drive=image-9f100592ac95eec6,scsi=off,config-wce=off: Failed to get "write" lock Is another process using the image?: unknown.

w

rite lock error

Why we switch to use NVDIMM on AArch64?

Slide6

NVDIMM/DAX on Kata Containers

Slide7

NVDIMM on AArch64

PoP

(Point of Persistence)ARMv8.2 at leastThe point in non-volatile memory

(aka. Persistent Memory)

DC CVAP

Clean data cache by address to Point of Persistence

HWCAP_DCPOP and

dcpop

in /proc/

cpuinfo

Kernel will use DC CVAC operations if DC CVAP is not supported

w

rite lock error

CPU 0

CPU 1CPU 3

CPU 2

I

D

I

D

I

D

I

D

L2 Cache (Unification)

L2 Cache (Unification)

L3 Cache

Memory (coherency)

Persistent Memory

Device

Slide8

vNVDIMM on Kata Containers

PoC

is enough for virtual NVDIMM

Slide9

Memory Hot-add on AArch64

Physical Memory Hot-add phase

Bridge communication between hardware/firmware and kernelACPI (automatically)Probe Interface

Kernel recognizes new memory, makes new memory management tables, and makes

sysfs

files for new memory

Logical Memory Hot-add phase

Change memory state into available for users (online)

Add a goroutine 24-hours listening to memory hot-add

uevents (in kata)Set CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y (automatically)

ACPI for x86_64 and Probe interface for AArch64

Slide10

Memory Hot-add on AArch64

kata-runtime

kata-agent

getGuestDetails

guest

c

heck probe interface

/sys/devices/system

/memory/probe

s

tore the result

First Step:

Slide11

Memory Hot-add on AArch64

qemu-system-aarch64

kata-runtime

kata-agent

memory-backend-ram

memory device addr

free slot number

query-memory-devices

object_add

device_add

p

c-

dimm

query-memory-devices

memHotplugByProbe

guest

e

cho addr

/sys/devices

/system/memory/probe

a

cquire uevent

online memory

Second Step:

Slide12

$ docker run -it --runtime kata-runtime -m 3G ubuntu

WARNING: Your kernel does not support swap limit capabilities or the

cgroup

is not mounted. Memory limited without swap.

root@40bc669706b2:/# ls /sys/devices/system/memory/

auto_online_blocks

memory1 memory3 memory5 probe

block_size_bytes

memory2 memory4 power

uevent

root@40bc669706b2:/# cat /sys/devices/system/memory/

block_size_bytes

40000000root@40bc669706b2:/# lsmem

RANGE SIZE STATE REMOVABLE BLOCK0x0000000040000000-0x000000017fffffff 5G online no 1-5

Memory block size: 1GTotal online memory: 5GTotal offline memory: 0B

Memory Hot-add on AArch64

Current Status

guest kernel

v5.0 +

Probe interface (packaging repo)

kata-runtime

kata-agent

qemu

upstream review

Slide13

Agenda

CPU

hotplug in arm64 qemuVM templatingSR-IOV virtual functions

hotplug

TODO

Slide14

Kata runtime

VM

1 VCPU

128M memory

VM

n VCPUs

2048 M memory

Mem/CPU hotplug

via gRPC

Update Resource

Kata

agent

Slide15

CPU hotplug

in arm64 qemu

Runtime (#1262

 

#1489

) agent (

#478

)

Limitations on arm64:

The apci based cpu

hotplug

has not been supported in qemu guest so far, and even no clear plan for the future Arm GIC architecture can’t handle hot-adding vcpu after booting.Mentioned by Mark Rutland & Christoffer Dall (commit 716139df):We'd need something along those lines. Each CPU has a notional point to point link to the interrupt controller (to the redistributor, to be precise), and this entity must pre-exist.When the vgic initializes its internal state it does so based on the number of VCPUs available at the time.  If we allow KVM to create more VCPUs after the VGIC has been initialized, we are likely to error out in unfortunate ways later, perform buffer overflows etc.

Slide16

Kata qmp

qemu

Guest vm

start guest with

-

smp

4

-

append “

maxcpus

=1”

start

cpu

s

(

1 online,3 offline

)

query_hotpluggable_cpus

return

cpu

node info

device_add

/del

cpu

online/offline

cpus

by

gRPC

Send DEVICE_DELETED event

Slide17

Pros vs Cons:

Pros

Hypervisor doesn’t need to support

acpi

or other hardware

cpu

hotplug

. This is really helpful to arm64 or even firecracker.

Code is easy to implement. 

C

ons

docker container density on arm64. Concerns from community: “this is an orchestration workaround that neither the hypervisor nor the architecture support. I'd prefer to get that feature implemented from the hypervisor.”“the biggest drawback of the proposed approach is security”

Slide18

VM templating

18

Template Pool

VM template

VM template

VM template

Kara runtime

Vm

incoming migration

Memory backend file

RAM share=on

Device state

save

load

load

save

Slide19

VM templating

Still more work is needed on arm64

Commit 18269069 ("migration: Introduce ignore-shared capability") adds ignore-shared capability to bypass the shared ram block (

e,g

,

membackend

file with share=on). It does good to live migration.

It

is expected that QEMU doesn‘t write

anything to guest RAM until VM starts, but it does in arm64 qemurom block “

dtb

” will be filled into

RAM during rom_reset. In incoming case, this rom filling seems to be not required since all the data have been stored in memory backend file already.Catherine Ho submitted a proposal to fix this, which is under upstream review.

Slide20

SR-IOV

vf hotplug

Eric Auger’s work(KVM PCIe/MSI Passthrough on Arm/Arm64) in

qemu

and kernel in 2016

Kata runtime

VM

1 VCPU

128M memory

VM

n VCPUs

2048 M memory

PCI devices

Update Resource

Mem/

cpu

hotplug

PCI

hotplug

Slide21

SR-IOV vf

hotplug

agent issue #414

$

sudo

docker network create -d

sriov

--internal --opt

pf_iface

=enp1s0f0 --opt vlanid=100 --subnet=192.168.0.0/24

vfnet

vf

driver (e.g. mlx5) might generate a random mac address which is different from the vf mac address in the host.Kata runtime and agent will use mac address as the top priority to identify the link is the same or not. Proposal: use pci bdf address to search the link instead

Slide22

TODO on arm64

virtio-fs

nvdimm dax supportnemu support

firecracker support

Kubernetes

 integration test

Metrics

...

22

Slide23