4302019 Kata Containers on Arm Lets talk about our progress Agenda Kata Containers Design Hardware Requirements NVDIMMDAX on AArch64 Memory Hotadd on AArch64 Kata Containers Design ID: 814643
Download The PPT/PDF document "Jia He & Penny Zheng" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Jia He & Penny Zheng
4/30/2019
Kata Containers on Arm:
Let’s talk about our progress !
Slide2Agenda
Kata Containers Design
Hardware Requirements NVDIMM/DAX on AArch64Memory Hot-add on AArch64
Slide3Kata Containers Design
Slide4Hardware Requirements
AAr
ch64Bare mentalkata-runtime kata-checktest if system can run Kata Containers
Host kernel version: v5.0 (
recommended
)
kvm: arm64: Dynamic & >40 bit IPA support
IPA Limit:
ioctl
request KVM_ARM_GET_MAX_VM_PHYS_SHIFTKVM_CREATE_VM: “type” filedMerged in v4.20, but the nearest stable version is v5.0
$ kata-runtime kata-check
INFO[0000] Unable to know if the system is running inside a VM source=
virtcontainers
INFO[0000] kernel property found arch=arm64 description="Kernel-based Virtual Machine" name=
kvm
pid
=132472 source=runtime type=module
INFO[0000] kernel property found arch=arm64 description="Host kernel accelerator for
virtio
" name=
vhost
pid
=132472 source=runtime type=module
INFO[0000] kernel property found arch=arm64 description="Host kernel accelerator for
virtio
network" name=
vhost_net
pid
=132472 source=runtime type=module
INFO[0000] System is capable of running Kata Containers arch=arm64 name=kata-runtime
pid
=132472 source=runtime
INFO[0000] device available arch=arm64 check-type=full device=/dev/
kvm
name=kata-runtime
pid
=132472 source=runtime
INFO[0000] feature available arch=arm64 check-type=full feature=create-
vm
name=kata-runtime
pid
=132472 source=runtime
INFO[0000] System can currently create Kata Containers arch=arm64 name=kata-runtime
pid
=132472 source=runtime
Slide5NVDIMM On
AArch64
Virtio-blk as guest rootfs (original)-device
virtio-blk,drive
=image-091275cdf53f819a,scsi=
off,config-wce
=
off,romfile
= -drive id=image-091275cdf53f819a,file=/
usr
/share/kata-containers/kata-containers-2018-11-08-02:07:30.763626711+0800-osbuilder-0123f8f-agent-0f411fd,aio=
threads,format
=raw,if=noneFatal error: couldn’t launch more than two containers simultaneously docker: Error response from daemon: OCI runtime create failed: qemu-system-aarch64: -device virtio-blk,drive=image-9f100592ac95eec6,scsi=off,config-wce=off: Failed to get "write" lock Is another process using the image?: unknown.
w
rite lock error
Why we switch to use NVDIMM on AArch64?
Slide6NVDIMM/DAX on Kata Containers
Slide7NVDIMM on AArch64
PoP
(Point of Persistence)ARMv8.2 at leastThe point in non-volatile memory
(aka. Persistent Memory)
DC CVAP
Clean data cache by address to Point of Persistence
HWCAP_DCPOP and
dcpop
in /proc/
cpuinfo
Kernel will use DC CVAC operations if DC CVAP is not supported
w
rite lock error
CPU 0
CPU 1CPU 3
CPU 2
I
D
I
D
I
D
I
D
L2 Cache (Unification)
L2 Cache (Unification)
L3 Cache
Memory (coherency)
Persistent Memory
Device
Slide8vNVDIMM on Kata Containers
PoC
is enough for virtual NVDIMM
Slide9Memory Hot-add on AArch64
Physical Memory Hot-add phase
Bridge communication between hardware/firmware and kernelACPI (automatically)Probe Interface
Kernel recognizes new memory, makes new memory management tables, and makes
sysfs
files for new memory
Logical Memory Hot-add phase
Change memory state into available for users (online)
Add a goroutine 24-hours listening to memory hot-add
uevents (in kata)Set CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y (automatically)
ACPI for x86_64 and Probe interface for AArch64
Slide10Memory Hot-add on AArch64
kata-runtime
kata-agent
getGuestDetails
guest
c
heck probe interface
/sys/devices/system
/memory/probe
s
tore the result
First Step:
Slide11Memory Hot-add on AArch64
qemu-system-aarch64
kata-runtime
kata-agent
memory-backend-ram
memory device addr
free slot number
query-memory-devices
object_add
device_add
p
c-
dimm
query-memory-devices
memHotplugByProbe
guest
e
cho addr
/sys/devices
/system/memory/probe
a
cquire uevent
online memory
Second Step:
Slide12$ docker run -it --runtime kata-runtime -m 3G ubuntu
WARNING: Your kernel does not support swap limit capabilities or the
cgroup
is not mounted. Memory limited without swap.
root@40bc669706b2:/# ls /sys/devices/system/memory/
auto_online_blocks
memory1 memory3 memory5 probe
block_size_bytes
memory2 memory4 power
uevent
root@40bc669706b2:/# cat /sys/devices/system/memory/
block_size_bytes
40000000root@40bc669706b2:/# lsmem
RANGE SIZE STATE REMOVABLE BLOCK0x0000000040000000-0x000000017fffffff 5G online no 1-5
Memory block size: 1GTotal online memory: 5GTotal offline memory: 0B
Memory Hot-add on AArch64
Current Status
guest kernel
v5.0 +
Probe interface (packaging repo)
kata-runtime
kata-agent
qemu
upstream review
Slide13Agenda
CPU
hotplug in arm64 qemuVM templatingSR-IOV virtual functions
hotplug
TODO
Slide14Kata runtime
VM
1 VCPU
128M memory
VM
n VCPUs
2048 M memory
Mem/CPU hotplug
via gRPC
Update Resource
Kata
agent
Slide15CPU hotplug
in arm64 qemu
Runtime (#1262
#1489
) agent (
#478
)
Limitations on arm64:
The apci based cpu
hotplug
has not been supported in qemu guest so far, and even no clear plan for the future Arm GIC architecture can’t handle hot-adding vcpu after booting.Mentioned by Mark Rutland & Christoffer Dall (commit 716139df):We'd need something along those lines. Each CPU has a notional point to point link to the interrupt controller (to the redistributor, to be precise), and this entity must pre-exist.When the vgic initializes its internal state it does so based on the number of VCPUs available at the time. If we allow KVM to create more VCPUs after the VGIC has been initialized, we are likely to error out in unfortunate ways later, perform buffer overflows etc.
Slide16Kata qmp
qemu
Guest vm
start guest with
-
smp
4
-
append “
maxcpus
=1”
start
cpu
s
(
1 online,3 offline
)
query_hotpluggable_cpus
return
cpu
node info
device_add
/del
cpu
online/offline
cpus
by
gRPC
Send DEVICE_DELETED event
Slide17Pros vs Cons:
Pros
Hypervisor doesn’t need to support
acpi
or other hardware
cpu
hotplug
. This is really helpful to arm64 or even firecracker.
Code is easy to implement.
C
ons
docker container density on arm64. Concerns from community: “this is an orchestration workaround that neither the hypervisor nor the architecture support. I'd prefer to get that feature implemented from the hypervisor.”“the biggest drawback of the proposed approach is security”
Slide18VM templating
18
Template Pool
VM template
VM template
VM template
Kara runtime
Vm
incoming migration
Memory backend file
RAM share=on
Device state
save
load
load
save
Slide19VM templating
Still more work is needed on arm64
Commit 18269069 ("migration: Introduce ignore-shared capability") adds ignore-shared capability to bypass the shared ram block (
e,g
,
membackend
file with share=on). It does good to live migration.
It
is expected that QEMU doesn‘t write
anything to guest RAM until VM starts, but it does in arm64 qemurom block “
dtb
” will be filled into
RAM during rom_reset. In incoming case, this rom filling seems to be not required since all the data have been stored in memory backend file already.Catherine Ho submitted a proposal to fix this, which is under upstream review.
Slide20SR-IOV
vf hotplug
Eric Auger’s work(KVM PCIe/MSI Passthrough on Arm/Arm64) in
qemu
and kernel in 2016
Kata runtime
VM
1 VCPU
128M memory
VM
n VCPUs
2048 M memory
PCI devices
Update Resource
Mem/
cpu
hotplug
PCI
hotplug
Slide21SR-IOV vf
hotplug
agent issue #414
$
sudo
docker network create -d
sriov
--internal --opt
pf_iface
=enp1s0f0 --opt vlanid=100 --subnet=192.168.0.0/24
vfnet
vf
driver (e.g. mlx5) might generate a random mac address which is different from the vf mac address in the host.Kata runtime and agent will use mac address as the top priority to identify the link is the same or not. Proposal: use pci bdf address to search the link instead
Slide22TODO on arm64
virtio-fs
nvdimm dax supportnemu support
firecracker support
Kubernetes
integration test
Metrics
...
22
Slide23