Chuanxiong Guo February 21 2018 Toutiao Bytedance Background RDMA over commodity Ethernet RoCEv2 RoCEv2 safety and performance challenges Deployment experiences and lessons Whats next ID: 725659
Download Presentation The PPT/PDF document "RDMA in Data Centers: From Cloud Computi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
RDMA in Data Centers: From Cloud Computing to Machine Learning
Chuanxiong Guo
February 21 2018
Toutiao
(
Bytedance
)Slide2
Background
RDMA over commodity Ethernet (RoCEv2)RoCEv2 safety and performance challengesDeployment experiences and lessonsWhat’s nextSummary
OutlineSlide3
Toutiao
: A new AI-powered information and content platform
Consumption
Feeds, Channels, Apps, Other Entry Points …
Machine Reading
Human Readers
Dissemination, Discovery, Interaction,
Search, Filtering, Operation
Process, Analyz
e
,
Data-Mine,
Understand
,
Organize
Creation
Articles, Images, Video
,
Live,
QnA
, AR/VR
…
Human Writers
Machine
Writing
AI-assisted Content Creation
AI-assisted Content
Consumption
AI Infrastructure, Platform, and Services
Learn semantic representation
s
of every input-and-output
(or
task)
based on big data & human
intelligence
miningSlide4
Connecting people with information
Articles
Video
Topics
Q&A
Wiki
Photos
Live
Individuals
Communities
Interest
Groups
Social
&
Personalized
+
Intelligent
&
Ubiquitous
AR/VRSlide5
Research
+
Startup AgilityAdvance the state of the art in the technical areas important to the company’s current and future businessDirectly participate in and work on company’s important products to integrate new technologiesAttract and grow the best and brightest technical talents; create a continuous talent pipeline for the whole companyToutiao AI lab
Research/Technology
Machine Learning
Computer
Vision
Natural Language
Speech &
Audio
Computer
GraphicsKnowledge & Data Mining
AI Infrastructure & Cloud Computing
AI as a ServiceMultimedia Streaming
Personalized Agent and AssistantAI Assisted Content ModerationAI CameraProduct/Project
AI for Consumer Service Support
AI for Content Creation, consumer service Slide6
The Rising of Cloud ComputingSlide7
Data CentersSlide8
Cloud scale services: IaaS, PaaS, Search,
BigData, Storage, Machine Learning, Deep LearningServices are latency sensitive or bandwidth hungry or bothCloud scale services need cloud scale computing and communication infrastructure
Data center networks (DCN)Slide9
Data center networks (DCN)
Single ownership Large scaleHigh bisection bandwidth Commodity Ethernet switches TCP/IP protocol suite
Spine
Leaf
ToR
Podset
Pod
ServersSlide10
But TCP/IP is not doing wellSlide11
TCP latency
405us (P50)
716us (P90)2132us (P99)
Long latency tail
Pingmesh
measurement resultsSlide12
TCP processing overhead (40G)
Sender
Receiver8 tcp connections40G NICSlide13
An RDMA renaissance story Slide14
RDMA
Remote Direct Memory Access (RDMA): Method of accessing memory on a remote system without interrupting the processing of the CPU(s) on that system RDMA offloads packet processing protocols to the NICIBVerbs/
NetDirect for RDMA programing RDMA in Ethernet based data centers (RoCEv2)Slide15
RoCEv2: RDMA over Commodity Ethernet
RoCEv2 for Ethernet based data centersRoCEv2 encapsulates packets in UDP
OS kernel is not in data pathNIC for network protocol processing and message DMATCP/IP NIC driverUser
Kernel
Hardware
RDMA transport
IP
Ethernet
RDMA app
DMA
RDMA verbs
TCP/IP
NIC driver
Ethernet
RDMA app
DMA
RDMA verbs
Lossless networkRDMA transportIP Slide16
RDMA benefit: latency reduction
Msg
sizeTCP P50 (us)TCP P99(us)RDMA P50(us)
RDMA P99
(us)
1KB
236
467
24
40
16KB58078851117128KB1483
2491
247
5511MB52906195
17832214
For small msgs (<32KB), OS processing latency mattersFor large msgs (100KB+), speed mattersSlide17
RDMA benefit: CPU overhead reduction
Sender
ReceiverOne ND connection40G NIC37Gb/s goodputSlide18
RDMA: Single QP, 88 Gb/s, 1.7% CPU
TCP: Eight connections, 30-50Gb/s,
Client: 2.6%, Server: 4.3% CPURDMA benefit: CPU overhead reductionIntel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz, two sockets 28 coresSlide19
RoCEv2 needs a lossless Ethernet network
PFC for hop-by-hop flow control
DCQCN for connection-level congestion controlSlide20
Priority-based flow control (PFC)
Hop-by-hop flow control, with eight priorities for HOL blocking mitigationThe priority in data packets is carried in the VLAN tagPFC pause frame to inform the upstream to stop
PFC causes HOL and colleterial damage PFC pause framep1Egress portIngress port
p0
p1
p7
Data packet
p0
p0
p1
p7
XOFF threshold
Data packet
PFC pause frameSlide21
DCQCN
CP: Switches use ECN for packet markingNP
: periodically check if ECN-marked packets arrived, if so, notify the senderRP: adjust sending rate based on NP feedbacksSender NIC
Reaction Point
(
RP
)
Switch
Congestion Point (
CP
)Receiver NICNotification Point (NP
)
DCQCN = Keep PFC + Use ECN + hardware rate-based congestion controlSlide22
The safety and performance challenges
RDMA transport livelockPFC deadlock
PFC pause frame stormSlow-receiver symptomSlide23
RDMA transport livelock
RDMA Send 0
RDMA Send 1RDMA Send N+1NAK N
RDMA Send
0
RDMA Send
1
RDMA Send
2
RDMA Send N+2
Go-back-0
Go-back-N
RDMA Send 0
RDMA Send 1
RDMA Send N+1
NAK N
RDMA Send NRDMA Send N+1RDMA Send N+2
RDMA Send N+2SenderReceiverSwitchPkt drop rate 1/256
SenderReceiver Receiver SenderSlide24
PFC deadlock
Our data centers use Clos networkPackets first travel up then go downNo cyclic buffer dependency for up-down routing -> no deadlockBut we did experience deadlock!
Spine
Leaf
ToR
Podset
Pod
ServersSlide25
PFC deadlock
PreliminariesARP table: IP address to MAC address mappingMAC table: MAC address to port mapping
If MAC entry is missing, packets are flooded to all portsIPMACTTLIP0MAC02hIP1MAC11h
MAC
Port
TTL
MAC0
Port0
10min
MAC1
--
Input
Output
Dst: IP1
ARP table
MAC tableSlide26
La
Lb
T0
T1
S1
S2
S3
S4
Server
p0
p1
p2
p3
p0
p1
p3
p4
p0
p1
p0
p1
Egress
portIngressport
1432PFC pause frames
p2
S5
Packet dropCongested port
Dead server
PFC pause frames
Path: {S1, T0, La, T1, S3}Path: {S1, T0, La, T1, S5}Path: {S4, T1, Lb, T0, S2}PFC deadlockSlide27
PFC deadlock
The PFC deadlock root cause: the interaction between the PFC flow control and the Ethernet packet floodingSolution (short-term): drop the lossless packets if the ARP entry is incompleteRecommendation: do not flood or multicast for lossless trafficSlide28
L0
T0
L1S0
S1
L2
T2
L3
T1
T3
L0
T0
L1
S0
S1
L2
T2
L3
T1
T3
Tagger
: practical PFC deadlock prevention
Tagger Algorithm works for general network topology
Deployable in existing switching ASICs
Concept: Expected Lossless Path (ELP) to decouple Tagger from routing
Strategy: move packets to different lossless queue before CBD forming with pre-calculated rulesSlide29
NIC PFC pause frame storm
A malfunctioning NIC may block the whole networkPFC pause frame storms caused several incidents
Solution: watchdogs at both NIC and switch sides to stop the stormToRsLeaf layerSpine layerservers
0
1
2
3
4
5
6
7
Malfunctioning NIC
Podset 0
Podset 1Slide30
The slow-receiver symptom
ToR to NIC is 40Gb/s, NIC to server is 64Gb/sBut NICs may generate large number of PFC pause framesRoot cause: NIC is resource constrained
MitigationLarge page size for the MTT (memory translation table) entryDynamic buffer sharing at the ToRCPUDRAMToR
QSFP
40Gb/s
PCIe
Gen3 8x8 64Gb/s
MTT
WQEs
QPC
NIC
Server
Pause framesSlide31
Deployment experiences and lessons learnedSlide32
Latency reduction
RoCEv2 deployed in Bing world-wide for two and half years
Significant latency reduction
Incast
problem solved as no packet drops Slide33
RDMA throughput
Using two
podsets
each with 500+ servers
5Tb/s capacity between the two
podsets
Achieved 3Tb/s inter-
podset
throughput
Bottlenecked by ECMP routing
Close to 0 CPU overheadSlide34
Latency and throughput tradeoff
L0
T0L1T1
L1
L1
S0,0
S0,23
S1,0
S1,23
RDMA latencies increase
as data shuffling
started
Low latency vs high throughput
us
Before data shuffling
During data shufflingSlide35
Lessons learned
Providing lossless is hard!Deadlock, livelock, PFC pause frames propagation and storm did happenBe prepared for the unexpectedConfiguration management, latency/availability, PFC pause frame, RDMA traffic monitoring
NICs are the key to make RoCEv2 workSlide36
What’s next?Slide37
Applications
Technologies
Architectures
Protocols
RDMA for X (Search, Storage, HFT, DNN, etc.)
Lossy vs lossless network
Practical, large-scale deadlock-free, colleterial-damage-free network
RDMA programming
RDMA for heterogenous computing systems
RDMA virtualization
Low latency, high bandwidth
RDMA security
Software vs hardware
Inter-DC RDMASlide38
Historically, software based packet processing won (multiple times)
TCP processing overhead analysis by David Clark, et al. None of the stateful TCP offloading took off (e.g., TCP Chimney)The story is different this time
Moore’s law is endingAccelerators are comingNetwork speed keep increasingDemands for ultra low latency are realWill software win (again)?Slide39
There is no binding between RDMA and lossless network
But designing and implementing more sophisticated transport protocol in hardware is a challenge
Is lossless mandatory for RDMA?Slide40
RDMA virtualization for the container networking
A router acts as a proxy for the containers
Shared memory for improved performance Zero copy possibleSlide41
RDMA for DNN
Distributed DNN training iterations
Forward propagationBack propagationUpdate gradients
Computation
CommunicationSlide42
RDMA for DNN
Compute
CommunicationMinibatch1
Epoch0
Epoch1
Epoch2
Epoch3
Epoch M
Minibatch0
Minibatch2
Minibatch N
GPU0
Compute
Communication
GPU1
Compute
Communication
GPU X
Data transmission
TrainingSlide43
RDMA for DNN
Tensorflow
1 PS, 2 workers, 100GbE RoCEv2Resnet50 benchmarkTCPRDMASlide44
How many LOC for a “hello world” communication using RDMA?
For TCP, it is 60 LOC for client or server codeFor RDMA, it is complicated … IBVerbs: 600 LOCRCMA CM: 300 LOC
Rsocket: 60 LOCRDMA programmingSlide45
Make RDMA programming more accessible
Easy-to-setup RDMA server and switch configurationsCan I run and debug my RDMA code on my desktop/laptop?High quality code samplesBetter abstractionsIn the Unix/Linux world, everything is a file
But RDMA QP/CQ are not file handlers…RDMA programmingSlide46
Summary
RDMA is experiencing a renaissance in data centersRoCEv2Many opportunities and interesting problems for high-speed, low-latency RDMA networking
Many opportunities in making RDMA accessible to more developers Slide47
Yan Cai, Gang Cheng, Zhong Deng
, Daniel Firestone, Juncheng Gu, Shuihai Hu, Hongqiang Liu, Marina Lipshteyn, Ali Monfared, Jitendra Padhye, Gaurav Soni, Haitao Wu,
Jianxi Ye, Yibo ZhuAzure, Bing, CNTK, Philly collaboratorsArista Networks, Cisco, Dell, Mellanox partners Colleagues in Toutiao AI lab and networking teamsAcknowledgement Slide48
We are hiring for both FTEs and interns
HR: rdus.staffing@bytedance.com guochuanxiong@bytedance.com
Q&A