/
RDMA in Data Centers: From Cloud Computing to Machine Learning RDMA in Data Centers: From Cloud Computing to Machine Learning

RDMA in Data Centers: From Cloud Computing to Machine Learning - PowerPoint Presentation

calandra-battersby
calandra-battersby . @calandra-battersby
Follow
346 views
Uploaded On 2018-11-09

RDMA in Data Centers: From Cloud Computing to Machine Learning - PPT Presentation

Chuanxiong Guo February 21 2018 Toutiao Bytedance Background RDMA over commodity Ethernet RoCEv2 RoCEv2 safety and performance challenges Deployment experiences and lessons Whats next ID: 725659

data rdma send pfc rdma data pfc send latency nic pause tcp deadlock lossless ethernet packet amp rocev2 server

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "RDMA in Data Centers: From Cloud Computi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

RDMA in Data Centers: From Cloud Computing to Machine Learning

Chuanxiong Guo

February 21 2018

Toutiao

(

Bytedance

)Slide2

Background

RDMA over commodity Ethernet (RoCEv2)RoCEv2 safety and performance challengesDeployment experiences and lessonsWhat’s nextSummary

OutlineSlide3

Toutiao

: A new AI-powered information and content platform

Consumption

Feeds, Channels, Apps, Other Entry Points …

Machine Reading

Human Readers

Dissemination, Discovery, Interaction,

Search, Filtering, Operation

Process, Analyz

e

,

Data-Mine,

Understand

,

Organize

Creation

Articles, Images, Video

,

Live,

QnA

, AR/VR

Human Writers

Machine

Writing

AI-assisted Content Creation

AI-assisted Content

Consumption

AI Infrastructure, Platform, and Services

Learn semantic representation

s

of every input-and-output

(or

task)

based on big data & human

intelligence

miningSlide4

Connecting people with information

Articles

Video

Topics

Q&A

Wiki

Photos

Live

Individuals

Communities

Interest

Groups

Social

&

Personalized

+

Intelligent

&

Ubiquitous

AR/VRSlide5

Research

+

Startup AgilityAdvance the state of the art in the technical areas important to the company’s current and future businessDirectly participate in and work on company’s important products to integrate new technologiesAttract and grow the best and brightest technical talents; create a continuous talent pipeline for the whole companyToutiao AI lab

Research/Technology

Machine Learning

Computer

Vision

Natural Language

Speech &

Audio

Computer

GraphicsKnowledge & Data Mining

AI Infrastructure & Cloud Computing

AI as a ServiceMultimedia Streaming

Personalized Agent and AssistantAI Assisted Content ModerationAI CameraProduct/Project

AI for Consumer Service Support

AI for Content Creation, consumer service Slide6

The Rising of Cloud ComputingSlide7

Data CentersSlide8

Cloud scale services: IaaS, PaaS, Search,

BigData, Storage, Machine Learning, Deep LearningServices are latency sensitive or bandwidth hungry or bothCloud scale services need cloud scale computing and communication infrastructure

Data center networks (DCN)Slide9

Data center networks (DCN)

Single ownership Large scaleHigh bisection bandwidth Commodity Ethernet switches TCP/IP protocol suite

Spine

Leaf

ToR

Podset

Pod

ServersSlide10

But TCP/IP is not doing wellSlide11

TCP latency

405us (P50)

716us (P90)2132us (P99)

Long latency tail

Pingmesh

measurement resultsSlide12

TCP processing overhead (40G)

Sender

Receiver8 tcp connections40G NICSlide13

An RDMA renaissance story Slide14

RDMA

Remote Direct Memory Access (RDMA): Method of accessing memory on a remote system without interrupting the processing of the CPU(s) on that system RDMA offloads packet processing protocols to the NICIBVerbs/

NetDirect for RDMA programing RDMA in Ethernet based data centers (RoCEv2)Slide15

RoCEv2: RDMA over Commodity Ethernet

RoCEv2 for Ethernet based data centersRoCEv2 encapsulates packets in UDP

OS kernel is not in data pathNIC for network protocol processing and message DMATCP/IP NIC driverUser

Kernel

Hardware

RDMA transport

IP

Ethernet

RDMA app

DMA

RDMA verbs

TCP/IP

NIC driver

Ethernet

RDMA app

DMA

RDMA verbs

Lossless networkRDMA transportIP Slide16

RDMA benefit: latency reduction

Msg

sizeTCP P50 (us)TCP P99(us)RDMA P50(us)

RDMA P99

(us)

1KB

236

467

24

40

16KB58078851117128KB1483

2491

247

5511MB52906195

17832214

For small msgs (<32KB), OS processing latency mattersFor large msgs (100KB+), speed mattersSlide17

RDMA benefit: CPU overhead reduction

Sender

ReceiverOne ND connection40G NIC37Gb/s goodputSlide18

RDMA: Single QP, 88 Gb/s, 1.7% CPU

TCP: Eight connections, 30-50Gb/s,

Client: 2.6%, Server: 4.3% CPURDMA benefit: CPU overhead reductionIntel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz, two sockets 28 coresSlide19

RoCEv2 needs a lossless Ethernet network

PFC for hop-by-hop flow control

DCQCN for connection-level congestion controlSlide20

Priority-based flow control (PFC)

Hop-by-hop flow control, with eight priorities for HOL blocking mitigationThe priority in data packets is carried in the VLAN tagPFC pause frame to inform the upstream to stop

PFC causes HOL and colleterial damage PFC pause framep1Egress portIngress port

p0

p1

p7

Data packet

p0

p0

p1

p7

XOFF threshold

Data packet

PFC pause frameSlide21

DCQCN

CP: Switches use ECN for packet markingNP

: periodically check if ECN-marked packets arrived, if so, notify the senderRP: adjust sending rate based on NP feedbacksSender NIC

Reaction Point

(

RP

)

Switch

Congestion Point (

CP

)Receiver NICNotification Point (NP

)

DCQCN = Keep PFC + Use ECN + hardware rate-based congestion controlSlide22

The safety and performance challenges

RDMA transport livelockPFC deadlock

PFC pause frame stormSlow-receiver symptomSlide23

RDMA transport livelock

RDMA Send 0

RDMA Send 1RDMA Send N+1NAK N

RDMA Send

0

RDMA Send

1

RDMA Send

2

RDMA Send N+2

Go-back-0

Go-back-N

RDMA Send 0

RDMA Send 1

RDMA Send N+1

NAK N

RDMA Send NRDMA Send N+1RDMA Send N+2

RDMA Send N+2SenderReceiverSwitchPkt drop rate 1/256

SenderReceiver Receiver SenderSlide24

PFC deadlock

Our data centers use Clos networkPackets first travel up then go downNo cyclic buffer dependency for up-down routing -> no deadlockBut we did experience deadlock!

Spine

Leaf

ToR

Podset

Pod

ServersSlide25

PFC deadlock

PreliminariesARP table: IP address to MAC address mappingMAC table: MAC address to port mapping

If MAC entry is missing, packets are flooded to all portsIPMACTTLIP0MAC02hIP1MAC11h

MAC

Port

TTL

MAC0

Port0

10min

MAC1

--

Input

Output

Dst: IP1

ARP table

MAC tableSlide26

La

Lb

T0

T1

S1

S2

S3

S4

Server

p0

p1

p2

p3

p0

p1

p3

p4

p0

p1

p0

p1

Egress

portIngressport

1432PFC pause frames

p2

S5

Packet dropCongested port

Dead server

PFC pause frames

Path: {S1, T0, La, T1, S3}Path: {S1, T0, La, T1, S5}Path: {S4, T1, Lb, T0, S2}PFC deadlockSlide27

PFC deadlock

The PFC deadlock root cause: the interaction between the PFC flow control and the Ethernet packet floodingSolution (short-term): drop the lossless packets if the ARP entry is incompleteRecommendation: do not flood or multicast for lossless trafficSlide28

L0

T0

L1S0

S1

L2

T2

L3

T1

T3

L0

T0

L1

S0

S1

L2

T2

L3

T1

T3

Tagger

: practical PFC deadlock prevention

Tagger Algorithm works for general network topology

Deployable in existing switching ASICs

Concept: Expected Lossless Path (ELP) to decouple Tagger from routing

Strategy: move packets to different lossless queue before CBD forming with pre-calculated rulesSlide29

NIC PFC pause frame storm

A malfunctioning NIC may block the whole networkPFC pause frame storms caused several incidents

Solution: watchdogs at both NIC and switch sides to stop the stormToRsLeaf layerSpine layerservers

0

1

2

3

4

5

6

7

Malfunctioning NIC

Podset 0

Podset 1Slide30

The slow-receiver symptom

ToR to NIC is 40Gb/s, NIC to server is 64Gb/sBut NICs may generate large number of PFC pause framesRoot cause: NIC is resource constrained

MitigationLarge page size for the MTT (memory translation table) entryDynamic buffer sharing at the ToRCPUDRAMToR

QSFP

40Gb/s

PCIe

Gen3 8x8 64Gb/s

MTT

WQEs

QPC

NIC

Server

Pause framesSlide31

Deployment experiences and lessons learnedSlide32

Latency reduction

RoCEv2 deployed in Bing world-wide for two and half years

Significant latency reduction

Incast

problem solved as no packet drops Slide33

RDMA throughput

Using two

podsets

each with 500+ servers

5Tb/s capacity between the two

podsets

Achieved 3Tb/s inter-

podset

throughput

Bottlenecked by ECMP routing

Close to 0 CPU overheadSlide34

Latency and throughput tradeoff

L0

T0L1T1

L1

L1

S0,0

S0,23

S1,0

S1,23

RDMA latencies increase

as data shuffling

started

Low latency vs high throughput

us

Before data shuffling

During data shufflingSlide35

Lessons learned

Providing lossless is hard!Deadlock, livelock, PFC pause frames propagation and storm did happenBe prepared for the unexpectedConfiguration management, latency/availability, PFC pause frame, RDMA traffic monitoring

NICs are the key to make RoCEv2 workSlide36

What’s next?Slide37

Applications

Technologies

Architectures

Protocols

RDMA for X (Search, Storage, HFT, DNN, etc.)

Lossy vs lossless network

Practical, large-scale deadlock-free, colleterial-damage-free network

RDMA programming

RDMA for heterogenous computing systems

RDMA virtualization

Low latency, high bandwidth

RDMA security

Software vs hardware

Inter-DC RDMASlide38

Historically, software based packet processing won (multiple times)

TCP processing overhead analysis by David Clark, et al. None of the stateful TCP offloading took off (e.g., TCP Chimney)The story is different this time

Moore’s law is endingAccelerators are comingNetwork speed keep increasingDemands for ultra low latency are realWill software win (again)?Slide39

There is no binding between RDMA and lossless network

But designing and implementing more sophisticated transport protocol in hardware is a challenge

Is lossless mandatory for RDMA?Slide40

RDMA virtualization for the container networking

A router acts as a proxy for the containers

Shared memory for improved performance Zero copy possibleSlide41

RDMA for DNN

Distributed DNN training iterations

Forward propagationBack propagationUpdate gradients

Computation

CommunicationSlide42

RDMA for DNN

Compute

CommunicationMinibatch1

Epoch0

Epoch1

Epoch2

Epoch3

Epoch M

Minibatch0

Minibatch2

Minibatch N

GPU0

Compute

Communication

GPU1

Compute

Communication

GPU X

Data transmission

TrainingSlide43

RDMA for DNN

Tensorflow

1 PS, 2 workers, 100GbE RoCEv2Resnet50 benchmarkTCPRDMASlide44

How many LOC for a “hello world” communication using RDMA?

For TCP, it is 60 LOC for client or server codeFor RDMA, it is complicated … IBVerbs: 600 LOCRCMA CM: 300 LOC

Rsocket: 60 LOCRDMA programmingSlide45

Make RDMA programming more accessible

Easy-to-setup RDMA server and switch configurationsCan I run and debug my RDMA code on my desktop/laptop?High quality code samplesBetter abstractionsIn the Unix/Linux world, everything is a file

But RDMA QP/CQ are not file handlers…RDMA programmingSlide46

Summary

RDMA is experiencing a renaissance in data centersRoCEv2Many opportunities and interesting problems for high-speed, low-latency RDMA networking

Many opportunities in making RDMA accessible to more developers Slide47

Yan Cai, Gang Cheng, Zhong Deng

, Daniel Firestone, Juncheng Gu, Shuihai Hu, Hongqiang Liu, Marina Lipshteyn, Ali Monfared, Jitendra Padhye, Gaurav Soni, Haitao Wu,

Jianxi Ye, Yibo ZhuAzure, Bing, CNTK, Philly collaboratorsArista Networks, Cisco, Dell, Mellanox partners Colleagues in Toutiao AI lab and networking teamsAcknowledgement Slide48

We are hiring for both FTEs and interns

HR: rdus.staffing@bytedance.com guochuanxiong@bytedance.com

Q&A