/
DCTCP and DCQCN 1 How to read a systems/networking paper DCTCP and DCQCN 1 How to read a systems/networking paper

DCTCP and DCQCN 1 How to read a systems/networking paper - PowerPoint Presentation

evelyn
evelyn . @evelyn
Follow
27 views
Uploaded On 2024-02-02

DCTCP and DCQCN 1 How to read a systems/networking paper - PPT Presentation

Measurement papers excluded 2 I would have done this so differently and so much better 3 DCTCP Presenting work of many teams in Microsoft 4 5 Poor search performance due to TCP incast ID: 1044036

flows latency rdma cpu latency flows cpu rdma tcp level aggregator flow packet congestion buffer problem pfc short lossless

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "DCTCP and DCQCN 1 How to read a systems/..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. DCTCP and DCQCN1

2. How to read a systems/networking paper**Measurement papers excluded2

3. I would have done this so differently, and so much better … 3

4. DCTCPPresenting work of many teams in Microsoft 4

5. 5

6. Poor search performance due to TCP incast resulting from partition/aggregate pattern6

7. Kamala Harris?Wikipedia, news, tweets, ….BingWeb search ….7

8. Kamala Harris?Top level aggregator (TLA)Mid level aggregator (MLA)Mid level aggregator (MLA)WorkersTwitter ..Web indexImagesMid level aggregator (MLA)WorkersWorkers8

9. Wikipedia, news, tweets, ….Top level aggregator (TLA)Mid level aggregator (MLA)Mid level aggregator (MLA)WorkersTwitter ..Web indexImagesMid level aggregator (MLA)WorkersWorkers9

10. …Top of Rack (ToR) RouterMLAWWWWW10

11. …Top of Rack (ToR) RouterMLAWWWWW11

12. ASIC inside the ToR RouterWMLAWWW12

13. Why is this loss a problem?Because responses are short (few KB)Packet loss on short flows  TCP timeoutTCP minimum timeout (used to be) 300ms on Windows …13

14. Are you convinced that this is a real problem?14

15. 15

16. Time is moneyReturn results “quickly” else no adclicksTLA and MLAs have tight latency budgetsResults from “late” workers ignoredPoor quality results: no adclicks16The Cost of Latency – Perspectives (mvdirona.com)

17. In summaryPartition/aggregate  IncastIncast  Packet lossLoss  TCP TimeoutsTCP Timeouts  Delayed/missing resultsDelayed/missing results  loss of revenue17

18. Do you agree that this problem is worth solving?18

19. 19

20. How would you solve the problem?Poor search performance due to TCP incast resulting from partition/aggregate pattern20

21. Don’t do scatter/gatherNo other way to “scale horizontally”More MLAs do not fundamentally fix incastCaching etc. possible but not always feasible21

22. Use more workersTake first “N” who finishOften used as a strategy to deal with “stragglers”Good idea if “straggling” not coorelated with loadBad idea here: makes incast worse!22

23. Schedule flows centrally to avoid collisionProposed by MIT and FacebookDoes not scale23

24. RTS/CTSto avoid collisionRequest to send/clear to sendThis (partly) how WiFi worksAdds latency – trading median for tailUsed by Microsoft to beat sorting record 24

25. Jitter to avoid collisionWorkers add random delay before respondingThis is (partly) how WiFi protocol worksAdds latency – trading median for tailOriginal solution attempted at Bing 25

26. Bigger buffers at routersRouter buffer has to support line rates64x100G  6.4 Tbps read/write bandwidth!Very expensive and power hungry!26

27. Reduce TCP timeoutMicrosecond timers can be CPU expensive Does not avoid drops!Must retransmit dropped packetsProposed by CMU27

28. What else can you think of?28

29. 29

30. Prior work oversimplified the problem by focusing only on short flows30

31. Database held by workers is continuously being refreshedLong-lived, throughput sensitive flowsthese flows also eat up switch buffer31

32. DCTCP Key InsightsCongestion from two sources:Long flows: throughput Short query flows: LatencyConvey congestion as soon as possiblelong flows react quicklyshort flows have chance to reactMaintain persistently low buffer occupancy32

33. Reaction Point (RP)Congestion Point (CP)Notification Point (NP)RouterSenderReceiverIf qlen >= K, mark ECN Echo ECN to RPEstimate fraction of marked packetsAdjust cwnd accordinglyDataDataAckDCTCP at a glanceSee paper for more details 33

34. What would you have done differently?34

35. 35

36. Results Summary (see paper for details)Comparison to TCP w/o RED, and w/ REDHigh throughput for long flows, short FCT for query flows Lower queue length Better/comparable convergence and fairness36

37. Used by37

38. 38

39. DCTCP was about Bing.We will now switch to Azure. 39

40. Forget everything about DCTCP.40

41. RDMARemote Direct Memory AccessPresenting work of many teams in Microsoft 41

42. 42

43. High throughput, low-latency Disk I/O with minimal CPU overhead43

44. How does Azure make money?We buy CPUs from Intel and AMD in bulkWe rent them out (VMs)Minimize cores spent on non-chargeable tasks 44

45. Public cloud VMsVMs “run” on compute serversLots of cores, lots of RAM and some flashMount disks from dedicated storage serversFew cores, some RAM, and a ton of flash drivesAllows for better resource management, fault tolerance etc.wee bit of disaggregated computing 45

46. Disk I/O traffic“east-west”Dominant traffic in data centers (bytes, packets, flows ..)Affects performance of databases, Spark clusters …We own “both ends”46

47. “Write the block of data to remote disk y, address z”TCP: Waste of CPU, high latency due to CPU processingApplicationApplicationTCPIPTCPIPPostPacketizeIndicateInterruptStreamCopyCPUCPU47

48. Reducing CPU usage for disk IOMore cores left to sell on compute VMsCan buy fewer/cheaper storage serversMillions of dollars to our bottom lineBetter performance for customers48

49. Do you agree that this is a real problem, worth solving?49

50. 50

51. Nope…not when we startedPublic clouds were (still) relatively new The scale, the cost equations, customer expectations, were rapidly evolvingSomewhat akin to research in DC fabric designLot of recent work, some which we will cover51

52. How would you solve the problem?High throughput, low-latency Disk I/O with minimal CPU overhead52

53. Improve TCP performanceStack optimizationsLarge Send OffloadNot enough – latency due to context switch remains 53

54. Some other transport protocolWe do own both ends …Congestion control etc. is not the key issueIf it burns CPUs, same problem54

55. DPDKLow latency, yes. Low CPU: most definitely notIf cores become cheap, this may the way to go55

56. 56

57. “Write the block of data to remote disk y, address z”TCP: Waste of CPU, high latency due to CPU processingApplicationApplicationTCPIPTCPIPPostPacketizeIndicateInterruptStreamCopyCPUCPURDMA: NIC handles all transfers via DMAApplicationApplicationMemoryBuffer AWrite local buffer at Address A to remotebuffer at Address BBuffer B is filledDMAMemoryBuffer BDMA57

58. Microbenchmarks58High throughputwithout CPU overheadLow latency

59. Do you agree that RDMA is the way to go? 59

60. But ….Prior RDMA deployments: Small scale – O(1K) servers onlyInfiniBand: Not compatible with Ethernet/IPHow to run RDMA over standard Ethernet/IP at datacenter scale?60

61. ChallengesMapping IB to Ethernet/IPScaled-out lossless fabricCongestion control Switch buffer management Deadlocks …….61

62. RoCEv2 Routable RDMA over Converged Ethernet62EthernetIPUDPInfiniBand L4RoCEv2 packet formatVery efficient, butrequires a lossless* fabricfor optimal performance*Lossless: no congestion packet dropPayload

63. Why Lossless FabricFast startReduce short flow latency in common casesNo retransmission due to congestion dropReduce tail latencySimplifies transport protocolACK consolidationImprove efficiency63

64. Lossless Ethernet64EthernetIPUDPInfiniBand L4RoCEv2 packet formatNeeds a lossless fabricfor fast start, no retransmission…Priority-based Flow Control (PFC)IEEE 802.1QbbLossless EthernetPayload

65. Priority-based Flow Control (PFC)PAUSE upstream neighbor when queue size reaches PFC thresholdNo congestion packet dropPFC troublesome in large-scale deployments65PFC PAUSE threshold: 3PAUSECongestion

66. PFC problemsNeed headroom reservationEnough to buffer all packets in flightLots of buffer needed if link is loingDeadlock (A->B->C->B->A)Circular buffer dependdencyVictim flows …66

67. Victim Flow Problem67Cascading link PAUSEingPFC operates on port-level, instead of flow-levelRed flow is victimizedH1H2H3R Ideal throughput 40-13.3=26.6GbpsThroughput (Gbps)H1 H2 H30102030SHostDS123440/3=13.3Gbps

68. Possible SolutionsDon’t use PFC We’ll come back to this later if we have timeSeparate priorities for each flowNot scalable 68

69. Our SolutionPFC as last resortKeep PFCUse per-flow E2E congestion controlPer flow CC reduces PFC generation BUT… lossless fast start, no retransmission, ...69

70. What kind of CC do we need?Must work with commodity hardwareMust be configurable to minimize PFC generationCan’t rely on packet dropsEasy to implement in hardware70

71. If you can’t rely on packet dropsECNDelayExplicit feedback from routers71

72. Forget everything about DCTCP.72Remember DCTCP?

73. Reaction Point (RP)Congestion Point (CP)Notification Point (NP)RouterSenderReceiverMark ECNKmin, Kmax and Pmax adjusted to ensure ECN before PFCEcho ECN to RPVia CNPCount CNPsAdjust rate accordinglyDataDataAckDCQCN at a glanceSee paper for more details 73

74. 74Just to impress you with our math skills ….

75. In summaryAzure makes money by renting CPUsTCP wastes CPUHence need RDMARDMA needs PFCPFC causes problemsHence DCQCN75

76. What would you have done?76

77. 77

78. Results summaryDCQCN reduce PFCs by a factor of 1000Unfairness, deadlock etc. not seenAzure RDMA deployment would have been impossible without DCQCN67% of Azure traffic is on RDMA : O(1018) bytes /dayTCP is a distant secondThe trend will accelerate 78

79. Finally(we believe that) Azure has the world’s largest RDMA deploymentTook lot more than just DCQCN to make it workSee our papers for more … (and more will be written)Lot of ongoing work: customized rdma, long-haul, many other domains, RDMA to VMs … 79

80. 80

81. Questions?we are hiring …81H/T Sarah McClure, for comments and editing