/
Understanding Congestion in High Performance Interconnection Networks Using Sampling Understanding Congestion in High Performance Interconnection Networks Using Sampling

Understanding Congestion in High Performance Interconnection Networks Using Sampling - PowerPoint Presentation

white
white . @white
Follow
29 views
Uploaded On 2024-02-03

Understanding Congestion in High Performance Interconnection Networks Using Sampling - PPT Presentation

SC19 Philip A Taffet John MellorCrummey Rice University November 20 th 2019 1 Motivation Image source wwwcapsouedu 2 Motivation 3 computeatmospheretemp computegroundtemp ID: 1044748

reservoir traffic switch node traffic reservoir node switch network congestion communication sampling performance packet congested links measuring mapping 1leaf

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Understanding Congestion in High Perform..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Understanding Congestion in High Performance Interconnection Networks Using SamplingSC’19Philip A. Taffet, John Mellor-CrummeyRice UniversityNovember 20th, 20191

2. MotivationImage source: www.caps.ou.edu2

3. Motivation3

4. compute_atmosphere_temp();compute_ground_temp();compute_wind_speed();move_clouds();send_boundary_to_neighbors();% Runtime 5%3%11%9%42%Why?4

5. High level goalEnable tools that help developers analyze and tune the communication performance of their applicationsCapture information about where, when, and why congestion is occurring5

6. State of the art in communication performance analysis toolsINAM2 combines normal InfiniBand switch counters, MPI process-level counter, and static route information. [Subramoni, High Performance Computing 2016]Correlate per-packet stage latencies, events on switches, calling contexts based on ID stored in packet [Yoga, Perf. Modeling, Benchmarking and Simulation of HPC Systems, 2017]6

7. Weaknesses of the current state of the artCentralized collection server, privileged system accessINAM:Traffic and congestion not associated with program contextYoga & Chabbi:Collected information is O(path length), not O(1)Node-level tools:Traffic and congestion information not associated with network topology7

8. High level goalEnable tools that help developers analyze and tune the communication performance of their applicationsCapture information about where, when, and why congestion is occurringSample the links through which a packet passesCollect a small amount of performance information about one linkStore performance information in the packet itself8

9. OutlineBackground and motivationReservoir sampling for measuring network traffic and congestionLow-overhead variant using compact probabilistic encodingDiagnosing communication performance issuesCase Study: miniGhostConclusions and future work9

10. Reservoir samplingCollect a uniform random sample of size from sequence of arbitrary lengthSequence might be too large to store; collect sample on the fly as sequence streams by [Knuth, TAOCP, 1981]ReservoirRedBlueGreenBlackOrangePurple10

11. Reservoir samplingStep 1: Fill the reservoir with the first elements  [Knuth, TAOCP, 1981]ReservoirRedBlueGreenBlackOrangePurple11

12. Reservoir samplingStep 1: Fill the reservoir with the first elements  [Knuth, TAOCP, 1981]ReservoirRedBlueGreenBlackOrangePurple12

13. Reservoir samplingStep 2: The th record enters the reservoir with probability ( for indicated record)Step 2a: If it will be stored in the reservoir, pick an element to replace uniformly randomly [Knuth, TAOCP, 1981]ReservoirRedBlueGreenBlackOrangePurple13Step 2: YESStep 2a: Replace 2nd element

14. Reservoir samplingStep 2: The th record enters the reservoir with probability ( for indicated record)Step 2a: If it will be stored in the reservoir, pick an element to replace uniformly randomly [Knuth, TAOCP, 1981]ReservoirRedBlueGreenBlackOrangePurple14Step 2: NOStep 2a: -

15. Reservoir samplingStep 2: The th record enters the reservoir with probability ( for indicated record)Step 2a: If it will be stored in the reservoir, pick an element to replace uniformly randomly [Knuth, TAOCP, 1981]ReservoirRedBlueGreenBlackOrangePurple15Step 2: YESStep 2a: Replace 1st element

16. Reservoir samplingStep 2: The th record enters the reservoir with probability ( for indicated record)Step 2a: If it will be stored in the reservoir, pick an element to replace uniformly randomly [Knuth, TAOCP, 1981]ReservoirRedBlueGreenBlackOrangePurple16

17. Reservoir samplingStep 3: Return final reservoir contents[Knuth, TAOCP, 1981]ReservoirRedBlueGreenBlackOrangePurple17We employ reservoir sampling with a reservoir of size 1

18. Reservoir sampling for measuring network traffic and congestionCompute Node 0Compute Node 99Leaf Switch 1Leaf Switch 3Switch 2PacketSource: 0, destination: 99, sequence #, other headersPacketTraffic ReservoirLink: — hop count: 1Congested?: NoPacket data18!!

19. Reservoir sampling for measuring network traffic and congestionCompute Node 0Compute Node 99Leaf Switch 1Leaf Switch 3Switch 2PacketSource: 0, destination: 99, sequence #, other headersPacketTraffic ReservoirLink: — hop count: 2Congested?: NoPacket data19!!

20. Reservoir sampling for measuring network traffic and congestionCompute Node 0Compute Node 99Leaf Switch 1Leaf Switch 3Switch 2PacketSource: 0, destination: 99, sequence #, other headersPacketTraffic ReservoirLink: — hop count: 3Congested?: YesPacket data20!!

21. Reservoir sampling for measuring network traffic and congestionCompute Node 0Compute Node 99Leaf Switch 1Leaf Switch 3Switch 2PacketSource: 0, destination: 99, sequence #, other headersPacketTraffic ReservoirLink: — hop count: 4Congested?: YesPacket data21!!

22. Reservoir sampling for measuring network traffic and congestionCompute Node 0Compute Node 99Leaf Switch 1Leaf Switch 3Switch 2PacketSource: 0, destination: 99, sequence #, other headersPacketTraffic ReservoirLink: — hop count: 4Congested?: YesPacket data22!!

23. Reservoir sampling for measuring network traffic and congestionTraffic ReservoirLink: — hop count: 4Congested?: Yes23Compute Node 0Compute Node 99Leaf Switch 1Leaf Switch 3Switch 2Packet!!LinkWeighted Traffic CountWeighted Congested CountCongested fraction000LinkWeighted Traffic CountWeighted Congested CountCongested fraction000Table stored on Compute Node 99:

24. Reservoir sampling for measuring network traffic and congestion24Compute Node 0Compute Node 99Leaf Switch 1Leaf Switch 3Switch 2Packet!!LinkWeighted Traffic CountWeighted Congested CountCongested fraction001.01.000LinkWeighted Traffic CountWeighted Congested CountCongested fraction001.01.000Table stored on Compute Node 99:PacketPacketPacketPacketPacketPacketX 1000

25. Why are these measurements useful?Weighted traffic count: how many packets traversed each linkCongested fraction: a measure of link congestionRecording measurement data in each packet lets us associate link data with:Program context (e.g. MPI call and calling context)MPI tagCollective vs. point to pointPolling switch counters gives 1 and 2 but not 325

26. OutlineBackground and motivationReservoir sampling for measuring network traffic and congestionLow-overhead variant using compact probabilistic encodingDiagnosing communication performance issuesCase Study: miniGhostConclusions and future work26

27. Compact probabilistic encoding Problem: Storing a link ID takes too many bits to fit in the packet headerSolution: Probabilistic encoding scheme inspired by work about detecting memory leaks [Bond, ASPLOS 2006]Let be a hash function that returns 1 bitInstead of storing a link ID, store H(packet ID, link ID) in the packetUpon receipt of the packet, test each potential link IDCompute Hash(packet ID, link ID)If it matches the hash value stored in the packet, increase the count for that link IDIf it doesn’t, decrease the count 27

28. Reservoir sampling for measuring network traffic and congestionCompute Node 0Compute Node 99Leaf Switch 1Leaf Switch 3Switch 2PacketSrc: 0, dest: 99, Packet ID: 0xCAFE, other headersPacketTraffic ReservoirLink hash: 0 hop count: 1Congested?: NoPacket data28!! 

29. Reservoir sampling for measuring network traffic and congestionCompute Node 0Compute Node 99Leaf Switch 1Leaf Switch 3Switch 2PacketSrc: 0, dest: 99, Packet ID: 0xCAFE, other headersPacketTraffic ReservoirLink hash: 0 hop count: 2Congested?: NoPacket data29!!Hash not computed

30. Reservoir sampling for measuring network traffic and congestionCompute Node 0Compute Node 99Leaf Switch 1Leaf Switch 3Switch 2PacketSrc: 0, dest: 99, Packet ID: 0xCAFE, other headersPacketTraffic ReservoirLink hash: 1 hop count: 3Congested?: YesPacket data30!! 

31. Reservoir sampling for measuring network traffic and congestionCompute Node 0Compute Node 99Leaf Switch 1Leaf Switch 3Switch 2PacketSrc: 0, dest: 99, Packet ID: 0xCAFE, other headersPacketTraffic ReservoirLink hash: 1 hop count: 4Congested?: YesPacket data31!!Hash not computed

32. Reservoir sampling for measuring network traffic and congestionCompute Node 0Compute Node 99Leaf Switch 1Leaf Switch 3Switch 2PacketSrc: 0, dest: 99, Packet ID: 0xCAFE, other headersPacketTraffic ReservoirLink hash: 1 hop count: 4Congested?: YesPacket data32!!Hash not computed

33. Reservoir sampling for measuring network traffic and congestion33Compute Node 0Compute Node 99Leaf Switch 1Leaf Switch 3Switch 2Packet!!LinkWeighted Traffic CountWeighted Congested CountCongested fractionLinkWeighted Traffic CountWeighted Congested CountCongested fractionTable stored on Compute Node 99:Src: 0, dest: 99, Packet ID: 0xCAFE, other headersTraffic ReservoirLink hash: 1 hop count: 4Congested?: Yes    

34. Reservoir sampling for measuring network traffic and congestion34Compute Node 0Compute Node 99Leaf Switch 1Leaf Switch 3Switch 2Packet!!LinkWeighted Traffic CountWeighted Congested CountCongested fraction001.01.000LinkWeighted Traffic CountWeighted Congested CountCongested fraction001.01.000Table stored on Compute Node 99:PacketPacketPacketPacketPacketPacketX 1000

35. Statistical Intuition: true negatives cancel false positivesLet’s look at the black linkOf the 1000 packets, we expect 250 of them to arrive containing the hash created by the computation These packets would have contained if we weren’t using probabilistic encodingOf the other 750 packets, we expect half to be false positives and half to be true negatives 35

36. Definition of  Must be very cheap to computeShould behave like a pseudo-random functionReturns 0 for half of inputsUncorrelated when one parameter is fixedWe use multiplication in a finite field Proofs that our satisfies these properties are shortWith a small amount of pre-computation, computing is 1 AND and a few XORsIn software, using CPU vector instructions, testing a link takes ~1 cycleEmbarrassingly parallel, very amenable to hardware acceleration 36RequirementsApproach

37. Visualizing  37H(14, 1..1024) H(15, 1..1024)

38. Same benefits as non-probabilistic approachTrades higher variance for fewer bits in each packet5-6 bits inside the packet header sufficeSame benefits from storing performance data inside the packetSee the paper for analysis of error and convergence38

39. OutlineBackground and motivationReservoir sampling for measuring network traffic and congestionLow-overhead variant using compact probabilistic encodingDiagnosing communication performance issuesCase Study: miniGhostConclusions and future work39

40. Analyzing communication performance with congested fraction plotsOverlay measurements on network topologyTop half of edge corresponds to down-pointing linkThicker = More trafficDarker = More congestedMore, darker black is worseLinks to compute nodes shown as boxes40

41. Diagnosing communication performance problems with congested fraction plotsCongestion often backs up, forming tree-like patternsWe look for congestion rootsCongestion rooted in interior of the networkMay be a problem with the mapping onto the physical network topologyTry optimizing the mappingCongestion rooted at an endpointCommunication pattern problemCode changes likely neededDark and thin links in the congested fraction plotExternal interference from background traffic41

42. OutlineBackground and motivationReservoir sampling for measuring network traffic and congestionLow-overhead variant using compact probabilistic encodingDiagnosing communication performance issuesCase Study: miniGhostConclusions and future work42

43. Experimental Setup OverviewExperiments are simulated with TraceR/CODES network simulatorNetworkUsing a 100 Gbps 2:1 tapered fat tree networkBased on Quartz at LLNL (#71 on TOP500)Adaptive routingApplicationminiGhost: a stencil proxy app27 point, 3d stencilSimulating 27k MPI ranks = 24 full leaf switchesAdditional case study with pF3D is in the paper43

44. miniGhost congested fraction plot: default configuration44

45. 45All links are fairly thickLots of traffic on all linksUp-pointing links are heavily congestedDown-pointing links show less congestionRoot of congestion tree in interiorInterpretation rules suggest trying a new mapping

46. 46All links are fairly thickLots of traffic on all linksUp-pointing links are heavily congestedDown-pointing links show less congestionRoot of congestion tree in interiorInterpretation rules suggest trying a new mapping

47. 47All links are fairly thickLots of traffic on all linksUp-pointing links are heavily congestedDown-pointing links show less congestionRoot of congestion tree in interiorInterpretation rules suggest trying a new mapping

48. 48All links are fairly thickLots of traffic on all linksUp-pointing links are heavily congestedDown-pointing links show less congestionRoot of congestion tree in interiorInterpretation rules suggest trying a new mapping

49. 49All links are fairly thickLots of traffic on all linksUp-pointing links are heavily congestedDown-pointing links show less congestionRoot of congestion tree in interiorInterpretation rules suggest trying a new mapping

50. 50All links are fairly thickLots of traffic on all linksUp-pointing links are heavily congestedDown-pointing links show less congestionRoot of congestion tree in interiorInterpretation rules suggest trying a new mapping

51. miniGhost congested fraction plot: linear mapping51

52. What’s wrong with a linear mapping?52 communication phase  communication phase  communication phase Packet-centric data collection enables us to split into phases based on MPI tagPhases can (and do) overlap in timeWe need a mapping that balances locality in all three dimensionsTry mapping with geometric tiling

53. Geometric tiling mapping is better53 communication phase  communication phase  communication phase Now all phases have some congestionThe worst is much better than the phase in the linear mapping2.9x reduction in time spent communicating vs. original mapping4.8% overall performance improvement 

54. But there’s still a lot of congestionNo longer clear congestion trees rooted in the interiorThere appears to be some scattered congestion rooted at endpointsPattern appears somewhat regular, but complicatedIf we consider the mapping, the roots form a pattern54   plane Heaviest congestion when  

55. Boundary conditions causing congestionminiGhost pseudocode:If I have a neighborSend boundary values to neighborIf I have a neighborSend boundary values to neighborEtc.Two nodes are simultaneously trying to send to nodes 55   !

56. Can we fix this?This is a problem because point-to-point communication is preceded almost immediately by a global AllreduceTwo possible solutions:Wait instead of skipping non-existing neighborsIntroduce asynchronySplit long communication, compute phases into multiple smaller phases, only synchronize at the end56

57. Modified communication pattern reduces congestion57OriginalSmall phases, sync after each phaseSmall phases, sync after each timestep60% reduction in communication time1.9% overall performance improvement

58. MiniGhost summaryMeasurements from reservoir sampling gave us insights to improve performanceCongestion rooted in network interior indicated a mapping problemSplitting congested fraction plots by phases showed congestion imbalance between phasesUsing a block mapping balanced locality, distributed the congestion better among phasesCongestion rooted at endpoints indicated a latent communication pattern problem for nodes on domain boundaryOverall improvement: nearly 5x reduction in communication time58

59. OutlineBackground and motivationReservoir sampling for measuring network traffic and congestionLow-overhead variant using compact probabilistic encodingDiagnosing communication performance issuesCase Study: miniGhostConclusions and future work59

60. ConclusionsUsing reservoir sampling, we can collect information about the path a packet takes and any congestion it encountersMeasurements are useful, effective, and cheapThis information helps application developers understand communication performance in the context of a network’s physical topology and of their applicationDiagnosis based on information collected with reservoir sampling explained the problem, guided to a fixA probabilistic variant dramatically reduces the data added to a monitored packet while still delivering useful measurementsRequires modifications to networking hardware, but we’ve shown they are simple and fast[Taffet, Mellor-Crummey HOTI’19]60

61. Future workAnalyze communication performance of other applicationsAutomate diagnosisTest with programmable Ethernet switchesExplore tradeoffs to reduce number of hashes required61

62. ContributionsNew packet-centric approach to measuring congestion using reservoir samplingCapture where in the network congestion is occurringCapture why in the application congestion is occurringDistinguish three types problems with different fixesExternal interference (even without monitoring other jobs)Problems with logical to physical mappingProblems with communication patternProbabilistic scheme for more practical implementationHash function62

63. Mapping to InfiniBand Packet Header1 bit: Hash of a link ID selected with reservoir sampling1 bit: Was that link congested3-4 bits: current hop countIncremented using saturating arithmeticFits in 6 reserved bits just after FECN bits, not protected by invariant CRCPacket Sequence Number used as packet ID63

64. Case Study I: pF3DpF3D is a multiphysics code used to simulate laser-plasma interactions at NIFCommunication dominated by 2d FFT in x-y planes64  Image source: Langer, Bhatele, Still 2014

65. Case Study I: pF3DUnder heavy background congestion, pF3D’s parallel FFTs run 24% slower.The section takes 24.6s with no congestion and 30.4s with heavy background congestion.Can we understand why? Can we improve the performance?65

66. 66

67. Insights from the congested fraction plotCongestion must be due to background trafficLots of dark and thin linksHigh congested fraction but low trafficpF3D drives an average of 2.0 Gbps per linkReservoir sampling lets us implicitly detect traffic from other jobs on the network without needing to bug the sysadmins67

68. Insights from the congested fraction plotCongestion must be due to background trafficLots of dark and thin linksHigh congested fraction but low trafficpF3D drives an average of 2.0 Gbps per linkReservoir sampling lets us implicitly detect traffic from other jobs on the network without needing to bug the sysadminsCan we make pF3D more robust to external interference?68

69. Looking for a second diagnosis69LinkCongested fraction10.8621.0030.9440.061234RootRoot of a congestion tree

70. Looking for a second diagnosisRoot is in network interiorInterpretation rules suggest trying to solve congestion with mappingIn particular, we need a mapping that reduces traffic entering circled switches701234Root of a congestion tree

71. Designing a new mappingSize of an x-y plane is 144 MPI ranks = 4 nodesMost communication happens in 4 node groupsBy shifting the node numbering, we keep FFT traffic off congested links71LS 0LS 1LS 2LS 3LS 4LS 5LS 0LS 1LS 2LS 3LS 4LS 5Original mappingShifted mappingEach rectangle represents 4 nodes

72. Performance improvement16% performance improvementSame allocation from the schedulerNew mapping was simple, but congested fraction plot helped us know how to change the mappingPerformance is almost as good as when run in isolation72Time (s)% differenceCongestion, default mapping30.40-Congestion, shifted mapping25.5016.1%No Congestion24.6019.1%

73. Comparison with Yoga & ChabbiMore flexible, but higher costAdditional SRAM in each switchDrainer component to forward performance informationThey could produce congested fraction plotsTheir work is also still theoretical, only evaluated in simulationOur work does not require a central collection serverNo security/permission challengesYour job automatically gets the data it can useTheir work is designed for Gen-Z, though can be ported to InfiniBand with a specification change73

74. Background on high performance interconnection networks for supercomputersWe focus on fat trees but almost everything is generalizableCut through routingUnder light load, message spends very little time in the buffersPackets can’t change lengthAdaptive routing potentially in useUnlike Ethernet, switch can’t drop packets to deal with congestionBuffer occupancy kept under control with credit-based flow control74Switch Input buffersOutput buffersCrossbar networkIncoming linksOutgoing links

75. Convergence analysisBoth schemes are unbiased estimators, i.e. expected value is true valueOriginal full-reservoir scheme: Compact low-overhead variant: : number of packets through link, : path length, : number of packets that might have gone through linkIn practice, amount of traffic required to get accurate values is secondDiagrams were produced using low-overhead variant 75

76. Computing cheaply For a given link, and are knownIn GF(), addition is bitwise XORAND with precomputed bitmask, then compute “sideways XOR”What is the parity of the popcount? 76