/
Advanced Flow-control Mechanisms for the Sockets Direct Protocol Advanced Flow-control Mechanisms for the Sockets Direct Protocol

Advanced Flow-control Mechanisms for the Sockets Direct Protocol - PowerPoint Presentation

murphy
murphy . @murphy
Follow
65 views
Uploaded On 2023-11-12

Advanced Flow-control Mechanisms for the Sockets Direct Protocol - PPT Presentation

over InfiniBand P Balaji S Bhagvat D K Panda R Thakur and W Gropp Mathematics and Computer Science Argonne National Laboratory High Performance Cluster Computing Dell Inc ID: 1031555

based flow performance control flow based control performance data rdma sockets buffer assisted buffers receiver credit messages sdp designs

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Advanced Flow-control Mechanisms for the..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Advanced Flow-control Mechanisms for the Sockets Direct Protocolover InfiniBandP. Balaji, S. Bhagvat, D. K. Panda, R. Thakur, and W. GroppMathematics and Computer Science, Argonne National LaboratoryHigh Performance Cluster Computing, Dell Inc.Computer Science and Engineering, Ohio State University

2. High-speed Networking with InfiniBandHigh-speed NetworksA significant driving force for ultra-large scale systemsHigh performance and scalability are keyInfiniBand is a popular choice as a high-speed networkWhat does InfiniBand provide?High raw performance (low latency and high bandwidth)Rich features and capabilitiesHardware offloaded protocol stack (data integrity, reliability, routing)Zero-copy communication (memory-to-memory)Remote direct memory access (read/write data to remote memory)Hardware flow-control (sender ensures receiver is not overrun)Atomic operations, multicast, QoS and several others

3. TCP/IP on High-speed NetworksTCP/IP unable to keep pace with high-speed networksImplemented purely in software (hardware TCP/IP incompatible)Utilizes the raw network capability (e.g., faster network link)Performance limited by the TCP/IP stackOn a 16Gbps network, TCP/IP achieves 2-3 GbpsReason: Does NOT fully utilize network featuresHardware offloaded protocol stackRDMA operationsHardware flow-controlAdvanced features of InfiniBandGreat for new applications!How should existing TCP/IP applications use them?

4. Sockets Direct Protocol (SDP)Industry standard high-performance socketsDefined for two purposes:Maintain compatibility for existing applicationsDeliver the performance of networks to the applicationsMany implementations:OSU, OpenFabrics, Mellanox, VoltaireHigh-speed NetworkDevice DriverIPTCPSocketsSockets DirectProtocol(SDP)Sockets Applications or LibrariesAdvancedFeaturesOffloadedProtocolSDP allows applications to utilize the network performance and capabilities with ZERO modifications

5. SDP State-of-ArtSDP standard specifies different communication designsLarge Messages: Synchronous Zero-copy design using RDMASmall Messages: Buffer-copy design with credit-based flow-control using send-recv operationsThese designs are often times not the best !Previously, we proposed Asynchronous Zero-copy SDP to improve the performance of large messages [balaji07:azsdp]In this paper, we propose new flow-control techniquesUtilizing RDMA and hardware flow-controlImprove the performance of small messages[balaji07:azsdp] “Asynchronous Zero-copy Communication for Synchronous Sockets in the Sockets Direct Protocol over InfiniBand”. P. Balaji, S. Bhagvat, H. –W. Jin and D. K. Panda. Workshop on Communication Archictecture for Clusters (CAC), with IPDPS 2007.

6. Presentation LayoutIntroductionExisting Credit-based Flow-control designRDMA-based Flow-controlNIC-assisted RDMA-based Flow-controlExperimental EvaluationConclusions and Future Work

7. Credit-based Flow-controlFlow-control needed to ensure sender does not overrun the receiverPopular flow-control for many programming modelsSDP, MPI (MPICH2, OpenMPI), File-systems (PVFS2, Lustre)Generic to many networks  does not utilize many exotic featuresTCP/IP like behaviorReceiver presents N credits; ensures buffering for N segmentsSender sends N message segments before waiting for an ACKWhen receiver application reads out data and receive buffer is free, an acknowledgment is sent outSDP credit-based flow-control uses static compile-time decided credits (unlike TCP/IP)

8. ACKSockets BuffersApplication BufferSenderApplication BufferReceiverSockets BuffersCredit-based Flow-controlReceiver has to pre-specify buffers in which data should be receivedInfiniBand requirement for Send-receive communicationSender manages send buffers and receiver manages receive buffersCoordination between sender-receiver through explicit acknowledgmentsCredits = 4

9. Sockets BuffersApplication BuffersSenderApplication Buffers Not PostedReceiverSockets BuffersCredits = 4Application BufferACKReceiver controls buffers – Statically sized temporary buffersTwo primary disadvantages:Inefficient resource usage  excessive wastage of buffersSmall messages pushed directly to networkNetwork performance is under-utilized for small messagesLimitations with Credit-based Flow-control

10. Presentation LayoutIntroductionExisting Credit-based Flow-control designRDMA-based Flow-controlNIC-assisted RDMA-based Flow-controlExperimental EvaluationConclusions and Future Work

11. InfiniBand RDMA capabilitiesRemote Direct Memory AccessReceiver transparent data placement  can help provide a shared-memory like illusionSender-side buffer management  sender can dictate which position in the receive buffer the data should be placedRDMA with Immediate-dataRequires receiver to explicitly check for the receipt of dataAllows receiver to know when the data has arrivedLoses receiver transparency!Still retains sender-side buffer managementIn this design, we utilize RDMA with immediate data

12. Sockets BuffersApplication BuffersSenderReceiverSockets BuffersUtilizes InfiniBand RDMA with Immediate Data featureSender side buffer managementAvoids buffer wastage for small-medium messagesUses an immediate send threshold to improve the throughput for small-medium messages using message coalescingImmediate Send Threshold = 4Application Buffers Not PostedApplication BufferACKRDMA-based Flow-control

13. Sockets BuffersApplication BuffersSenderReceiverSockets BuffersImmediate Send Threshold = 4Application Buffers Not PostedApplication BufferLimitations of RDMA-based Flow-controlApplication is computing Remote credits are available, data is present in sockets buffer Communication progress does not take placeACK

14. Presentation LayoutIntroductionExisting Credit-based Flow-control designRDMA-based Flow-controlNIC-assisted RDMA-based Flow-controlExperimental EvaluationConclusions and Future Work

15. InfiniBand hardware provides a naïve message level flow control mechanismGuarantees that a message is not sent out till the receiver is readyHardware takes care of progress even if application is busy with other computationDoes not guarantee that the receiver has posted a sufficiently large buffer  buffer overruns are errors!Does not provide message coalescing capabilitiesSoftware Flow control schemes are more intelligentMessage coalescing, segmentation and reassemblyNo progress if application is busy with other computationHardware vs. Software Flow-control

16. NIC-Assisted Flow ControlHybrid Hardware/SoftwareTakes the best of IB hardware flow-control and the software features of RDMA-based flow-controlContains two main mechanisms:Virtual window mechanismMainly for correctness – avoid buffer overflowsAsynchronous interrupt mechanismEnhancement to virtual window mechanismImproves performance by coalescing dataNIC-assisted RDMA-based Flow-control

17. Sockets BuffersApplication BuffersSenderReceiverSockets BuffersN / W = 4Application Buffers Not PostedApplication BufferVirtual Window MechanismApplication is computingACKFor a virtual window size of W, the receiver posts N/W work queue entries, i.e., it is ready to receive N/W messagesSender always sends message segments smaller than WThe first N/W messages are directly transmitted by the NICThe later send requests are queued by the hardwareNIC-handled Buffers

18. Sockets BuffersApplication BuffersSenderReceiverSockets BuffersN / W = 4Application Buffers Not PostedApplication BufferAsynchronous Interrupt MechanismApplication is computingACKAfter the NIC gives the interrupt, it still has some messages to send – allows us to effectively utilize the interrupt time without wasting itWe can coalesce small amounts of data – sufficient to reach the performance of RDMA-based flow controlIB InterruptSoftware handled Buffers

19. Presentation LayoutIntroductionExisting Credit-based Flow-control designRDMA-based Flow-controlNIC-assisted RDMA-based Flow-controlExperimental EvaluationConclusions and Future Work

20. Experimental Testbed16-node clusterDual Intel Xeon 3.6GHz EM64T processors (single core, dual-processor)Each processor has 2MB L2 cacheThe system has 1GB of 533MHz DDR SDRAMConnected using Mellanox MT25208 InfiniBand DDR adapters (3rd generation adapters)Mellanox MTS-2400 24-port fully non-blocking switch

21. SDP Latency and BandwidthRDMA-based and NIC-assisted flow-control designs outperform credit-based flow-control by almost 10X for some message sizes

22. SDP Buffer UtilizationRDMA-based and NIC-assisted flow-control designs utilize the SDP buffers in a much better manner, which eventually leads to their better performance

23. Communication ProgressComputationGoodcommunicationProgressBadcommunicationProgress

24. Component Framework for Combined Task/Data ParallelismDeveloped by U. MarylandPopular model for data-intensive applicationsUser defines sequence of pipelined components (filters and filter groups)Data parallelismStream based communicationUser tells the runtime system to generate/instantiate copies of filtersTask parallelismFlow control between filter copiesTransparent: single stream illusionData-cutter Libraryhost1R0R1host2R2host3Ra0host1E0EKhost2EK+1ENhost4Ra1host5Ra2host1MCluster 1Cluster 3Cluster 2Virtual Microscope Application

25. Evaluating the Data-cutter LibraryRDMA-based and NIC-assisted flow-control designs achieve about 10-15% better performanceNo difference between RDMA-based and NIC-assisted designs  application makes regular progress

26. Presentation LayoutIntroductionExisting Credit-based Flow-control designRDMA-based Flow-controlNIC-assisted RDMA-based Flow-controlExperimental EvaluationConclusions and Future Work

27. Conclusions and Future WorkSDP is an industry standard to allow sockets applications to transparently utilize the performance and features of IBPrevious designs allow SDP to utilize some of the features of IBCapabilities of features such as hardware flow-control and RDMA for small messages have not been studied so farIn this paper we present two flow-control mechanisms which utilizes these features of IBShown that our designs can improve performance by up to 10X in some casesFuture Work: Integrate our designs in the OpenFabrics SDP implementation. Study MPI flow-control techniques.

28. Thank You !Contacts:P. Balaji: balaji@mcs.anl.govS. Bhagvat: sitha_bhagvat@dell.comD. K. Panda: panda@cse.ohio-state.eduR. Thakur: thakur@mcs.anl.govW. Gropp: gropp@mcs.anl.govWeb links:http://www.mcs.anl.gov/~balajihttp://nowlab.cse.ohio-state.edu