/
NVMe™/TCP  Development NVMe™/TCP  Development

NVMe™/TCP Development - PowerPoint Presentation

aaron
aaron . @aaron
Follow
349 views
Uploaded On 2019-11-24

NVMe™/TCP Development - PPT Presentation

NVMeTCP Development Status and a Case study of SPDK User Space Solution 2019 NVMe Annual Members Meeting and Developer Day March 19 2019 Sagi Grimberg Lightbits Labs Ben Walker and Ziye ID: 767797

nvme tcp nvme

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "NVMe™/TCP Development" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

NVMe™/TCP Development Status and a Case study of SPDK User Space Solution2019 NVMe™ Annual Members Meeting and Developer DayMarch 19, 2019 Sagi Grimberg, Lightbits Labs Ben Walker and Ziye Yang, Intel

NVMe™/ TCP StatusTP ratified @ Nov 2018Linux Kernel NVMe/TCP inclusion made v5.0Interoperability tested with vendors and SPDK Running in large-scale production environments (backported though) Main TODOs:TLS supportConnection Termination reworkI/O Polling (leverage .sk_busy_loop() for polling)Various performance optimizations (mainly on the host driver)A few minor Specification wording issues to fixup

Performance: Interrupt Affinity In NVMe™ we pay a close attention to steer an interrupt to the application CPU coreIn TCP Networking: TX interrupts are usually steered to the submitting CPU core (XPS) RX interrupts steering is determined by: Hash(5-tuple)That is not local to the application CPU coreBut, aRFS comes to the rescue!RPS mechanism is offloaded to the NICNIC driver implements: .ndo_rx_flow_steerThe RPS stack learns where the CPU core that processes the stream and teaches the HW with a dedicated steering rule.

Canonical Latency Overhead Comparison The measurement tests the latency overhead for a QD=1 I/O operationNVMe™/TCP is faster than iSCSI but slower than NVMe/RDMA

Performance: Large Transfers OptimizationsNVMe™ usually impose minor CPU overhead for large I/O<= 8K (two pages) only assign 2 pointers> 8K setup PRP/SGL In TCP networking: TX large transfers involves higher overhead for TCP segmentation and copySolution: TCP Segmentation Offload (TSO) and .sendpage()RX large transfers involves higher overhead for more interrupts and copySolution: Generic Receive Offload (GRO) and Adaptive Interrupt ModerationStill more overhead than PCIe though...

Throughput Comparison Single-threaded NVMe™/TCP achieves 2x better throughputNVMe/TCP scales to saturate 100Gb/s for 2-3 threads however iSCSI is blocked

NVMe™/ TCP Parallel InterfaceEach NVMe queue maps to a dedicated bidirectional TCP connectionNo controller-wide sequencingNo controller-wide reassembly constraints

4K IOPs Scalability iSCSI is serialized heavily and cannot scale with the number of threadsNVMe™/TCP scales very well reaching over 2M 4K IOPs

Performance: Read vs. Write I/O Queue SeparationCommon problem with TCP/IP is head-of-queue (HOQ) blockingFor example, a small 4KB Read is blocked behind a large 1MB Write to complete data transfer Linux supports Separate Queue mappings since v5.0 Default Queue MapRead Queue MapPoll Queue MapNVMe™/TCP leverages separate Queue Maps to eliminate HOQ Blocking.In the Future can contain Priority Based Queue Arbitration to eliminate even further

Performance: Read vs. Write I/O Queue SeparationNVMe™/TCP leverages separate Queue Maps to eliminate HOQ Blocking.Future: Priority Based Queue Arbitration can reduce impact even further

Mixed Workloads Test Test the impact of Large Write I/O on Read Latency32 “readers” issuing synchronous READ I/O1 Writer that issues 1MB Writes @ QD=16 iSCSI Latencies collapse in the presence of Large Writes Heavy serialization over a single channelNVMe™/TCP is very much on-par with NVMe/RDMA

Commercial Performance Software NVMe™/TCP controller performance (IOPs vs. Latency)* * Commercial single 2U NVM subsystem that implements RAID and compression with 8 attached hosts

Commercial Performance – Mixed Workloads Software NVMe™/TCP Controller performance (IOPs vs. Latency)* * Commercial single 2U NVM subsystem that implements RAID and compression with 8 attached hosts

Slab, sendpage and kernel hardening We never copy buffers NVMe™/TCP TX side (not even PDU headers)As a proper blk_mq driver, Our PDU headers were preallocated in advance PDU headers were allocated as normal Slab objects Can a Slab original allocation be sent to the network with Zcopy?Linux-mm seemed to agree we can (Discussion)...But, every now and then, under some workloads the kernel would panic... kernel BUG at mm/usercopy.c:72! CPU: 3 PID: 2335 Comm: dhclient Tainted: G O 4.12.10-1.el7.elrepo.x86_64 #1 ... Call Trace: copy_page_to_iter_iovec+0x9c/0x180 copy_page_to_iter+0x22/0x160 skb_copy_datagram_iter+0x157/0x260 packet_recvmsg+0xcb/0x460 sock_recvmsg+0x3d/0x50 ___sys_recvmsg+0xd7/0x1f0 __sys_recvmsg+0x51/0x90 SyS_recvmsg+0x12/0x20 entry_SYSCALL_64_fastpath+0x1a/0xa5

Slab, sendpage and kernel hardening Root Cause:In high queue depth, TCP stack coalesce PDU headers into a single fragmentAt the same time, we have userspace programs applying bpf packet filters (in this case dhclient) Kernel Hardening applies heuristics to catch exploits: In this case, panic if usercopy attempts to copy skbuff that contains a fragment that cross the Slab object boundaryResolution:Don’t allocate PDU headers from the Slab allocatorsInstead use a queue private page_frag_cacheThis resolved the panic issueBut also improved the page referencing efficiency on the TX path!

Ecosystem Linux kernel support is upstream since v5.0 (both host and NVM subsystem)https://lwn.net/Articles/772556/https://patchwork.kernel.org/patch/10729733/ SPDK support (both host and NVM subsystem) https://github.com/spdk/spdk/releaseshttps://spdk.io/news/2018/11/15/nvme_tcp/NVMe™ compliance programInteroperability testing started at UNH-IOL in the Fall of 2018Formal NVMe compliance testing at UNH-IOL planned to start in the Fall of 2019For more information see:https://nvmexpress.org/welcome-nvme-tcp-to-the-nvme-of-family-of-transports/

Summary NVMe™/TCP is a new NVMe-oF™ transportNVMe/TCP is specified by TP 8000 (available at www.nvmexpress.org)Since TP 8000 is ratified, NVMe/TCP is officially part of NVMe-oF 1.0 and will be documented as part of the next NVMe-oF specification release NVMe/TCP offers a number of benefits Works with any fabric that support TCP/IPDoes not require a “storage fabric” or any special hardwareProvides near direct attached NAND SSD performanceScalable solution that works within a data center or across the world

Storage Performance Development Kit User-space C Libraries that implement a block stackIncludes an NVMe™ driverFull featured block stackOpen Source 3-clause BSD Asynchronous, event loop, polling design strategyVery different than traditional OS stack (but very similar to the new io_uring in Linux)100% focus on performance (latency and bandwidth)https://spdk.io

NVMe- oF™ HistoryNVMe™ over Fabrics Target July 2016: Initial Release (RDMA Transport) July 2016 – Oct 2018:Hardening, Feature CompletenessPerformance Improvements (scalability)Design changes (introduction of poll groups) Jan 2019: TCP TransportCompatible with Linux kernelBased on POSIX sockets (option to swap in VPP)NVMe over Fabrics Host December 2016: Initial Release (RDMA Transport)July 2016 – Oct 2018:Hardening, Feature CompletenessPerformance Improvements (zero copy) Jan 2019: TCP TransportCompatible with Linux kernelBased on POSIX sockets (option to swap in VPP)

NVMe- oF™ Target Design OverviewTarget spawns one thread per core which runs an event loopEvent loop is called a “poll group” New connections (sockets) are assigned to a poll group when accepted Poll group polls the sockets it owns using epoll/kqueue for incoming requestsPoll group polls dedicated NVMe™ queue pairs on back end for completions (indirectly, via block device layer)I/O processing is run-to-completion mode and entirely lock-free.

Adding a New Transport Transports are abstracted away from the common NVMe-oF™ code via a plugin systemPlugins are a set of function pointers that are registered as a new transport. TCP Transport implemented in lib/ nvmf/tcp.cTransport Abstraction FC?PosixRDMA TCPVPP Socket operations are also abstracted behind a plugin systemPOSIX sockets and VPP supported

Future Work Better socket syscall batching!Calling epoll_wait, readv, and writev over and over isn’t effective. Need to batch the syscalls for a given poll group.Abuse libaio’s io_submit? io_uring? Can likely reduce number of syscalls by a factor of 3 or 4. Better integration with VPP (eliminate a copy) Integrate with TCP acceleration available in NICsNVMe-oF offload support