between NVIDIA GPUs with the APEnet 3D Torus Interconnect daviderossettiroma1infnit SC11 Seattle Nov 1417 2011 Credits APEnet design and development in INFN by the APE team ID: 641105
Download Presentation The PPT/PDF document "Remote Direct Memory Access" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Remote Direct Memory Access between NVIDIA GPUs with the APEnet 3D Torus Interconnect
davide.rossetti@roma1.infn.itSC11Seattle, Nov 14-17, 2011Slide2
Credits
APEnet design and development in INFN by the APE™ team Partially supported by EU EURETILE project (euretile.roma1.infn.it)GPU support developed in collaboration with Massimiliano Fatica, Timothy Murray et al @ NvidiaNov 16th 2011
D.Rossetti
- SC11 - Seattle
2Slide3
IndexAPEnet cluster interconnect
GPU featuresPerformance plotsProgramming modelStatus & Future plansNov 16th 2011D.Rossetti - SC11 - Seattle
3Slide4
The APEnet+ HistoryCustom HPC supercomputers: APE (86), APE100 (94),
APEmille (99), apeNEXT (04)Cluster interconnects:2003-2004: APEnet V32005: APEnet V3+, same HW with RDMA API2006-2009: DNP, or APEnet goes embedded2011: APEnet V4 aka APEnet+Nov 16th 2011
D.Rossetti - SC11 - Seattle
4Slide5
APEnet interconnect
APEnet 3D Torus networkideal for large-scale scientific simulations (domain decomposition, stencil computation, …)today scalable up to 32K nodesNo external switches
! scaling
cost: 1 card + 3 cables
RDMA: Zero
-copy RX &
TX !
Small latency & high bandwidthGPU clusters features:RDMA support for GPUs! no buffer copies between GPU and
host.Very good GPU to GPU latency
Nov 16th 2011D.Rossetti - SC11 - Seattle
5Slide6
APEnet at a glanceNov 16th 2011
D.Rossetti - SC11 - Seattle6
APEnet
+ card
:
FPGA based
6
bidirectional links up to 34
Gbps
raw
PCIe
X8 Gen2 in X16 slotpeak BW 4+4 GB/s
Network Processor,
off
-loading engine integrated in the FPGA
Zero-copy RDMA host interface
Direct GPU interface
Industry
standard QSFP+ cabling
Passive
copper / active
copper
OpticalSlide7
GPU cluster a la APEnet
Nov 16th 2011D.Rossetti - SC11 - Seattle7
apeNET
+
1+ GPUs
Cluster node
6 Torus
Links
A
s simple as
Slide8
APEnet+ card architectureNov 16th 2011
D.Rossetti - SC11 - Seattle8
r
outer
7x7 ports switch
t
orus
link
torus
link
torus
link
torus
link
torus
link
torus
link
TX/RX Block
r
outing logic
arbiter
X
+
X
-
Y
+
Y
-
Z
+
Z
-
PCIe
core
32bit Micro
C
ontroller
Collective
comm
block
memory controller
DDR3
Module
PCIe
X8 Gen2 x8 link
1Gb Eth port
Altera
Stratix
IV
GPU I/O acceleratorSlide9
GPU support: P2PCUDA
4.0:Uniform Virtual Address spaceGPUdirect 2.0 aka P2P among up to 8 GPUsCUDA 4.1: P2P protocol with alien devicesP2P between Nvidia Fermi and APEnet+First non-Nvidia device to support it!!!Jointly developed with NVidia
APElink
+ card acts as a P2P peer
APElink
I/O to/from GPU FB memory
Nov 16th 2011
D.Rossetti - SC11 - Seattle
9Slide10
The traditional flow …Nov 16th 2011
D.Rossetti - SC11 - Seattle10Network
CPU
GPU
Director
kernel
calc
CPU memory
GPU memory
transferSlide11
… and with APEnet P2PNov 16th 2011
D.Rossetti - SC11 - Seattle11APEnet+
CPU
GPU
Director
kernel
CPU memory
GPU memory
transfer
P2P transferSlide12
P2P advantagesP2P means:
Data exchange on the PCIe busNo bounce buffers on hostSo:Latency reduction for small msgAvoid host cache pollution for large msgFree GPU resources, e.g. for same host GPU-to-GPU memcpyLess load on host, more room for comp/comm overlap
Nov 16th 2011
D.Rossetti - SC11 - Seattle
12Slide13
Benchmarking platform
Preliminary benchmarks:Coded with APEnet RDMA APIOne-way only but …CUDA 4.1 pre-releaseCaveat: used APEnet test cards with reduced capabilities:PCIe X8 Gen1Link raw speed @14Gbps2 slightly different serversSuperMicro motherboardsCentOS 5.7 x86_64
Dual Xeon 56xx 24GB
Nvidia
C2050 on X16 Gen2
slots
Nov 16th 2011
D.Rossetti - SC11 - Seattle
13Slide14
Latency benchmark
OSU-like one-way latency testre-coded using RDMA PUTNo small msg optimizationsNo big difference with round-trip6.7 us on GPU-GPU test!GPU TX demanding !! still …
Nov 16th 2011
D.Rossetti - SC11 - Seattle
14Slide15
Latency benchmark: P2P effects
No P2P = cuMemcpyD2H/H2D() on host bounce buffersBuffers pinned with cuMemHostRegistercuMemcpy() costs ~ 10usMVAPICH2 points from the Ohio State U. web site*
Nov 16th 2011
D.Rossetti - SC11 - Seattle
15
* http
://
mvapich.cse.ohio-state.edu
/performance/mvapich2/
inter_gpu.shtml
~ 2 x
cuMemcpy
()
Slide16
Bandwidth benchmark
Very preliminaryGPU RX is great! Better than Host RXHost TX is better but suffers PCIe Gen1 & link speed cap of APEnet test cardLink speed over 2GB/s on final APEnet HW
Nov 16th 2011
D.Rossetti - SC11 - Seattle
16Slide17
Low-level profiling
OSU-like oneway latency benchmarkGPU to GPUShowing TX sideSizes: 32B to 2KBBreak-out contributions to message time:Software: application, user-space library, kernel driverHW data: data related timeHW metadata: remaining time
Nov 16th 2011
D.Rossetti - SC11 - Seattle
17Slide18
SW: RDMA API
RDMA Buffer management:expose memory buffers to remote accessam_register_buf(), am_unregister_buf()2 types: SBUF use-once, PBUF are targets of RDMA_PUTTypically at app init
time
Comm
primitives:
Non blocking,
a
sync progress
am_send() to SBUFam_put
() to remote PBUF via buffer virtual addressam_get() from remote PBUF (future work)Event delivery:
am_wait_event
()Generated on: comm primitives completion, RDMA buffers access
Nov 16th 2011
D.Rossetti - SC11 - Seattle
18Slide19
SW: RDMA APITypical
stencil appSame app with P2PInit:cudaMalloc() buffers on GPURegister GPU buffersExchange GPU buffers addressComputation loop:
Launch
calc_bound
kernel on stream0
Launch
calc_bulk
kernel on stream1cudaStreamSync(stream0)am_put
(rem_gpu_addr)Wait for put done and buffer writtencudaStreamSync(
stream1)Nov 16th 2011
D.Rossetti - SC11 - Seattle
19
Thanks to P2P!
Init
:
Allocate buffers for
ghost cells
Register buffers
Exchange buffers host address
Computation loop:
Calc
boundary
cuMemcpy
of boundary to buffer
am_put
() buffer to neighbors
Calc
bulk
Wait for put done and local ghost cells written
cuMemcpy
of
rx
buffer to GPU ghost cellsSlide20
SW: MPIOpenMPI
1.5APEnet BTL-level module2 protocols, based on threshold size:Eager: small message size, uses plain send, no syncRendezvous: pre-register dest buffer, use RDMA_PUT, synch neededWorking on integration of P2P supportUse of CUDA 4.x UVANov 16th 2011
D.Rossetti - SC11 - Seattle
20Slide21
Status & future plans
Status:Bring-up phase for the APEnet cardMature FPGA firmware (beta)OpenMPI coming soon (alpha)Future:8-16 node GPU+APEnet cluster available 1Q’12HW roadmap: Gen2 x16, Gen3GPU initiated communicationsfault tolerance
Application acceleration via
reconf
. comp.
&
new
comm primitives
Nov 16th 2011D.Rossetti - SC11 - Seattle
21Slide22
Game overSee you at IGI booth #752 at 12
:30PM forClose view of the real APEnet+ cardGoing through sample codeQ&AAPEnet web site: http://apegate.roma1.infn.it/ Contact us at apenet@apegate.roma1.infn.itNov 16th 2011
D.Rossetti - SC11 - Seattle
22