/
Remote Direct Memory  Access Remote Direct Memory  Access

Remote Direct Memory Access - PowerPoint Presentation

ellena-manuel
ellena-manuel . @ellena-manuel
Follow
382 views
Uploaded On 2018-03-06

Remote Direct Memory Access - PPT Presentation

between NVIDIA GPUs with the APEnet 3D Torus Interconnect daviderossettiroma1infnit SC11 Seattle Nov 1417 2011 Credits APEnet design and development in INFN by the APE team ID: 641105

apenet gpu sc11 rossetti gpu apenet rossetti sc11 16th 2011 nov seattle rdma p2p torus host buffers memory buffer

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Remote Direct Memory Access" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Remote Direct Memory Access between NVIDIA GPUs with the APEnet 3D Torus Interconnect

davide.rossetti@roma1.infn.itSC11Seattle, Nov 14-17, 2011Slide2

Credits

APEnet design and development in INFN by the APE™ team Partially supported by EU EURETILE project (euretile.roma1.infn.it)GPU support developed in collaboration with Massimiliano Fatica, Timothy Murray et al @ NvidiaNov 16th 2011

D.Rossetti

- SC11 - Seattle

2Slide3

IndexAPEnet cluster interconnect

GPU featuresPerformance plotsProgramming modelStatus & Future plansNov 16th 2011D.Rossetti - SC11 - Seattle

3Slide4

The APEnet+ HistoryCustom HPC supercomputers: APE (86), APE100 (94),

APEmille (99), apeNEXT (04)Cluster interconnects:2003-2004: APEnet V32005: APEnet V3+, same HW with RDMA API2006-2009: DNP, or APEnet goes embedded2011: APEnet V4 aka APEnet+Nov 16th 2011

D.Rossetti - SC11 - Seattle

4Slide5

APEnet interconnect

APEnet 3D Torus networkideal for large-scale scientific simulations (domain decomposition, stencil computation, …)today scalable up to 32K nodesNo external switches

! scaling

cost: 1 card + 3 cables

RDMA: Zero

-copy RX &

TX !

Small latency & high bandwidthGPU clusters features:RDMA support for GPUs! no buffer copies between GPU and

host.Very good GPU to GPU latency

Nov 16th 2011D.Rossetti - SC11 - Seattle

5Slide6

APEnet at a glanceNov 16th 2011

D.Rossetti - SC11 - Seattle6

APEnet

+ card

:

FPGA based

6

bidirectional links up to 34

Gbps

raw

PCIe

X8 Gen2 in X16 slotpeak BW 4+4 GB/s

Network Processor,

off

-loading engine integrated in the FPGA

Zero-copy RDMA host interface

Direct GPU interface 

Industry

standard QSFP+ cabling

Passive

copper / active

copper

OpticalSlide7

GPU cluster a la APEnet

Nov 16th 2011D.Rossetti - SC11 - Seattle7

apeNET

+

1+ GPUs

Cluster node

6 Torus

Links

A

s simple as

Slide8

APEnet+ card architectureNov 16th 2011

D.Rossetti - SC11 - Seattle8

r

outer

7x7 ports switch

t

orus

link

torus

link

torus

link

torus

link

torus

link

torus

link

TX/RX Block

r

outing logic

arbiter

X

+

X

-

Y

+

Y

-

Z

+

Z

-

PCIe

core

32bit Micro

C

ontroller

Collective

comm

block

memory controller

DDR3

Module

PCIe

X8 Gen2 x8 link

1Gb Eth port

Altera

Stratix

IV

GPU I/O acceleratorSlide9

GPU support: P2PCUDA

4.0:Uniform Virtual Address spaceGPUdirect 2.0 aka P2P among up to 8 GPUsCUDA 4.1: P2P protocol with alien devicesP2P between Nvidia Fermi and APEnet+First non-Nvidia device to support it!!!Jointly developed with NVidia

APElink

+ card acts as a P2P peer

APElink

I/O to/from GPU FB memory

Nov 16th 2011

D.Rossetti - SC11 - Seattle

9Slide10

The traditional flow …Nov 16th 2011

D.Rossetti - SC11 - Seattle10Network

CPU

GPU

Director

kernel

calc

CPU memory

GPU memory

transferSlide11

… and with APEnet P2PNov 16th 2011

D.Rossetti - SC11 - Seattle11APEnet+

CPU

GPU

Director

kernel

CPU memory

GPU memory

transfer

P2P transferSlide12

P2P advantagesP2P means:

Data exchange on the PCIe busNo bounce buffers on hostSo:Latency reduction for small msgAvoid host cache pollution for large msgFree GPU resources, e.g. for same host GPU-to-GPU memcpyLess load on host, more room for comp/comm overlap

Nov 16th 2011

D.Rossetti - SC11 - Seattle

12Slide13

Benchmarking platform

Preliminary benchmarks:Coded with APEnet RDMA APIOne-way only but …CUDA 4.1 pre-releaseCaveat: used APEnet test cards with reduced capabilities:PCIe X8 Gen1Link raw speed @14Gbps2 slightly different serversSuperMicro motherboardsCentOS 5.7 x86_64

Dual Xeon 56xx 24GB

Nvidia

C2050 on X16 Gen2

slots

Nov 16th 2011

D.Rossetti - SC11 - Seattle

13Slide14

Latency benchmark

OSU-like one-way latency testre-coded using RDMA PUTNo small msg optimizationsNo big difference with round-trip6.7 us on GPU-GPU test!GPU TX demanding !! still …

Nov 16th 2011

D.Rossetti - SC11 - Seattle

14Slide15

Latency benchmark: P2P effects

No P2P = cuMemcpyD2H/H2D() on host bounce buffersBuffers pinned with cuMemHostRegistercuMemcpy() costs ~ 10usMVAPICH2 points from the Ohio State U. web site*

Nov 16th 2011

D.Rossetti - SC11 - Seattle

15

* http

://

mvapich.cse.ohio-state.edu

/performance/mvapich2/

inter_gpu.shtml

~ 2 x

cuMemcpy

()

Slide16

Bandwidth benchmark

Very preliminaryGPU RX is great! Better than Host RXHost TX is better but suffers PCIe Gen1 & link speed cap of APEnet test cardLink speed over 2GB/s on final APEnet HW

Nov 16th 2011

D.Rossetti - SC11 - Seattle

16Slide17

Low-level profiling

OSU-like oneway latency benchmarkGPU to GPUShowing TX sideSizes: 32B to 2KBBreak-out contributions to message time:Software: application, user-space library, kernel driverHW data: data related timeHW metadata: remaining time

Nov 16th 2011

D.Rossetti - SC11 - Seattle

17Slide18

SW: RDMA API

RDMA Buffer management:expose memory buffers to remote accessam_register_buf(), am_unregister_buf()2 types: SBUF use-once, PBUF are targets of RDMA_PUTTypically at app init

time

Comm

primitives:

Non blocking,

a

sync progress

am_send() to SBUFam_put

() to remote PBUF via buffer virtual addressam_get() from remote PBUF (future work)Event delivery:

am_wait_event

()Generated on: comm primitives completion, RDMA buffers access

Nov 16th 2011

D.Rossetti - SC11 - Seattle

18Slide19

SW: RDMA APITypical

stencil appSame app with P2PInit:cudaMalloc() buffers on GPURegister GPU buffersExchange GPU buffers addressComputation loop:

Launch

calc_bound

kernel on stream0

Launch

calc_bulk

kernel on stream1cudaStreamSync(stream0)am_put

(rem_gpu_addr)Wait for put done and buffer writtencudaStreamSync(

stream1)Nov 16th 2011

D.Rossetti - SC11 - Seattle

19

Thanks to P2P!

Init

:

Allocate buffers for

ghost cells

Register buffers

Exchange buffers host address

Computation loop:

Calc

boundary

cuMemcpy

of boundary to buffer

am_put

() buffer to neighbors

Calc

bulk

Wait for put done and local ghost cells written

cuMemcpy

of

rx

buffer to GPU ghost cellsSlide20

SW: MPIOpenMPI

1.5APEnet BTL-level module2 protocols, based on threshold size:Eager: small message size, uses plain send, no syncRendezvous: pre-register dest buffer, use RDMA_PUT, synch neededWorking on integration of P2P supportUse of CUDA 4.x UVANov 16th 2011

D.Rossetti - SC11 - Seattle

20Slide21

Status & future plans

Status:Bring-up phase for the APEnet cardMature FPGA firmware (beta)OpenMPI coming soon (alpha)Future:8-16 node GPU+APEnet cluster available 1Q’12HW roadmap: Gen2 x16, Gen3GPU initiated communicationsfault tolerance

Application acceleration via

reconf

. comp.

&

new

comm primitives

Nov 16th 2011D.Rossetti - SC11 - Seattle

21Slide22

Game overSee you at IGI booth #752 at 12

:30PM forClose view of the real APEnet+ cardGoing through sample codeQ&AAPEnet web site: http://apegate.roma1.infn.it/ Contact us at apenet@apegate.roma1.infn.itNov 16th 2011

D.Rossetti - SC11 - Seattle

22