Presentation on theme: "HiPS : Hierarchical Parameter Synchronization"— Presentation transcript
Slide1
HiPS:Hierarchical Parameter Synchronization in Large-Scale Distributed Machine Learning
Jinkun
Geng, Dan Li, Yang Cheng, Shuai Wang, and Junfeng Li
1
Slide2
NetforACM SIGCOMM
Workshop on NetAIAI2
Slide3
BackgroundDistributed Machine LearningComputationCommunication
3
Slide4
BackgroundStrong Computation Power (GPU &
TPU)4
Slide5
BackgroundCommunication ChallengeTCP: High Latency & Low Throughput, Kernel Overheads, etc.
RDMA-Promising
Alternative to TCP5
Slide6
BackgroundA MNIST Benchmark with 1
m
illion paras 6
Slide7
BackgroundRoCE/RDMA –multi-vendor ecosystem
Many
Problems in Fat-Tree based Deployment7
Slide8
BackgroundFat-Tree based DeploymentPFC
pause
frame storm [SIGCOMM’15,’16, NS-3 Simulation]Resilient RoCE-Performance Sacrifice [Chelsio-Tech]Synchronization Performance
8
Slide9
BackgroundFat-Tree based DeploymentPFC
pause
frame storm [SIGCOM’15,’16]Resilient RoCE-Performance SacrificeServer-Centric Networks9
Slide10
BackgroundFat-Tree based DeploymentSynchronization
Performance
Hierarchical Synchronization10
Slide11
BackgroundServer-Centric NetworksLess hops
lead
to less PFC pause framesServers prevent cascading effect of PFC pause frame
11
Slide12
BackgroundSynchronization AlgorithmPS-based Mesh-based
Ring-based
12
Slide13
BackgroundSynchronization AlgorithmPS-based (Pull+Push
)
13
Slide14
BackgroundSynchronization AlgorithmMesh-based (Diffuse+Collect
)
14
Slide15
BackgroundSynchronization AlgorithmRing-based (Scatter+Gather
)
15
Slide16
BackgroundSynchronization AlgorithmRing-based (Scatter+Gather
)
16
Slide17
HiPS DesignMap Logic View and
Physical
StructureFlexible (Topology-Aware)Hierarchical (Efficient)17
Slide18
HiPS DesignHiPS in BCube
18
Slide19
HiPS DesignHiPS in BCube
19
Slide20
HiPS DesignHiPS in BCube
20
Slide21
HiPS DesignHiPS in BCube (Server
<01>)
21
Slide22
HiPS DesignHiPS in BCube
22
Slide23
HiPS DesignHiPS in Torus
23
Slide24
Theoretical Evaluation24
Slide25
Theoretical Evaluation25
Slide26
Theoretical Evaluation26
Slide27
Future WorkConduct Further Comparative
Study
Integrate HiPS into DML systems27
Slide28
Simulation EvaluationGST Comparison with RDMA in Torus
GST Comparison with
RDMA in BCubeNS-3 Simulation with VGG WorkloadBCube: GST reduced by 37.5%∼61.9%.
Torus:
GST
reduced
by
49.6%∼66.4%
28
Slide29
Testbed EvaluationSystem Instance
of
HiPS: BMLAdd an OP in Tensorflow9 Servers, each equipped with
2
RNICs
(
BCube
(3,1))
MINIST
and
VGG19
as
benchmarks
Ring
Allreduce
in
Ring
and
Mesh-based
(P2P)
Sync
in
Fat-Tree
as
Baseline
29
Slide30
Testbed Evaluation30
Slide31
Testbed Evaluation18.7%~56.4% 31
Slide32
Ongoing WorkConduct Further Comparative
Study
Optimize HiPS in DML systemsMore Cases of Network for AI
32
Slide33
Thanks!NASP Research Grouphttps://nasp.cs.tsinghua.edu.cn
/
33