in LargeScale Distributed Machine Learning Jinkun Geng Dan Li Yang Cheng Shuai Wang and Junfeng Li 1 Net for ACM SIGCOMM Workshop ID: 803314
Download The PPT/PDF document "HiPS : Hierarchical Parameter Synchron..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
HiPS:Hierarchical Parameter Synchronization in Large-Scale Distributed Machine Learning
Jinkun
Geng, Dan Li, Yang Cheng, Shuai Wang, and Junfeng Li
1
Slide2NetforACM SIGCOMM
Workshop on NetAIAI2
Slide3BackgroundDistributed Machine LearningComputationCommunication
3
Slide4BackgroundStrong Computation Power (GPU &
TPU)4
Slide5BackgroundCommunication ChallengeTCP: High Latency & Low Throughput, Kernel Overheads, etc.
RDMA-Promising
Alternative to TCP5
Slide6BackgroundA MNIST Benchmark with 1
m
illion paras 6
Slide7BackgroundRoCE/RDMA –multi-vendor ecosystem
Many
Problems in Fat-Tree based Deployment7
Slide8BackgroundFat-Tree based DeploymentPFC
pause
frame storm [SIGCOMM’15,’16, NS-3 Simulation]Resilient RoCE-Performance Sacrifice [Chelsio-Tech]Synchronization Performance
8
Slide9BackgroundFat-Tree based DeploymentPFC
pause
frame storm [SIGCOM’15,’16]Resilient RoCE-Performance SacrificeServer-Centric Networks9
Slide10BackgroundFat-Tree based DeploymentSynchronization
Performance
Hierarchical Synchronization10
Slide11BackgroundServer-Centric NetworksLess hops
lead
to less PFC pause framesServers prevent cascading effect of PFC pause frame
11
Slide12BackgroundSynchronization AlgorithmPS-based Mesh-based
Ring-based
12
Slide13BackgroundSynchronization AlgorithmPS-based (Pull+Push
)
13
Slide14BackgroundSynchronization AlgorithmMesh-based (Diffuse+Collect
)
14
Slide15BackgroundSynchronization AlgorithmRing-based (Scatter+Gather
)
15
Slide16BackgroundSynchronization AlgorithmRing-based (Scatter+Gather
)
16
Slide17HiPS DesignMap Logic View and
Physical
StructureFlexible (Topology-Aware)Hierarchical (Efficient)17
Slide18HiPS DesignHiPS in BCube
18
Slide19HiPS DesignHiPS in BCube
19
Slide20HiPS DesignHiPS in BCube
20
Slide21HiPS DesignHiPS in BCube (Server
<01>)
21
Slide22HiPS DesignHiPS in BCube
22
Slide23HiPS DesignHiPS in Torus
23
Slide24Theoretical Evaluation24
Slide25Theoretical Evaluation25
Slide26Theoretical Evaluation26
Slide27Future WorkConduct Further Comparative
Study
Integrate HiPS into DML systems27
Slide28Simulation EvaluationGST Comparison with RDMA in Torus
GST Comparison with
RDMA in BCubeNS-3 Simulation with VGG WorkloadBCube: GST reduced by 37.5%∼61.9%.
Torus:
GST
reduced
by
49.6%∼66.4%
28
Slide29Testbed EvaluationSystem Instance
of
HiPS: BMLAdd an OP in Tensorflow9 Servers, each equipped with
2
RNICs
(
BCube
(3,1))
MINIST
and
VGG19
as
benchmarks
Ring
Allreduce
in
Ring
and
Mesh-based
(P2P)
Sync
in
Fat-Tree
as
Baseline
29
Slide30Testbed Evaluation30
Slide31Testbed Evaluation18.7%~56.4% 31
Slide32Ongoing WorkConduct Further Comparative
Study
Optimize HiPS in DML systemsMore Cases of Network for AI
32
Slide33Thanks!NASP Research Grouphttps://nasp.cs.tsinghua.edu.cn
/
33