Zhihao Jia 1 62319 Stanford University Deep Learning is Everywhere 2 Recurrent Neural Networks Convolutional Neural Networks Neural Architecture Search Reinforcement Learning Deep Learning Deployment is Challenging ID: 812108
Download The PPT/PDF document "Search-Based Approaches to Accelerate De..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Search-Based Approaches to Accelerate Deep Learning
Zhihao Jia
1
6/23/19
Stanford University
Slide2Deep Learning is Everywhere
2
Recurrent Neural Networks
Convolutional Neural Networks
Neural Architecture Search
Reinforcement Learning
Slide3Deep Learning Deployment is Challenging
3
Diverse and Complex
DNN Models
Distributed Heterogenous
Hardware Platforms
What operators to execute?
How to distribute these operators?
Slide4Existing Approach: Heuristic Optimizations
DNN Architecture
Graph Optimizations
Parallelization
Device 1
Device N
Miss model- and hardware-specific optimizations
Performance is suboptimal
Rule-based
Operator Fusion
Data/Model Parallelism
Slide5Search-Based Optimizations
5
A
search space
of possible strategies
=
Optimized strategies
+
A
cost model
and
a
search algorithm
Challenge 1:
How to build a search space including optimized strategies?
Challenge 2:
How to efficiently explore the search space?
Slide6A
search space
of possible strategies
=
+
A
cost model
and
a
search algorithm
Parallelization
Device 1
Device N
…
Graph Optimizations
The SOAP search space
Auto-generated
graph substitutions
Markov Chain
Monte Carlo
Fast parallelization strategies
Optimized strategies
Cost-based
backtracking search
Optimized computation
graphs
Overview
Outperform data/model parallelism by 3.3x
Outperform rule-based operator fusion by 2.9x
Slide7A
search space
of possible strategies
=
+
A
cost model
and
a
search algorithm
Parallelization
Device 1
Device N
…
Graph Optimizations
The SOAP search space
Auto-generated
graph substitutions
Markov Chain
Monte Carlo
Fast parallelization strategies
Optimized strategies
Cost-based
backtracking search
Optimized computation
graphs
Overview
Slide8Beyond Data and Model Parallelism for Deep Neural Networks
ICML’18, SysML’19
8
Slide9Current Approaches: Data and Model Parallelism
Data parallelism is the default strategy in existing DNN frameworks
Manually-designed strategies [1, 2]Combine data and model parallelism to accelerate specific DNNsAutomatic generated strategiesColocRL
[3] uses RL to find device placement for model parallelism
[1] Alex
Krizhevsky
. One weird trick for parallelizing convolutional neural networks. 2014
[2] Wu et. al. Google’s neural machine translation system: Bridging the gap between human and machine translation. 2016
[3]
Mirhoseini
et. al. Device placement optimization with reinforcement learning. 2017
Exploring dimensions beyond data and model parallelism can further accelerate DNN training (by up to 3.3x)
9
Slide10The SOAP Search Space
S
amples
O
perators
A
ttributes
P
arameters
10
Slide11The SOAP Search Space
Samples:
partitioning training samples (Data Parallelism)OperatorsAttributes
Parameters
Parallelizing a 1D convolution
Sample
Parameter
GPU1
GPU2
GPU3
GPU4
Pixel
11
Slide12The SOAP Search Space
Samples
: partitioning training samples (Data Parallelism)O
perators: partitioning DNN operators (Model Parallelism)AttributesP
arameters
Pixel
Pixel
Pixel
Sample
Convolution#1
Parameter
Sample
Convolution#2
Parameter
Sample
Convolution#3
Parameter
GPU1
GPU3
GPU2
12
Slide13The SOAP Search Space
Samples
: partitioning training samples (Data Parallelism)O
perators: partitioning DNN operators (Model Parallelism)
A
ttributes:
partitioning attributes in a sample (e.g., different pixels)
P
arameters
Parallelizing a 1D convolution
Sample
Parameter
Pixel
GPU1
GPU2
GPU3
GPU4
13
Slide14The SOAP Search Space
Samples
: partitioning training samples (Data Parallelism)O
perators: partitioning DNN operators (Model Parallelism)
A
ttributes
:
partitioning attributes in a sample (e.g., different pixels)
P
arameters:
partitioning parameters in an operator
Parallelizing a 1D convolution
Sample
Parameter
Pixel
GPU1
GPU2
GPU3
GPU4
14
Slide15Hybrid Parallelism in SOAP
Example parallelization strategies for 1D convolution
Different strategies perform the same computation.
15
Slide16Parallelization
Approach
Sample
Operator
Attribute
Parameter
Data Parallelism
✓
Model Parallelism
✓
✓
Manually-designed strategies
(
Krizhevsky
, 2012
)
✓
✓
(
Wu et al., 2014)
✓
✓
Mesh-TensorFlow
✓
✓
Automatic generated strategies
ColocRL
✓
Tofu and
SoyBean
✓
✓
Gpipe
and
PipeDream
✓
✓
The SOAP search space
Our work
✓
✓
✓
✓
Slide17Data parallelism
A possible parallelization strategy in the SOAP search space
Parameter
Sample
GPU1
GPU2
GPU3
GPU4
Slide18Parameter
Sample
GPU1
GPU2
GPU3
GPU4
Data parallelism
A possible parallelization strategy in the SOAP search space
Slide19FlexFlow
MCMC
Search Alg.
Distributed Runtime
Best Found Strategy
Candidate
Strategy
Simulated
Performance
Execution Optimizer
Execution
Simulator
DNN Architecture
Device Topology
GPU
GPU
CPU
Network
GPU
GPU
CPU
Conv
Conv
Concat
MatMul
Search Algorithm
Cost Model
19
Slide20DNNs
AlexNet
ResNet-50
Inception-v3
RNNTC
RNNLM
GNMT
FlexFlow
3.3x
1.1x
1.6x
1.7x
1.9x
2.4x
Number of nodes (four K80 GPUs per node)
Training Throughput
(samples per second)
1.7x faster
Evaluation
20
Speedup Over SOTA
Slide21A
search space
of possible strategies
=
+
A
cost model
and
a
search algorithm
Parallelization
Device 1
Device N
…
Graph Optimizations
The SOAP search space
Auto-generated
graph substitutions
Markov Chain
Monte Carlo
Fast parallelization strategies
Optimized strategies
Cost-based
backtracking search
Optimized computation
graphs
Overview
Slide22Optimizing DNN Computation with Automated Generation of Graph Substitutions
SysML’19
22
Slide23Current Practice: Rule-Based Graph Transformations
Apply graph transformations designed by domain expertsE.g., fuse a convolution and a
relu into a ``conv + relu’’
Conv3x3
Conv1x1
Input
Conv3x3
add
relu
Relu
Relu
Conv3x3
+
Relu
Conv1x1
+
Relu
Input
Conv3x3
add
relu
fuse
conv
and
relu
23
Slide24Limitations of Rule-based Approaches
When I turned on XLA (TensorFlow’s graph optimizer), the training speed is
about 20% slower.
With XLA, my program is
almost 2x slower than
without XLA
Robustness
Experts’ heuristics do not apply to all DNNs/hardware
24
Slide25Limitations of Rule-based Approaches
Robustness
Experts’ heuristics do not apply to all DNNs/hardware
Scalability
New operators and graph structures require more rules
Performance
Miss subtle optimizations for
specific DNNs/hardware
TensorFlow involves ~4K LOC to optimize a new operator
25
Slide26A Missing Graph Optimization
Conv3x3
+
Relu
Conv1x1
+
Relu
Input
Conv3x3
Add
Relu
Conv3x3
+
Relu
Conv3x3
+
Relu
Input
Conv3x3
Add
Relu
Enlarge
convs
Conv3x3
+
Relu
Input
Conv3x3
Add
Relu
Split
Fuse
convs
Fuse
conv & add
The final graph is
1.3x faster
on V100 but
10% slower
on K80.
Conv3x3
+
Relu
Input
Conv3x3
+
Relu
Fuse
conv &
relu
Conv3x3
+
Relu
Input
Conv3x3
Relu
26
Slide27Can we automatically find these optimizations?
27
Automatically generated graph substitutions
Slide28XFlow
Cost-Based Search Alg.
Optimized
Comp. Graph
…
Input Comp.
Graph
Operator Specifications
Graph Subst. Generator
Graph Subst. Verifier
28
Candidate
Substitutions
…
Verified
Substitutions
…
Slide29End-to-end Inference Performance
1.0x
1.3x
2.9x
1.5x
1.4x
29
Use ~500 automatically generated substitutions
Competitive with SOTA
Outperform SOTA on unconventional DNNs
Slide30Open Problems
30
Can we design better
search space for parallelization and graph optimizations?
Can we find more efficient
search algorithms
?
Can we use search-based optimizations in
other domains
?
Slide31A
search space
of possible strategies
=
+
A
cost model
and
a
search algorithm
Parallelization
Device 1
Device N
…
Graph Optimizations
The SOAP search space
Auto-generated
graph substitutions
Markov Chain
Monte Carlo
Fast parallelization strategies
Optimized strategies
Cost-based
backtracking search
Optimized computation
graphs
Conclusion
https://github.com/flexflow/FlexFlow
Slide32Backup Slides
32