/
Search-Based Approaches to Accelerate Deep Learning Search-Based Approaches to Accelerate Deep Learning

Search-Based Approaches to Accelerate Deep Learning - PowerPoint Presentation

bikershomemaker
bikershomemaker . @bikershomemaker
Follow
343 views
Uploaded On 2020-09-22

Search-Based Approaches to Accelerate Deep Learning - PPT Presentation

Zhihao Jia 1 62319 Stanford University Deep Learning is Everywhere 2 Recurrent Neural Networks Convolutional Neural Networks Neural Architecture Search Reinforcement Learning Deep Learning Deployment is Challenging ID: 812108

graph search parallelism relu search graph relu parallelism strategies space model conv3x3 parallelization data soap device based optimizations cost

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Search-Based Approaches to Accelerate De..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Search-Based Approaches to Accelerate Deep Learning

Zhihao Jia

1

6/23/19

Stanford University

Slide2

Deep Learning is Everywhere

2

Recurrent Neural Networks

Convolutional Neural Networks

Neural Architecture Search

Reinforcement Learning

Slide3

Deep Learning Deployment is Challenging

3

Diverse and Complex

DNN Models

Distributed Heterogenous

Hardware Platforms

What operators to execute?

How to distribute these operators?

Slide4

Existing Approach: Heuristic Optimizations

DNN Architecture

Graph Optimizations

Parallelization

Device 1

Device N

Miss model- and hardware-specific optimizations

Performance is suboptimal

Rule-based

Operator Fusion

Data/Model Parallelism

Slide5

Search-Based Optimizations

5

A

search space

of possible strategies

=

Optimized strategies

+

A

cost model

and

a

search algorithm

Challenge 1:

How to build a search space including optimized strategies?

Challenge 2:

How to efficiently explore the search space?

Slide6

A

search space

of possible strategies

=

+

A

cost model

and

a

search algorithm

Parallelization

Device 1

Device N

Graph Optimizations

The SOAP search space

Auto-generated

graph substitutions

Markov Chain

Monte Carlo

Fast parallelization strategies

Optimized strategies

Cost-based

backtracking search

Optimized computation

graphs

Overview

Outperform data/model parallelism by 3.3x

Outperform rule-based operator fusion by 2.9x

Slide7

A

search space

of possible strategies

=

+

A

cost model

and

a

search algorithm

Parallelization

Device 1

Device N

Graph Optimizations

The SOAP search space

Auto-generated

graph substitutions

Markov Chain

Monte Carlo

Fast parallelization strategies

Optimized strategies

Cost-based

backtracking search

Optimized computation

graphs

Overview

Slide8

Beyond Data and Model Parallelism for Deep Neural Networks

ICML’18, SysML’19

8

Slide9

Current Approaches: Data and Model Parallelism

Data parallelism is the default strategy in existing DNN frameworks

Manually-designed strategies [1, 2]Combine data and model parallelism to accelerate specific DNNsAutomatic generated strategiesColocRL

[3] uses RL to find device placement for model parallelism

[1] Alex

Krizhevsky

. One weird trick for parallelizing convolutional neural networks. 2014

[2] Wu et. al. Google’s neural machine translation system: Bridging the gap between human and machine translation. 2016

[3]

Mirhoseini

et. al. Device placement optimization with reinforcement learning. 2017

Exploring dimensions beyond data and model parallelism can further accelerate DNN training (by up to 3.3x)

9

Slide10

The SOAP Search Space

S

amples

O

perators

A

ttributes

P

arameters

10

Slide11

The SOAP Search Space

Samples:

partitioning training samples (Data Parallelism)OperatorsAttributes

Parameters

Parallelizing a 1D convolution

Sample

Parameter

GPU1

GPU2

GPU3

GPU4

Pixel

11

Slide12

The SOAP Search Space

Samples

: partitioning training samples (Data Parallelism)O

perators: partitioning DNN operators (Model Parallelism)AttributesP

arameters

Pixel

Pixel

Pixel

Sample

Convolution#1

Parameter

Sample

Convolution#2

Parameter

Sample

Convolution#3

Parameter

GPU1

GPU3

GPU2

12

Slide13

The SOAP Search Space

Samples

: partitioning training samples (Data Parallelism)O

perators: partitioning DNN operators (Model Parallelism)

A

ttributes:

partitioning attributes in a sample (e.g., different pixels)

P

arameters

Parallelizing a 1D convolution

Sample

Parameter

Pixel

GPU1

GPU2

GPU3

GPU4

13

Slide14

The SOAP Search Space

Samples

: partitioning training samples (Data Parallelism)O

perators: partitioning DNN operators (Model Parallelism)

A

ttributes

:

partitioning attributes in a sample (e.g., different pixels)

P

arameters:

partitioning parameters in an operator

Parallelizing a 1D convolution

Sample

Parameter

Pixel

GPU1

GPU2

GPU3

GPU4

14

Slide15

Hybrid Parallelism in SOAP

Example parallelization strategies for 1D convolution

Different strategies perform the same computation.

15

Slide16

Parallelization

Approach

Sample

Operator

Attribute

Parameter

Data Parallelism

Model Parallelism

Manually-designed strategies

(

Krizhevsky

, 2012

)

(

Wu et al., 2014)

Mesh-TensorFlow

Automatic generated strategies

ColocRL

Tofu and

SoyBean

Gpipe

and

PipeDream

The SOAP search space

Our work

Slide17

Data parallelism

A possible parallelization strategy in the SOAP search space

Parameter

Sample

GPU1

GPU2

GPU3

GPU4

Slide18

Parameter

Sample

GPU1

GPU2

GPU3

GPU4

Data parallelism

A possible parallelization strategy in the SOAP search space

Slide19

FlexFlow

MCMC

Search Alg.

Distributed Runtime

Best Found Strategy

Candidate

Strategy

Simulated

Performance

Execution Optimizer

Execution

Simulator

DNN Architecture

Device Topology

GPU

GPU

CPU

Network

GPU

GPU

CPU

Conv

Conv

Concat

MatMul

Search Algorithm

Cost Model

19

Slide20

DNNs

AlexNet

ResNet-50

Inception-v3

RNNTC

RNNLM

GNMT

FlexFlow

3.3x

1.1x

1.6x

1.7x

1.9x

2.4x

Number of nodes (four K80 GPUs per node)

Training Throughput

(samples per second)

1.7x faster

Evaluation

20

Speedup Over SOTA

Slide21

A

search space

of possible strategies

=

+

A

cost model

and

a

search algorithm

Parallelization

Device 1

Device N

Graph Optimizations

The SOAP search space

Auto-generated

graph substitutions

Markov Chain

Monte Carlo

Fast parallelization strategies

Optimized strategies

Cost-based

backtracking search

Optimized computation

graphs

Overview

Slide22

Optimizing DNN Computation with Automated Generation of Graph Substitutions

SysML’19

22

Slide23

Current Practice: Rule-Based Graph Transformations

Apply graph transformations designed by domain expertsE.g., fuse a convolution and a

relu into a ``conv + relu’’

Conv3x3

Conv1x1

Input

Conv3x3

add

relu

Relu

Relu

Conv3x3

+

Relu

Conv1x1

+

Relu

Input

Conv3x3

add

relu

fuse

conv

and

relu

23

Slide24

Limitations of Rule-based Approaches

When I turned on XLA (TensorFlow’s graph optimizer), the training speed is

about 20% slower.

With XLA, my program is

almost 2x slower than

without XLA

Robustness

Experts’ heuristics do not apply to all DNNs/hardware

24

Slide25

Limitations of Rule-based Approaches

Robustness

Experts’ heuristics do not apply to all DNNs/hardware

Scalability

New operators and graph structures require more rules

Performance

Miss subtle optimizations for

specific DNNs/hardware

TensorFlow involves ~4K LOC to optimize a new operator

25

Slide26

A Missing Graph Optimization

Conv3x3

+

Relu

Conv1x1

+

Relu

Input

Conv3x3

Add

Relu

Conv3x3

+

Relu

Conv3x3

+

Relu

Input

Conv3x3

Add

Relu

Enlarge

convs

Conv3x3

+

Relu

Input

Conv3x3

Add

Relu

Split

Fuse

convs

Fuse

conv & add

The final graph is

1.3x faster

on V100 but

10% slower

on K80.

Conv3x3

+

Relu

Input

Conv3x3

+

Relu

Fuse

conv &

relu

Conv3x3

+

Relu

Input

Conv3x3

Relu

26

Slide27

Can we automatically find these optimizations?

27

Automatically generated graph substitutions

Slide28

XFlow

Cost-Based Search Alg.

Optimized

Comp. Graph

Input Comp.

Graph

Operator Specifications

Graph Subst. Generator

Graph Subst. Verifier

28

Candidate

Substitutions

Verified

Substitutions

Slide29

End-to-end Inference Performance

1.0x

1.3x

2.9x

1.5x

1.4x

29

Use ~500 automatically generated substitutions

Competitive with SOTA

Outperform SOTA on unconventional DNNs

Slide30

Open Problems

30

Can we design better

search space for parallelization and graph optimizations?

Can we find more efficient

search algorithms

?

Can we use search-based optimizations in

other domains

?

Slide31

A

search space

of possible strategies

=

+

A

cost model

and

a

search algorithm

Parallelization

Device 1

Device N

Graph Optimizations

The SOAP search space

Auto-generated

graph substitutions

Markov Chain

Monte Carlo

Fast parallelization strategies

Optimized strategies

Cost-based

backtracking search

Optimized computation

graphs

Conclusion

https://github.com/flexflow/FlexFlow

Slide32

Backup Slides

32