/
Training DNNs with up to 5x less memory via optimal tensor Training DNNs with up to 5x less memory via optimal tensor

Training DNNs with up to 5x less memory via optimal tensor - PowerPoint Presentation

SunnySeahorse
SunnySeahorse . @SunnySeahorse
Follow
342 views
Uploaded On 2022-08-01

Training DNNs with up to 5x less memory via optimal tensor - PPT Presentation

rematerialization Paras Jain Joint work with Ajay Jain Ani Nrusimha Amir Gholami Pieter Abbeel Kurt Keutzer Ion Stoica Joseph Gonzalez To appear at MLSys 2020 Deep learning continues to adapt to increasing complex applications ID: 931649

layers ram pass memory ram layers memory pass recompute compute hungry backpropagation policy models time label loss gpu peak

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Training DNNs with up to 5x less memory ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Training DNNs with up to 5x less memory via optimal tensor rematerialization

Paras JainJoint work with: Ajay Jain, Ani Nrusimha, Amir Gholami,Pieter Abbeel, Kurt Keutzer, Ion Stoica, Joseph Gonzalez

To appear at:

MLSys 2020

Slide2

Deep learning continues to adapt to increasing complex applications.

Handwriting

classification

Imageclassification

Reinforcement learning

Image

generation

LeCun

et al. 1998

Krizhevsky

et al. 2012

Silver et al. 2016

Karras

et al. 2018

2

Slide3

Parameter count (10

9)New applications drove 330x growth in model sizeFigure adapted from NVIDIA3

Slide4

Compute has been a key enabler for big models:

60x faster accelerators4TOPSper chip

Z

But…

Slide5

RAM usage

per-GPUAt a per-GPU level, recent state-of-the-art models have hit a memory capacity wall.5Limited GPU memory is slowing progress in new deep learning models!

Slide6

RAM usage

per-GPUAt a per-GPU level, recent state-of-the-art models have hit a memory capacity wall.6Limited GPU memory is slowing progress in new deep learning models!

How do we train

models

both efficiently

and beyond memory limits?Problem:

Slide7

This work:

trade-off RAM for compute, efficiently7

RAM

Keep every

layers in RAM

Chen et al (2016)

 

Keep all layers in RAM

Keep no layers in RAM

Compute

Checkmate explores optimal trade-off

5x larger models w/ 2x cost

Slide8

8

RAM

Keep every

layers in RAMChen et al (2016)

 

Keep no layers in RAM

Compute

RAM-hungry backpropagation policy

Keep all layers in RAM

Keep all layers in RAM

Slide9

A

BCD

E

∇E

∇D

∇C

∇B

∇A

Loss

Forward Pass

Backward Pass

Time

RAM

used

B

C

D

E

∇E

9

Label

RAM-hungry backpropagation policy

Keep all layers in RAM

A

Slide10

A

BCD

E

Label

∇E

∇D

∇C

∇B

∇A

Loss

Forward Pass

Backward Pass

A

B

C

D

E

∇E

∇D

Time

RAM

used

E

10

RAM-hungry backpropagation policy

Keep all layers in RAM

Slide11

A

BCD

E

Label

∇E

∇D

∇C

∇B

∇A

Loss

Forward Pass

Backward Pass

A

B

C

D

E

∇E

∇D

D

∇C

Time

RAM

used

E

11

RAM-hungry backpropagation policy

Keep all layers in RAM

Slide12

A

BCD

E

Label

∇E

∇D

∇C

∇B

∇A

Loss

Forward Pass

Backward Pass

A

B

C

D

E

∇E

E

∇D

D

∇C

∇B

∇A

C

B

Time

RAM

used

12

Peak RAM

RAM-hungry backpropagation policy

Keep all layers in RAM

Slide13

13

RAM

Keep every

layers in RAMChen et al (2016)

 

Compute

Keep all layers in RAM

Compute-hungry backpropagation policy

Recompute all layers as needed, storing none

Recompute all layers

Slide14

Time

RAMusedHow can we use less memory?Free early & recomputeCompute-hungry backpropagation policy

Recompute all layers

14

Peak RAM (no

recomputation)

A

B

C

D

E

Label

∇E

∇D

∇C

∇B

∇A

Loss

Forward Pass

Backward Pass

A

B

C

D

Slide15

Time

RAMusedABC

D

B

A

C

How can we use less memory?

Free early & recompute

Compute-hungry backpropagation policy

Recompute

all layers

15

Peak RAM (no

recomputation

)

A

B

C

D

E

Label

∇E

∇D

∇C

∇B

∇A

Loss

Forward Pass

Backward Pass

Slide16

Time

RAMusedABC

B

D

A

E

C

∇E

∇D

How can we use less memory?

Free early & recompute

Compute-hungry backpropagation policy

Recompute

all layers

16

Peak RAM (no

recomputation

)

A

B

C

D

E

Label

∇E

∇D

∇C

∇B

∇A

Loss

Forward Pass

Backward Pass

Slide17

Time

RAMusedABC

B

D

A

E

C

∇E

∇D

A

B

C

How can we use less memory?

Free early & recompute

Compute-hungry backpropagation policy

Recompute

all layers

17

Peak RAM (no

recomputation

)

A

B

C

D

E

Label

∇E

∇D

∇C

∇B

∇A

Loss

Forward Pass

Backward Pass

Slide18

Time

RAMusedAB

C

B

D

A

E

C

∇E

∇D

A

B

C

∇C

Peak RAM

How can we use less memory?

Free early & recompute

Compute-hungry backpropagation policy

Recompute

all layers

18

Peak RAM (no

recomputation

)

A

B

C

D

E

Label

∇E

∇D

∇C

∇B

∇A

Loss

Forward Pass

Backward Pass

Slide19

19

RAM

Compute

Keep all layers in RAM

Prior heuristic as an intermediate trade-off point

Recompute all layers

Keep every

layers in RAM

Chen et al (2016)

 

Slide20

20

RAM

Checkpointevery node

Recompute all layersCompute

How to trade-off RAM for compute

optimally?

3. Real DNNs are non-linear

2. Variable RAM usage per layer

1. Variable runtime per layer

Challenges

:

Slide21

Latency is not constant between layers

106x compute gap between biggest and smallest layer in VGG19x106

21

Why do fixed heuristics perform poorly?

Slide22

Why do fixed heuristics perform poorly?22

Feature memory profile4 MB2 MB

0 MB

OperationData from https://

github.com/

albanie/convnet-burden/blob/master/reports/densenet201.md

2. Tensors are not all the same size

DenseNet-201 has large variability in activation sizes between layers

x10

3

Slide23

Why do fixed heuristics perform poorly?

3. Real DNN architectures are non-linear23U-Net: 11k citations

DenseNet: 7k citations

Slide24

Profile layers

24A system for optimal tensor rematerializationSolve integer LP

Rewrite TF2.0 graph

Statically optimize graph once (10s to 1hr)

Train optimized graph for weeks

Checkmate composed of 3 parts:

Profiling:

hardware/RAM aware schedules

Integer LP:

enables finding optimal schedule

TF2.0 graph pass:

support TPU, GPU, CPU

Slide25

When?

What is computed?What is in memory?

Concrete example:

VGG19

Checkpointing all nodes:

Batch size 167

224x224 images

Slide26

1.18x larger!

Square root heuristicBatch size 197What is computed?When?What is in memory?

Concrete example:

VGG19

224x224 images

Slide27

CheckmateBatch size 289

10 second solve time1.73x larger!What is computed?When?

What is in memory?

Concrete example:

VGG19

224x224 images

Slide28

Evaluation: Checkmate supports complex models (U-Net)

Best heuristic

Checkmate

GPU Memory available (GB)

Slide29

Evaluation: Checkmate enables training w/ large inputs

295x larger input sizeCheckmate

Best heuristic

MobileNet

GPU Memory available (GB)

Slide30

Key ideas:

GPU memory limits are preventing the development of new deep learning models.We present the first general solution for optimal graph

recomputation.

Formulation supports arbitrary DAGs and is both hardware-aware

and

memory-awareIntegration with just one line of code

Code and paper:

checkmateai.github.io

Email me:

parasj@berkeley.edu

30