rematerialization Paras Jain Joint work with Ajay Jain Ani Nrusimha Amir Gholami Pieter Abbeel Kurt Keutzer Ion Stoica Joseph Gonzalez To appear at MLSys 2020 Deep learning continues to adapt to increasing complex applications ID: 931649
Download Presentation The PPT/PDF document "Training DNNs with up to 5x less memory ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Training DNNs with up to 5x less memory via optimal tensor rematerialization
Paras JainJoint work with: Ajay Jain, Ani Nrusimha, Amir Gholami,Pieter Abbeel, Kurt Keutzer, Ion Stoica, Joseph Gonzalez
To appear at:
MLSys 2020
Slide2Deep learning continues to adapt to increasing complex applications.
Handwriting
classification
Imageclassification
Reinforcement learning
Image
generation
LeCun
et al. 1998
Krizhevsky
et al. 2012
Silver et al. 2016
Karras
et al. 2018
2
Slide3Parameter count (10
9)New applications drove 330x growth in model sizeFigure adapted from NVIDIA3
Slide4Compute has been a key enabler for big models:
60x faster accelerators4TOPSper chip
Z
But…
Slide5RAM usage
per-GPUAt a per-GPU level, recent state-of-the-art models have hit a memory capacity wall.5Limited GPU memory is slowing progress in new deep learning models!
Slide6RAM usage
per-GPUAt a per-GPU level, recent state-of-the-art models have hit a memory capacity wall.6Limited GPU memory is slowing progress in new deep learning models!
How do we train
models
both efficiently
and beyond memory limits?Problem:
Slide7This work:
trade-off RAM for compute, efficiently7
RAM
Keep every
layers in RAM
Chen et al (2016)
Keep all layers in RAM
Keep no layers in RAM
Compute
Checkmate explores optimal trade-off
5x larger models w/ 2x cost
Slide88
RAM
Keep every
layers in RAMChen et al (2016)
Keep no layers in RAM
Compute
RAM-hungry backpropagation policy
Keep all layers in RAM
Keep all layers in RAM
Slide9A
BCD
E
∇E
∇D
∇C
∇B
∇A
Loss
Forward Pass
Backward Pass
Time
RAM
used
B
C
D
E
∇E
9
Label
RAM-hungry backpropagation policy
Keep all layers in RAM
A
Slide10A
BCD
E
Label
∇E
∇D
∇C
∇B
∇A
Loss
Forward Pass
Backward Pass
A
B
C
D
E
∇E
∇D
Time
RAM
used
E
10
RAM-hungry backpropagation policy
Keep all layers in RAM
Slide11A
BCD
E
Label
∇E
∇D
∇C
∇B
∇A
Loss
Forward Pass
Backward Pass
A
B
C
D
E
∇E
∇D
D
∇C
Time
RAM
used
E
11
RAM-hungry backpropagation policy
Keep all layers in RAM
Slide12A
BCD
E
Label
∇E
∇D
∇C
∇B
∇A
Loss
Forward Pass
Backward Pass
A
B
C
D
E
∇E
E
∇D
D
∇C
∇B
∇A
C
B
Time
RAM
used
12
Peak RAM
RAM-hungry backpropagation policy
Keep all layers in RAM
Slide1313
RAM
Keep every
layers in RAMChen et al (2016)
Compute
Keep all layers in RAM
Compute-hungry backpropagation policy
Recompute all layers as needed, storing none
Recompute all layers
Slide14Time
RAMusedHow can we use less memory?Free early & recomputeCompute-hungry backpropagation policy
Recompute all layers
14
Peak RAM (no
recomputation)
A
B
C
D
E
Label
∇E
∇D
∇C
∇B
∇A
Loss
Forward Pass
Backward Pass
A
B
C
D
Slide15Time
RAMusedABC
D
B
A
C
How can we use less memory?
Free early & recompute
Compute-hungry backpropagation policy
Recompute
all layers
15
Peak RAM (no
recomputation
)
A
B
C
D
E
Label
∇E
∇D
∇C
∇B
∇A
Loss
Forward Pass
Backward Pass
Slide16Time
RAMusedABC
B
D
A
E
C
∇E
∇D
How can we use less memory?
Free early & recompute
Compute-hungry backpropagation policy
Recompute
all layers
16
Peak RAM (no
recomputation
)
A
B
C
D
E
Label
∇E
∇D
∇C
∇B
∇A
Loss
Forward Pass
Backward Pass
Slide17Time
RAMusedABC
B
D
A
E
C
∇E
∇D
A
B
C
How can we use less memory?
Free early & recompute
Compute-hungry backpropagation policy
Recompute
all layers
17
Peak RAM (no
recomputation
)
A
B
C
D
E
Label
∇E
∇D
∇C
∇B
∇A
Loss
Forward Pass
Backward Pass
Slide18Time
RAMusedAB
C
B
D
A
E
C
∇E
∇D
A
B
C
∇C
…
Peak RAM
How can we use less memory?
Free early & recompute
Compute-hungry backpropagation policy
Recompute
all layers
18
Peak RAM (no
recomputation
)
A
B
C
D
E
Label
∇E
∇D
∇C
∇B
∇A
Loss
Forward Pass
Backward Pass
Slide1919
RAM
Compute
Keep all layers in RAM
Prior heuristic as an intermediate trade-off point
Recompute all layers
Keep every
layers in RAM
Chen et al (2016)
20
RAM
Checkpointevery node
Recompute all layersCompute
How to trade-off RAM for compute
optimally?
3. Real DNNs are non-linear
2. Variable RAM usage per layer
1. Variable runtime per layer
Challenges
:
Slide21Latency is not constant between layers
106x compute gap between biggest and smallest layer in VGG19x106
21
Why do fixed heuristics perform poorly?
Slide22Why do fixed heuristics perform poorly?22
Feature memory profile4 MB2 MB
0 MB
OperationData from https://
github.com/
albanie/convnet-burden/blob/master/reports/densenet201.md
2. Tensors are not all the same size
DenseNet-201 has large variability in activation sizes between layers
x10
3
Slide23Why do fixed heuristics perform poorly?
3. Real DNN architectures are non-linear23U-Net: 11k citations
DenseNet: 7k citations
Slide24Profile layers
24A system for optimal tensor rematerializationSolve integer LP
Rewrite TF2.0 graph
Statically optimize graph once (10s to 1hr)
Train optimized graph for weeks
Checkmate composed of 3 parts:
Profiling:
hardware/RAM aware schedules
Integer LP:
enables finding optimal schedule
TF2.0 graph pass:
support TPU, GPU, CPU
Slide25When?
What is computed?What is in memory?
Concrete example:
VGG19
Checkpointing all nodes:
Batch size 167
224x224 images
Slide261.18x larger!
Square root heuristicBatch size 197What is computed?When?What is in memory?
Concrete example:
VGG19
224x224 images
Slide27CheckmateBatch size 289
10 second solve time1.73x larger!What is computed?When?
What is in memory?
Concrete example:
VGG19
224x224 images
Slide28Evaluation: Checkmate supports complex models (U-Net)
Best heuristic
Checkmate
GPU Memory available (GB)
Slide29Evaluation: Checkmate enables training w/ large inputs
295x larger input sizeCheckmate
Best heuristic
MobileNet
GPU Memory available (GB)
Slide30Key ideas:
GPU memory limits are preventing the development of new deep learning models.We present the first general solution for optimal graph
recomputation.
Formulation supports arbitrary DAGs and is both hardware-aware
and
memory-awareIntegration with just one line of code
Code and paper:
checkmateai.github.io
Email me:
parasj@berkeley.edu
30