Applications Lectures 1213 Regularization and Optimization Zhu Han University of Houston Thanks Xusheng Du and Kevin Tsai For Slide Preparation 1 Part 1 Regularization Outline Parameter Norm Penalties ID: 720020
Download Presentation The PPT/PDF document "Signal processing and Networking for Big..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Signal processing and Networking for Big Data ApplicationsLectures 12-13: Regularization and Optimization
Zhu HanUniversity of HoustonThanks Xusheng Du and Kevin Tsai For Slide Preparation
1Slide2
Part 1 Regularization OutlineParameter Norm PenaltiesDataset AugmentationNoise RobustnessSemi-Supervised LearningMulti-Task LearningEarly StoppingParameter Tying and Parameter SharingSparse Representations
Bagging and Other Ensemble MethodsDropoutAdversarial TrainingFederated Training2Slide3
RegularizationReduce test errorSolve overfitting (Pediatric ADHD)Regularization of an estimator works by trading increased bias for reduced variance (mention CRLB vs. MUSIC)3Slide4
Parameter Norm penaltiesNorm penalty
where is the parameters/weightsNote: In this chapter, we ignore bias for simplicity, so
indicate weights should be affected by a norm penalty where
includes
and
unregularized
parameters
Objective function
Regularized objective function
is a
hyperparameter
results in no regularization
4Slide5
Parameter Norm penaltiesMinimize will decrease both original objective and the size of the parameters
(or some subset)Typically penalizes only the weights and leaves the biases unregularizedEach bias controls only a single variable. Regularizing the bias parameters can introduce a significant amount of
underfitting
5Slide6
Parameter Norm penaltiesL2 parameter regularizationWeight decay – Ridge regression or Tikhonov regularizationRegularization termTotal objective function
Parameter gradientUpdate weights6
minimizeSlide7
Regularization termTotal objective functionParameter gradientGradient no longer scales linearlyUsually use as a feature selection mechanism (LASSO)7minimize
Parameter Norm penalties
L
1
parameter regularizationSlide8
Dataset augmentation8Slide9
Dataset augmentationLimited data create fake dataVery effective technique for Object RecognitionInjecting noise – add noise to dataDifficult for some task: density estimation, … etc. (solve problem first
)9Slide10
Dataset augmentationThings to look-outNeed carefully generate dataImage rotation produce different classes imagesToo many noise
And more…10Slide11
Noise robustness11Slide12
Noise robustnessNoise injection can be much more powerful than shrinking the parametersAdd to inputs or weightsOrg. object function:Object function with noise:12Slide13
Semi-supervised learning13Slide14
Semi-supervised learning14Goal is to classify the inputs have similar representation to the same class Learn from both labeled and unlabeled input to estimate
Usually construct separate models for supervised and unsupervised with shared parameters Slide15
Semi-supervised learningCompare labeled input with unlabeled input. Fill in missing data.Usually used for pre-processingMore detail: Salakhutdinov and Hinton (2008) and Chapelle et al. (2006)
F1
F2
F3
Y (class)
A
B
C
1
A
C
15
BSlide16
Multi-task learning16Slide17
Multi-task learningWhen part of a model is shared across tasks, that part of the model is more constrained towards good valuesGeneric parametersTask-specific parametersDeep learning point of view: among the factors that explain the variations observed in the data associated with the different tasks, some are shared across two or more tasks17Slide18
Early stopping18Slide19
Early stoppingTraining error decreases but validation set error increasesStop training when reach local minimum of validation error (for certain amount of time when error not improve)Save parameters when validation error improves
19Slide20
Early stoppingadvantages & disadvantagesAdvantagesReduces computational costWill not damage the learning dynamicsEarly stop checking process can run parallel with training process (ideally with CPU or GPU)
DisadvantagesRequires a validation setMemory to save parameters
20Slide21
Early stoppingAlgorithm21Slide22
Parameter tying & parameter sharing22Slide23
Parameter tyingWe might not know precisely what values the parameters should takeFrom knowledge of the domain and model architecture, there should be some dependencies between the model parameters
23Slide24
Parameter tyingExampleTwo models ( and ) performing the same classification taskAssumptions: different input distributionsTwo models map the input to different but related outputs: and should be close to where we can use a penalty:
24Slide25
Parameter sharingForce sets of parameters to be equalLead to significant reduction in the memory footprint of the model.Ex: convolutional neural networks (CNNs)25Slide26
Sparse representations26Slide27
Sparse representationsPlace a penalty on the activations of the units in a neural network, encouraging their activations to be sparseSparse parametrization – many of the parameters become zero (or close to zero)Penalty derived from a Student-t prior (Olshausen and Field, 1996; Bergstra, 2011), KL divergence penalties (Larochelle and Bengio, 2008), … etc
.
27Slide28
bagging28Slide29
baggingDef: technique for reducing generalization error by combining several modelsModel averaging: train several models and have them vote on the outputDifferent models will usually not make same errors on test set means models avg. not help. means errors and perfectly uncorrelated
29
Variances
CovariancesSlide30
baggingexample30Disadv.: Training
multiple models and evaluating multiple models on each test exampleSlide31
Dropout31Slide32
DropoutRemoving non-output units by multiplying the unit output value by zeroUnlike bagging, dropout aims for exponentially large number of neural networks32Slide33
DropoutMinibatch-based learning algorithmRandomly sample a different binary mask μwhen load an example into a minibatch80% include input unit and 50% include hidden unit33Slide34
DropoutLet be the mask vector and be the parametersDropout training consists in minimizing
(expectation of loss function)
Contains exponentially many terms but can obtain unbiased estimate
34Slide35
DropoutAdvantage & disadvantageAdv.:Computationally cheap
per example per updateNot significantly limit the type of model as long as it can be trained with stochastic gradient decentOptimal validation set error is much lowerDisadv.:
Large model
Much more iterations when training
35Slide36
Adversarial training36Slide37
Adversarial trainingCheck the network’s level of understanding of a taskUse optimization procedure to search for input that make the neural network fail the task in order to TRAIN the network37Slide38
Add-On MaterialFederated LearningCollaborative Machine Learning without Centralized Training Data 38Slide39
Typical deep learning modelData from numerous phonesTrain in the cloud serverModel back to phoneSame model to all users
39Slide40
Federated learningModel to phoneTrain in phoneSummarize changesSummary back to server
40Slide41
AdvantageSmarter modelsLower latencyLess power consumptionPrivacyPersonalized model, immediate usability41Slide42
Technique challengesoptimization algorithm SGD: high iteration Devices are not even, it requires low throughput Federated Averaging algorithm10-100x less communication compared to a naively oneuse the powerful processors in modern mobile devices to compute higher quality updates than simple gradient steps.
Upload speed << download speedCompressing updateshigh-dimensional sparse convex models: click-through-rate prediction42Slide43
SecurityNo user data stored in the cloudSecure Aggregation protocolOnly can decrypt the average when 100s or 1000s of users participatedNo individual’s data can be inspected before averaging.43Slide44
Scenariothe device is idle,
plugged infree wireless connection44Slide45
Part II: Optimization outlineDifference between pure optimization and optimization in trainingChallenges in optimization of neural networksIntroduction to some optimization algorithms45Slide46
Goal of optmization in trainingFinding the parameters θ of a neural network that significantly reduce a cost function J(θ), which typically includes a performance measure evaluated on the entire training set as well as additional regularization terms.46Slide47
Difference between pure optimization and optimization in trainingPure optimization acts directly while machine learning (ML) acts indirectly
MLTraining Set
Testing Set
Suppose we have performance measurement P, which is optimized on training set, but our goal is to let it does good job on testing set.
47Slide48
Difference between pure optimization and optimization in trainingPure optimization acts directly while machine learning (ML) acts indirectly
per-example loss function
predicted output
target
empirical distribution
the data generating distribution
48Slide49
Difference between pure optimization and optimization in trainingEmpirical Risk Minimization
per-example loss function
predicted output
target
the data generating distribution
The goal of ML is to reduce the expected generalization error given by
, the quantity of which is also known as risk
But ML only knows the distribution of training samples, if the true distribution is known, then ML is converted to an optimization problem.
49Slide50
Difference between pure optimization and optimization in trainingEmpirical Risk MinimizationConvert ML to optimization problem, which minimize the expected loss on the training set
However, this approach is often overfitting. Give up generalization too much.
50Slide51
Difference between pure optimization and optimization in trainingSurrogate loss functionUse negative log-likelihood as surrogate loss function.The negative log-likelihood allows the model to estimate the conditional probability of the classes, given the input, and if the model can do that well, then it can pick the classes that yield the least classification error in expectation.
EX: 0-1 loss function:
51Slide52
Difference between pure optimization and optimization in trainingSurrogate loss functionEX: 0-1 loss function:However, sometimes, loss is not simply 0 or 1
.Even in the same class, every sample may have its own value of loss compared to other samples 52Slide53
EX: diagnose in hospital, it's more costly to miss a postive case of disease (false negative) than to falsely dignose disease (false positive). Loss of false negative > Loss of false positive53Slide54
Difference between pure optimization and optimization in trainingSurrogate loss functionEX: 0-1 loss function:
When 0-1 loss function reaches 0, the loss of log-likelihood continues decreasing, which guarantees robustness of the classifier by further pushing the classes apart by considering the actual loss (some mistakes in diagnose will be more severe than others)
54Slide55
Difference between pure optimization and optimization in trainingBatch and minibatchOne aspect of machine learning algorithms that separates them from general optimization algorithms is that the objective function usually decomposes as a
sum over the training examples.
55Slide56
Difference between pure optimization and optimization in trainingBatch and minibatch
=56Slide57
Difference between pure optimization and optimization in trainingBatch and minibatch
optimize
Computing this expectation exactly is very expensive because it requires
evaluating the model on every example in the entire dataset.
57Slide58
Difference between pure optimization and optimization in trainingBatch and minibatchIn a huge dataset, there're big amounts of redundancy.Assume a worst case, all m
samples in the training set could be identical copies of each other. A sampling based estimate of the gradient could compute the correct gradient with a single sample, using m times less computation than the naive approach.To solve this problem, sampling is needed.
58Slide59
Difference between pure optimization and optimization in trainingBatch and minibatchBatch training: put all training samples into training at onceStochastic training (online training): fetch “water” from a streamDeep Learning use something in between (batch and stoachastic), which is called minibatch.
59Slide60
Difference between pure optimization and optimization in trainingBatch and minibatchThe size of minibatch is driven by following factors:
• Larger batches provide a more accurate estimate of the gradient, but withless than linear returns.• Multicore architectures are usually underutilized by extremely small batches.This motivates using some absolute minimum batch size, below which there
is no reduction in the time to process a
minibatch
.
• If all examples in the batch are to be processed in parallel (as is typically
the case), then the amount of memory scales with the batch size. For many
hardware setups this is the limiting factor in batch size.
60Slide61
• Some kinds of hardware achieve better runtime with specific sizes of arrays.Especially when using GPUs, it is common for power of 2 batch sizes to offerbetter runtime. Typical power of 2 batch sizes range from 32 to 256, with 16 sometimes being attempted for large models.Difference between pure optimization and optimization in training
Batch and minibatch
61Slide62
• Small batches can offer a regularizing effect (Wilson and Martinez, 2003),perhaps due to the noise they add to the learning process. Generalizationerror is often best for a batch size of 1. Training with such a small batchsize might require a small learning rate to maintain stability due to the high
variance in the estimate of the gradient. The total runtime can be very highdue to the need to make more steps, both because of the reduced learningrate and because it takes more steps to observe the entire training set.
Difference between pure optimization and optimization in training
Batch and
minibatch
62Slide63
Optimization outlineDifference between pure optimization and optimization in trainingChallenges in optimization of neural
networksIntroduction to some optimization algorithms63Slide64
Ill-conditioning Some challenges arise even when optimizing convex functions. Of these, the most prominent is ill-conditioning of the Hessian matrix H.Ill-conditioning can manifest by causing SGD to get “stuck” in the sense that even very small steps increase the cost function.
64Slide65
Ill-conditioning Some challenges arise even when optimizing convex functions. Of these, the most prominent is ill-conditioning of the Hessian matrix H.a second-order Taylor series expansion of the cost function predicts that a gradient descent step of will add to the cost
65Slide66
Ill-conditioning Ill-conditioning of the gradient becomes a problem when exceeds .To determine whether ill-conditioning is detrimental to a neural network training task, one can monitor the squared gradient norm and the term.
66Slide67
Ill-conditioning In many cases, the gradient norm does not shrink significantly throughout learning, but the grows by more than order of magnitude. The result is that learning becomes very slow despite the presence of a strong gradient because the learning rate must be shrunk to compensate for even stronger curvature.
67Slide68
Ill-conditioning Gradient descent often does not arrive at a critical point of any kindUnlike the KKT condition in optimization68Slide69
Local minimaOne of the most prominent features of a convex optimization problem is that it can be reduced to the problem of finding a local minimum.When optimizing a convex function, we know that we have reached a good solution if we find a critical point of any kind.
69Slide70
Local minimaNon-identifiability of neural networks causes a lot of local minima:For example, we could take a neural network and modify layer 1 by swapping the incoming weight vector for unit i with the incoming weight vector
for unit j, then doing the same for the outgoing weight vectors. If we have m layers with n units each, then there are
ways
of arranging the hidden units.
This kind of non-
identifiability
is known as weight space symmetry.
70Slide71
Local minimaLocal minima can be problematic if they have high cost in comparison to the global minimum.If local minima with high cost are common, this could pose a serious problem for gradient-based optimization algorithms.
71Slide72
Local minimaThe problem remains an active area of research, but experts now suspect that, for sufficiently large neural networks, most local minima have a low cost function value, and that it is not important to find a true global minimum rather than to find a point in parameter space that has low but not minimal cost
72Slide73
Cliffs and exploding gradientsNeural networks with many layers often have extremely steep regions resembling cliffs. These result from the multiplication of several large weights together. On the face of an extremely steep cliff structure, the gradient update step can move the parameters extremely far, usually jumping off of the cliff structure altogether.
73Slide74
Cliffs and exploding gradients74Slide75
Long-term dependenciesRNN as exampleAny eigenvalues λ that are not near an absolute value of 1 will either explode if they are greater than 1 in magnitude or vanish if they are less than 1 in magnitude.
75Slide76
Long-term dependencies76Slide77
Theoretical limits of optimizationSeveral theoretical results show that there are limits on the performance of any optimization algorithm we might design for neural networksTheoretical analysis of whether an optimization algorithm can accomplish this goal is extremely difficult.
77Slide78
Optimization outlineDifference between pure optimization and optimization in trainingChallenges in optimization of neural networksIntroduction to some optimization algorithms
78Slide79
Stochastic gradient descentStochastic gradient descent (SGD) and its variants are probably the most used optimization algorithms for machine learning in general and for deep learning in particular. It is possible to obtain an unbiased estimate of the gradient by taking the average gradient on a minibatch of m examples
drawn i.i.d. from the data generating distribution.
79Slide80
Stochastic gradient descent80Slide81
Stochastic gradient descentBecause the SGD gradient estimator introduces a source of noise (the random sampling of m training examples) that does not vanish even when we arrive at a minimum, it is necessary to gradually decrease the learning rate over time, so we now denote the learning
rate at iteration k as
81Slide82
Stochastic gradient descentA sufficient condition to guarantee convergence of SGD is that
82Slide83
Stochastic gradient descentIn practice, it is common to decay the learning rate linearly until iteration τ:with
. After iteration τ , it is common to leave constant
.
83Slide84
momentumA concept comes from physicsThe momentum algorithm accumulates an exponentially decaying moving average of past gradients and continues to move in their direction.
84Slide85
momentumThe velocity v accumulates the gradient elementsThe larger α is relative to , the more previous gradients affect the current direction.
85Slide86
momentumWe can see that a poorly conditioned quadratic objective looks like a long, narrow valley or canyon with steep sides. Momentum correctly traverses the canyon lengthwise, while gradient steps waste time moving back and forth across the narrow axis of the canyon.
86Slide87
momentumThe step size is the largest when many successive gradients point in exactly the same direction. If the momentum algorithm always observes gradient g, then it will accelerate in the direction of -g , until reaching a terminal velocity where the size of each step is
It is thus helpful to think of the momentum
hyperparameter
in terms
of . For example
, α = 0
.9
corresponds to multiplying the maximum speed by 10 relative
to the
gradient descent algorithm.
87Slide88
momentum88Slide89
Nesterov momentumThe update rules in this case are given by:89Slide90
Nesterov momentumThe difference between Nesterov momentum and standard momentum is where the gradient is evaluated. With Nesterov momentum the gradient is evaluated after
the current velocity is applied. Thus one can interpret Nesterov momentum as attempting to add a correction factor to the standard method of momentum.
90Slide91
Nesterov momentum91Slide92
parameters initialization strategyTraining algorithms for deep learning models are usually iterative in nature and thus require the user to specify some initial point from which to begin the iterations.92Slide93
parameters initialization strategySome heuristics are available for choosing the initial scale of the weights. One heuristic is to initialize the weights of a fully connected layer with m inputs and n outputs by sampling each weight from
Glorot and Bengio (2010) suggest using the normalized initialization:
93Slide94
Adaptive learning rateNeural network researchers have long realized that the learning rate was reliably one of the hyperparameters that is the most difficult to set because it has a significant impact on model performance.
Recently
, a number of incremental (or mini-batch-based) methods
have been
introduced that adapt the learning rates of model parameters.
94Slide95
adaGradThe AdaGrad algorithm individually adapts the learning rates of all model parameters by scaling them inversely proportional to the square root of the sum of all of their historical squared values
95Slide96
adaGrad
96Slide97
adaGradThe AdaGrad algorithm enjoys some desirable theoretical properties. However, empirically it has been found that—for training deep neural network models—the accumulation of squared gradients from the beginning of training can result in a premature and excessive decrease in the effective learning rate.
97Slide98
rmspropThe RMSProp algorithm (Hinton, 2012) modifies AdaGrad to perform better in the non-convex setting by changing the gradient accumulation into an exponentially weighted moving average.RMSProp uses an exponentially decaying average to discard history from the extreme past so that it can converge rapidly after finding a convex bowl, as if it were an instance of the
AdaGrad algorithm initialized within that bowl.
98Slide99
rmsprop
99Slide100
rmsprop
100Slide101
rmspropEmpirically, RMSProp has been shown to be an effective and practical optimization algorithm for deep neural networks. It is currently one of the go-to optimization methods being employed routinely by deep learning practitioners.
101Slide102
adamThe name “Adam” derives from the phrase “adaptive moments.”First, in Adam, momentum is incorporated directly as an estimate of the first order moment (with exponential weighting) of the gradient.Second, Adam
includes bias corrections to the estimates of both the first-order moments (the momentum term) and the (uncentered) second-order moments to account for their initialization at the origion
102Slide103
adam
103Slide104
Which algorithm to choose
Currently, the most popular optimization algorithms actively in use
include SGD
, SGD with momentum,
RMSProp
,
RMSProp
with
momentum and
Adam
.
Unfortunately, there is currently no consensus on this point.
Schaul
et al. (2014) presented a valuable comparison of a large number of optimization algorithms across a wide range of learning tasks. While the results suggest that the family of algorithms with adaptive learning rates (represented by
RMSProp
and
AdaDelta
) performed fairly robustly, no single best algorithm has emerged.
104Slide105
Thanks
105