/
Signal processing and Networking for Big Data Signal processing and Networking for Big Data

Signal processing and Networking for Big Data - PowerPoint Presentation

luanne-stotts
luanne-stotts . @luanne-stotts
Follow
358 views
Uploaded On 2018-11-07

Signal processing and Networking for Big Data - PPT Presentation

Applications Lectures 1213 Regularization and Optimization Zhu Han University of Houston Thanks Xusheng Du and Kevin Tsai For Slide Preparation 1 Part 1 Regularization Outline Parameter Norm Penalties ID: 720020

gradient optimization training learning optimization gradient learning training function pure loss parameters difference model neural parameter momentum algorithm set

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Signal processing and Networking for Big..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Signal processing and Networking for Big Data ApplicationsLectures 12-13: Regularization and Optimization

Zhu HanUniversity of HoustonThanks Xusheng Du and Kevin Tsai For Slide Preparation

1Slide2

Part 1 Regularization OutlineParameter Norm PenaltiesDataset AugmentationNoise RobustnessSemi-Supervised LearningMulti-Task LearningEarly StoppingParameter Tying and Parameter SharingSparse Representations

Bagging and Other Ensemble MethodsDropoutAdversarial TrainingFederated Training2Slide3

RegularizationReduce test errorSolve overfitting (Pediatric ADHD)Regularization of an estimator works by trading increased bias for reduced variance (mention CRLB vs. MUSIC)3Slide4

Parameter Norm penaltiesNorm penalty

where is the parameters/weightsNote: In this chapter, we ignore bias for simplicity, so

indicate weights should be affected by a norm penalty where

includes

and

unregularized

parameters

Objective function

Regularized objective function

is a

hyperparameter

results in no regularization

 

4Slide5

Parameter Norm penaltiesMinimize will decrease both original objective and the size of the parameters

(or some subset)Typically penalizes only the weights and leaves the biases unregularizedEach bias controls only a single variable. Regularizing the bias parameters can introduce a significant amount of

underfitting

 

5Slide6

Parameter Norm penaltiesL2 parameter regularizationWeight decay – Ridge regression or Tikhonov regularizationRegularization termTotal objective function

Parameter gradientUpdate weights6

minimizeSlide7

Regularization termTotal objective functionParameter gradientGradient no longer scales linearlyUsually use as a feature selection mechanism (LASSO)7minimize

Parameter Norm penalties

L

1

parameter regularizationSlide8

Dataset augmentation8Slide9

Dataset augmentationLimited data  create fake dataVery effective technique for Object RecognitionInjecting noise – add noise to dataDifficult for some task: density estimation, … etc. (solve problem first

)9Slide10

Dataset augmentationThings to look-outNeed carefully generate dataImage rotation produce different classes imagesToo many noise

And more…10Slide11

Noise robustness11Slide12

Noise robustnessNoise injection can be much more powerful than shrinking the parametersAdd to inputs or weightsOrg. object function:Object function with noise:12Slide13

Semi-supervised learning13Slide14

Semi-supervised learning14Goal is to classify the inputs have similar representation to the same class Learn from both labeled and unlabeled input to estimate

Usually construct separate models for supervised and unsupervised with shared parameters Slide15

Semi-supervised learningCompare labeled input with unlabeled input. Fill in missing data.Usually used for pre-processingMore detail: Salakhutdinov and Hinton (2008) and Chapelle et al. (2006)

F1

F2

F3

Y (class)

A

B

C

1

A

C

15

BSlide16

Multi-task learning16Slide17

Multi-task learningWhen part of a model is shared across tasks, that part of the model is more constrained towards good valuesGeneric parametersTask-specific parametersDeep learning point of view: among the factors that explain the variations observed in the data associated with the different tasks, some are shared across two or more tasks17Slide18

Early stopping18Slide19

Early stoppingTraining error decreases but validation set error increasesStop training when reach local minimum of validation error (for certain amount of time when error not improve)Save parameters when validation error improves

19Slide20

Early stoppingadvantages & disadvantagesAdvantagesReduces computational costWill not damage the learning dynamicsEarly stop checking process can run parallel with training process (ideally with CPU or GPU)

DisadvantagesRequires a validation setMemory to save parameters

20Slide21

Early stoppingAlgorithm21Slide22

Parameter tying & parameter sharing22Slide23

Parameter tyingWe might not know precisely what values the parameters should takeFrom knowledge of the domain and model architecture, there should be some dependencies between the model parameters

23Slide24

Parameter tyingExampleTwo models ( and ) performing the same classification taskAssumptions: different input distributionsTwo models map the input to different but related outputs: and should be close to where we can use a penalty:

 

24Slide25

Parameter sharingForce sets of parameters to be equalLead to significant reduction in the memory footprint of the model.Ex: convolutional neural networks (CNNs)25Slide26

Sparse representations26Slide27

Sparse representationsPlace a penalty on the activations of the units in a neural network, encouraging their activations to be sparseSparse parametrization – many of the parameters become zero (or close to zero)Penalty derived from a Student-t prior (Olshausen and Field, 1996; Bergstra, 2011), KL divergence penalties (Larochelle and Bengio, 2008), … etc

.

27Slide28

bagging28Slide29

baggingDef: technique for reducing generalization error by combining several modelsModel averaging: train several models and have them vote on the outputDifferent models will usually not make same errors on test set means models avg. not help. means errors and perfectly uncorrelated

 

29

Variances

CovariancesSlide30

baggingexample30Disadv.: Training

multiple models and evaluating multiple models on each test exampleSlide31

Dropout31Slide32

DropoutRemoving non-output units by multiplying the unit output value by zeroUnlike bagging, dropout aims for exponentially large number of neural networks32Slide33

DropoutMinibatch-based learning algorithmRandomly sample a different binary mask μwhen load an example into a minibatch80% include input unit and 50% include hidden unit33Slide34

DropoutLet be the mask vector and be the parametersDropout training consists in minimizing

(expectation of loss function)

Contains exponentially many terms but can obtain unbiased estimate

 

34Slide35

DropoutAdvantage & disadvantageAdv.:Computationally cheap

per example per updateNot significantly limit the type of model as long as it can be trained with stochastic gradient decentOptimal validation set error is much lowerDisadv.:

Large model

Much more iterations when training

 

35Slide36

Adversarial training36Slide37

Adversarial trainingCheck the network’s level of understanding of a taskUse optimization procedure to search for input that make the neural network fail the task in order to TRAIN the network37Slide38

Add-On MaterialFederated LearningCollaborative Machine Learning without Centralized Training Data 38Slide39

Typical deep learning modelData from numerous phonesTrain in the cloud serverModel back to phoneSame model to all users

39Slide40

Federated learningModel to phoneTrain in phoneSummarize changesSummary back to server

40Slide41

AdvantageSmarter modelsLower latencyLess power consumptionPrivacyPersonalized model, immediate usability41Slide42

Technique challengesoptimization algorithm SGD: high iteration Devices are not even, it requires low throughput Federated Averaging algorithm10-100x less communication compared to a naively oneuse the powerful processors in modern mobile devices to compute higher quality updates than simple gradient steps.

Upload speed << download speedCompressing updateshigh-dimensional sparse convex models: click-through-rate prediction42Slide43

SecurityNo user data stored in the cloudSecure Aggregation protocolOnly can decrypt the average when 100s or 1000s of users participatedNo individual’s data can be inspected before averaging.43Slide44

Scenariothe device is idle,

plugged infree wireless connection44Slide45

Part II: Optimization outlineDifference between pure optimization and optimization in trainingChallenges in optimization of neural networksIntroduction to some optimization algorithms45Slide46

Goal of optmization in trainingFinding the parameters θ of a neural network that significantly reduce a cost function J(θ), which typically includes a performance measure evaluated on the entire training set as well as additional regularization terms.46Slide47

Difference between pure optimization and optimization in trainingPure optimization acts directly while machine learning (ML) acts indirectly

MLTraining Set

Testing Set

Suppose we have performance measurement P, which is optimized on training set, but our goal is to let it does good job on testing set.

47Slide48

Difference between pure optimization and optimization in trainingPure optimization acts directly while machine learning (ML) acts indirectly

per-example loss function

predicted output

target

empirical distribution

the data generating distribution

48Slide49

Difference between pure optimization and optimization in trainingEmpirical Risk Minimization

per-example loss function

predicted output

target

the data generating distribution

The goal of ML is to reduce the expected generalization error given by

, the quantity of which is also known as risk

But ML only knows the distribution of training samples, if the true distribution is known, then ML is converted to an optimization problem.

49Slide50

Difference between pure optimization and optimization in trainingEmpirical Risk MinimizationConvert ML to optimization problem, which minimize the expected loss on the training set

However, this approach is often overfitting. Give up generalization too much.

50Slide51

Difference between pure optimization and optimization in trainingSurrogate loss functionUse negative log-likelihood as surrogate loss function.The negative log-likelihood allows the model to estimate the conditional probability of the classes, given the input, and if the model can do that well, then it can pick the classes that yield the least classification error in expectation.

EX: 0-1 loss function:

51Slide52

Difference between pure optimization and optimization in trainingSurrogate loss functionEX: 0-1 loss function:However, sometimes, loss is not simply 0 or 1

.Even in the same class, every sample may have its own value of loss compared to other samples 52Slide53

EX: diagnose in hospital, it's more costly to miss a postive case of disease (false negative) than to falsely dignose disease (false positive). Loss of false negative > Loss of false positive53Slide54

Difference between pure optimization and optimization in trainingSurrogate loss functionEX: 0-1 loss function:

When 0-1 loss function reaches 0, the loss of log-likelihood continues decreasing, which guarantees robustness of the classifier by further pushing the classes apart by considering the actual loss (some mistakes in diagnose will be more severe than others)

54Slide55

Difference between pure optimization and optimization in trainingBatch and minibatchOne aspect of machine learning algorithms that separates them from general optimization algorithms is that the objective function usually decomposes as a

sum over the training examples.

55Slide56

Difference between pure optimization and optimization in trainingBatch and minibatch

=56Slide57

Difference between pure optimization and optimization in trainingBatch and minibatch

optimize

Computing this expectation exactly is very expensive because it requires

evaluating the model on every example in the entire dataset.

57Slide58

Difference between pure optimization and optimization in trainingBatch and minibatchIn a huge dataset, there're big amounts of redundancy.Assume a worst case, all m

samples in the training set could be identical copies of each other. A sampling based estimate of the gradient could compute the correct gradient with a single sample, using m times less computation than the naive approach.To solve this problem, sampling is needed.

58Slide59

Difference between pure optimization and optimization in trainingBatch and minibatchBatch training: put all training samples into training at onceStochastic training (online training): fetch “water” from a streamDeep Learning use something in between (batch and stoachastic), which is called minibatch.

59Slide60

Difference between pure optimization and optimization in trainingBatch and minibatchThe size of minibatch is driven by following factors:

• Larger batches provide a more accurate estimate of the gradient, but withless than linear returns.• Multicore architectures are usually underutilized by extremely small batches.This motivates using some absolute minimum batch size, below which there

is no reduction in the time to process a

minibatch

.

• If all examples in the batch are to be processed in parallel (as is typically

the case), then the amount of memory scales with the batch size. For many

hardware setups this is the limiting factor in batch size.

60Slide61

• Some kinds of hardware achieve better runtime with specific sizes of arrays.Especially when using GPUs, it is common for power of 2 batch sizes to offerbetter runtime. Typical power of 2 batch sizes range from 32 to 256, with 16 sometimes being attempted for large models.Difference between pure optimization and optimization in training

Batch and minibatch

61Slide62

• Small batches can offer a regularizing effect (Wilson and Martinez, 2003),perhaps due to the noise they add to the learning process. Generalizationerror is often best for a batch size of 1. Training with such a small batchsize might require a small learning rate to maintain stability due to the high

variance in the estimate of the gradient. The total runtime can be very highdue to the need to make more steps, both because of the reduced learningrate and because it takes more steps to observe the entire training set.

Difference between pure optimization and optimization in training

Batch and

minibatch

62Slide63

Optimization outlineDifference between pure optimization and optimization in trainingChallenges in optimization of neural

networksIntroduction to some optimization algorithms63Slide64

Ill-conditioning Some challenges arise even when optimizing convex functions. Of these, the most prominent is ill-conditioning of the Hessian matrix H.Ill-conditioning can manifest by causing SGD to get “stuck” in the sense that even very small steps increase the cost function.

64Slide65

Ill-conditioning Some challenges arise even when optimizing convex functions. Of these, the most prominent is ill-conditioning of the Hessian matrix H.a second-order Taylor series expansion of the cost function predicts that a gradient descent step of will add to the cost

65Slide66

Ill-conditioning Ill-conditioning of the gradient becomes a problem when exceeds .To determine whether ill-conditioning is detrimental to a neural network training task, one can monitor the squared gradient norm and the term.

66Slide67

Ill-conditioning In many cases, the gradient norm does not shrink significantly throughout learning, but the grows by more than order of magnitude. The result is that learning becomes very slow despite the presence of a strong gradient because the learning rate must be shrunk to compensate for even stronger curvature.

67Slide68

Ill-conditioning Gradient descent often does not arrive at a critical point of any kindUnlike the KKT condition in optimization68Slide69

Local minimaOne of the most prominent features of a convex optimization problem is that it can be reduced to the problem of finding a local minimum.When optimizing a convex function, we know that we have reached a good solution if we find a critical point of any kind.

69Slide70

Local minimaNon-identifiability of neural networks causes a lot of local minima:For example, we could take a neural network and modify layer 1 by swapping the incoming weight vector for unit i with the incoming weight vector

for unit j, then doing the same for the outgoing weight vectors. If we have m layers with n units each, then there are

ways

of arranging the hidden units.

This kind of non-

identifiability

is known as weight space symmetry.

 

70Slide71

Local minimaLocal minima can be problematic if they have high cost in comparison to the global minimum.If local minima with high cost are common, this could pose a serious problem for gradient-based optimization algorithms.

71Slide72

Local minimaThe problem remains an active area of research, but experts now suspect that, for sufficiently large neural networks, most local minima have a low cost function value, and that it is not important to find a true global minimum rather than to find a point in parameter space that has low but not minimal cost

72Slide73

Cliffs and exploding gradientsNeural networks with many layers often have extremely steep regions resembling cliffs. These result from the multiplication of several large weights together. On the face of an extremely steep cliff structure, the gradient update step can move the parameters extremely far, usually jumping off of the cliff structure altogether.

73Slide74

Cliffs and exploding gradients74Slide75

Long-term dependenciesRNN as exampleAny eigenvalues λ that are not near an absolute value of 1 will either explode if they are greater than 1 in magnitude or vanish if they are less than 1 in magnitude.

75Slide76

Long-term dependencies76Slide77

Theoretical limits of optimizationSeveral theoretical results show that there are limits on the performance of any optimization algorithm we might design for neural networksTheoretical analysis of whether an optimization algorithm can accomplish this goal is extremely difficult.

77Slide78

Optimization outlineDifference between pure optimization and optimization in trainingChallenges in optimization of neural networksIntroduction to some optimization algorithms

78Slide79

Stochastic gradient descentStochastic gradient descent (SGD) and its variants are probably the most used optimization algorithms for machine learning in general and for deep learning in particular. It is possible to obtain an unbiased estimate of the gradient by taking the average gradient on a minibatch of m examples

drawn i.i.d. from the data generating distribution.

79Slide80

Stochastic gradient descent80Slide81

Stochastic gradient descentBecause the SGD gradient estimator introduces a source of noise (the random sampling of m training examples) that does not vanish even when we arrive at a minimum, it is necessary to gradually decrease the learning rate over time, so we now denote the learning

rate at iteration k as

 

81Slide82

Stochastic gradient descentA sufficient condition to guarantee convergence of SGD is that

82Slide83

Stochastic gradient descentIn practice, it is common to decay the learning rate linearly until iteration τ:with

. After iteration τ , it is common to leave constant

.

 

83Slide84

momentumA concept comes from physicsThe momentum algorithm accumulates an exponentially decaying moving average of past gradients and continues to move in their direction.

84Slide85

momentumThe velocity v accumulates the gradient elementsThe larger α is relative to , the more previous gradients affect the current direction.

 

85Slide86

momentumWe can see that a poorly conditioned quadratic objective looks like a long, narrow valley or canyon with steep sides. Momentum correctly traverses the canyon lengthwise, while gradient steps waste time moving back and forth across the narrow axis of the canyon.

86Slide87

momentumThe step size is the largest when many successive gradients point in exactly the same direction. If the momentum algorithm always observes gradient g, then it will accelerate in the direction of -g , until reaching a terminal velocity where the size of each step is

It is thus helpful to think of the momentum

hyperparameter

in terms

of . For example

, α = 0

.9

corresponds to multiplying the maximum speed by 10 relative

to the

gradient descent algorithm.

87Slide88

momentum88Slide89

Nesterov momentumThe update rules in this case are given by:89Slide90

Nesterov momentumThe difference between Nesterov momentum and standard momentum is where the gradient is evaluated. With Nesterov momentum the gradient is evaluated after

the current velocity is applied. Thus one can interpret Nesterov momentum as attempting to add a correction factor to the standard method of momentum.

90Slide91

Nesterov momentum91Slide92

parameters initialization strategyTraining algorithms for deep learning models are usually iterative in nature and thus require the user to specify some initial point from which to begin the iterations.92Slide93

parameters initialization strategySome heuristics are available for choosing the initial scale of the weights. One heuristic is to initialize the weights of a fully connected layer with m inputs and n outputs by sampling each weight from

Glorot and Bengio (2010) suggest using the normalized initialization:

93Slide94

Adaptive learning rateNeural network researchers have long realized that the learning rate was reliably one of the hyperparameters that is the most difficult to set because it has a significant impact on model performance.

Recently

, a number of incremental (or mini-batch-based) methods

have been

introduced that adapt the learning rates of model parameters.

94Slide95

adaGradThe AdaGrad algorithm individually adapts the learning rates of all model parameters by scaling them inversely proportional to the square root of the sum of all of their historical squared values

95Slide96

adaGrad

96Slide97

adaGradThe AdaGrad algorithm enjoys some desirable theoretical properties. However, empirically it has been found that—for training deep neural network models—the accumulation of squared gradients from the beginning of training can result in a premature and excessive decrease in the effective learning rate.

97Slide98

rmspropThe RMSProp algorithm (Hinton, 2012) modifies AdaGrad to perform better in the non-convex setting by changing the gradient accumulation into an exponentially weighted moving average.RMSProp uses an exponentially decaying average to discard history from the extreme past so that it can converge rapidly after finding a convex bowl, as if it were an instance of the

AdaGrad algorithm initialized within that bowl.

98Slide99

rmsprop

99Slide100

rmsprop

100Slide101

rmspropEmpirically, RMSProp has been shown to be an effective and practical optimization algorithm for deep neural networks. It is currently one of the go-to optimization methods being employed routinely by deep learning practitioners.

101Slide102

adamThe name “Adam” derives from the phrase “adaptive moments.”First, in Adam, momentum is incorporated directly as an estimate of the first order moment (with exponential weighting) of the gradient.Second, Adam

includes bias corrections to the estimates of both the first-order moments (the momentum term) and the (uncentered) second-order moments to account for their initialization at the origion

102Slide103

adam

103Slide104

Which algorithm to choose

Currently, the most popular optimization algorithms actively in use

include SGD

, SGD with momentum,

RMSProp

,

RMSProp

with

momentum and

Adam

.

Unfortunately, there is currently no consensus on this point.

Schaul

et al. (2014) presented a valuable comparison of a large number of optimization algorithms across a wide range of learning tasks. While the results suggest that the family of algorithms with adaptive learning rates (represented by

RMSProp

and

AdaDelta

) performed fairly robustly, no single best algorithm has emerged.

104Slide105

Thanks

105