/
CPH 636 - Dr. CPH 636 - Dr.

CPH 636 - Dr. - PowerPoint Presentation

tatyana-admore
tatyana-admore . @tatyana-admore
Follow
366 views
Uploaded On 2016-08-02

CPH 636 - Dr. - PPT Presentation

Charnigo Chap 3 Notes The authors discuss linear methods for regression not only because of their historical importance but because especially with some modern innovations they may perform well when n is small p is large andor the noise variance is large ID: 429508

charnigochap 636 cph notes 636 charnigochap notes cph regression squares model predictors linear authors estimates parameter lasso figure ordinary

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "CPH 636 - Dr." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

CPH 636 - Dr. CharnigoChap. 3 Notes

The authors discuss linear methods for regression, not only because of their historical importance but because (especially with some modern innovations) they may perform well when n is small, p is large, and/or the noise variance is large

.

Formula (3.1) describes the model for f(x) := E[Y|X=x], which can accommodate non-linear functions of x including dummy codes; regarding the latter, note part f of exercise 1 on your first team project.Slide2

CPH 636 - Dr. CharnigoChap. 3 Notes

Even if f(x) is non-linear in x, f(x) is linear in the parameters

β

0

,

β

1

, …,

β

p

. Thus, estimating the parameters by ordinary least squares – i.e., minimization of (3.2) – leads to the closed-form solution in (3.6). If you’ve not studied vector calculus, I will show you how the solution is obtained with p=1 and neglecting the intercept.

Figure 3.1 illustrates ordinary least squares when p=2.Slide3

CPH 636 - Dr. CharnigoChap. 3 Notes

Note that, once the parameters are estimated, we may predict Y for any x. In particular, (3.7) shows that predicting Y

for x which occurred in the training data is accomplished by a matrix-vector multiplication,

Y

predicted

=

H

Y

training

,

where

H

:=

X

(

X

T

X

)

-1

X

T

.

To make predictions for a new data set (rather than for the training data set), simply replace the first

X

in

H

by its analogue from the new data set.Slide4

CPH 636 - Dr. CharnigoChap. 3 Notes

We refer to

H

as the hat matrix. The hat matrix is what mathematicians call a projection matrix. Consequently,

H

2

=

H

. Figure 3.2 provides an illustration with p=2, but I will provide an illustration with p=1 which may be easier to understand.

Assuming there are no redundancies in the columns of

X

, the trace of

H

is p+1. This trace is called the degrees of freedom (

df

) for the model, a concept which generalizes to nonparametric methods which are linear in the outcomes.Slide5

CPH 636 - Dr. CharnigoChap. 3 Notes

If we regard the predictors as fixed rather than random (or, if we condition upon their observed values), then under the usual assumptions for linear regression (what are they ?), we have result (3.10).

Combined with result (3.11), which says that the residual sum of squares divided by the error variance follows the chi-square distribution on n-p-1 degrees of freedom, result (3.10) forms the basis for inference on

β

0

,

β

1

, …,

β

p

.

(Even if we are more interested in prediction, this is still worth your understanding.)Slide6

CPH 636 - Dr. CharnigoChap. 3 Notes

The authors make the point that the T distribution which is proper to such inferences is often replaced by the Z distribution when n (or, rather, n-p-1) is large.

I think the authors have oversimplified a bit here, though, because the adequacy of the Z approximation to the T distribution depends on the desired level of confidence or significance. In any case, with modern computing powerful enough to implement methods in this book, why use such an approximation at all ?Slide7

CPH 636 - Dr. CharnigoChap. 3 Notes

Result (3.13) tells you how to test a null hypothesis that some but not all predictors in your model are unnecessary.

This is important because testing H

0

:

β

1

=

β

2

= 0 is not the same as testing H

0

:

β

1

= 0 and H

0

:

β

2

= 0. Possibly X

1

could be deleted if X

2

were kept in the model, or vice versa, while deleting both would be unwise. For example, consider X

1

= SBP and X

2

= DBP. Slide8

CPH 636 - Dr. CharnigoChap. 3 Notes

One could (and sometimes does, especially in backward elimination) test whether to remove X

2

and then, once X

2

is gone, test whether to remove X

1

as well. But that entails (a greater degree of) sequential hypothesis testing, which is not well understood in terms of its implications for actual versus nominal statistical significance.

Moreover, there are some situations in which X

1

and X

2

should either both or neither be in the model, such as when they are dummy variables.Slide9

CPH 636 - Dr. CharnigoChap. 3 Notes

Result (3.15) describes how to make a confidence region for

β

0

,

β

1

, …,

β

p

. I will illustrate what this looks like for

β

0

and

β

1

when p = 1. Importantly, such a region is

not

a rectangle.

Though not shown explicitly in (3.15), one may also make a confidence region for any subset of the parameters. I will illustrate what this looks like for

β

1

and

β

2

when p

>

2 supposing that, for example, X

1

= SBP and X

2

= DBP.Slide10

CPH 636 - Dr. CharnigoChap. 3 Notes

The authors discuss the prostate cancer data set at some length.

Notice that they began (Figure 1.1) by exploring the data. Dr. Stromberg would be proud !

I don’t quite agree with the authors’ conflation of “strongest effect” with largest Z score, even though they did standardize the predictors to have unit variance. Let’s discuss that…Slide11

CPH 636 - Dr. CharnigoChap. 3 Notes

Let’s also make sure we understand what the authors mean by “base error rate” and its reduction by 50%.

Returning to the idea of inference, the Gauss-Markov Theorem (and, likewise, the Cramer-Rao lower bound, for those of you who’ve heard of it) will permit us to conclude that, for a correctly specified model: (

i

) parameters are estimated unbiasedly; and, (ii) parameters are estimated with minimal variance subject to (

i

). Slide12

CPH 636 - Dr. CharnigoChap. 3 Notes

Although the Gauss-Markov Theorem

sounds

reassuring, there are some cases when we can achieve a huge reduction in variance by tolerating a modest amount of bias. This may substantially reduce both mean square error of estimation and mean square error of prediction.

Moreover, there’s a big catch to the Gauss-Markov Theorem:

for a correctly specified model

. How often do you suppose that (really) happens ?Slide13

CPH 636 - Dr. CharnigoChap. 3 Notes

The authors proceed to describe how a multiple linear regression model can actually be viewed as the result of fitting several simple linear regression models.

They begin by noting that when the inputs are orthogonal (roughly akin to the idea of statistical independence), unadjusted and adjusted parameter estimates are identical.Slide14

CPH 636 - Dr. CharnigoChap. 3 Notes

Usually inputs are not orthogonal. But imagine that, with standardized inputs and response (hence, no need for an intercept), we do the following:

Multiple linear regression of

X

k

on all other features.

Simple linear regression of

Y

k

on the residuals from step 1.

These residuals

are

orthogonal to the other features…Slide15

CPH 636 - Dr. CharnigoChap. 3 Notes

…and so the parameter

estimate from step 2 will be the same as we would have obtained for

X

k

in a multiple linear regression of Y

on X

1

, X

2

, …,

X

p

.

This gives us an alternate interpretation of an adjusted regression coefficient: we are quantifying the effect of

X

k

on that portion of Y which is orthogonal to (or, if you prefer, unexplained by) the other inputs. Slide16

CPH 636 - Dr. CharnigoChap. 3 Notes

Formula (3.29) then shows why it’s difficult to estimate the coefficient for

X

k

if

X

k

is highly correlated with the other features. This condition, in its extreme form, is known as

collinearity

.

In fact, the quantity in the denominator of (3.29) is related to the so-called variance inflation factor, which is sometimes used as a diagnostic for

collinearity

.Slide17

CPH 636 - Dr. CharnigoChap. 3 Notes

You may have also heard of the distinction between ANOVA and MANOVA.

You may wonder, then, whether there is an analogue to MANOVA for situations when you have multiple continuous outcomes which are being regressed on multiple continuous predictors.Slide18

CPH 636 - Dr. CharnigoChap. 3 Notes

There is, but parameter estimates relating a particular outcome to the predictors do not depend on whether they are acquired for the one outcome by itself or for all outcomes simultaneously. This is true even if the multiple outcomes are correlated with each other.

This is in stark contrast, of course, to the dependence of parameter estimates relating a particular outcome to a particular predictor on whether other predictors are considered simultaneously.Slide19

CPH 636 - Dr. CharnigoChap. 3 Notes

The authors note that ordinary least squares may have little bias and large variability. (Here they are assuming that the model is specified correctly. If the model is a drastic simplification of reality, then ordinary least squares will have large bias and little variability in relation to viable competing paradigms for modeling and estimation.)

The authors therefore discuss subset selection, which may reduce variability and enhance interpretation.Slide20

CPH 636 - Dr. CharnigoChap. 3 Notes

Best subset selection, which is computationally feasible for up to a few dozen candidate predictors, entails finding the best one-predictor model, the best two-predictor model, and so forth. This is illustrated in Figure 3.5.

Then, using either a validation data set or cross-validation (the authors do the latter in Figure 3.7), choose from among the best one-predictor model, best two-predictor model, and so forth. Slide21

CPH 636 - Dr. CharnigoChap. 3 Notes

Note that the authors do not define “best” strictly by mean square error of prediction but also by

considerations

of parsimony.

Forward selection is an alternative to best subsets selection when

p

is large. The authors refer to it as a “greedy algorithm” because, at each step, that predictor is chosen which explains the greatest part of the remaining variability in the outcome.Slide22

CPH 636 - Dr. CharnigoChap. 3 Notes

While that

seems

desirable, the end result may actually be sub-optimal if predictors are strongly correlated.

Backward elimination is another option. A disadvantage is that it may not be viable when

p

is large relative to

n

. A compelling advantage is that backward elimination can be easily implemented “manually” if not otherwise programmed into the statistical software. This may be useful in, for example, PROC MIXED of SAS. Slide23

CPH 636 - Dr. CharnigoChap. 3 Notes

In addition to explicitly choosing from among available predictors, we may also employ “shrinkage” methods for estimating parameters in a linear regression model.

These are so called because the resulting estimates are often smaller in magnitude than those acquired via ordinary least squares.

Until further notice

, we assume that Y and X

1

, …,

X

p

have been standardized with respect to training data.Slide24

CPH 636 - Dr. CharnigoChap. 3 Notes

Ridge regression is defined by formula (3.44) in the textbook and can be viewed as the solution to the penalized least squares problem expressed in (3.41).

Though perhaps not obvious, the constrained least squares problem in (3.42) is equivalent to (3.41), for an appropriate choice of t depending on

λ

. Moreover, (3.44) may be a good way to address

collinearity

. In particular, a correlation of

ρ

between X

1

and X

2

is, roughly speaking, effectively reduced to

ρ

/ (1 +

λ

).Slide25

CPH 636 - Dr. CharnigoChap. 3 Notes

Figure 3.8 displays a “ridge trace”, which shows how the estimated parameters in ridge regression depend on

λ

. One may choose

λ

by cross validation, as the authors have done.

Ridge regression can also be viewed as finding a Bayesian posterior mode (whilst ordinary least squares is

frequentist

maximum likelihood), when the prior distribution on each estimated parameter is normal with mean 0 and variance

σ

2

/

λ

.Slide26

CPH 636 - Dr. CharnigoChap. 3 Notes

What do you think about ridge regression in terms of the bias / variance tradeoff ?

One weakness of ridge regression is that, almost invariably, you are still “stuck” with all of the predictors. Even if the

collinearity

issue is satisfactorily resolved, why is this a weakness ?

An alternative is the lasso, for which the corresponding optimization problems are (3.51) and (3.52).Slide27

CPH 636 - Dr. CharnigoChap. 3 Notes

There is no analytic solution for the parameter estimates with the lasso; one must use numerical optimization methods.

However, a

favourable

feature of the lasso is that some parameter estimates are “shrunk” to zero. In other words, some variables are effectively removed.

Figure 3.10 displays estimated parameters in relation to a shrinkage factor s which is proportional to t.Slide28

CPH 636 - Dr. CharnigoChap. 3 Notes

Figure 3.11 explains why the lasso can effectively remove some predictors from the model.

Each

red ellipse represents a contour on which the residual sum of squares

equals a fixed value.

However, we are only allowed to accept parameter estimates within the blue geometric regions. So, the final parameter estimates will occur where an ellipse is tangent to a region. For a circular region, this will almost never happen along a coordinate axis.Slide29

CPH 636 - Dr. CharnigoChap. 3 Notes

The lasso also has a Bayesian interpretation, corresponding to a prior distribution on parameter estimates which has much heavier tails than a normal distribution. Thus, the lasso is less capable of reducing very large ordinary least squares estimates such as may occur with

collinearity

.

Table 3.4 nicely characterizes

how subset selection, lasso, and ridge regression shrink ordinary least squares parameter estimates for uncorrelated predictors.Slide30

CPH 636 - Dr. CharnigoChap. 3 Notes

While having uncorrelated predictors is a fanciful notion for an observational study (versus a designed experiment), Table 3.4 helps explain why ridge regression does not produce zeroes and why

the lasso

is more of a “continuous” operation than subset selection.

The authors also mention the elastic net and least angle regression as shrinkage methods. Slide31

CPH 636 - Dr. CharnigoChap. 3 Notes

The former is a sort of compromise between ridge regression and the lasso, as suggested by Figure 18.5 later in the textbook.

The idea is to both reduce very large ordinary least squares estimates and eliminate extraneous predictors from the model.

The

latter is similar to the lasso, as shown in Figure

3.15, and provides insight into how to compute parameter estimates for the lasso more efficiently.Slide32

CPH 636 - Dr. CharnigoChap. 3 Notes

Besides subset selection and shrinkage methods, one may fit linear regression models via approaches based on derived input directions.

Principal components regression replaces X

1

, X

2

, …,

X

p

by a set of uncorrelated variables W

1

, W

2

, …,

W

p

such that

Var

(W

1

)

>

Var

(W

2

)

>

>

Var

(

W

p

). Each W is a linear combination of the X’s, such that the squared coefficients of the X’s sum to 1.Slide33

CPH 636 - Dr. CharnigoChap. 3 Notes

One then uses some or all of W

1

, W

2

, …,

W

p

as predictors in lieu of X

1

, X

2

, …,

X

p

. This eliminates any problem that may exist with

collinearity

.

The downside of principal components regression is that a W may be difficult to interpret contextually, unless it should happen that the W is approximately proportional to an average of some of the X’s or a “contrast” (the difference between the average of one subset of the X’s and the average of another subset).Slide34

CPH 636 - Dr. CharnigoChap. 3 Notes

Partial least squares – which has been investigated by our own Dr.

Rayens

, among others - is similar to principal components regression, except that W

1

, W

2

, …,

W

p

are chosen in a way that Corr

2

(Y,W

1

)

Var

(W

1

)

>

Corr

2

(Y

,W

2

)

Var

(W

2

)

>

>

Corr

2

(

Y,W

p

)

Var

(

W

p

).

If one intends to use only some of W

1

, W

2

, …,

W

p

as predictors, partial least squares may explain more variation in Y than principal components regression.Slide35

CPH 636 - Dr. CharnigoChap. 3 Notes

Figure 3.18 presents a nice illustrative comparison of ordinary least squares, best subset selection, ridge regression, lasso, principal components regression, and partial least squares.

To aid in

the interpretation, note that X

2

could be expressed as + or - (1/2) X

1

+ (

sqrt

(3)/2) Z, where Z is standard normal and independent of X

1

. Also, W

1

= (X

1

+ or - X

2

) /

sqrt

(2) and W

2

= (X

1

– or + X

2

) /

sqrt

(2) for principal components regression.

Related Contents


Next Show more