Charnigo Chap 3 Notes The authors discuss linear methods for regression not only because of their historical importance but because especially with some modern innovations they may perform well when n is small p is large andor the noise variance is large ID: 429508
Download Presentation The PPT/PDF document "CPH 636 - Dr." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
CPH 636 - Dr. CharnigoChap. 3 Notes
The authors discuss linear methods for regression, not only because of their historical importance but because (especially with some modern innovations) they may perform well when n is small, p is large, and/or the noise variance is large
.
Formula (3.1) describes the model for f(x) := E[Y|X=x], which can accommodate non-linear functions of x including dummy codes; regarding the latter, note part f of exercise 1 on your first team project.Slide2
CPH 636 - Dr. CharnigoChap. 3 Notes
Even if f(x) is non-linear in x, f(x) is linear in the parameters
β
0
,
β
1
, …,
β
p
. Thus, estimating the parameters by ordinary least squares – i.e., minimization of (3.2) – leads to the closed-form solution in (3.6). If you’ve not studied vector calculus, I will show you how the solution is obtained with p=1 and neglecting the intercept.
Figure 3.1 illustrates ordinary least squares when p=2.Slide3
CPH 636 - Dr. CharnigoChap. 3 Notes
Note that, once the parameters are estimated, we may predict Y for any x. In particular, (3.7) shows that predicting Y
for x which occurred in the training data is accomplished by a matrix-vector multiplication,
Y
predicted
=
H
Y
training
,
where
H
:=
X
(
X
T
X
)
-1
X
T
.
To make predictions for a new data set (rather than for the training data set), simply replace the first
X
in
H
by its analogue from the new data set.Slide4
CPH 636 - Dr. CharnigoChap. 3 Notes
We refer to
H
as the hat matrix. The hat matrix is what mathematicians call a projection matrix. Consequently,
H
2
=
H
. Figure 3.2 provides an illustration with p=2, but I will provide an illustration with p=1 which may be easier to understand.
Assuming there are no redundancies in the columns of
X
, the trace of
H
is p+1. This trace is called the degrees of freedom (
df
) for the model, a concept which generalizes to nonparametric methods which are linear in the outcomes.Slide5
CPH 636 - Dr. CharnigoChap. 3 Notes
If we regard the predictors as fixed rather than random (or, if we condition upon their observed values), then under the usual assumptions for linear regression (what are they ?), we have result (3.10).
Combined with result (3.11), which says that the residual sum of squares divided by the error variance follows the chi-square distribution on n-p-1 degrees of freedom, result (3.10) forms the basis for inference on
β
0
,
β
1
, …,
β
p
.
(Even if we are more interested in prediction, this is still worth your understanding.)Slide6
CPH 636 - Dr. CharnigoChap. 3 Notes
The authors make the point that the T distribution which is proper to such inferences is often replaced by the Z distribution when n (or, rather, n-p-1) is large.
I think the authors have oversimplified a bit here, though, because the adequacy of the Z approximation to the T distribution depends on the desired level of confidence or significance. In any case, with modern computing powerful enough to implement methods in this book, why use such an approximation at all ?Slide7
CPH 636 - Dr. CharnigoChap. 3 Notes
Result (3.13) tells you how to test a null hypothesis that some but not all predictors in your model are unnecessary.
This is important because testing H
0
:
β
1
=
β
2
= 0 is not the same as testing H
0
:
β
1
= 0 and H
0
:
β
2
= 0. Possibly X
1
could be deleted if X
2
were kept in the model, or vice versa, while deleting both would be unwise. For example, consider X
1
= SBP and X
2
= DBP. Slide8
CPH 636 - Dr. CharnigoChap. 3 Notes
One could (and sometimes does, especially in backward elimination) test whether to remove X
2
and then, once X
2
is gone, test whether to remove X
1
as well. But that entails (a greater degree of) sequential hypothesis testing, which is not well understood in terms of its implications for actual versus nominal statistical significance.
Moreover, there are some situations in which X
1
and X
2
should either both or neither be in the model, such as when they are dummy variables.Slide9
CPH 636 - Dr. CharnigoChap. 3 Notes
Result (3.15) describes how to make a confidence region for
β
0
,
β
1
, …,
β
p
. I will illustrate what this looks like for
β
0
and
β
1
when p = 1. Importantly, such a region is
not
a rectangle.
Though not shown explicitly in (3.15), one may also make a confidence region for any subset of the parameters. I will illustrate what this looks like for
β
1
and
β
2
when p
>
2 supposing that, for example, X
1
= SBP and X
2
= DBP.Slide10
CPH 636 - Dr. CharnigoChap. 3 Notes
The authors discuss the prostate cancer data set at some length.
Notice that they began (Figure 1.1) by exploring the data. Dr. Stromberg would be proud !
I don’t quite agree with the authors’ conflation of “strongest effect” with largest Z score, even though they did standardize the predictors to have unit variance. Let’s discuss that…Slide11
CPH 636 - Dr. CharnigoChap. 3 Notes
Let’s also make sure we understand what the authors mean by “base error rate” and its reduction by 50%.
Returning to the idea of inference, the Gauss-Markov Theorem (and, likewise, the Cramer-Rao lower bound, for those of you who’ve heard of it) will permit us to conclude that, for a correctly specified model: (
i
) parameters are estimated unbiasedly; and, (ii) parameters are estimated with minimal variance subject to (
i
). Slide12
CPH 636 - Dr. CharnigoChap. 3 Notes
Although the Gauss-Markov Theorem
sounds
reassuring, there are some cases when we can achieve a huge reduction in variance by tolerating a modest amount of bias. This may substantially reduce both mean square error of estimation and mean square error of prediction.
Moreover, there’s a big catch to the Gauss-Markov Theorem:
for a correctly specified model
. How often do you suppose that (really) happens ?Slide13
CPH 636 - Dr. CharnigoChap. 3 Notes
The authors proceed to describe how a multiple linear regression model can actually be viewed as the result of fitting several simple linear regression models.
They begin by noting that when the inputs are orthogonal (roughly akin to the idea of statistical independence), unadjusted and adjusted parameter estimates are identical.Slide14
CPH 636 - Dr. CharnigoChap. 3 Notes
Usually inputs are not orthogonal. But imagine that, with standardized inputs and response (hence, no need for an intercept), we do the following:
Multiple linear regression of
X
k
on all other features.
Simple linear regression of
Y
k
on the residuals from step 1.
These residuals
are
orthogonal to the other features…Slide15
CPH 636 - Dr. CharnigoChap. 3 Notes
…and so the parameter
estimate from step 2 will be the same as we would have obtained for
X
k
in a multiple linear regression of Y
on X
1
, X
2
, …,
X
p
.
This gives us an alternate interpretation of an adjusted regression coefficient: we are quantifying the effect of
X
k
on that portion of Y which is orthogonal to (or, if you prefer, unexplained by) the other inputs. Slide16
CPH 636 - Dr. CharnigoChap. 3 Notes
Formula (3.29) then shows why it’s difficult to estimate the coefficient for
X
k
if
X
k
is highly correlated with the other features. This condition, in its extreme form, is known as
collinearity
.
In fact, the quantity in the denominator of (3.29) is related to the so-called variance inflation factor, which is sometimes used as a diagnostic for
collinearity
.Slide17
CPH 636 - Dr. CharnigoChap. 3 Notes
You may have also heard of the distinction between ANOVA and MANOVA.
You may wonder, then, whether there is an analogue to MANOVA for situations when you have multiple continuous outcomes which are being regressed on multiple continuous predictors.Slide18
CPH 636 - Dr. CharnigoChap. 3 Notes
There is, but parameter estimates relating a particular outcome to the predictors do not depend on whether they are acquired for the one outcome by itself or for all outcomes simultaneously. This is true even if the multiple outcomes are correlated with each other.
This is in stark contrast, of course, to the dependence of parameter estimates relating a particular outcome to a particular predictor on whether other predictors are considered simultaneously.Slide19
CPH 636 - Dr. CharnigoChap. 3 Notes
The authors note that ordinary least squares may have little bias and large variability. (Here they are assuming that the model is specified correctly. If the model is a drastic simplification of reality, then ordinary least squares will have large bias and little variability in relation to viable competing paradigms for modeling and estimation.)
The authors therefore discuss subset selection, which may reduce variability and enhance interpretation.Slide20
CPH 636 - Dr. CharnigoChap. 3 Notes
Best subset selection, which is computationally feasible for up to a few dozen candidate predictors, entails finding the best one-predictor model, the best two-predictor model, and so forth. This is illustrated in Figure 3.5.
Then, using either a validation data set or cross-validation (the authors do the latter in Figure 3.7), choose from among the best one-predictor model, best two-predictor model, and so forth. Slide21
CPH 636 - Dr. CharnigoChap. 3 Notes
Note that the authors do not define “best” strictly by mean square error of prediction but also by
considerations
of parsimony.
Forward selection is an alternative to best subsets selection when
p
is large. The authors refer to it as a “greedy algorithm” because, at each step, that predictor is chosen which explains the greatest part of the remaining variability in the outcome.Slide22
CPH 636 - Dr. CharnigoChap. 3 Notes
While that
seems
desirable, the end result may actually be sub-optimal if predictors are strongly correlated.
Backward elimination is another option. A disadvantage is that it may not be viable when
p
is large relative to
n
. A compelling advantage is that backward elimination can be easily implemented “manually” if not otherwise programmed into the statistical software. This may be useful in, for example, PROC MIXED of SAS. Slide23
CPH 636 - Dr. CharnigoChap. 3 Notes
In addition to explicitly choosing from among available predictors, we may also employ “shrinkage” methods for estimating parameters in a linear regression model.
These are so called because the resulting estimates are often smaller in magnitude than those acquired via ordinary least squares.
Until further notice
, we assume that Y and X
1
, …,
X
p
have been standardized with respect to training data.Slide24
CPH 636 - Dr. CharnigoChap. 3 Notes
Ridge regression is defined by formula (3.44) in the textbook and can be viewed as the solution to the penalized least squares problem expressed in (3.41).
Though perhaps not obvious, the constrained least squares problem in (3.42) is equivalent to (3.41), for an appropriate choice of t depending on
λ
. Moreover, (3.44) may be a good way to address
collinearity
. In particular, a correlation of
ρ
between X
1
and X
2
is, roughly speaking, effectively reduced to
ρ
/ (1 +
λ
).Slide25
CPH 636 - Dr. CharnigoChap. 3 Notes
Figure 3.8 displays a “ridge trace”, which shows how the estimated parameters in ridge regression depend on
λ
. One may choose
λ
by cross validation, as the authors have done.
Ridge regression can also be viewed as finding a Bayesian posterior mode (whilst ordinary least squares is
frequentist
maximum likelihood), when the prior distribution on each estimated parameter is normal with mean 0 and variance
σ
2
/
λ
.Slide26
CPH 636 - Dr. CharnigoChap. 3 Notes
What do you think about ridge regression in terms of the bias / variance tradeoff ?
One weakness of ridge regression is that, almost invariably, you are still “stuck” with all of the predictors. Even if the
collinearity
issue is satisfactorily resolved, why is this a weakness ?
An alternative is the lasso, for which the corresponding optimization problems are (3.51) and (3.52).Slide27
CPH 636 - Dr. CharnigoChap. 3 Notes
There is no analytic solution for the parameter estimates with the lasso; one must use numerical optimization methods.
However, a
favourable
feature of the lasso is that some parameter estimates are “shrunk” to zero. In other words, some variables are effectively removed.
Figure 3.10 displays estimated parameters in relation to a shrinkage factor s which is proportional to t.Slide28
CPH 636 - Dr. CharnigoChap. 3 Notes
Figure 3.11 explains why the lasso can effectively remove some predictors from the model.
Each
red ellipse represents a contour on which the residual sum of squares
equals a fixed value.
However, we are only allowed to accept parameter estimates within the blue geometric regions. So, the final parameter estimates will occur where an ellipse is tangent to a region. For a circular region, this will almost never happen along a coordinate axis.Slide29
CPH 636 - Dr. CharnigoChap. 3 Notes
The lasso also has a Bayesian interpretation, corresponding to a prior distribution on parameter estimates which has much heavier tails than a normal distribution. Thus, the lasso is less capable of reducing very large ordinary least squares estimates such as may occur with
collinearity
.
Table 3.4 nicely characterizes
how subset selection, lasso, and ridge regression shrink ordinary least squares parameter estimates for uncorrelated predictors.Slide30
CPH 636 - Dr. CharnigoChap. 3 Notes
While having uncorrelated predictors is a fanciful notion for an observational study (versus a designed experiment), Table 3.4 helps explain why ridge regression does not produce zeroes and why
the lasso
is more of a “continuous” operation than subset selection.
The authors also mention the elastic net and least angle regression as shrinkage methods. Slide31
CPH 636 - Dr. CharnigoChap. 3 Notes
The former is a sort of compromise between ridge regression and the lasso, as suggested by Figure 18.5 later in the textbook.
The idea is to both reduce very large ordinary least squares estimates and eliminate extraneous predictors from the model.
The
latter is similar to the lasso, as shown in Figure
3.15, and provides insight into how to compute parameter estimates for the lasso more efficiently.Slide32
CPH 636 - Dr. CharnigoChap. 3 Notes
Besides subset selection and shrinkage methods, one may fit linear regression models via approaches based on derived input directions.
Principal components regression replaces X
1
, X
2
, …,
X
p
by a set of uncorrelated variables W
1
, W
2
, …,
W
p
such that
Var
(W
1
)
>
Var
(W
2
)
>
…
>
Var
(
W
p
). Each W is a linear combination of the X’s, such that the squared coefficients of the X’s sum to 1.Slide33
CPH 636 - Dr. CharnigoChap. 3 Notes
One then uses some or all of W
1
, W
2
, …,
W
p
as predictors in lieu of X
1
, X
2
, …,
X
p
. This eliminates any problem that may exist with
collinearity
.
The downside of principal components regression is that a W may be difficult to interpret contextually, unless it should happen that the W is approximately proportional to an average of some of the X’s or a “contrast” (the difference between the average of one subset of the X’s and the average of another subset).Slide34
CPH 636 - Dr. CharnigoChap. 3 Notes
Partial least squares – which has been investigated by our own Dr.
Rayens
, among others - is similar to principal components regression, except that W
1
, W
2
, …,
W
p
are chosen in a way that Corr
2
(Y,W
1
)
Var
(W
1
)
>
Corr
2
(Y
,W
2
)
Var
(W
2
)
>
…
>
Corr
2
(
Y,W
p
)
Var
(
W
p
).
If one intends to use only some of W
1
, W
2
, …,
W
p
as predictors, partial least squares may explain more variation in Y than principal components regression.Slide35
CPH 636 - Dr. CharnigoChap. 3 Notes
Figure 3.18 presents a nice illustrative comparison of ordinary least squares, best subset selection, ridge regression, lasso, principal components regression, and partial least squares.
To aid in
the interpretation, note that X
2
could be expressed as + or - (1/2) X
1
+ (
sqrt
(3)/2) Z, where Z is standard normal and independent of X
1
. Also, W
1
= (X
1
+ or - X
2
) /
sqrt
(2) and W
2
= (X
1
– or + X
2
) /
sqrt
(2) for principal components regression.