Prepared by TO Antwi Asare 222017 1 Correlation and Regression Correlation Scatter Diagram Karl Pearson Coefficient of Correlation Rank Correlation Limits for Correlation Coefficient ID: 674383
Download Presentation The PPT/PDF document "CORRELATION AND REGRESSION" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
CORRELATION AND REGRESSION
Prepared by T.O. Antwi-Asare
2/2/2017
1Slide2
Correlation and Regression
Correlation
Scatter Diagram,
Karl Pearson Coefficient of Correlation
Rank Correlation
Limits for Correlation Coefficient Definition of Regression Lines of Regression Regression Curves Regression coefficients properties of Regression coefficients Correlation Analysis vs. Regression Analysis.
2/2/2017
2Slide3
Definition of Correlation
Correlation means
Co -
together
and
Relation.Correlation analysis is a statistical technique of measuring the degree, direction and significance of co-variation between two or more variablesIf there is any relation between two variables i.e. when one variable changes the other also changes in the same or in the opposite direction, we say that the two variables are correlated.It means the study of existence, magnitude and direction of the relation between two or more variables
.2/2/2017
3Slide4
Correlation: Is there any relation between:
fast food sales and different seasons?specific crimes and religion?smoking cigarettes and lung cancer?maths
scores and overall scores in exams?
temperature and earthquake?
cost of advertisement and number of items sold ?
To answer each question two sets of corresponding data need to be collected randomly .Let random variable X represent the first group ofdata and random variable Y represent the second. Question: Is it true that there is a relationship between the two variables? As a first step we can plot a graph for X and YSlide5
Imagine we have a random sample of scores in a school as following:
2/2/2017
5Slide6
In our example, the correlation between š and š can be shown in a scatter diagram:
2/2/2017
6Slide7
Our aim is to find out whether there is any linear association between
X and Y.
In statistics, the technical term for linear association is ācorrelationā. So, we are looking to see if there is any correlation between two scores.
āLinear association
ā :
The variables are considered in their levels, i.e. X with Y not with Y2 , Y3 or 1/
Y, 1/ Y2
etc
Or even āY
.
Slide8
Correlation analysis consists of three steps
Determining whether any relation exists
if it does, measuring its strength and direction
Testing whether the relationship is significant.
Note:
Sometimes a cause & effect relationship is conjectured but this is not conclusive. Other more powerful Statistical tools are used to show causality. Eg. Granger Causality Test, Toda -Yamamoto Test.Slide9
Note .
Significant correlation between increase in smoking and increase in long cancer does not prove that smoking causes cancer.
The proof of a cause & effect relationship can be developed only by
means of an exhaustive study of the operative elements
themselves.
Slide10
Significance of the study of correlation:
Correlation analysis helps to measure the degree & direction of correlation
in one number.
Eg
:- Age & Height, Blood pressure and pulse rateWhen there is close correlation between two variables, the value of one variable can be estimated given the known value of the other variable. Predictions can be based on correlation analysis. There are also other methods for forecasting purposesSlide11
Correlation may be due to the following reasons
:
May be due to pure chance, especially in small samplesThe correlated variables may be influenced by similar variables or a common trend.
Example: Correlation between yield per acre of rice & yield per acre of tea may be due to the fact that both are
depending upon the same amount of rainfall.2/2/201711Slide12
Correlation may be due to following reasons:
Both the variables may be mutually influencing each other so that neither can be designated as the cause and the other the effect. Example: Correlationā between demand and supply, price & production.
Correlation may be due to the fact that one variable is the cause and the other variable the effect.Slide13
Negative Correlationāas
x increases,
y decreases- Green line just shows the direction
X
= hours of
training workers (horizontal axis)Y = number of accidents (vertical axis)2/2/201713Scatter Plots and Types of Correlation
60
50
40
30
20
10
0
0
2
4
6
8
10
12
14
16
18
20
Hours of Training
AccidentsSlide14
Positive Correlationāas
x increases, y increases
X
= SAT score
Y
= GPAGPA2/2/201714Scatter Plots and Types of Correlation
4.00
3.75
3.50
3.00
2.75
2.50
2.25
2.00
1.50
1.75
3.25
300
350
400
450
500
550
600
650
700
750
800
SATSlide15
No linear correlation
X = height;
Y = IQ
2/2/2017
15
Scatter Plots and Types of Correlation160150
140
130
120
110
100
90
80
60
64
68
72
76
80
Height
IQSlide16
Types of Correlation
Positive (Direct) and negative correlation (Indirect)Linear and Non-Linear (
Curvi-linear) Correlation
Simple, Partial and
Multiple Correlation
2/2/201716Slide17
Positive and Negative correlation
If two variables change in the same direction, this is called positive correlation.
For example:
Advertising and sales
.
Height & weight.If two variables change in the opposite direction then the correlation is called negative correlation. For example: T.V. registrations and cinema attendance; PRICE & QUANTITY demanded2/2/2017
17Slide18
Direction of the Correlation
Positive relationship
ā
Variables change in the same direction. (Indicated by + sign)
As X is increasing, Y is increasing
As X is decreasing, Y is decreasingE.g., As height increases, so does weight.
Negative relationship
ā
Variables change in
opposite directions. (Indicated by ā sign)
As X is increasing, Y is decreasing
As X is decreasing, Y is increasing
E.g., As TV time increases, grades decreaseSlide19
More Examples
Positive relationships
water consumption and day
temperature
.
Study time and grades.
Negative relationships
:
alcohol consumption and driving ability.
Price & quantity demandedSlide20
iii)
Linear and Non-Linear (Curvi-linear) Correlation
If the amount of change in one variable tends to bear
a constant ratio
to the amount of change in other variable then the
correlation is said to be LINEAR.If the amount of change in one variable does not bear a constant ratio to the amount of change in other variable then the correlation would be NON-LINEAR.However, since the method of analyzing non-linear correlation is very complicated, generally, we usually assume a linear relationship between the variables.Slide21
Simple, Partial and Multiple Correlation
Simple correlation
is a study between two variables
When three or more variables are studied it is a problem of either
multiple correlation
. In multiple correlation, three or more variables are studied simultaneously.In partial correlation, three or more variables are recognized, but the correlation between any two variables are studied keeping the effect of the other influencing variable(s) constant. Slide22
Methods Of Determining Correlation
The most commonly used methods are:
(1) Scatter Graph or Scatter Plot
(2) Pearsonās Coefficient of Correlation
(3) Spearmanās Rank Correlation Coefficient
(4) Regression Analysis for partial and multiple regression2/2/201722Slide23
Scatter Plot Method
In this method the values of the two variables are plotted on a graph paper.
One is taken along the horizontal ( (x-axis) and the other along the vertical (y-axis).
By plotting the data, we get points (dots) on the graph which are generally scattered and hence the name āScatter Plotā.
The manner in which these points are scattered, suggest the degree and the direction of correlation.
The degree of correlation is denoted by ā r ā and its direction is given by the signs positive or negative.2/2/201723Slide24
Scatter Plot Method
i) If all points lie on a rising straight line the correlation is perfectly positive and r = +1 (see fig.1 ). ii) If all points lie on a falling straight line the correlation is perfectly negative and r = -1 (see fig.2)Ā
2/2/2017
24Slide25
Scatter Plot Method
iii) If the points lie in narrow strip, rising upwards, the correlation is high. Its degree is positive (see fig.3)Ā iv) If the points lie in a narrow strip, falling downwards, the correlation is negative of high degree (see fig.4
)Ā
2/2/2017
25Slide26
Scatter Plot Method
v) If the points are spread widely over a broad strip, rising upwards, the correlation is low degree positive (see fig.5)Ā vi) If the points are spread widely over a broad strip, falling downward, the correlation is low degree negative (see fig.6)Ā
vii) If the points are spread (scattered) without any specific pattern, the correlation is absent. i.e. r = 0. (see fig.7)
2/2/2017
26Slide27
Scatter Plot Method
i) If all points lie on a rising straight line the correlation
is perfectly positive
and r = +1 (see fig.1 )
ii) If all points lie on a falling straight line the correlation
is perfectly negative and r = -1 (see fig.2)Ā iii) If the points lie in narrow strip, rising upwards, the correlation is high degree of positive (see fig.3)Ā iv) If the points lie in a narrow strip, falling downwards, the correlation is high degree of negative (see fig.4)Ā 2/2/2017
27Slide28
Show fig. 5 and fig. 6 on Board in class
Students could sketch themselves2/2/2017
28Slide29
v) If the points are spread widely over a broad strip, rising upwards, the correlation is
low degree positive (see fig.5)Ā vi) If the points are spread widely over a broad strip, falling downward, the correlation
is low degree negative
(see fig.6)Ā
vii) If the points are spread (scattered) without any specific pattern,
the correlation is absent. i.e. r = 0. (see fig.7)Slide30
Scatter Plot Method
Though this method is simple
and gives a
rough idea about the existence and the degree of correlation
, it is not reliable.
As it is not a mathematical method, it cannot measure the degree of correlation.2/2/201730Slide31
Pearsonās Coefficient of Correlation
Karl Pearsonās Coefficient of Correlation denoted by
ārā
Pearsonās ārā is the most commonly used correlation coefficient.The coefficient of correlation r
measures the degree of linear relationship between two variables, say X & Y. Slide32
Assumptions for Pearsonās Correlation Coefficient:
For each value of X there is a normally distributed subpopulation of Y values.For each value of Y there is a normally distributed subpopulation of X values.
The joint distribution of X and Y is a normal distribution called the
bivariate
normal distribution.
The subpopulations of Y values have the same variance. The subpopulations of X values have the same variance.Slide33
Some problems with ārā
.The correlation coefficientĀ
rĀ is not a good summary of association if the data has very large differences in variance
The correlation coefficientĀ
rĀ
is not a good summary of association if the data have outliers.The correlation coefficient r measures only linear associationsSlide34
FORMULA- Pearsonās Correlation Coefficient
For a sampler
= [n(āXY) ā (ā X)(ā Y)] / {[n(ā X
2
) ā (ā X)
2] [n(ā Y2) ā (ā Y)2]}0.5 = Cov(X,Y)/S
X
S
Y
Where
n
is the number of data pairs, X is the independent variable and Y the dependent variable.
Cov
(
X,Y
) = Sample Covariance between X and Y
S
X
= Sample Standard Deviation of X
S
Y
= Sample Standard Deviation of Y
2/2/2017
34Slide35
Formula in Deviation Form
Note that x = X - mean of X = deviation of X from its mean; and y = Y - mean of Y = deviation of Y from the mean of Y
so r = [ā āxy
]/ [āāx
2
] [āāy2]Also known as the Product-moment coefficient of correlation2/2/201735Slide36
Correlation
Through the coefficient of correlation, we can measure the degree or
extent of the correlation between two variables
.
On the basis of the sign of the coefficient of correlation we can also
determine whether the correlation is positive or negativePerfect correlation: If two variables change in the same direction and in the same proportion, the correlation between the two is perfect and positive2/2/201736Slide37
Degree of Correlation
Absence of correlation
If two series of two variables exhibit no relationship between them; or if a change in one
variable does not lead to a change in the other variable. (then r=0)
Limited degree of correlation:
If two variables are not perfectly correlated then we term the correlation as Limited Correlation2/2/201737Slide38
Karl Pearsonās coefficient of correlation
It gives the numerical expression for the measure of correlation. it is noted by ā r ā. The value of ā r ā gives the magnitude of correlation and its sign denotes its direction.
2/2/2017
38Slide39
Degrees of Correlation
High degree, moderate degree or low degree are the three categories of linear correlation.
The following table reveals how to interpret the coefficient of correlation.
2/2/2017
39Slide40
Degrees of Correlation
2/2/2017
40
Degrees
Positive
Negative
Absence of correlation
Ā®
Zero
0
Perfect correlation
Ā®
+ 1
-1
High degree
Ā®
+ 0.75 to + 1
- 0.75 to -1
Moderate degree
Ā®
+ 0.25 to + 0.75
- 0.25 to - 0.75
Low degree
Ā®
0 to 0.25
0 to - 0.25Slide41
Karl Pearsonās coefficient of correlation
Note : r is also known as product-moment coefficient of correlation.
2/2/2017
41Slide42
Karl Pearsonās coefficient of correlation
Example: Calculate the coefficient of correlation between the heights of fathers and sons for the following data.
2/2/2017
42
Height of father (cm):
165
166
167
168
167
169
170
172
Height of son (cm):
167
168
165
172
168
172
169
171
Solution:
n = 8 ( pairs of observations ) Slide43
Height of
father
x
i
Height ofsony
i
x =
x
i
-x
y =
y
i
-y
xy
x
2
y
2
165
167
-3
-2
6
9
4
166
168
-2
-1
2
4
1
167
165
-1
-4
4
1
16
167
168
-1
-1
1
1
1
168
172
0
3
0
0
9
169
172
1
3
3
1
9
170
169
2
0
0
4
0
172
171
4
2
8
16
4
S
x
i
=1344
S
y
i
=1352
0
0
S
xy=24
S
x
2
=36
S
y
2
=44
2/2/2017
43Slide44
Solved example
Problem: Find the relationship between the Flowers on plant is correlated with the height of plant44
S. No.
Height of
plant
Flowers on plant141223103413
4515
5
5
16
6
4
11
7
6
18
8
3
9
9
5
14
10
4
12Slide45
S. No.
Height of
plant
Flowers on plant
1
41223103413451555
16
6
4
11
7
6
18
8
3
9
9
5
14
10
4
12
2/2/2017
45Slide46
46
S. No.
Height of
plant (x)
Flowers on plant (y)
x2y2xy
14
12
16
144
48
2
3
10
9
100
30
3
4
13
16
169
52
4
5
15
25
225
75
5
5
16
25
256
80
6
4
11
16
121
44
7
6
18
36
324
108
8
3
9
9
81
27
9
5
14
25
196
70
10
4
12
16
144
48
Total
43
130
193
1760
582Slide47
47
Ā
Ā
Ā
Ā Slide48
Limits for Correlation Coefficient
Pearsonian correlation coefficient lies between -1 and +1 .
Symbolically ,
-1 ā¤ r ā¤ +1
2/2/2017
48Slide49
Merits and limitations of the Pearsonās Coefficient of Correlation
It summarizes in one number the degree and direction of correlation.Karl Pearsonās co-efficient of correlation assumes linear relationship regardless of whether that assumption is correct or not.
The value of the co-efficient is unduly affected by the extreme items.
Very often there is risk of misinterpreting the co-efficient hence great care must be exercised, while interpreting the co-efficient of correlation.Slide50
Properties of
Ī”earsonās coefficient of correlationThis measure of correlation has interesting properties, some of which are stated below:It is
independent of the units of measurement. It is in fact unit free.
For example, Ļ between highest day temperature (in Centigrade) and rainfall per day (in mm) is not expressed either in terms of centigrade or mm.
It is symmetric
. This means that Ļ between X and Y is exactly the same as Ļ between Y and X.Slide51
Properties of
Ī”earsonās coefficient of correlationPearson's correlation coefficient is independent of change in origin and scale.
Thus, Ļ
between temperature (in Centigrade) and rainfall (in mm) would numerically be equal to
Ļ
between temperature (in Fahrenheit) and rainfall (in cm).If the variables are independent of each other, then one would obtain Ļ = 0. However, the converse is not true. In other words Ļ = 0 does not imply that the variables are independent - it may indicate the existence of a non-linear relationship.Slide52
Caveats and Warnings
While Ļ is a powerful tool, it is a much abused one and hence has to be handled carefully.People often tend to forget or gloss over the fact that
Ļ is a
measure of linear relationship.
Consequently a small value of
Ļ is often interpreted to mean non existence of relationship when actually it only indicates non existence of a linear relationship or at best a very weak linear relationship. Under such circumstances it is possible that a non linear relationship exists.Slide53
A scatter diagram can reveal the same and one is well advised to observe the graph before firmly concluding non existence of a relationship.
If the scatter diagram points to a non linear relationship, an appropriate transformation can often attain linearity in which case Ļ can be recomputed.Slide54
Caveats and Warnings
One has to be careful in interpreting the value of Ļ. For example, one could compute Ļ between size of a shoe and intelligence of individuals, heights and income. Irrespective of the value of Ļ, such a correlation makes no sense and is hence termed chance or
nonesense correlation.
Ļ
should not be used to say anything about cause and effect. Put differently, by examining the value of Ļ, we could conclude that variables X and Y are related. However the same value of Ļ does not tell us if X influences Y or the other way round - a fact that is of grave import in regression analysis.Slide55Slide56
Significance test for correlation:
Hypothesis: H0
: = 0
H
A
: ā 0Test Statistics: Under the null hypothesis (when null hypothesis is true)
follows a Students ātā distribution with n-2 degrees of freedom.
Decision rule:
If we let
Ī±
=0.05, the critical value of t is Ā± w
, we reject H
0
if t is outside,
-w ā¤ t ā¤ +wSlide57
Spearmanās Rank Correlation
Ā Spearman's rank correlation
Ā coefficient orĀ Spearman's
rho, is a measure of statistical dependence between two ordinal variables.Slide58
Spearman Rank Correlation Coefficient
is a non-parametric measure of correlation, using ranks of the variables to calculate the correlation.Advantages and CaveatsOther measures of correlation are parametric in the sense of being based variables which are measured within the ratio scaleSlide59
Advantages and Caveats
Another advantage with this measure is that it is much easier to use since it does not matter which way we rank the data, ascending or descending. We may assign rank 1 to the smallest value or the largest value, provided we do the same thing for both sets of data.The only requirement is that data should be ranked or at least converted into ranks.Slide60
Interpretation
The sign of the Spearman correlation indicates the direction of association betweenĀ X
Ā (the independent variable) andĀ Y
Ā (the dependent variable).
IfĀ
YĀ tends to increase whenĀ XĀ increases, the Spearman correlation coefficient is positive. IfĀ YĀ tends to decrease whenĀ X increases, the Spearman correlation coefficient is negative. A Spearman correlation of zero indicates that there is no association between X and YSlide61
Repeated Ranks
If there is more than one item with the same value , then they are given a common rank which is average of their respective ranks as shown in the table.Slide62
Example
The raw data in the table below is used to calculate the correlation between theĀ IQ
Ā of a student with the number of hours spent in front of
Ā TV
per week.Slide63
2/2/2017
63Slide64
Example Contd.Slide65
2/2/2017
65Slide66
Correlation and causation
: closely related to confounding variables, is the incorrect assumption that because something correlates, there is a causal relationship.Causality is the area of statistics that is most commonly misused, and misinterpreted, by non-specialists.Slide67
Media sources, politicians and lobby groups often leap upon a perceived correlation, and use it to 'prove' their own beliefs.
They fail to understand that, just because results show a correlation, there is no proof of an underlying causality.Many people assume that because a poll, or a statistic, contains many numbers, it must be scientific, and therefore correct.Slide68
The Cost of Disregarding Correlation and Causation
The principle of incorrectly linking correlation and causation is closely linked to post hoc reasoning where incorrect assumptions generate an incorrect link between two effects.The principle of Correlation and Causation is very important for anybody working as a scientist or researcher. It is also a useful principle for non-scientists, especially those studying politics, media and marketing.Slide69
Understanding causality promotes a greater understanding, and honest evaluation of the alleged facts given by pollsters.
Imagine an expensive advertising campaign, based around intense market research, where misunderstanding a correlation could cost a lot of money in advertising, production costs, and damage to the company's reputation.Slide70
Partial correlation analysis involves studying the linear relationship between two variables after excluding the effect of one or more independent factors.
Simple Correlation does not prove to be an all-encompassing technique especially under the above circumstances. In order to get a correct picture of the relationship between two variables, we should first eliminate the influence of other variables.For example, study of partial correlation between price and demand would involve studying the relationship between price and demand excluding the effect of money supply, exports, etc
.Slide71
Limitations
However, this technique suffers from some limitations some of which are stated below.The calculation of the partial correlation co-efficient is based on the simple correlation co-efficient. However, simple correlation coefficient assumes a linear relationship.
Generally this assumption is not valid especially in social sciences, as linear relationship rarely exists in such phenomena.Slide72
As the order of the partial correlation co-efficient goes up, its reliability goes down.
Its calculation is somewhat cumbersome - often difficult to the mathematically uninitiated (though software's have made life a lot easier)Slide73
REGRESSION
2/2/201773Slide74
Definition of Regression
Regression can be defined as a method that estimates the value of one variable when that of other variables are known
, provided the variables are correlated.
The dictionary meaning of regression is "
to go backward
." It was used for the first time by Sir Francis Galton in his research paper "Regression towards mediocrity in hereditary stature."Regression helps us to estimate one variable or the dependent variable from other variables or independent variables2/2/201774Slide75
4. According to
Blair āRegression is the measure of the average relationship between two or more variables in terms of the original units of dataā5. According to
Wallis and Robert āIt is often more important to find out what the relation actually is , in order to estimate or predict one variable (the dependent variable) and statistical techniques appropriate in such cases is called
Regression Analysis
āSlide76
Regression Analysis
6. Regression Analysis is the mathematical measure of the average relationship between two or more variables. Slide77
Regression analysis is a statistical tool for the investigation of relationships between variables.
Usually, the investigator seeks to ascertain the causal eļ¬ect of one variable upon anotherāthe eļ¬ect of a price increase upon demand, for example, or the eļ¬ect of changes in the money supply upon the inļ¬ation rate. To explore such issues, the investigator assembles data on the underlying variables of interest and employs regression to estimate the quantitative eļ¬ect of the causal variables upon the variable that they inļ¬uence.
2/2/2017
77Slide78
The investigator also typically assesses the āstatistical signiļ¬canceā of the estimated relationships, that is, the degree of conļ¬dence that the true relationship is close to the estimated relationship.
Regression techniques have long been central to the ļ¬eld of economic statistics (āeconometricsā).
2/2/2017
78Slide79
regression analysis
a method of modelling the relationships among three or more variables. It is used to predict the value of one variable given the values of the others. For example, a model might estimate sales based on age and gender.
A regression analysis yields an equation that expresses the relationship
2/2/2017
79Slide80
The purposes of regression analysis include
the determination of the general form of a regression equation, the construction of estimates of unknown parameters occurring in a regression equation, and the testing of statistical regression hypotheses.
2/2/2017
80Slide81
Summary of ideas on Regression
Regression analysis is used to predict the value of one variable (the dependent variable
) on the basis of other variables (the
independent variables
).
ā¢ In correlation, the two variables are treated as equals. In regression, one or more variables are considered independent (=predictor) variables (X) and the other the dependent (=outcome) variable Y.ā¢ Dependent variable: usually denoted Yā¢
Independent variables: denoted X
1
, X
2
, ā¦,
X
k
2/2/2017
81Slide82
Regression Lines
They are obtained by (I)Graphically - by Scatter plot
(II)Mathematically ā
- by the method of least squares.
- Other methods
eg ML2/2/201782Slide83
SIMPLE REGRESSION MODEL
Y= Ī²0 +
Ī²1
X +
Īµ
2/2/201783Slide84
Y= Ī²0 +
Ī²
1
X +
Īµ The model above is referred to as the Simple linear regression model. We would be interested in estimating Ī²0 and Ī²1 from the data we collect.If you know something about X, this knowledge helps you predict something about Y. Variables:
ā¢ X = Independent Variable (we provide this)
ā¢ Y = Dependent Variable (we observe this)
ā¢ Parameters:
ā¢
Ī²
0
=
Y-Intercept; ā¢
Ī²
1
=
Slope
ā¢
Īµ
=
the Disturbance term, which is assumed as a
Normal Random Variable
2/2/2017
84Slide85
Why does the disturbance term exist? There are several reasons.
1. Omission of some explanatory variables: The relationship between
Y and
X
is almost certain to be a simplification. In reality there will be other factors affecting
Y that have been left out of the model Y= Ī²1 + Ī²2 X + Īµ and their influence will cause the points to lie off the line. It often happens that there are variables that you would like to include in the regression equation but cannot because you are unable to measure them. All of these other factors contribute to the disturbance term.2/2/201785Slide86
Why does the disturbance term exist? There are several reasons
2. Aggregation of variables:
In many cases the relationship is an attempt to summarize in aggregate a number of microeconomic relationships. For example, the aggregate consumption function is an attempt to summarize a set of individual expenditure decisions. Since the individual relationships are likely to have different parameters, any attempt to relate aggregate expenditure to aggregate income can only be an approximation.
The discrepancy is attributed to the disturbance term
2/2/2017
86Slide87
Why does the disturbance term exist? There are several reasons
3. Model misspecification: The model may be
mis-specified in terms of its structure.
examples, if the relationship refers to time series data, the value of
Y
may depend not on the actual value of X but on the value that had been anticipated in the previous period. If the anticipated and actual values are closely related, there will appear to be a relationship between Y and X, but it will only be an approximation, and again the disturbance term will pick up the discrepancy.2/2/201787Slide88
Why does the disturbance term exist? There are several reasons
4. Functional misspecification: The functional relationship between
Y and
X
may be
misspecified mathematically. For example, the true relationship may be nonlinear instead of linear. Obviously, one should try to avoid this problem by using an appropriate mathematical specification, but even the most sophisticated specification is likely to be only an approximation, and the discrepancy contributes to the disturbance term.2/2/201788Slide89
Why does the disturbance term exist? There are several reasons
5. Measurement error: If the measurement of one or more of the variables in the relationship is subject to error, the observed values will not appear to conform to an exact relationship, and
the discrepancy contributes to the disturbance term.
The disturbance term is the collective outcome of all these factors.
2/2/2017
89Slide90
Assumptions of linear regression- When Can I fit the linear regression line
Simple Linear regression model assumes thatā¦
1. The relationship between X and Y is linear
[
Imagine a quadratic(parabolic) relation ship between X & Y. Does it make sense to fit a straight line through this data]
2. Y is distributed normally at each value of X: [At each X, Y is normally distributed, which means at each X, Y value is around its overall mean value]2/2/201790Slide91
3. The variance of Y at every value of X is the same (homogeneity of variances)
[Imagine data in the form of a cone, as we move away from origin the variance in Y is increasing drastically. Does it make sense to fit a straight line through this data?]4. The observations are independent:
[There is already one trend in the data If the observations are dependent. One trend line is not sufficient to model in this case]Slide92
Advantages of Regression Analysis
Regression analysis provides estimates of values of the dependent variables from the values of independent variables.
Regression analysis also helps to obtain a measure of the error involved in using the regression line as a basis for estimations .
Regression analysis helps in obtaining a measure of the degree of association or correlation that exists between the variables.Slide93
Regression line
Simple Regression line
is the line which gives the best estimate of one variable from the value of any other given variable.
The simple regression line
gives the average relationship between the two variables in mathematical form.Slide94
Lines of Regression By
- Graphically - by Scatter plot Method
Line of regression of y on x
Its form is
y = a + b x
It is used to estimate y when x is givenWhere a is intercept of the line and b is the slope of line x on y.(2) Line of regression of x on yIts form is x = a + b yIt is used to estimate x when y is given.
Where a is intercept of the line and b is the slope of line y on x.
2/2/2017
94Slide95
Regression Lines
In scatter plot, we have seen that if the variables are highly correlated then the points (dots) lie in a narrow strip. if the strip is nearly straight, we can draw a straight line, such that all points are close to it from both sides.
This line is called the line of best fit if it minimizes the distances of all data points from it.
This line is called
the line of regression
. Prediction is easier because all we need to do is to extend the line and read the value2/2/201795Slide96
Regression Line
To obtain a line of regression, we need to have a line of best fit. But statisticians donāt measure the distances by dropping perpendiculars from points on to the line.
They measure deviations
( or
errors
or residuals as they are called) (i) vertically and (ii) horizontally. Thus we get two lines of regressions as shown in the figure (1) and (2).2/2/201796Slide97
2/2/2017
97Slide98
2/2/2017
98Slide99Slide100
The Least Squares Method
The line of regression is the line which gives the best estimate to the value of one variable for any specific value of the other variable. Thus the line of regression is the line of ābest fitā and is Obtained by the
principle of least squares. This principle consists in minimizing the sum of the squares of the deviations of the actual values of y from their estimated values giving the line of best fit. [
i.e
Minimize the Error Sum of Squares w.r.t the estimated coefficients of the model.]Slide101Slide102Slide103
SHOW that : TSS = ESS + RSS OR SST = SSE + SSR FROM
=
Ā Slide104Slide105
Sometimes: ESS
= Error Sum of Squares or Residual Sum of SquaresRSS = Regression Sum of Squares or Explained Sum of SquaresTSS
= Total Sum of SquaresNOTE : In some books;
RSS
= Residual Sum of Squares which is the same as Error Sum of SquaresSlide106Slide107
A good fit will have
ā¢ SSE of ESS Minimized
ā¢ The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable
The coefficient of determination is also called R-squared and is denoted as R
2
SSR/ SST = R2 = 1 - [ESS/TSS] where 0 ā¤ R2 ā¤ 1,
In the Simple Regression Model, the coefficient of determination is equal to square of simple correlation coefficientSlide108
Standard Error of Estimate
ā¢ The standard deviation of the variation of observations around the regression line is estimated by
Se = [ESS/(n - k)]
0.5
Where:
ESS = Error Sum of Squaresn = Sample sizek = number of parameters in the model = 2 for the simple regression model.Slide109Slide110Slide111Slide112Slide113Slide114Slide115Slide116Slide117Slide118
ASSUMPTIONS REQUIRED FOR A LINEAR REGRESSION MODEL
The mean of the probability distribution of the random error is 0, E(
e) = 0. that is, the average of the errors over an infinitely long series of experiments is 0 for each setting of the independent variable
x
. this assumption states that the mean value of
y, E(y) for a given value of x is y = A + B
x.
The variance of the random error is equal to a constant, say
ļ³
2
, for all value of
x
.
The probability distribution of the random error is normal.
The errors associated with any two different observations are independent. That is, the error associated with one value of
y
has no effect on the errors associated with other values.Slide119
Formulae ā change all variables to capital letters and useSlide120
2/2/2017
120Slide121
Example
Fit a least square line to the following dataSlide122
SolutionSlide123
Lines of Regression By
- by the method of least squares
2/2/2017
123
Line of regression of y on x
Where
Line of regression of x on y
Where
Slide124
Example on Regression By
- by the method of least squares
2/2/2017
124
A panel of two judges A and B graded dramatic performance by independently awarding marks as follows:
Solution:-Slide125
Example on Regression
2/2/2017125
Example Continue ā¦.
Slide126
Example on Regression By
- by the method of least squares
2/2/2017
126
Example Continue ā¦.
Slide127Slide128
MATRIX APPROACH
2/2/2017128Slide129
Example
Problem: Nitrogen produced by the treatment plant in the mid term and final. Develop a regression equation which may be used to predict final yield from the mid term score.
129
Treatment plant
Mid term
Final19890266743100
98
4
96
88
5
88
80
6
45
62
7
76
78
8
60
74
9
74
86
10
82
80Slide130
Solution
130
Treatment plant
Mid term (X)
Final (Y)
X2XY198909064882026674435648843
1009810000
9800
4
96
88
9216
8448
5
88
80
7744
7040
6
45
62
2025
2790
7
76
78
5776
5928
8
60
74
3600
4440
9
74
86
5476
6364
10
82
80
6724
6560
Total
785
810
64521
65074Slide131
Numerator of b = 10x65074-785x810
= 650740-635850 = 14890Denominator of b = 64521-(785)2
= 645210-616225
= 28985
Therefore b = 14890/28985 = 0.5137Numerator of a = 810-785x0.5137 = 810-403.2545 = 406.7455Denominator of a = 10131Slide132
Thus,Value of a = numerator of a/dominator of a
= 406.7455/a = 40.67455 considering the formula of regression equation:
Y=a+b
(Xā)
Y= predicting value
a = value obtainedb = value obtainedXā = number of object for the prediction is desirableThus,Y hat= 40.67455+(0.5137)50 = 40.6746+25.685 = 66.3596132Slide133
Correlation analysis vs. Regression analysis.
Regression is the average relationship between two variables
Correlation need not imply cause & effect relationship between the variables understudy.- R A clearly indicate the cause and effect relation ship between the variables.
There may be non-sense correlation between two variables.- There is no such thing like non-sense regression.Slide134
Uses of Correlation and Regression
There are three main uses for correlation and regression.One is to test hypotheses about cause and effect relationships. In this case, the experimenter determines the values of the X-variable and sees whether variation in X causes variation in Y. For example, giving people different amounts of a drug and measuring their blood pressure.The second main use for correlation and regression is to see whether two variables are associated, without necessarily inferring a cause and effect relationship.
Slide135
In this case, neither variable is determined by the experimenter; both are naturally variable. If an association is found, the inference is that variation in X may cause variation in Y, or variation in Y may cause variation in X, or variation in some other factor may affect both X and Y.
The third common use of linear regression is estimating the value of one variable corresponding to a particular value of the other variable.Slide136
Interpretation of a Linear Regression Equation
This is a foolproof way of interpreting the coefficients of a linear regressionY= Ī²1 + Ī²
2 X + Īµ
when Y and X are variables with straightforward natural units (not logarithms or other functions). The first step is to say that a one-unit increase in X (measured in units of X) will cause a Ī²
2
unit increase in Y (measured in units of Y). Slide137
Interpretation of a Linear Regression Equation
The second step is to check to see what the units of X and Y actually are, and to replace the word "unit" with the actual unit of measurement. The third step is to see whether the result could be expressed in a better way, without altering its substance. The constant, Ī²1, gives the predicted value of Y (in units of Y) for X equal to 0. It may or may not have a plausible meaning, depending on the context.Slide138
TEST OF GOODNESS OF FIT AND CORRELATION
The closer the observations fall to the regression line (i.e., the smaller the residuals), the greater is thevariation in Y "explained" by the estimated regression equation. The total variation in Y is equal to the explained plus the residual variation:Slide139
Dividing both sides by TSS gives: 1 =
RSS/TSS + ESS
/TSS ;
The
coeficient
of determination, or R2, is then defined as the proportion of the total variation in Y"explained" by the regression of Y on X: R2 = 1āESS/TSS = RSS/
TSSSlide140
What is the relationship between correlation and regression analysis?
Regression analysis implies (but does not prove) causality between the independent variable X and dependent variable Y. However, correlation analysis implies no causality or dependence but refers simply to the type and degree of association between two variables. For example, X and Y may be highly correlated because of another variable that strongly affects both. Slide141
Thus correlation analysis is a much less powerful tool than regression analysis and is seldom used by itself in the real world. In fact, the main use of correlation analysis is to determine the degree of association found in regression analysis. This is given by the coefficient of determination, which is the square of the correlation coefficient.Slide142
MULTIPLE REGRESSION
2/2/2017142Slide143
Multiple regression analysis is a powerful technique used for predicting the unknown value of a variable from the known value of two or more variables- also called the predictors.
More precisely, multiple regression analysis helps us to predict the value of Y for given values of X2, X3
, ā¦, X
k
.
For example the yield of rice per acre depends upon quality of seed, fertility of soil, fertilizer used, temperature, rainfall. If one is interested to study the joint affect of all these variables on rice yield, one can use this technique. An additional advantage of this technique is it also enables us to study the individual influence of these variables on yield.Slide144
The Multiple Regression ModelIn general, the multiple regression equation of Y on X
2, X3, ā¦,
Xk
is given by:
Y = b
1 + b2 X2 + b3 X3 + ā¦ + bk X
k
+
e
i
Interpreting Regression Coefficients
Here b
1
is the intercept and b
2
, b
3
, b
4
, ā¦,
b
k
are analogous to the slope in linear regression equation and are also called regression coefficients. They can be interpreted the same way as slope. Thus if b
2
= 2.5, it would indicates that Y will increase by 2.5 units if X
i
increased by 1 unit.
The appropriateness of the multiple regression model as a whole can be tested by the F-test in the ANOVA table. A significant F indicates a linear relationship between Y and at least one of the X's.Slide145
How Good Is the Regression?Once a multiple regression equation has been constructed, one can check how good it is (in terms of predictive ability) by examining the coefficient of determination (R
2 ). R2
always lies between 0 and 1.R
2
- coefficient of determination
All software provides it whenever regression procedure is run. The closer R2 is to 1, the better is the model and its prediction.A related question is whether the independent variables individually influence the dependent variable significantly. Statistically, it is equivalent to testing the null hypothesis that the relevant regression coefficient is zero.This can be done using t-test. If the t-test of a regression coefficient is significant, it indicates that the variable is in question influences