/
CORRELATION AND REGRESSION CORRELATION AND REGRESSION

CORRELATION AND REGRESSION - PowerPoint Presentation

yoshiko-marsland
yoshiko-marsland . @yoshiko-marsland
Follow
350 views
Uploaded On 2018-09-21

CORRELATION AND REGRESSION - PPT Presentation

Prepared by TO Antwi Asare 222017 1 Correlation and Regression Correlation Scatter Diagram Karl Pearson Coefficient of Correlation Rank Correlation Limits for Correlation Coefficient ID: 674383

regression correlation variable variables correlation regression variables variable relationship line coefficient linear 2017 analysis degree points data independent method

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "CORRELATION AND REGRESSION" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

CORRELATION AND REGRESSION

Prepared by T.O. Antwi-Asare

2/2/2017

1Slide2

Correlation and Regression

Correlation

Scatter Diagram,

Karl Pearson Coefficient of Correlation

Rank Correlation

Limits for Correlation Coefficient Definition of Regression Lines of Regression Regression Curves Regression coefficients properties of Regression coefficients Correlation Analysis vs. Regression Analysis.

2/2/2017

2Slide3

Definition of Correlation

Correlation means

Co -

together

and

Relation.Correlation analysis is a statistical technique of measuring the degree, direction and significance of co-variation between two or more variablesIf there is any relation between two variables i.e. when one variable changes the other also changes in the same or in the opposite direction, we say that the two variables are correlated.It means the study of existence, magnitude and direction of the relation between two or more variables

.2/2/2017

3Slide4

Correlation: Is there any relation between:

fast food sales and different seasons?specific crimes and religion?smoking cigarettes and lung cancer?maths

scores and overall scores in exams?

temperature and earthquake?

cost of advertisement and number of items sold ?

To answer each question two sets of corresponding data need to be collected randomly .Let random variable X represent the first group ofdata and random variable Y represent the second. Question: Is it true that there is a relationship between the two variables? As a first step we can plot a graph for X and YSlide5

Imagine we have a random sample of scores in a school as following:

2/2/2017

5Slide6

In our example, the correlation between š’™ and š’š can be shown in a scatter diagram:

2/2/2017

6Slide7

Our aim is to find out whether there is any linear association between

X and Y.

In statistics, the technical term for linear association is ā€œcorrelationā€. So, we are looking to see if there is any correlation between two scores.

ā€œLinear association

ā€ :

The variables are considered in their levels, i.e. X with Y not with Y2 , Y3 or 1/

Y, 1/ Y2

etc

Or even āˆ†Y

.

Slide8

Correlation analysis consists of three steps

Determining whether any relation exists

if it does, measuring its strength and direction

Testing whether the relationship is significant.

Note:

Sometimes a cause & effect relationship is conjectured but this is not conclusive. Other more powerful Statistical tools are used to show causality. Eg. Granger Causality Test, Toda -Yamamoto Test.Slide9

Note .

Significant correlation between increase in smoking and increase in long cancer does not prove that smoking causes cancer.

The proof of a cause & effect relationship can be developed only by

means of an exhaustive study of the operative elements

themselves.

Slide10

Significance of the study of correlation:

Correlation analysis helps to measure the degree & direction of correlation

in one number.

Eg

:- Age & Height, Blood pressure and pulse rateWhen there is close correlation between two variables, the value of one variable can be estimated given the known value of the other variable. Predictions can be based on correlation analysis. There are also other methods for forecasting purposesSlide11

Correlation may be due to the following reasons

:

May be due to pure chance, especially in small samplesThe correlated variables may be influenced by similar variables or a common trend.

Example: Correlation between yield per acre of rice & yield per acre of tea may be due to the fact that both are

depending upon the same amount of rainfall.2/2/201711Slide12

Correlation may be due to following reasons:

Both the variables may be mutually influencing each other so that neither can be designated as the cause and the other the effect. Example: Correlationā€ between demand and supply, price & production.

Correlation may be due to the fact that one variable is the cause and the other variable the effect.Slide13

Negative Correlationā€“as

x increases,

y decreases- Green line just shows the direction

X

= hours of

training workers (horizontal axis)Y = number of accidents (vertical axis)2/2/201713Scatter Plots and Types of Correlation

60

50

40

30

20

10

0

0

2

4

6

8

10

12

14

16

18

20

Hours of Training

AccidentsSlide14

Positive Correlationā€“as

x increases, y increases

X

= SAT score

Y

= GPAGPA2/2/201714Scatter Plots and Types of Correlation

4.00

3.75

3.50

3.00

2.75

2.50

2.25

2.00

1.50

1.75

3.25

300

350

400

450

500

550

600

650

700

750

800

SATSlide15

No linear correlation

X = height;

Y = IQ

2/2/2017

15

Scatter Plots and Types of Correlation160150

140

130

120

110

100

90

80

60

64

68

72

76

80

Height

IQSlide16

Types of Correlation

Positive (Direct) and negative correlation (Indirect)Linear and Non-Linear (

Curvi-linear) Correlation

Simple, Partial and

Multiple Correlation

2/2/201716Slide17

Positive and Negative correlation

If two variables change in the same direction, this is called positive correlation.

For example:

Advertising and sales

.

Height & weight.If two variables change in the opposite direction then the correlation is called negative correlation. For example: T.V. registrations and cinema attendance; PRICE & QUANTITY demanded2/2/2017

17Slide18

Direction of the Correlation

Positive relationship

ā€“

Variables change in the same direction. (Indicated by + sign)

As X is increasing, Y is increasing

As X is decreasing, Y is decreasingE.g., As height increases, so does weight.

Negative relationship

ā€“

Variables change in

opposite directions. (Indicated by ā€“ sign)

As X is increasing, Y is decreasing

As X is decreasing, Y is increasing

E.g., As TV time increases, grades decreaseSlide19

More Examples

Positive relationships

water consumption and day

temperature

.

Study time and grades.

Negative relationships

:

alcohol consumption and driving ability.

Price & quantity demandedSlide20

iii)

Linear and Non-Linear (Curvi-linear) Correlation

If the amount of change in one variable tends to bear

a constant ratio

to the amount of change in other variable then the

correlation is said to be LINEAR.If the amount of change in one variable does not bear a constant ratio to the amount of change in other variable then the correlation would be NON-LINEAR.However, since the method of analyzing non-linear correlation is very complicated, generally, we usually assume a linear relationship between the variables.Slide21

Simple, Partial and Multiple Correlation

Simple correlation

is a study between two variables

When three or more variables are studied it is a problem of either

multiple correlation

. In multiple correlation, three or more variables are studied simultaneously.In partial correlation, three or more variables are recognized, but the correlation between any two variables are studied keeping the effect of the other influencing variable(s) constant. Slide22

Methods Of Determining Correlation

The most commonly used methods are:

(1) Scatter Graph or Scatter Plot

(2) Pearsonā€™s Coefficient of Correlation

(3) Spearmanā€™s Rank Correlation Coefficient

(4) Regression Analysis for partial and multiple regression2/2/201722Slide23

Scatter Plot Method

In this method the values of the two variables are plotted on a graph paper.

One is taken along the horizontal ( (x-axis) and the other along the vertical (y-axis).

By plotting the data, we get points (dots) on the graph which are generally scattered and hence the name ā€˜Scatter Plotā€™.

The manner in which these points are scattered, suggest the degree and the direction of correlation.

The degree of correlation is denoted by ā€˜ r ā€™ and its direction is given by the signs positive or negative.2/2/201723Slide24

Scatter Plot Method

i) If all points lie on a rising straight line the correlation is perfectly positive and r = +1 (see fig.1 ). ii) If all points lie on a falling straight line the correlation is perfectly negative and r = -1 (see fig.2)Ā 

2/2/2017

24Slide25

Scatter Plot Method

iii) If the points lie in narrow strip, rising upwards, the correlation is high. Its degree is positive (see fig.3)Ā iv) If the points lie in a narrow strip, falling downwards, the correlation is negative of high degree (see fig.4

)Ā 

2/2/2017

25Slide26

Scatter Plot Method

v) If the points are spread widely over a broad strip, rising upwards, the correlation is low degree positive (see fig.5)Ā vi) If the points are spread widely over a broad strip, falling downward, the correlation is low degree negative (see fig.6)Ā 

vii) If the points are spread (scattered) without any specific pattern, the correlation is absent. i.e. r = 0. (see fig.7)

2/2/2017

26Slide27

Scatter Plot Method

i) If all points lie on a rising straight line the correlation

is perfectly positive

and r = +1 (see fig.1 )

ii) If all points lie on a falling straight line the correlation

is perfectly negative and r = -1 (see fig.2)Ā iii) If the points lie in narrow strip, rising upwards, the correlation is high degree of positive (see fig.3)Ā iv) If the points lie in a narrow strip, falling downwards, the correlation is high degree of negative (see fig.4)Ā 2/2/2017

27Slide28

Show fig. 5 and fig. 6 on Board in class

Students could sketch themselves2/2/2017

28Slide29

v) If the points are spread widely over a broad strip, rising upwards, the correlation is

low degree positive (see fig.5)Ā vi) If the points are spread widely over a broad strip, falling downward, the correlation

is low degree negative

(see fig.6)Ā 

vii) If the points are spread (scattered) without any specific pattern,

the correlation is absent. i.e. r = 0. (see fig.7)Slide30

Scatter Plot Method

Though this method is simple

and gives a

rough idea about the existence and the degree of correlation

, it is not reliable.

As it is not a mathematical method, it cannot measure the degree of correlation.2/2/201730Slide31

Pearsonā€™s Coefficient of Correlation

Karl Pearsonā€™s Coefficient of Correlation denoted by

ā€˜rā€™

Pearsonā€™s ā€˜rā€™ is the most commonly used correlation coefficient.The coefficient of correlation r

measures the degree of linear relationship between two variables, say X & Y. Slide32

Assumptions for Pearsonā€™s Correlation Coefficient:

For each value of X there is a normally distributed subpopulation of Y values.For each value of Y there is a normally distributed subpopulation of X values.

The joint distribution of X and Y is a normal distribution called the

bivariate

normal distribution.

The subpopulations of Y values have the same variance. The subpopulations of X values have the same variance.Slide33

Some problems with ā€œrā€

.The correlation coefficientĀ 

rĀ is not a good summary of association if the data has very large differences in variance

The correlation coefficientĀ 

rĀ 

is not a good summary of association if the data have outliers.The correlation coefficient r measures only linear associationsSlide34

FORMULA- Pearsonā€™s Correlation Coefficient

For a sampler

= [n(āˆ‘XY) ā€“ (āˆ‘ X)(āˆ‘ Y)] / {[n(āˆ‘ X

2

) ā€“ (āˆ‘ X)

2] [n(āˆ‘ Y2) ā€“ (āˆ‘ Y)2]}0.5 = Cov(X,Y)/S

X

S

Y

Where

n

is the number of data pairs, X is the independent variable and Y the dependent variable.

Cov

(

X,Y

) = Sample Covariance between X and Y

S

X

= Sample Standard Deviation of X

S

Y

= Sample Standard Deviation of Y

2/2/2017

34Slide35

Formula in Deviation Form

Note that x = X - mean of X = deviation of X from its mean; and y = Y - mean of Y = deviation of Y from the mean of Y

so r = [āˆš āˆ‘xy

]/ [āˆšāˆ‘x

2

] [āˆšāˆ‘y2]Also known as the Product-moment coefficient of correlation2/2/201735Slide36

Correlation

Through the coefficient of correlation, we can measure the degree or

extent of the correlation between two variables

.

On the basis of the sign of the coefficient of correlation we can also

determine whether the correlation is positive or negativePerfect correlation: If two variables change in the same direction and in the same proportion, the correlation between the two is perfect and positive2/2/201736Slide37

Degree of Correlation

Absence of correlation

If two series of two variables exhibit no relationship between them; or if a change in one

variable does not lead to a change in the other variable. (then r=0)

Limited degree of correlation:

If two variables are not perfectly correlated then we term the correlation as Limited Correlation2/2/201737Slide38

Karl Pearsonā€™s coefficient of correlation

It gives the numerical expression for the measure of correlation. it is noted by ā€˜ r ā€™. The value of ā€˜ r ā€™ gives the magnitude of correlation and its sign denotes its direction.

2/2/2017

38Slide39

Degrees of Correlation

High degree, moderate degree or low degree are the three categories of linear correlation.

The following table reveals how to interpret the coefficient of correlation.

2/2/2017

39Slide40

Degrees of Correlation

2/2/2017

40

Degrees

Positive

Negative

Absence of correlation

Ā®

Zero

0

Perfect correlation

Ā®

+ 1

-1

High degree

Ā®

+ 0.75 to + 1

- 0.75 to -1

Moderate degree

Ā®

+ 0.25 to + 0.75

- 0.25 to - 0.75

Low degree

Ā®

0 to 0.25

0 to - 0.25Slide41

Karl Pearsonā€™s coefficient of correlation

Note : r is also known as product-moment coefficient of correlation.

2/2/2017

41Slide42

Karl Pearsonā€™s coefficient of correlation

Example: Calculate the coefficient of correlation between the heights of fathers and sons for the following data.

2/2/2017

42

Height of father (cm):

165

166

167

168

167

169

170

172

Height of son (cm):

167

168

165

172

168

172

169

171

Solution:

n = 8 ( pairs of observations ) Slide43

Height of

father

x

i

Height ofsony

i

x =

x

i

-x

y =

y

i

-y

xy

x

2

y

2

165

167

-3

-2

6

9

4

166

168

-2

-1

2

4

1

167

165

-1

-4

4

1

16

167

168

-1

-1

1

1

1

168

172

0

3

0

0

9

169

172

1

3

3

1

9

170

169

2

0

0

4

0

172

171

4

2

8

16

4

S

x

i

=1344

S

y

i

=1352

0

0

S

xy=24

S

x

2

=36

S

y

2

=44

2/2/2017

43Slide44

Solved example

Problem: Find the relationship between the Flowers on plant is correlated with the height of plant44

S. No.

Height of

plant

Flowers on plant141223103413

4515

5

5

16

6

4

11

7

6

18

8

3

9

9

5

14

10

4

12Slide45

S. No.

Height of

plant

Flowers on plant

1

41223103413451555

16

6

4

11

7

6

18

8

3

9

9

5

14

10

4

12

2/2/2017

45Slide46

46

S. No.

Height of

plant (x)

Flowers on plant (y)

x2y2xy

14

12

16

144

48

2

3

10

9

100

30

3

4

13

16

169

52

4

5

15

25

225

75

5

5

16

25

256

80

6

4

11

16

121

44

7

6

18

36

324

108

8

3

9

9

81

27

9

5

14

25

196

70

10

4

12

16

144

48

Total

43

130

193

1760

582Slide47

47

Ā 

Ā 

Ā 

Ā Slide48

Limits for Correlation Coefficient

Pearsonian correlation coefficient lies between -1 and +1 .

Symbolically ,

-1 ā‰¤ r ā‰¤ +1

2/2/2017

48Slide49

Merits and limitations of the Pearsonā€™s Coefficient of Correlation

It summarizes in one number the degree and direction of correlation.Karl Pearsonā€™s co-efficient of correlation assumes linear relationship regardless of whether that assumption is correct or not.

The value of the co-efficient is unduly affected by the extreme items.

Very often there is risk of misinterpreting the co-efficient hence great care must be exercised, while interpreting the co-efficient of correlation.Slide50

Properties of

Ī”earsonā€™s coefficient of correlationThis measure of correlation has interesting properties, some of which are stated below:It is

independent of the units of measurement. It is in fact unit free.

For example, Ļ between highest day temperature (in Centigrade) and rainfall per day (in mm) is not expressed either in terms of centigrade or mm.

It is symmetric

. This means that Ļ between X and Y is exactly the same as Ļ between Y and X.Slide51

Properties of

Ī”earsonā€™s coefficient of correlationPearson's correlation coefficient is independent of change in origin and scale.

Thus, Ļ

between temperature (in Centigrade) and rainfall (in mm) would numerically be equal to

Ļ

between temperature (in Fahrenheit) and rainfall (in cm).If the variables are independent of each other, then one would obtain Ļ = 0. However, the converse is not true. In other words Ļ = 0 does not imply that the variables are independent - it may indicate the existence of a non-linear relationship.Slide52

Caveats and Warnings

While Ļ is a powerful tool, it is a much abused one and hence has to be handled carefully.People often tend to forget or gloss over the fact that

Ļ is a

measure of linear relationship.

Consequently a small value of

Ļ is often interpreted to mean non existence of relationship when actually it only indicates non existence of a linear relationship or at best a very weak linear relationship. Under such circumstances it is possible that a non linear relationship exists.Slide53

A scatter diagram can reveal the same and one is well advised to observe the graph before firmly concluding non existence of a relationship.

If the scatter diagram points to a non linear relationship, an appropriate transformation can often attain linearity in which case Ļ can be recomputed.Slide54

Caveats and Warnings

One has to be careful in interpreting the value of Ļ. For example, one could compute Ļ between size of a shoe and intelligence of individuals, heights and income. Irrespective of the value of Ļ, such a correlation makes no sense and is hence termed chance or

nonesense correlation.

Ļ

should not be used to say anything about cause and effect. Put differently, by examining the value of Ļ, we could conclude that variables X and Y are related. However the same value of Ļ does not tell us if X influences Y or the other way round - a fact that is of grave import in regression analysis.Slide55
Slide56

Significance test for correlation:

Hypothesis: H0

: = 0

H

A

: ā‰  0Test Statistics: Under the null hypothesis (when null hypothesis is true)

follows a Students ā€˜tā€™ distribution with n-2 degrees of freedom.

Decision rule:

If we let

Ī±

=0.05, the critical value of t is Ā± w

, we reject H

0

if t is outside,

-w ā‰¤ t ā‰¤ +wSlide57

Spearmanā€™s Rank Correlation

Ā Spearman's rank correlation

Ā coefficient orĀ Spearman's

rho, is a measure of statistical dependence between two ordinal variables.Slide58

Spearman Rank Correlation Coefficient

is a non-parametric measure of correlation, using ranks of the variables to calculate the correlation.Advantages and CaveatsOther measures of correlation are parametric in the sense of being based variables which are measured within the ratio scaleSlide59

Advantages and Caveats

Another advantage with this measure is that it is much easier to use since it does not matter which way we rank the data, ascending or descending. We may assign rank 1 to the smallest value or the largest value, provided we do the same thing for both sets of data.The only requirement is that data should be ranked or at least converted into ranks.Slide60

Interpretation

The sign of the Spearman correlation indicates the direction of association betweenĀ X

Ā (the independent variable) andĀ Y

Ā (the dependent variable).

IfĀ 

YĀ tends to increase whenĀ XĀ increases, the Spearman correlation coefficient is positive. IfĀ YĀ tends to decrease whenĀ X increases, the Spearman correlation coefficient is negative. A Spearman correlation of zero indicates that there is no association between X and YSlide61

Repeated Ranks

If there is more than one item with the same value , then they are given a common rank which is average of their respective ranks as shown in the table.Slide62

Example

The raw data in the table below is used to calculate the correlation between theĀ IQ

Ā of a student with the number of hours spent in front of

Ā TV

per week.Slide63

2/2/2017

63Slide64

Example Contd.Slide65

2/2/2017

65Slide66

Correlation and causation

: closely related to confounding variables, is the incorrect assumption that because something correlates, there is a causal relationship.Causality is the area of statistics that is most commonly misused, and misinterpreted, by non-specialists.Slide67

Media sources, politicians and lobby groups often leap upon a perceived correlation, and use it to 'prove' their own beliefs.

They fail to understand that, just because results show a correlation, there is no proof of an underlying causality.Many people assume that because a poll, or a statistic, contains many numbers, it must be scientific, and therefore correct.Slide68

The Cost of Disregarding Correlation and Causation

The principle of incorrectly linking correlation and causation is closely linked to post hoc reasoning where incorrect assumptions generate an incorrect link between two effects.The principle of Correlation and Causation is very important for anybody working as a scientist or researcher. It is also a useful principle for non-scientists, especially those studying politics, media and marketing.Slide69

Understanding causality promotes a greater understanding, and honest evaluation of the alleged facts given by pollsters.

Imagine an expensive advertising campaign, based around intense market research, where misunderstanding a correlation could cost a lot of money in advertising, production costs, and damage to the company's reputation.Slide70

Partial correlation analysis involves studying the linear relationship between two variables after excluding the effect of one or more independent factors.

Simple Correlation does not prove to be an all-encompassing technique especially under the above circumstances. In order to get a correct picture of the relationship between two variables, we should first eliminate the influence of other variables.For example, study of partial correlation between price and demand would involve studying the relationship between price and demand excluding the effect of money supply, exports, etc

.Slide71

Limitations

However, this technique suffers from some limitations some of which are stated below.The calculation of the partial correlation co-efficient is based on the simple correlation co-efficient. However, simple correlation coefficient assumes a linear relationship.

Generally this assumption is not valid especially in social sciences, as linear relationship rarely exists in such phenomena.Slide72

As the order of the partial correlation co-efficient goes up, its reliability goes down.

Its calculation is somewhat cumbersome - often difficult to the mathematically uninitiated (though software's have made life a lot easier)Slide73

REGRESSION

2/2/201773Slide74

Definition of Regression

Regression can be defined as a method that estimates the value of one variable when that of other variables are known

, provided the variables are correlated.

The dictionary meaning of regression is "

to go backward

." It was used for the first time by Sir Francis Galton in his research paper "Regression towards mediocrity in hereditary stature."Regression helps us to estimate one variable or the dependent variable from other variables or independent variables2/2/201774Slide75

4. According to

Blair ā€œRegression is the measure of the average relationship between two or more variables in terms of the original units of dataā€5. According to

Wallis and Robert ā€œIt is often more important to find out what the relation actually is , in order to estimate or predict one variable (the dependent variable) and statistical techniques appropriate in such cases is called

Regression Analysis

ā€Slide76

Regression Analysis

6. Regression Analysis is the mathematical measure of the average relationship between two or more variables. Slide77

Regression analysis is a statistical tool for the investigation of relationships between variables.

Usually, the investigator seeks to ascertain the causal eļ¬€ect of one variable upon anotherā€”the eļ¬€ect of a price increase upon demand, for example, or the eļ¬€ect of changes in the money supply upon the inļ¬‚ation rate. To explore such issues, the investigator assembles data on the underlying variables of interest and employs regression to estimate the quantitative eļ¬€ect of the causal variables upon the variable that they inļ¬‚uence.

2/2/2017

77Slide78

The investigator also typically assesses the ā€œstatistical signiļ¬canceā€ of the estimated relationships, that is, the degree of conļ¬dence that the true relationship is close to the estimated relationship.

Regression techniques have long been central to the ļ¬eld of economic statistics (ā€œeconometricsā€).

2/2/2017

78Slide79

regression analysis

a method of modelling the relationships among three or more variables. It is used to predict the value of one variable given the values of the others. For example, a model might estimate sales based on age and gender.

A regression analysis yields an equation that expresses the relationship

2/2/2017

79Slide80

The purposes of regression analysis include

the determination of the general form of a regression equation, the construction of estimates of unknown parameters occurring in a regression equation, and the testing of statistical regression hypotheses.

2/2/2017

80Slide81

Summary of ideas on Regression

Regression analysis is used to predict the value of one variable (the dependent variable

) on the basis of other variables (the

independent variables

).

ā€¢ In correlation, the two variables are treated as equals. In regression, one or more variables are considered independent (=predictor) variables (X) and the other the dependent (=outcome) variable Y.ā€¢ Dependent variable: usually denoted Yā€¢

Independent variables: denoted X

1

, X

2

, ā€¦,

X

k

2/2/2017

81Slide82

Regression Lines

They are obtained by (I)Graphically - by Scatter plot

(II)Mathematically ā€“

- by the method of least squares.

- Other methods

eg ML2/2/201782Slide83

SIMPLE REGRESSION MODEL

Y= Ī²0 +

Ī²1

X +

Īµ

2/2/201783Slide84

Y= Ī²0 +

Ī²

1

X +

Īµ The model above is referred to as the Simple linear regression model. We would be interested in estimating Ī²0 and Ī²1 from the data we collect.If you know something about X, this knowledge helps you predict something about Y. Variables:

ā€¢ X = Independent Variable (we provide this)

ā€¢ Y = Dependent Variable (we observe this)

ā€¢ Parameters:

ā€¢

Ī²

0

=

Y-Intercept; ā€¢

Ī²

1

=

Slope

ā€¢

Īµ

=

the Disturbance term, which is assumed as a

Normal Random Variable

2/2/2017

84Slide85

Why does the disturbance term exist? There are several reasons.

1. Omission of some explanatory variables: The relationship between

Y and

X

is almost certain to be a simplification. In reality there will be other factors affecting

Y that have been left out of the model Y= Ī²1 + Ī²2 X + Īµ and their influence will cause the points to lie off the line. It often happens that there are variables that you would like to include in the regression equation but cannot because you are unable to measure them. All of these other factors contribute to the disturbance term.2/2/201785Slide86

Why does the disturbance term exist? There are several reasons

2. Aggregation of variables:

In many cases the relationship is an attempt to summarize in aggregate a number of microeconomic relationships. For example, the aggregate consumption function is an attempt to summarize a set of individual expenditure decisions. Since the individual relationships are likely to have different parameters, any attempt to relate aggregate expenditure to aggregate income can only be an approximation.

The discrepancy is attributed to the disturbance term

2/2/2017

86Slide87

Why does the disturbance term exist? There are several reasons

3. Model misspecification: The model may be

mis-specified in terms of its structure.

examples, if the relationship refers to time series data, the value of

Y

may depend not on the actual value of X but on the value that had been anticipated in the previous period. If the anticipated and actual values are closely related, there will appear to be a relationship between Y and X, but it will only be an approximation, and again the disturbance term will pick up the discrepancy.2/2/201787Slide88

Why does the disturbance term exist? There are several reasons

4. Functional misspecification: The functional relationship between

Y and

X

may be

misspecified mathematically. For example, the true relationship may be nonlinear instead of linear. Obviously, one should try to avoid this problem by using an appropriate mathematical specification, but even the most sophisticated specification is likely to be only an approximation, and the discrepancy contributes to the disturbance term.2/2/201788Slide89

Why does the disturbance term exist? There are several reasons

5. Measurement error: If the measurement of one or more of the variables in the relationship is subject to error, the observed values will not appear to conform to an exact relationship, and

the discrepancy contributes to the disturbance term.

The disturbance term is the collective outcome of all these factors.

2/2/2017

89Slide90

Assumptions of linear regression- When Can I fit the linear regression line

Simple Linear regression model assumes thatā€¦

1. The relationship between X and Y is linear

[

Imagine a quadratic(parabolic) relation ship between X & Y. Does it make sense to fit a straight line through this data]

2. Y is distributed normally at each value of X: [At each X, Y is normally distributed, which means at each X, Y value is around its overall mean value]2/2/201790Slide91

3. The variance of Y at every value of X is the same (homogeneity of variances)

[Imagine data in the form of a cone, as we move away from origin the variance in Y is increasing drastically. Does it make sense to fit a straight line through this data?]4. The observations are independent:

[There is already one trend in the data If the observations are dependent. One trend line is not sufficient to model in this case]Slide92

Advantages of Regression Analysis

Regression analysis provides estimates of values of the dependent variables from the values of independent variables.

Regression analysis also helps to obtain a measure of the error involved in using the regression line as a basis for estimations .

Regression analysis helps in obtaining a measure of the degree of association or correlation that exists between the variables.Slide93

Regression line

Simple Regression line

is the line which gives the best estimate of one variable from the value of any other given variable.

The simple regression line

gives the average relationship between the two variables in mathematical form.Slide94

Lines of Regression By

- Graphically - by Scatter plot Method

Line of regression of y on x

Its form is

y = a + b x

It is used to estimate y when x is givenWhere a is intercept of the line and b is the slope of line x on y.(2) Line of regression of x on yIts form is x = a + b yIt is used to estimate x when y is given.

Where a is intercept of the line and b is the slope of line y on x.

2/2/2017

94Slide95

Regression Lines

In scatter plot, we have seen that if the variables are highly correlated then the points (dots) lie in a narrow strip. if the strip is nearly straight, we can draw a straight line, such that all points are close to it from both sides.

This line is called the line of best fit if it minimizes the distances of all data points from it.

This line is called

the line of regression

. Prediction is easier because all we need to do is to extend the line and read the value2/2/201795Slide96

Regression Line

To obtain a line of regression, we need to have a line of best fit. But statisticians donā€™t measure the distances by dropping perpendiculars from points on to the line.

They measure deviations

( or

errors

or residuals as they are called) (i) vertically and (ii) horizontally. Thus we get two lines of regressions as shown in the figure (1) and (2).2/2/201796Slide97

2/2/2017

97Slide98

2/2/2017

98Slide99
Slide100

The Least Squares Method

The line of regression is the line which gives the best estimate to the value of one variable for any specific value of the other variable. Thus the line of regression is the line of ā€œbest fitā€ and is Obtained by the

principle of least squares. This principle consists in minimizing the sum of the squares of the deviations of the actual values of y from their estimated values giving the line of best fit. [

i.e

Minimize the Error Sum of Squares w.r.t the estimated coefficients of the model.]Slide101
Slide102
Slide103

SHOW that : TSS = ESS + RSS OR SST = SSE + SSR FROM

=

Ā Slide104
Slide105

Sometimes: ESS

= Error Sum of Squares or Residual Sum of SquaresRSS = Regression Sum of Squares or Explained Sum of SquaresTSS

= Total Sum of SquaresNOTE : In some books;

RSS

= Residual Sum of Squares which is the same as Error Sum of SquaresSlide106
Slide107

A good fit will have

ā€¢ SSE of ESS Minimized

ā€¢ The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable

The coefficient of determination is also called R-squared and is denoted as R

2

SSR/ SST = R2 = 1 - [ESS/TSS] where 0 ā‰¤ R2 ā‰¤ 1,

In the Simple Regression Model, the coefficient of determination is equal to square of simple correlation coefficientSlide108

Standard Error of Estimate

ā€¢ The standard deviation of the variation of observations around the regression line is estimated by

Se = [ESS/(n - k)]

0.5

Where:

ESS = Error Sum of Squaresn = Sample sizek = number of parameters in the model = 2 for the simple regression model.Slide109
Slide110
Slide111
Slide112
Slide113
Slide114
Slide115
Slide116
Slide117
Slide118

ASSUMPTIONS REQUIRED FOR A LINEAR REGRESSION MODEL

The mean of the probability distribution of the random error is 0, E(

e) = 0. that is, the average of the errors over an infinitely long series of experiments is 0 for each setting of the independent variable

x

. this assumption states that the mean value of

y, E(y) for a given value of x is y = A + B

x.

The variance of the random error is equal to a constant, say

ļ³

2

, for all value of

x

.

The probability distribution of the random error is normal.

The errors associated with any two different observations are independent. That is, the error associated with one value of

y

has no effect on the errors associated with other values.Slide119

Formulae ā€“ change all variables to capital letters and useSlide120

2/2/2017

120Slide121

Example

Fit a least square line to the following dataSlide122

SolutionSlide123

Lines of Regression By

- by the method of least squares

2/2/2017

123

Line of regression of y on x

Where

Line of regression of x on y

Where

Slide124

Example on Regression By

- by the method of least squares

2/2/2017

124

A panel of two judges A and B graded dramatic performance by independently awarding marks as follows:

Solution:-Slide125

Example on Regression

2/2/2017125

Example Continue ā€¦.

Slide126

Example on Regression By

- by the method of least squares

2/2/2017

126

Example Continue ā€¦.

Slide127
Slide128

MATRIX APPROACH

2/2/2017128Slide129

Example

Problem: Nitrogen produced by the treatment plant in the mid term and final. Develop a regression equation which may be used to predict final yield from the mid term score.

129

Treatment plant

Mid term

Final19890266743100

98

4

96

88

5

88

80

6

45

62

7

76

78

8

60

74

9

74

86

10

82

80Slide130

Solution

130

Treatment plant

Mid term (X)

Final (Y)

X2XY198909064882026674435648843

1009810000

9800

4

96

88

9216

8448

5

88

80

7744

7040

6

45

62

2025

2790

7

76

78

5776

5928

8

60

74

3600

4440

9

74

86

5476

6364

10

82

80

6724

6560

Total

785

810

64521

65074Slide131

Numerator of b = 10x65074-785x810

= 650740-635850 = 14890Denominator of b = 64521-(785)2

= 645210-616225

= 28985

Therefore b = 14890/28985 = 0.5137Numerator of a = 810-785x0.5137 = 810-403.2545 = 406.7455Denominator of a = 10131Slide132

Thus,Value of a = numerator of a/dominator of a

= 406.7455/a = 40.67455 considering the formula of regression equation:

Y=a+b

(Xā€™)

Y= predicting value

a = value obtainedb = value obtainedXā€™ = number of object for the prediction is desirableThus,Y hat= 40.67455+(0.5137)50 = 40.6746+25.685 = 66.3596132Slide133

Correlation analysis vs. Regression analysis.

Regression is the average relationship between two variables

Correlation need not imply cause & effect relationship between the variables understudy.- R A clearly indicate the cause and effect relation ship between the variables.

There may be non-sense correlation between two variables.- There is no such thing like non-sense regression.Slide134

Uses of Correlation and Regression

There are three main uses for correlation and regression.One is to test hypotheses about cause and effect relationships. In this case, the experimenter determines the values of the X-variable and sees whether variation in X causes variation in Y. For example, giving people different amounts of a drug and measuring their blood pressure.The second main use for correlation and regression is to see whether two variables are associated, without necessarily inferring a cause and effect relationship.

Slide135

In this case, neither variable is determined by the experimenter; both are naturally variable. If an association is found, the inference is that variation in X may cause variation in Y, or variation in Y may cause variation in X, or variation in some other factor may affect both X and Y.

The third common use of linear regression is estimating the value of one variable corresponding to a particular value of the other variable.Slide136

Interpretation of a Linear Regression Equation

This is a foolproof way of interpreting the coefficients of a linear regressionY= Ī²1 + Ī²

2 X + Īµ

when Y and X are variables with straightforward natural units (not logarithms or other functions). The first step is to say that a one-unit increase in X (measured in units of X) will cause a Ī²

2

unit increase in Y (measured in units of Y). Slide137

Interpretation of a Linear Regression Equation

The second step is to check to see what the units of X and Y actually are, and to replace the word "unit" with the actual unit of measurement. The third step is to see whether the result could be expressed in a better way, without altering its substance. The constant, Ī²1, gives the predicted value of Y (in units of Y) for X equal to 0. It may or may not have a plausible meaning, depending on the context.Slide138

TEST OF GOODNESS OF FIT AND CORRELATION

The closer the observations fall to the regression line (i.e., the smaller the residuals), the greater is thevariation in Y "explained" by the estimated regression equation. The total variation in Y is equal to the explained plus the residual variation:Slide139

Dividing both sides by TSS gives: 1 =

RSS/TSS + ESS

/TSS ;

The

coeficient

of determination, or R2, is then defined as the proportion of the total variation in Y"explained" by the regression of Y on X: R2 = 1ā€”ESS/TSS = RSS/

TSSSlide140

What is the relationship between correlation and regression analysis?

Regression analysis implies (but does not prove) causality between the independent variable X and dependent variable Y. However, correlation analysis implies no causality or dependence but refers simply to the type and degree of association between two variables. For example, X and Y may be highly correlated because of another variable that strongly affects both. Slide141

Thus correlation analysis is a much less powerful tool than regression analysis and is seldom used by itself in the real world. In fact, the main use of correlation analysis is to determine the degree of association found in regression analysis. This is given by the coefficient of determination, which is the square of the correlation coefficient.Slide142

MULTIPLE REGRESSION

2/2/2017142Slide143

Multiple regression analysis is a powerful technique used for predicting the unknown value of a variable from the known value of two or more variables- also called the predictors.

More precisely, multiple regression analysis helps us to predict the value of Y for given values of X2, X3

, ā€¦, X

k

.

For example the yield of rice per acre depends upon quality of seed, fertility of soil, fertilizer used, temperature, rainfall. If one is interested to study the joint affect of all these variables on rice yield, one can use this technique. An additional advantage of this technique is it also enables us to study the individual influence of these variables on yield.Slide144

The Multiple Regression ModelIn general, the multiple regression equation of Y on X

2, X3, ā€¦,

Xk

is given by:

Y = b

1 + b2 X2 + b3 X3 + ā€¦ + bk X

k

+

e

i

Interpreting Regression Coefficients

Here b

1

is the intercept and b

2

, b

3

, b

4

, ā€¦,

b

k

are analogous to the slope in linear regression equation and are also called regression coefficients. They can be interpreted the same way as slope. Thus if b

2

= 2.5, it would indicates that Y will increase by 2.5 units if X

i

increased by 1 unit.

The appropriateness of the multiple regression model as a whole can be tested by the F-test in the ANOVA table. A significant F indicates a linear relationship between Y and at least one of the X's.Slide145

How Good Is the Regression?Once a multiple regression equation has been constructed, one can check how good it is (in terms of predictive ability) by examining the coefficient of determination (R

2 ). R2

always lies between 0 and 1.R

2

- coefficient of determination

All software provides it whenever regression procedure is run. The closer R2 is to 1, the better is the model and its prediction.A related question is whether the independent variables individually influence the dependent variable significantly. Statistically, it is equivalent to testing the null hypothesis that the relevant regression coefficient is zero.This can be done using t-test. If the t-test of a regression coefficient is significant, it indicates that the variable is in question influences