/
Chapter 4 Describing Bivariate Numerical Data Chapter 4 Describing Bivariate Numerical Data

Chapter 4 Describing Bivariate Numerical Data - PowerPoint Presentation

emily
emily . @emily
Follow
343 views
Uploaded On 2022-06-11

Chapter 4 Describing Bivariate Numerical Data - PPT Presentation

Created by Kathy Fritz Forensic scientists must often estimate the age of an unidentified crime victim Prior to 2010 this was usually done by analyzing teeth and bones and the resulting estimates were not very reliable A study described in the paper Estimating Human Age from TCell DNA Re ID: 915887

regression line squares relationship line regression relationship squares data distance linear correlation negative positive coefficient strength residual scatterplot model

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Chapter 4 Describing Bivariate Numerical..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Chapter 4

Describing Bivariate Numerical Data

Created by Kathy Fritz

Slide2

Forensic scientists must often estimate the age of an unidentified crime victim. Prior to 2010, this was usually done by analyzing teeth and bones, and the resulting estimates were not very reliable. A study described in the paper “Estimating Human Age from T-Cell DNA Rearrangements” (Current Biology [2010]) examined the

relationship between age and a measure based on a blood test.

Age and the blood test measure were recorded for 195 people ranging in age from a few weeks to 80 years. A scatterplot of the data appears to the right.

Do you think there is a relationship? If so, what kind? If not, why not?

This line can be used to estimate the age of a crime victim from a blood test.

Slide3

Correlation

Pearson’s Sample Correlation Coefficient

Properties of

r

Slide4

Does it look like there is a relationship between the two variables?

If so, is the relationship linear?

Yes

Yes

Slide5

Does it look like there is a relationship between the two variables?

If so, is the relationship linear?

Yes

Yes

Slide6

Does it look like there is a relationship between the two variables?

If so, is the relationship linear?

Yes

No, looks curved

Slide7

Does it look like there is a relationship between the two variables?

If so, is the relationship linear?

Yes

No, looks parabolic

Slide8

Does it look like there is a relationship between the two variables?

If so, is the relationship linear?

No

Slide9

Linear relationships can be either

positive or negative in direction.

Are these linear relationships positive or negative?

Positive

Negative

Slide10

When the points in a scatterplot tend to

cluster tightly around a line, the relationship is described as strong.

Try to order the scatterplots from strongest relationship to the weakest.

These four scatterplots were constructed using data from graphs in

Archives of General Psychiatry

(June 2010).

A

B

C

D

A, C, B, D

Slide11

Pearson’s Sample Correlation Coefficient

Usually referred to as just the

correlation coefficientDenoted by rMeasures the

strength

and

direction of a linear relationship between two numerical variables

The

strongest values of the correlation coefficient are r = +1 and r = -1. The weakest

value of the correlation coefficient is r = 0.

An important definition!

Slide12

Properties of

r

The sign of r matches

the

direction

of the linear relationship.

r is positive

r is negative

Slide13

Properties of

r

The value of

r

is always greater than or equal to

-1

and less than or equal to +1.

Weak correlation

Strong correlation

Moderate correlation

Slide14

Properties of

r

3.

r

= 1

only when all the points in the scatterplot fall on a

straight line that slopes upward. Similarly, r = -1

when all the points fall on a downward sloping line.

Slide15

Properties of

r

4

.

r

is a measure of the extent to which

x and y are linearly related

Find the correlation for these points:

Compute the correlation coefficient?

Sketch the scatterplot.

x

2

4

6

8

10

12

14

y

40

20

8

4

8

20

40

r

= 0

r

= 0, but the data set has a

definite

relationship!

Does this mean that there is

NO

relationship between these points?

Slide16

Properties of

r

The value of

r

does not depend on the

unit of measurement

for either variable.

Mare Weight (in Kg)

Foal Weight (in Kg)

556

129.0

638

119.0

588

132.0

550

123.5

580

112.0

642

113.5

568

95.0

642

104.0

556

104.0

616

93.5

549

108.5

504

95.0

515

117.5

551

128.0

594

127.5

Calculate

r

for the data set of mares’ weight and the weight of their foals.

r

= -0.00359

Mare Weight (in

lbs

)

Foal Weight (in Kg)

1223.2

129.0

1403.6

119.0

1293.6

132.0

1210.0

123.5

1276.0

112.0

1412.4

113.5

1249.6

95.0

1412.4

104.0

1223.2

104.0

1355.2

93.5

1207.8

108.5

1108.8

95.0

1111.0

117.5

1212.2

127.5

1306.8

127.5

Change the mare weights to pounds by multiply Kg by 2.2 and calculate

r

.

r

= -0.00359

Slide17

Calculating Correlation Coefficient

The correlation coefficient is calculated using the following formula:

 

where

 

and

 

Slide18

The web site

www.collegeresults.org (The Education Trust) publishes data on U.S. colleges and universities. The following six-year graduation rates and student-related expenditures per full-time student for 2007 were reported for the seven primarily undergraduate public universities in California with enrollments between 10,000 and 20,000.

Here is the scatterplot:

Does the relationship appear linear?

Explain.

Expenditures

8810

7780

8112

8149

8477

7342

7984

Graduation rates

66.1

52.4

48.9

48.1

42.0

38.3

31.3

Slide19

College Expenditures Continued:

To compute the correlation coefficient, first find the

z

-scores.

x

y

z

x

zy

z

x

z

y

8810

66.1

1.52

1.74

2.64

7780

52.4

-0.66

0.51

-0.34

8112

48.9

0.04

0.19

0.01

8149

48.1

0.12

0.12

0.01

8477

42.0

0.81

-0.42

-0.34

7342

38.3

-1.59

-0.76

1.21

7984

31.3

-0.23

-1.38

0.32

 

 

To interpret the correlation coefficient, use the definition –

There is a positive, moderate linear relationship between six-year graduation rates and student-related expenditures.

Slide20

How the Correlation Coefficient Measures the Strength of a Linear Relationship

z

x

is positive

z

y

is positivezxz

y is positive

z

x

is negative

z

y

is

negative

z

x

z

y

is positive

z

x

is negative

z

y

is positive

z

x

z

y

is

negative

Will the

sum of

z

x

z

y

be positive or negative?

Slide21

How the Correlation Coefficient Measures the Strength of a Linear Relationship

z

x

is positive

z

y

is positivezxzy is positive

z

x

is negative

z

y

is

negative

z

x

z

y

is positive

z

x

is negative

z

y

is positive

z

x

z

y

is

negative

Will the

sum of

z

x

z

y

be positive or negative?

z

x

is negative

z

y

is positive

z

x

z

y

is

negative

Slide22

How the Correlation Coefficient Measures the Strength of a Linear Relationship

Will the

sum of

z

x

z

y be positive or negative or zero?

Slide23

Does a value of

r close to 1 or -1 mean that a change in one variable

causes a change in the other variable?

Consider the following examples:

The relationship between the number of cavities in a child’s teeth and the size of his or her vocabulary is strong and positive.

Consumption of hot chocolate is negatively correlated with crime rate.

These variables are both strongly related to the age of the childBoth are responses to cold weather

Causality can only be shown by carefully controlling values of all variables that might be related to the ones under study. In other words, with a well-controlled, well-designed experiment

.

So does this mean I should feed children more candy to increase their vocabulary?

Should we all drink more hot chocolate to lower the crime rate?

Association does

NOT

imply causation.

Slide24

Linear Regression

Least Squares Regression Line

Slide25

Suppose there is a relationship between two numerical variables.

Let

x be the amount spent on advertising and y be the amount of sales for the product during a given period.

You might want to predict product

sales (

y

) for a month when the amount spent on advertising is $10,000 (x).The letter y is used to denoted the variable you want to predict, called the response variable (or dependent variable).

The other variable, denoted by x, is the

predictor variable (sometimes called independent or explanatory variable).

Slide26

Where:

b – is the slope of the line

it is the amount by which y increases when x increases by 1 unit

a

– is the

intercept

(also called y-intercept or vertical intercept)it is the height of the line above x = 0in some contexts, it is not reasonable to interpret the interceptThe equation of a line is:

Slide27

The Deterministic Model

Notice, the

y

-value is

determined

by substituting the

x-value into the equation of the line.Also notice that the points fall on the line.We often say x determines

y.

But, when we fit

a line to data, do all the points fall on the line?

Slide28

How do you find an appropriate line for describing a bivariate data set?

y

= 10 + 2

x

To assess the fit of a line, we look at how the points

deviate

vertically from the line.This point

is (20,45).The predicted value for y

when x = 20 is:

= 10 + 2(20) = 50

The

deviation

of the point (20,45) from the line is: 45 - 50 = -5

 

What is the meaning of a negative deviation?

The point (15,44) has a

deviation

of +4.

To assess the fit of a line, we need a way to

combine

the

n

deviations into

a single measure of fit

.

What is the meaning of

this deviation

?

Slide29

Least squares regression line

The

least squares regression line is the line that minimizes the sum of squared deviations.

The

slope

of the least squares regression line is:

 

and the

y

-intercept

is:

 

The

equation

of the least square regression line is:

 

The most widely used measure of the

fit of a line

y

=

a

+

bx

to bivariate data is the

sum of the squared deviations

about the line.

 

 

Slide30

(0,0)

(3,10)

(6,2)

Sum of the squares = 54

Use a calculator to find the

least squares regression line

Find the vertical deviations from the line

-3

6

-3

What is the sum of the deviations from the line?

Will

the sum

always be zero?

The line that

minimizes

the sum of

squared deviations is

the

least squares regression line

.

Find the sum of the squares of the deviations from the line

Let’s investigate the meaning of the least squares regression line. Suppose

we have a data set that consists of the observations (0,0), (3,10) and 6,2).

Hmmmmm

. . .

Why does this seem so familiar?

Slide31

Pomegranate, a fruit native to Persia, has been used in the folk medicines of many cultures to treat various ailments. Researchers are now investigating if pomegranate's antioxidants properties are useful in the treatment of cancer.

In one study, mice were injected with cancer cells and randomly assigned to one of three groups, plain water, water supplemented with .1% pomegranate fruit extract (PFE), and water supplemented with .2% PFE. The average tumor volume for mice in each group was recorded for several points in time.

(

x

= number of days after injection of cancer cells in mice assigned to plain water and

y

= average tumor volume (in mm3) x 11 15 19 23 27

y 150 270 450 580 740Sketch a scatterplot for this data set.

Number of days after injection

Average tumor volume

Slide32

 

 

Interpretation of slope:

The average volume of the tumor increases by approximately 37.25 mm

3

for each day increase in the number of days after injection.

Does the intercept have meaning in this context? Why or why not?

Computer software and graphing calculators can calculate the least squares regression line.

Slide33

Pomegranate study continued

Predict the average volume of the tumor for 20 days after injection.

Predict the average volume of the tumor for 5 days after injection.

 

 

This is the

danger of

extrapolation. The least squares line should not be used to make predictions for

y using x-values outside

the range in the data set.

It is unknown whether the pattern observed in the scatterplot continues outside the range of

x

-values.

Why?

Can volume be negative?

Slide34

Why is the line used to summarize a linear relationship called the least squares

regression line?

An alternate expression for the slope b is:

 

The least squares regression line passes through

t

he point of averages

 

This terminology comes from the relationship between the least squares line and the correlation coefficient.

Using the point-slope form of a line and

r

= 1, we can substitute the alternative slope and the point of averages.

which is

 

If r = 1, what

do you know about the location of the points?

 

Suppose that a point

on the line is one standard deviation above the mean of

x

. The value of this point would be

. Substitute this value for

x

in our equation.

 

Notice that when

r

= 1, the

y

-value will be one standard deviation

above the mean of

y

,

, for an

x

-value one standard deviation above the mean of

x

,

.

 

Slide35

Why is the line used to summarize a linear relationship called the least squares

regression line?

Let’s investigate what happens when

r

< 1

.

 

 

Suppose

r = 0.5

and

. Substitute these values in our equation.

 

Notice that when

r

= 0.5, the

y

-value will be

one-half

standard deviation

above the mean of

y

,

, for an

x

-value

one

standard deviation above the mean of

x

,

.

 

Using the least squares line, the predicted y is pulled back in (or

regressed

) toward

.

 

What would

happen if r = 0.4? . . . 0.3? . . . 0.2?

Slide36

The

regression line of

y on x should

not

be used to predict

x, because it is not the line that minimizes the sum of the squared deviations in the x direction.If you want to predict x from y, can you use the least squares line of y on x

? The slope of the least squares line for predicting

x will be

not

.

Also, the

intercepts

of the lines are almost always different.

 

Slide37

Assessing the Fit of a Line

Residuals

Residual Plots

Outliers and Influential Points

Coefficient of Determination

Standard Deviation about the Line

Slide38

Assessing the fit of a line

Important questions are:

Is the line an appropriate way to summarize the relationship between x and

y

?

Are there any

unusual aspects of the data set that you need to consider before proceeding to use the least squares regression line to make predictions?If you decide that it is reasonable to use the line as a basis for prediction, how accurate can you expect predictions to be?Once the least squares regression line is obtained, the next step is to examine how effectively the line summarizes the relationship between x and y.This section will look at graphical and numerical methods to answer these questions.

Slide39

Residuals

Recall, the vertical deviations of points from the least squares regression line are called

deviations.

These deviations are also called

residuals

.

 

Slide40

In a study, researchers were interested in how the distance a deer mouse will travel for food (

y) is related to the distance from the food to the nearest pile of fine woody debris (

x). Distances were measured in meters.

Distance from Debris (

x

)

Distance Traveled (

y)

6.94

0.00

5.23

6.13

5.21

11.29

7.10

14.35

8.16

12.03

5.50

22.72

9.19

20.11

9.05

26.16

9.36

30.65

14.76

-14.76

9.23

-3.10

9.16

2.13

15.28

-0.93

18.70

-6.67

10.10

12.62

22.04

-1.93

21.58

4.58

22.59

8.06

Calculate the predicted

y

and the residuals.

Distance traveled

Distance to debris

If the point is

below

the line the residual will be

negative

.

If the point is

above

the line the residual will be

positive

.

Minitab was

used to fit the least squares regression line. The regression line is:

 

Slide41

Residual plots

A residual plot is a

scatterplot of the (x, residual) pairs.

Residuals can also be graphed against the

predicted

y

-valuesIsolated points or a pattern of points in the residual plot indicate potential problems.A careful look at the residuals can reveal many potential problems. A residual plot is a graph of the residuals.

Slide42

Deer mice continued

Distance from Debris (

x

)

Distance Traveled (

y

)

6.94

0.00

5.23

6.13

5.21

11.29

7.10

14.35

8.16

12.03

5.50

22.72

9.19

20.11

9.05

26.16

9.36

30.65

14.76

-14.76

9.23

-3.10

9.16

2.13

15.28

-0.93

18.70

-6.67

10.10

12.62

22.04

-1.93

21.58

4.58

22.59

8.06

Plot the

residuals

against the

distance from debris (x)

Slide43

Are there any

isolated points

?

Is there

a pattern

in the points?

Deer mice continued

The points in the residual plot appear scattered at random.

This indicates that

a line

is a reasonable way to describe the r

elationship

between the distance

from debris and the distance traveled.

Slide44

Residual plots can be plotted against either the

x

-values

or the

predicted

y

-values

.

Deer mice continued

Slide45

Residual plots continued

Let’s examine the accompanying data on

x

= height (in inches) and

y

= average weight

(in pounds) for American females, ages 30-39 (from The World Almanac and Book of Facts).

x

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

y

113

115

118

121

124

128

131

134

137

141

145

150

153

159

164

The scatterplot appears rather straight.

The residual plot displays a

definite curved pattern

.

Even though

r

= 0.99, it is

not accurate

to say that weight increases

linearly

with height

Slide46

Let’s examine the data set for 12 black bears from the Boreal Forest.

x

= age (in years) and

y

= weight (in kg)

Sketch a scatterplot with the fitted regression line.

x

10.5

6.5

28.5

10.5

6.5

7.5

6.5

5.5

7.5

11.5

9.5

5.5

Y

54

40

62

51

55

56

62

42

40

59

51

50

Do you notice anything unusual about this data set?

This observation has an

x

-value that differs greatly from the others in the data set.

What would happen to the regression line if this point is removed?

If

the point affects the placement of the least-squares regression line, then the point is considered an

influential point

.

Slide47

Black bears continued

Notice that this observation falls

far away

from the regression line in the

y

direction.

An observation is an outlier

if it has a large residual.

x

10.5

6.5

28.5

10.5

6.5

7.5

6.5

5.5

7.5

11.5

9.5

5.5

Y

54

40

62

51

55

56

62

42

40

59

51

50

Slide48

Coefficient of Determination

The coefficient of determination is the proportion of

variation in y that can be attributed to an approximate linear relationship between

x

&

y

Denoted by r2The value of r2 is often converted to a percentage.Suppose that you would like to predict the price of houses in a particular city from the size of the house (in square feet). There will be variability in house price, and it is this variability that makes accurate price prediction a challenge. If you know that differences in house size account for a large proportion of the variability in house price, then knowing the size of a house will

help you predict its price.

Slide49

Suppose you

didn’t

know any

x

-values. What distance would you

expect deer mice to travel? Let’s explore the meaning of r2 by revisiting the deer mouse data set. x = the distance from the food to the nearest pile of fine woody debris y = distance a deer mouse will travel for food

x

6.94

5.23

5.21

7.10

8.16

5.50

9.19

9.05

9.36

y

0

6.13

11.29

14.35

12.03

22.72

20.11

26.16

30.65

To find the

total

amount of variation in the distance traveled (

y

) you need to find the

sum of the squares of these deviations from the mean

.

Total amount of variation

in the distance traveled (

y

) is

SSTo = 773.95 m

2

 

Why do we square the deviations?

 

Slide50

Now let’s find how much

variation

there is in the

distance traveled (

y

)

from the least squares regression line.Deer mice continued x = the distance from the food to the nearest pile of fine woody debris y

= distance a deer mouse will travel for food

x

6.94

5.23

5.21

7.10

8.16

5.50

9.19

9.05

9.36

y

0

6.13

11.29

14.35

12.03

22.72

20.11

26.16

30.65

The amount of

variation in the distance traveled (

y

)

from the

least squares regression line

is

SSResid

= 526.27 m

2

To find the amount of variation in the distance traveled (

y

), find the

sum of the squared residuals

.

Distance traveled

Distance to debris

Why do we square the residuals?

 

Slide51

The amount of variation in

y

values from the regression line is

SS

Resid

= 526.27 m

2

Total amount of variation

in the distance traveled (

y

) is

SS

TO

= 773.95 m

2

.

Approximately what

percent of the variation in distance traveled (

y

)

can be explained by the

linear relationship?

Deer mice continued

x

= the distance from the food to the nearest pile of fine woody debris

y

= distance a deer mouse will travel for food

r

2

= 32%

How does the variation in

y

change when we used the least squares regression line?

 

 

If the relationship between

the two variables is negative, then you would use

 

Slide52

The

standard deviation about the least squares regression line is

The value of

s

e

can be interpreted as the

typical amount an observation deviates from the least squares regression line. Standard Deviation about the Least Squares Regression Line

The coefficient of determination (

r 2) measures the extent of variability about the least squares regression line

relative

to overall variability in

y

. This does

not necessarily imply

that the deviations from the line are

small

in an absolute sense.

Slide53

Partial output from the regression analysis of deer mouse data:

Predictor

Coef

SE Coef

T

P

Constant

-7.69

13.33

-0.58

0.582

Distance to debris

3.234

1.782

1.82

0.112

S = 8.67071

R-sq = 32.0%

R-

sq

(

adj

) = 22.3%

Analysis of Variance

Source

DF

SS

MS

F

P

Regression

1

247.68

247.68

3.29

0.112

Resid

Error

7

526.27

75.18

Total

8

773.95

The coefficient of determination (

r

2

):

Only 32% of the observed variability in the distance traveled for food can be explained by the approximate linear relationship between the distance traveled for food and the distance to the nearest debris pile.

The standard deviation (s):

This is the

typical

amount by which an observation deviates from the least squares regression line

The y-intercept (

a

):

This value has no meaning in context since it doesn't make sense to have a negative distance.

The slope (

b

):

The distance traveled to food increases by approximately 3.234 meters for an increase of 1 meter to the nearest debris pile.

SSResid

SSTo

Slide54

A small value of

se indicates that residuals tend to be small. This value tells you

how much accuracy you can expect when using the least squares regression line to make predictions.A large value of r

2

indicates that a large proportion of the

variability in

y can be explained by the approximate linear relationship between x and y. This tells you that knowing the value of x is helpful for predicting y.A useful regression line will have a reasonably small value of se and a reasonably large value of r2.Interpreting the Values of s

e and r2

Slide55

A study (Archives of General Psychiatry[2010]: 570-577) looked at how working memory capacity was related to scores on a test of cognitive functioning and to scores on an IQ test. Two groups were studied – one group consisted of patients diagnosed with schizophrenia and the other group consisted of healthy control subjects.

For the patient group, the typical deviation of the observations from the regression line is about 10.7, which is somewhat large. Approximately 14% (a relatively small amount) of the variation in the cognitive functioning score is explained by the

linear relationship.

For the control group, the typical deviation of the observations from the regression line is about 6.1, which is smaller. Approximately 79% (a much larger amount) of the variation in the cognitive functioning score is explained by the regression line.

Thus, the regression line for the control group would produce more accurate predictions than the regression line for the patient group.

Slide56

Putting it All Together

Describing Linear Relationships

Making Predictions

Slide57

Steps in a Linear Regression Analysis

Summarize the data graphically by constructing a

scatterplotBased on the scatterplot, decide if it looks like the relationship between

x

an

y

is approximately linear. If so, proceed to the next step.Find the equation of the least squares regression line.Construct a residual plot and look for any patterns or unusual features that may indicate that line is not the best way to summarize the relationship between x and y. In none are found, proceed to the next step.Compute the values of se and r2 and interpret them in context.Based on what you have learned from the residual plot and the values of s

e and r2, decide whether the least squares regression line is useful for making predictions. If so, proceed to the last step. Use the least squares regression line to

make predictions.

Slide58

Revisit the crime scene DNA data

Recall the scientists were interested in predicting age of a crime scene victim (

y) using the blood test measure (x).

Step 1: Scientist first constructed a scatterplot of the data.

Step 2: Based on the scatterplot, it does appear that there is a reasonably strong negative linear relationship between and the blood test measure.

 

Slide59

Step 4: A residual plot constructed from these data showed a few observations with large residuals, but these observations were not far removed from the rest of the data in the

x direction. The observations were not judged to be influential. Also there were no unusual patterns in the residual plot that would suggest a nonlinear relationship between age and the blood test measure.

Step 5:

s

e

= 8.9 and

r2 = 0.835Approximately 83.5% of the variability in age can be explained by the linear relationship. A typical difference between the predicted age and the actual age would be about 9 years.

Slide60

Step 6: Based on the residual plot, the large value of

r

2

, and the relatively small value of

s

e

, the scientists proposed using the blood test measure and the least squares regression line as a way to estimate ages of crime victims.

Step 7: To illustrate predicting age, suppose that a blood sample is taken from an unidentified crime victim and that the value of the blood test measure is determined to be -10. The predicted age of the victim would be

 

Slide61

Modeling Nonlinear Relationships

Slide62

Choosing a Nonlinear Function to Describe a Relationship

Function

Equation

Looks Like

Quadratic

Square root

Reciprocal

 

 

 

Slide63

Choosing a Nonlinear Function to Describe a Relationship

Function

Equation

Looks Like

Log

Exponential

Power

 

 

 

While statisticians often use these nonlinear regressions, in AP Statistics, we will

linearize our data using transformations

. Then we can use what we already know about the least squares regression line.

The common log (base 10) may also be used.

Slide64

Models that Involve Transforming Only

x

The square root, reciprocal, and log models all have the form

Where the function of x is square root, reciprocal, or log.

 

Model

Transformation

Square root

Reciprocal

Log

Model

Transformation

Square root

Reciprocal

Log

This suggest that if the pattern in the scatterplot of (x, y) pairs looks like one of these curves, an appropriate transformation of the x values should result in

transformed data that shows a linear relationship

.

Read “x prime”

Let’s look at an example.

Slide65

Is electromagnetic radiation from phone antennae associated with declining bird populations? The accompanying data on

x = electromagnetic field strength (Volts per meter) and y = sparrow density (sparrows per hectare)

Field

Strength

Sparrow Density

0.11

41.71

0.20

33.60

0.29

24.74

0.40

19.50

0.50

19.42

0.61

18.74

1.01

24.23

1.10

22.04

0.70

16.29

0.80

14.69

0.90

16.29

1.20

16.97

1.30

12.83

1.41

13.17

1.50

4.64

1.80

2.11

1.90

0.00

3.01

0.00

3.10

14.69

3.41

0.00

First look at a scatterplot of the data.

The data is curved and looks similar to the

graph of the log model.

Slide66

Field Strength vs. Sparrow Density Continued

Ln Field

Strength

Sparrow Density

-2.207

41.71

-1.609

33.60

-1.238

24.74

-.0916

19.50

-0.693

19.42

-0.494

18.74

0.001

24.23

0.095

22.04

-0.357

16.29

-0.223

14.69

-0.105

16.29

0.182

16.97

0.262

12.83

0.344

13.17

0.405

4.64

0.588

2.11

0.642

0.00

1.102

0.00

1.131

14.69

1.227

0.00

Second, we will transform the data by using

. . .

 

. . . and graph the scatterplot

of

y

on

x

Notice that the transformed data is now linear. We can find the least squares regression line.

Sparrow Density = 14.8 –

ln

(Field Strength)

Predictor

Coef

SE

Coef

T

P

Constant

14.805

1.238

11.96

0.000

Ln (field strength)

-10.546

1.389

-7.59

0.000

S = 5.50641

R-

Sq

= 76.2%

R-

Sq

(

adj

) = 74.9%

Slide67

Field Strength vs. Sparrow Density Continued

Sparrow Density = 14.8 –

ln

(Field Strength)

Predictor

Coef

SE

Coef

T

P

Constant

14.805

1.238

11.96

0.000

Ln (field strength)

-10.546

1.389

-7.59

0.000

S = 5.50641

R-

Sq

= 76.2%

R-

Sq

(

adj

) = 74.9%

A residual plot from the least squares regression line fit to the transformed data, shown below, has

no apparent patterns or unusual features

. It appears that the log model is a reasonable choice for describing the relationship between sparrow density and field strength.

The value of

R

2

for this model

is 0.762

and

s

e

= 5.5

.

Slide68

Field Strength vs. Sparrow Density Continued

Sparrow Density = 14.8 –

ln

(Field Strength)

Predictor

Coef

SE

Coef

T

P

Constant

14.805

1.238

11.96

0.000

Ln (field strength)

-10.546

1.389

-7.59

0.000

S = 5.50641

R-

Sq

= 76.2%

R-

Sq

(

adj

) = 74.9%

This model can now be used to predict sparrow density from field strength. For example, if the field strength is 1.6 Volts per meter, what is the prediction for the sparrow density?

 

Slide69

Models that Involve Transforming

y

Let’s consider the remaining nonlinear models, the exponential model and the power model.

Model

Transformation

Exponential

Power

Model

Transformation

Exponential

Power

Exponential Model

 

Using

properties of logarithms, it follows that . . .

 

Power Model

 

 

Using

properties of logarithms, it follows that . . .

Notice that using the transformations below, the exponential and power models are linearized.

Slide70

In a study of factors that affect the survival of loon chicks in Wisconsin, a relationship between the pH of lake water and blood mercury level in loon chicks was observed. The researchers thought that it is possible that the pH of the lake could be related to the type of fish that the loons ate. A scatterplot of the data is shown below.

The curve appears to be exponential,

t

herefore use

t

o transform the data.

 

The

scatterplot of

ln

(blood mercury level) on lake pH appears linear.

The

linear model is

.

 

Ln(blood mercury level)= 1.06-0.396 Lake pH

Predictor

Coef

SE

Coef

T

P

Constant

1.0550

0.5535

1.91

0.065

Lake pH

-0.3956

0.0826

-4.79

0.000

S = 0.6056

R-

Sq

= 39.6%

R-

Sq

(

adj

) = 37.8%

Slide71

Choosing Among Different Possible Nonlinear Models

Often there is more than one reasonable model that could be used to describe a nonlinear relationship between two variables.

How do you choose a model?

1) Consider

scientific theory

. Does it suggest what model the relationship is?

2) In the absence of scientific theory, choose a model that has small residuals (small se) and accounts for a large proportion of the variability in y (large R 2).

Slide72

Common Mistakes

Slide73

Avoid these Common Mistakes

Correlation does

not imply causation. A strong correlation implies only that the two variables tend to vary together in a predictable way, but there are many possible explanations for why this is occurring other than one variable causing change in the other.

Don’t fall into this trap!

The number of fire trucks at a house that is on fire and the amount of damage from the fire have a strong, positive correlation.

So, to avoid a large amount of damage if your house is on fire – don’t allow several fire trucks to come to your house?

Slide74

Avoid these Common Mistakes

A correlation coefficient

near 0 does not necessarily imply that there is no relationship between two variables. Although the variables may be unrelated, it is also possible that there is a strong but nonlinear relationship.

Be sure to look at a scatterplot!

Slide75

Avoid these Common Mistakes

The least squares regression line for predicting

y from x is

NOT the same line

as the least squares regression line for predicting

x

from y.The ages (x, in months) and heights (y, in inches) of seven children are given.x 16 24 42 60 75 102 120 y 24 30 35 40 48 56 60

To predict height from age:

 

To predict age from height:

 

Slide76

Avoid these Common Mistakes

Beware of

extrapolation. Using the least squares regression line to make predictions outside the range of x values in the data set often

leads to poor predictions

.

Predict the height of a child that is 15 years (180 months) old.

 

It is unreasonable that a 15 year-old would be 81.6 inches or 6.8 feet tall

Slide77

Avoid these Common Mistakes

Be careful in

interpreting the value of the intercept of the least squares regression line. In many instances interpreting the intercept as the value of y that would be predicted when

x

= 0 is equivalent to

extrapolating way beyond the range of

x values in the data set.The ages (x, in months) and heights (y, in inches) of seven children are given.x 16 24 42 60 75 102 120 y 24 30 35 40 48 56 60

Slide78

Avoid these Common Mistakes

Remember that the least squares regression line may be the “best” line, but that doesn’t necessarily mean that the line will produce good predictions.

This has a relatively large s

e

– thus we can’t accurately predict

IQ from working memory capacity.

Slide79

Avoid these Common Mistakes

It is not enough to look at just

r2 or just s

e

when evaluating the regression line. Remember to consider

both values

. In general, your would like to have both a small value for se and a large value for r2.This indicates that deviations from the line tend to be small.

This indicates that the linear relationship explains a large proportion of the variability in the y values.

Slide80

Avoid these Common Mistakes

The value of the

correlation coefficient, as well as the values for the intercept and

slope

of the least squares regression line, can be

sensitive to influential observations

in the data set, particularly if the sample size is small.Be sure to always start with a plot to check for potential influential observations.