/
1 - Introduction 2 - Exploratory Data Analysis 1 - Introduction 2 - Exploratory Data Analysis

1 - Introduction 2 - Exploratory Data Analysis - PowerPoint Presentation

giovanna-bartolotta
giovanna-bartolotta . @giovanna-bartolotta
Follow
345 views
Uploaded On 2019-06-19

1 - Introduction 2 - Exploratory Data Analysis - PPT Presentation

3 Probability Theory 4 Classical Probability Distributions 5 Sampling Distrbns Central Limit Theorem 6 Statistical Inference 7 Correlation and Regression 8 Survival Analysis ID: 759158

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "1 - Introduction 2 - Exploratory Data An..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

1 - Introduction2 - Exploratory Data Analysis 3 - Probability Theory4 - Classical Probability Distributions5 - Sampling Distrbns / Central Limit Theorem6 - Statistical Inference7 - Correlation and Regression(8 - Survival Analysis)

1

Supplemental Lecture Notes

Slide2

10

10½

11

Quantitative

[measurement]

length

mass

temperature

pulse rate

# puppies

shoe size

2

“Random Variable”

X

= any

numerical

value that can be assigned to each unit of a population“Random” refers to the notion that this value is unknown until actually observed (usually as part of an outcome of an experiment to test a specific hypothesis). Contrast this with the idea of a “nonrandom” variable with no empirical error, e.g., X = # cards in a deck = 52.There are two general types.........Quantitative and Qualitative

How is…

“Random Variable”

X(age, income level, …)… distributed?

What do we want to know about this population?

composed

of “units” (people, rocks, toasters,...)

To make certain calculations simpler, we assume that populations are “arbitrarily large” (or indeed, infinite).

POPULATION

Slide3

Quantitative [measurement] length mass temperature pulse rate # puppies shoe size

3

“Random Variable”

X

= any numerical value that can be assigned to each unit of a population“Random” refers to the notion that this value is unknown until actually observed (usually as part of an outcome of an experiment to test a specific hypothesis). Contrast this with the idea of a “nonrandom” variable with no empirical error, e.g., X = # cards in a deck = 52.There are two general types.........Quantitative and Qualitative

How is…“Random Variable” X(age, income level, …)… distributed?

What do we want to know about this population?

composed

of “units” (people, rocks, toasters,...)

To make certain calculations simpler, we assume that populations are “arbitrarily large” (or indeed, infinite).

POPULATION

CONTINUOUS

(can take their values at any point in a continuous interval)

DISCRETE

(only take their values in disconnected jumps)

Slide4

Qualitative [categorical] video game levels (1, 2, 3,...) income level (low, mid, high) zip code PIN # color (Red, Green, Blue)

ORDINAL,

RANKED

4

“Random Variable”

X

= any

numerical value that can be assigned to each unit of a population“Random” refers to the notion that this value is unknown until actually observed (usually as part of an outcome of an experiment to test a specific hypothesis). Contrast this with the idea of a “nonrandom” variable with no empirical error, e.g., X = # cards in a deck = 52.There are two general types.........Quantitative and Qualitative

How is…“Random Variable” X(age, income level, …)… distributed?

What do we want to know about this population?

composed

of “units” (people, rocks, toasters,...)

To make certain calculations simpler, we assume that populations are “arbitrarily large” (or indeed, infinite).

POPULATION

IMPORTANT SPECIAL CASE: Binary (or Dichotomous) “Pregnant?” (Yes / No) Coin toss (Heads / Tails) Treatment (Drug / Placebo)

1 2 3

NOMINAL

1

2 3

(ordered labels)

(unordered labels)

Another way… define

X

using “indicator variables”:

Note that

I

1

+

I

2

+

I

3

= 1

Slide5

Example: Excel file of patient blood types

Note that each patient row sums to 1, i.e.,

O + A + B + AB = 1.

Therefore, even though there are 4 indicator variables, knowing the values of any 3 of them must determine the value of the 4

th

. That is, we only have 3

“degrees of freedom.”

Slide6

Qualitative [categorical] video game levels (1, 2, 3,...) income level (low, mid, high) zip code PIN # color (Red, Green, Blue)

ORDINAL,

RANKED

6

“Random Variable”

X

= any

numerical value that can be assigned to each unit of a population“Random” refers to the notion that this value is unknown until actually observed (usually as part of an outcome of an experiment to test a specific hypothesis). Contrast this with the idea of a “nonrandom” variable with no empirical error, e.g., X = # cards in a deck = 52.There are two general types.........Quantitative and Qualitative

How is…“Random Variable” X(age, income level, …)… distributed?

What do we want to know about this population?

composed

of “units” (people, rocks, toasters,...)

To make certain calculations simpler, we assume that populations are “arbitrarily large” (or indeed, infinite).

POPULATION

IMPORTANT SPECIAL CASE: Binary (or Dichotomous) “Pregnant?” (Yes / No) Coin toss (Heads / Tails) Treatment (Drug / Placebo)

1 2 3

NOMINAL

1

2 3

(ordered labels)

(unordered labels)

Slide7

Random Variable

Discrete Random Variable

Define a new parameter

=

P

(Success)

Point estimator

Suppose we

intend

to select a random sample of size

n

from this population of Success and Failures…

… in such a way that the “Success or Failure” outcome of any selected individual conveys no information about the “Success or Failure” outcome of any other selected individual.

That is, the “Success or Failure” outcomes between any two individuals are independent. (Think of tossing a coin n times.)

Then a natural estimator for

 could be

(0, 1, 2, …,

n)

Random sample of size

n

Let

X = “Number of Successes in the sample.”

the sample proportion of Success

Ex

: n = 500 tosses, X= 285 Heads 

POPULATION

Slide8

“Classical Scientific Method”

Hypothesis – Define the study population... What’s the question?Experiment – Designed to test hypothesisObservations – Collect sample measurementsAnalysis – Do the data formally tend to support or refute the hypothesis, and with what strength? (Lots of juicy formulas...)Conclusion – Reject or retain hypothesis; is the result statistically significant?Interpretation – Translate findings in context!Statistics is implemented in each step of the classical scientific method!

8

Slide9

Analysis

– Do the data formally tend to support or refute the hypothesis, and with what strength? (Lots of juicy formulas...) To help answer this question, we should first try to obtain an informal “feel” for the sample data we have collected, and see if it suggests anything about the population distribution. ~ Exploratory Data Analysis ~Visual Displays (charts, tables, graphs, etc.) “What do the data look like?”“Descriptive Statistics” (measures of center, measures of spread, proportions, etc.) “How can the data be summarized?”

9

Slide10

Because there are many duplicate values, we may construct a table of (absolute) frequencies and corresponding dotplot…

R code:x = c(60, 70, 80, 90)freq = c(2, 8, 4, 6)sample = rep(x, freq) stripchart(sample, method = "stack", pch = 19, offset = 1, ylim = range(1, 8))

Example: Sample exam scores, n = 20 (“sample size”){60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90}

10

Data values

xiFrequenciesfi602708804906Totaln = 20

Slide11

Relative Frequencies

2/20 = 0.108/20 = 0.404/20 = 0.206/20 = 0.3020/20 = 1.00

Example

: Sample exam scores, n = 20 (“sample size”){60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90}

Data values

xiFrequenciesfi602708804906Totaln = 20

Because there are many duplicate values, we may construct a table of (absolute) frequencies and corresponding dotplot…

Often though, it is preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20:

Density

All are +, and sum = 1

Slide12

DataxiFrequencyfiRelative Frequencyp(xi ) = fi /nTotaln1

12

In general…

Density

” =

Rel

freq / width

Slide13

13

In general…

Data

x

i

Frequency

fiRelative Frequencyp(xi ) = fi /nTotaln1

Density

Slide14

Example: Sample exam scores, n = 20 (“sample size”){60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90}

Data values

x

iFrequencyfi602708804906Totaln = 20

Relative Frequency

2/20 = 0.108/20 = 0.404/20 = 0.206/20 = 0.3020/20 = 1.00

x = c(60, 70, 80, 90)

f = c(2, 8, 4, 6)sample = rep(x, f)hist(sample, freq = F, breaks = c(50, 55, 65, 75, 85, 95, 100), labels = T, col = "lightblue")

0.10

0.40

0.20

0.30

Total Area = 1!

Slide15

Example

:

Suppose the

random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages.

In published journal articles, the original data are almost never shown, but displayed in tabular form as above. This summary is called “grouped data.”

{10, 15, 15, 18, 20, 21, 21, 23, 24, 26, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59}

{10, 15, 15, 18, 20, 21, 21, 23, 24, 26, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59}

4

8

2

5

1

Frequency Histogram

Suggests population may be

skewed to the right (i.e., positively skewed).

Class Interval

Frequency[10, 20)4[20, 30)8[30, 40)5[40, 50)2[50, 60)1Totaln = 20

“Endpoint convention”Here, the left endpoint is included, but not the right.Note!...Stay away from “10-20,” “20-30,” “30-40,” etc.

15

4

values

8

values

5 values

2 values

1

value

From these values, we can construct a table which consists of the frequencies of each age-interval in the dataset, i.e., a

frequency table

.

Slide16

Class IntervalFrequency[10, 20)4[20, 30)8[30, 40)5[40, 50)2[50, 60)1Totaln = 20

Relative Frequency4/20 = 0.208/20 = 0.405/20 = 0.252/20 = 0.101/20 = 0.0520/20 = 1.00

As before, it is often preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20. ↓

Relative frequencies are always between 0 and 1, and sum to 1.

Relative Frequency Histogram

.20

.40

.10

.25

.05

16

0.4

0.3

0.2

0.1

0.0

Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages.

{

10, 15, 15, 18

,

20, 21, 21, 23, 24,

26,

26, 27

,

31, 35, 35, 37, 38

,

42, 46

,

59

}

Slide17

Class IntervalFrequency[10, 20)4[20, 30)8[30, 40)5[40, 50)2[50, 60)1Totaln = 20

Relative Frequency4/20 = 0.208/20 = 0.405/20 = 0.252/20 = 0.101/20 = 0.0520/20 = 1.00

Relative frequencies are always between 0 and 1, and sum to 1.

Relative Frequency Histogram

.20

.40

.10

.25

.05

17

0.4

0.3

0.2

0.1

0.0

“0.20 of the sample is under 20 yrs old”

“0.60 of the sample is under 30 yrs old”

“0.85 of the sample is under 40 yrs old”

“0.95 of the sample is under 50 yrs old”

“1.00 of the sample is under 60 yrs old”

“0.00 of the sample is under 10 yrs old”

Cumulative

(0.00)

0.20

0.60

0.85

0.951.00

Example

: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages.

As before, it is often preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20. ↓

{

10, 15, 15, 18

,

20, 21, 21, 23, 24,

26,

26, 27

,

31, 35, 35, 37, 38

,

42, 46

,

59

}

Slide18

Class IntervalFrequency[10, 20)4[20, 30)8[30, 40)5[40, 50)2[50, 60)1Totaln = 20

Relative Frequency4/20 = 0.208/20 = 0.405/20 = 0.252/20 = 0.101/20 = 0.0520/20 = 1.00

Relative frequencies are always between 0 and 1, and sum to 1.

Relative Frequency Histogram

.20

.40

.10

.25

.05

18

0.4

0.3

0.2

0.1

0.0

Cumulative(0.00)0.200.600.850.951.00

Cumulative

relative frequencies always increase from 0 to 1.

Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages.

“staircase graph” from 0 to 1

(Not a histogram!)

As before, it is often preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20. ↓

{

10, 15, 15, 18

,

20, 21, 21, 23, 24,

26,

26, 27

,

31, 35, 35, 37, 38

,

42, 46

,

59

}

Slide19

Class IntervalFrequency[10, 20)4[20, 30)8[30, 40)5[40, 50)2[50, 60)1Totaln = 20

Relative Frequency4/20 = 0.208/20 = 0.405/20 = 0.252/20 = 0.101/20 = 0.0520/20 = 1.00

Relative frequencies are always between 0 and 1, and sum to 1.

Relative Frequency Histogram

.20

.40

.10

.25

.05

19

0.4

0.3

0.2

0.1

0.0

Cumulative(0.00)0.200.600.850.951.00

But alas, there is a

major problem….

Example: Suppose the random variable is X = Age (years) in a certain population of individuals, and we select the following random sample of n = 20 ages.

Cumulative relative frequencies always increase from 0 to 1.

“staircase graph” from 0 to 1

(Not a histogram!)

As before, it is often preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20. ↓

{

10, 15, 15, 18

,

20, 21, 21, 23, 24,

26,

26, 27

,

31, 35, 35, 37, 38

,

42, 46

,

59

}

Slide20

{

10, 15, 15, 18, 20, 21, 21, 23, 24, 26, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59}

Relative Frequency Histogram

.20

.40

.10

.25

.05

Suppose that, for the purpose of the study, we are not primarily concerned with those

30 or older

, and wish to “lump” them into a single class interval.

Class Interval

Frequency

[10, 20)4[20, 30)8[30, 40)5[40, 50)2[50, 60)1Totaln = 20

Relative Frequency4/20 = 0.208/20 = 0.405/20 = 0.252/20 = 0.101/20 = 0.0520/20 = 1.00

Class Interval[10, 20)[20, 30)[30, 60)Total

Relative Frequency4/20 = 0.208/20 = 0.408/20 = 0.4020/20 = 1.00

.40

The skew no longer appears.

The histogram is distorted because of the presence of an outlier (59) in the data, creating the need for unequal class widths.

20

0.40.30.20.10.0

As before, it is often preferable to work with proportions, i.e., relative frequencies… Divide frequencies by n = 20. ↓

What effect will this have on the histogram?

Slide21

Outliers

What are they?Informally, an outlier is a sample data value that is either “much” smaller or larger than the other values. How do they arise? experimental error measurement error recording error not an error; genuine What can we do about them? double-check them if possible delete them? include them… somehow perform analysis both ways

(A Pain in the Tuches)

21

Slide22

IDEA: Instead of having height of each class rectangle = relative frequency, make... area of each class rectangle = relative frequency.

Class IntervalRelative Frequency[10, 20) 0.20[20, 30)0.40[30, 60)0.40Total1.00

Density(= height)0.20/10 = 0.020 0.40/10 = 0.0400.40/30 = 0.013

height

= relative frequency

×

width

/

width = 10

width = 10

width = 30

Density Histogram

0.02

0.04

0.0133…

0.20

0.40

0.40

Total Area = 1!

22

The outlier is included, and the overall skewed appearance is restored.

Exercise:

What if the outlier

were 99 instead of 59?

=

Density”

Slide23

0.02

0.0133…

0.20

0.40

0.40

0.04

Density Histogram

Step 1.

Identify the intervals & rectangles.

0.02

0.04

0.20

0.40

Step 2.

Split the FIRST rectangle at 18 as shown.

Step 3.

Observe that…

the interval [18, 20) has width = 2 years

the interval [10, 20) has width = 10 years.

The ratio = 2/10 =

1/5

.

Class Interval

Absolute Frequency

Relative Frequency

Density

[10, 20)

40.200.020 [20, 30)80.400.040[30, 60)80.400.01333

Step 4.

Therefore, the

red

area = 1/5 of .20 = .04.

Step 5. Repeat Steps 2-4 for SECOND rectangle at 24. The red area = 2/5 of .40 = .16.

Step 6.

ADD: .04 + .16 =

.20

i.e., 20%

Question:

Approx what proportion of the sample is between 18-24 yrs old (inclusive)?

Slide24

0.02

0.0133…

0.20

0.40

0.40

0.04

Density Histogram

Step 1.

Identify the intervals & rectangles.

0.02

0.04

0.20

0.40

Class Interval

Absolute Frequency

Relative Frequency

Density

[10, 20)

4

0.20

0.020

[20, 30)

8

0.40

0.040

[30, 60)

8

0.400.01333

Step 3.

ADD:

.04

+

.16

=

.20

i.e., 20%

Question: Approx what proportion of the sample is between 18-24 yrs old (inclusive)?

- OR -

Step 2. Use “Density = Area / Width”(see page 2.3-5 of the posted Lecture Notes):

FIRST area = Width  Density = (20 – 18)(.02) = .04

Exercise: Confirm that the actual proportion = 30%.

Exercise: What if ages 23, 24 were both changed to 25?

SECOND

area

= Width

Density = (24 – 20)(.04) =

.16

Slide25

Analysis

– Do the data formally tend to support or refute the hypothesis, and with what strength? (Lots of juicy formulas...) To help answer this question, we should first try to obtain an informal “feel” for the sample data we have collected, and see if it suggests anything about the population distribution. ~ Exploratory Data Analysis ~Visual Displays (charts, tables, graphs, etc.) “What do the data look like?”“Descriptive Statistics” (measures of center, measures of spread, proportions, etc.) “How can the data be summarized?”

25

Slide26

Example: Sample exam scores {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90}

“Measures of ”

sample mode

most frequent value = 70 sample median “middle” value = (70 + 80) / 2 = 75sample mean average value =

26

Data values xiFrequenciesfi602708804906Totaln = 20

(60)(2) + (70)(8) + (80)(4) + (90)(6)

x

=

xi fi

=

77

Quartiles are found similarly: Q1 = 70, Q2 = 75, Q3 = 90

Center

1/20

Quintiles, deciles,

other

percentiles

(=

quantiles

) similar.

Useful when outliers are present, e.g., employee salaries

+ CEO

Slide27

sample mode most frequent value = 70 sample median “middle” value = (70 + 80) / 2 = 75sample mean average value =

“Measures of Center”

27

Data values xiFrequenciesfi602708804906Totaln = 20

(60)(2) + (70)(8) + (80)(4) + (90)(6)

1/20

=

77

x

=

x

i fi

Example:

Sample exam scores {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90}

Slide28

sample mean

28

Data values

xiFrequenciesfi602708804906Totaln = 20

Relative Frequenciesp(xi ) = fi /n2/20 = 0.18/20 = 0.44/20 = 0.26/20 = 0.320/20 = 1.0

(60)(2) + (70)(8) + (80)(4) + (90)(6)

1/20

x

=

xi p (xi)

“Notation, notation, notation.”

220

820

420

(60)(2) + (70)(8) + (80)(4) + (90)(6) =

1/20

77

x

=

x

i

fi

“weighted”

sample mean(with weights = rel freqs)

“Measures of Center”

Example:

Sample exam scores {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90}

6

2

0

Slide29

Example: Sample exam scores {60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90}

sample mean

29

… but how do we measure the “spread” of a set of values?

First attempt

: sample range = xn – x1 = 90 – 60 = 30. Simple, but…

Spread

“Measures of ”

Ignores all of the data except the extreme points, thus far too sensitive to outliers to be of any practical value.Example: Company employee salaries, including CEO

Can modify with… sample interquartile range (IQR) = Q3 – Q1 = 90 – 70 = 20.

We would still prefer a measure that uses all of the data.

Data values

x

i

Frequencies

f

i

60

2

70

8

80

4

90

6

Total

n

= 20

Slide30

0.10

0.40

0.20

0.30

Deviations from mean

x

i

– x 60 – 77 = –17 70 – 77 = –7 80 – 77 = +390 – 77 = +13

sample mean

… but how do we measure the “spread” of a set of values?

Better attempt

: Calculate the average of the “deviations from the mean.”

1/20 [(–17)(2) + (–7)(8) + (3)(4) + (13)(6)] =

0. ????????

This is not a coincidence – the deviations always sum to 0* – so it is not a good measure of variability.

 (xi – x) fi =

* The

sample mean

is a “balance point” for the data.

Example:

Sample exam scores

{60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90}

“Measures of Spread”

Data values

x

i

Frequencies

fi602708804906Totaln = 20

Question:

Why wouldn’t the median 75 be the balance point?

See

Prob

2.5

/ 11 in

Lec

Notes

for a more obvious example.

Slide31

Deviations from meanxi – x 60 – 77 = –17 70 – 77 = –7 80 – 77 = +390 – 77 = +13

sample mean

31

Data values xiFrequenciesfi602708804906Totaln = 20

(

xi – x) 2 fi

[

(–17) 2 (2) + (–7) 2 (8) + (3) 2 (4) + (13) 2 (6)]

Calculate the

s 2 =

sample variance

sample standard deviation

s =

1/

19

= 106.316

average of the

“squared deviations from the mean.”

s = 10.311

a modified

“typical” sample value

“typical” distance from mean

Example:

Sample exam scores

{60, 60, 70, 70, 70, 70, 70, 70, 70, 70, 80, 80, 80, 80, 90, 90, 90, 90, 90, 90}

“Measures of Spread”

Slide32

Grouped Data - revisited

32

Class IntervalAbsolute Frequency[10, 20) 4[20, 30)8[30, 60)8

Use the interval

midpoints

for

Slide33

Grouped Data - revisited

33

Class IntervalAbsolute Frequency[10, 20) 4[20, 30)8[30, 60)8

15

25

45

Use the interval

midpoints

for

Compare this “grouped mean” with the actual sample mean.

Slide34

Class IntervalAbsolute Frequency[10, 20) 4[20, 30)8[30, 60)8

Grouped Data - revisited

34

Use the interval

midpoints

for

median

Q2 = ?

Compare this “grouped mean” with the actual sample mean.

Class IntervalAbsolute FrequencyRelative FrequencyDensity[10, 20) 40.200.020 [20, 30)80.400.040[30, 60)80.400.01333

0.02

0.04

0.0133…

0.20

0.40

0.40

Step 1.

Identify the interval & rectangle.

Step 2.

Split the rectangle so that

0.5 area lies above and below.

0.3

0.1

Slide35

00

00

0.1

0.1

0.1

0.3

Grouped Data - revisited

Use the interval

midpoints

for

median

Q

2

= ?

Compare this “grouped mean” with the actual sample mean.

Step 1.

Identify the interval & rectangle.

Step 2.

Split the rectangle so that

0.5 area lies above and below.

Step 3.

Observe that this rectangle can be split into 4 strips of 0.1 each.

0.1

22.5

25

27.5

Step 4.

Thus, split the interval into 4 equal parts, each of width (30 – 20

)/4 = 2.5 years.

…OR…

Slide36

00

00

0.3 0.1

Grouped Data - revisited

Use the interval

midpoints

for

median

Q

2

= ?

Compare this “grouped mean” with the actual sample mean.

Step 1.

Identify the interval & rectangle.

Step 2.

Split the rectangle so that

0.5 area lies above and below.

Step 3.

Set up a proportion and solve for

Q:

Label as shown, and

use the formula

.

…OR…

Other percentiles are done similarly.

Solve using

cumul

dist

, w/o histogram …s

ee posted Lecture Notes!

…OR…

Slide37

Comments

is an unbiased estimator of the population mean , s 2 is an unbiased estimator of the population variance  2. (Their “expected values” are  and  2, respectively.)Beware of roundoff error!!! There is an alternate, more computationally stable formula for sample variance s 2.The numerator of s 2 is called a sum of squares (SS); the denominator “n – 1” is the number of degrees of freedom (df) of the n deviations xi – , because they must satisfy a constraint (sum = 0), hence 1 degree of freedom is “lost.”A natural setting for these formulas and concepts is geometric, specifically, the Pythagorean Theorem: a 2 + b 2 = c 2. See Lecture Notes Appendix…

37

a

c

b