/
Basic statistic inference in R Basic statistic inference in R

Basic statistic inference in R - PowerPoint Presentation

lindy-dunigan
lindy-dunigan . @lindy-dunigan
Follow
403 views
Uploaded On 2016-07-26

Basic statistic inference in R - PPT Presentation

Shai Meiri Everything differs We expect to find differences between x and y is a trivial saying The statistician within you asks Are the differences we found are larger that expected by chance ID: 419829

test data island csv data test csv island type distribution variable mass model parametric dat read tests final2 regression

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Basic statistic inference in R" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Basic statistic inference in R

Shai

MeiriSlide2

Everything differs

!!!“We expect to find differences between x and y” is a trivial saying

The statistician within you asks “Are the differences we found are larger that expected by chance?”

The biologist within you asks “Why the differences I found are in the direction and the level they are?”Slide3

Moments of central tendency

Mean

Arithmetic mean:

Σ

x

i

/n

Geometric mean: (

x

1*x2*…*xn)1/n

Harmonic mean: Slide4

1. Arithmetic mean:

Σxi

/n

2. Geometric mean: (x

1

*x

2

*…*

xn)1/n

Use the function “mean”Example:

data<-c(2,3,4,5,6,7,8)

mean(data)

[1

] 5

data<-

c(2,3,4,5,6,7,8)

exp

(mean(log(data)))[1] 4.549163

Example:

You can also use the .csv file :

dat<-read.csv("island_type_final2.csv")Attach(dat)mean(lat)[1] 17.40439

Moments of central tendency in RSlide5

Moments of central

tendency

A. mean

B. Median

C. Mode

General example:

data<-c(2,3,4,5,6,7,8)

median(data)

[1

] 5

median(mass

)

[

1]

0.69

Example from the .

csv

:Slide6

Moments of central tendency

Mean

Variance

=

Σ

(x

i

-

μ)2

/ nExample: http://www.statmethods.net/management/functions.htmldata<-c(2,3,4,5,6,7,8) var

(data)[1] 4.666667

Var

(

lat

)

[1] 89.20388

Is the mean is a good measurement to what is happening in the population when the variance is low? Slide7

Moments of central tendency

Mean

Variance

The second moment of central tendency is the measurement of how much the data is scattered around the first moment (mean)

An example for the second moment are the variance, the standard variation, standard error, coefficient of variation and the confidence interval of 90%, 95% and 99% from something Slide8

Moments of central tendency

#for:

data<-c(2,3,4,5,6,7,8)

var

(data)

sd

(data)

length(data)

se<-(

sd

(data)/length(data)^0.5)

se

[

1

] 0.8164966

CV<-

sd

(data)/mean(data)

CV

[1

] 0.4320494

Sample size:Variance:

Standard deviation:

Standard error:coefficient of variation

:Slide9

Moments of central tendency

Mean

Variance

Skew

Skewed distribution of frequencies is not symmetric

Do you think that the arithmetic mean is a good measurement of central tendency for a skewed frequency distribution

What is the mean salary of the student here and of Bill Gates? Slide10

Moments of central tendency

Skew

skew<-function(data){

m3

<-sum((data-mean(data))^3)/length(data)

s3

<-

sqrt

(

var(data))^3 m3/s3}

skew(data)The SE of skewness

:

sdskew

<-

function(x

) sqrt(6/length(x

))Slide11

Moments of central tendency

Mean

Variance

Skew

KurtosisSlide12

Moments of central tendency

Kurtosis

kurtosis<-function(x){

m4

<-sum((x-mean(x))^4)/length(x)

s4

<-

var

(x)^2

m4/s4-3 }

kurtosis(x)SE of kurtosis:

sdkurtosis

<-

function(x

)

sqrt

(

24/length(x

))Slide13

A normal distribution can get a value of mean and variance but its

skewness

and the kurtosis should equal to zero

Values of skew and kurtosis have their own variance – and zero should be outside of their confidence interval in order for them to be significantly different from zeroSlide14

46,699

for a month

(excluding

the bottles)

Residuals

http://www.haaretz.co.il/1.2057452

Rab

. Dov LiorServed in IDF for 1 month

m2.06

When doing statistics we’re creating models of the reality

One of the most simple models is the mean

:

The mean height of Israeli citizens is

173

cm

The mean salary is

9271 ₪

(correct for April 2014)

The mean service in IDF is 24 months (I guess)Slide15

We can see here that our models: 24 months,

9271 ₪

and

173 cm are

not very successful

Residuals

Residual = 33 cm

Residual =

37428

Residual = -23 month IDF service

When doing statistics we’re creating models of the reality

The Residual

Is how

much a certain value is far from the prediction of the model

.

Omri

Caspi is far away in 32 cm from the model “Israeli = 173” and in 29 cm from the more complicated model: “Israeli man = 177, Israeli women = 168”Slide16

Residuals

Residual = -23 month service

Residual =

37

428

Residual = 33 cm

dat

<-

read.csv("island_type_final2.csv")

model

<-

lm(

mass

~

iso+area+age+lat

, data=

dat

)

out<-model$residuals

outwrite.table(out, file = "residuals.txt",sep="\t",col.names=F,row.names=F)#note

that residual values are in the order entered (i.e

., not alphabetic, not by residual size – first in, first out)

When doing statistics we’re creating models of the realitySlide17

In statistical inference

we are testing the behavior of our data compared to a certain hypothesis

We can present our hypothesis as a statistical model

For Example:

The distribution of the heights is normal

Number of species increases with area

Number of species increases with area with a power function of 0.25

Theoretical statistics and statistical

inference

When we have data it is best that we first describe them: plot graphs, calculate the mean and so on Slide18

Frequency distribution*

*graphic form = “histogram”

Describes the distribution of all observations

dat

<-read.csv("island_type_final2.csv

")

attach(

dat

)

names(dat)Hist

(mass)

How many observations are in each bin?Slide19

Frequency distribution

What did we learn?

dat

<-read.csv("island_type_final2.csv

")

attach(

dat

)

Hist

(mass)There are no mass smaller than one tenth of a gram or larger than 100 kg

Lizard with mass between 1 and 10 are very common – larger or smaller lizards are rare

The distribution is

unimodal

and skewed to the rightSlide20

Frequency distribution

Histograms don’t

have to be so ugly

dat

<-read.csv("island_type_final2.csv

")

attach(

dat

)

hist(mass, col="purple",breaks=25,xlab="log mass (g)",main="masses of island lizards - great data by Maria",cex.axis

=1.2,cex.lab=1.5)Slide21

Presenting a categorical

predictor

with a continuous

response variable

dat

<-read.csv("island_type_final2.csv

")

attach(

dat

)

Always prefer boxplot

to

barplot

plot(type,brood

)Slide22

Presenting a continuous variable

against another

continuous variable

dat

<-read.csv("island_type_final2.csv

")

attach(

dat

)

plot(mass,clutch

)

plot

(

mass,clutch,pch

=16

,

col=“blue”)Slide23

Which test should we choose?

If the response variable is “success or failure” and the null hypothesis is equality of both we’ll use a binomial test

If the response variable is counts we’ll usually use chi-square or G

In many cases our response variable will be continuous (14 species, 78 individuals, 54 heartbeats per second, 7.3 eggs, 23 degrees)

It changes according to the nature of our

response

variable

(=y variable), and mostly according to the nature of our predictor variablesSlide24

Which test should we choose?

What is your response variable

?

Continuous

(14 species, 78 individuals, 23 degrees, 7.3 eggs)

Success or failure

(found the cheese/idiot)

BinomialChi-square or G (=log-likelihood)

Counts(

frequency

: 6 females, 4 males)

Soon…Slide25

Binomial test in R

binom.test

(19,34)

Exact binomial test data: 19 and 34 number of successes = 19, number of trials = 34

p-value = 0.6076 alternative hypothesis: true probability of success is not equal to 0.5 95 percent confidence interval: 0.3788576 0.7281498 sample estimates: probability of success 0.5588235

binom.test

(19,20)

Exact binomial test data: 19 and 20 number of successes = 19, number of trials = 20,

p-value =

4.005e-05

alternative hypothesis: true probability of success is not equal to 0.5 95 percent confidence interval: 0.7512672 0.9987349 sample estimates: probability of success 0.95

You need to define the number of successes from the whole sample size.

For example: 19 out of 34 is not significant

19 out of 20 is significantSlide26

Chi-square test in R

c

hisq.test

habitat

diet

species#

island

carnivore

488

island

herbivore43island

omnivore

177

mainland

carnivore

1901

mainland

herbivore101

mainland

omnivore269

M<-as.table(

rbind

(c(1901,101,269),c(488,43,177)))

chisq.test

(M)

Data

: lizard insularity & diet

:

data: M

χ

2

= 80.04, df = 2, p-value < 2.2e-16Slide27

Chi-square test in R

c

hisq.test

M<-

as.table

(

rbind

(c(7,45,45),c(1,30,14),c(23,110,44)))

chisq.test

(M)

data: M

χ

2

= 17.568, df = 4, p-value = 0.0015

Now lets use our dataset:

dat

<-read.csv("island_type_final2.csv")

install.packages

("reshape

")

library(reshape)cast(dat, type ~ what, length)

type

anoleselse

gecko

Continental745

45

Land_bridge

1

30

14

Oceanic

23

110

44Slide28

Which test should we choose?

If our response variable is continuous then we’ll choose our test based on the predictor variables

If our predictor variable is categorical (Area 1, Area 2, Area 3 or species A, species B, species C)

We’ll use

ANOVA

If our predictor variable is continuous (temperature, body mass, height)

We’ll use

REGRESSION Slide29

Sex

size

female

79.7

male

85

male

120

female

133.0male118

female

126.0

female

105.8

male

112

male

106female

121.0

male

95female111.0male86female93.0male65

female

75.0

male230

female240.0t-test in R

t.test

(

x,y

)

dimorphism<-

read.csv("

ssd.csv",header

=T)

attach(dimorphism)

names(dimorphism

)

males

<-size[Sex=="male"]

females<-size[Sex=="female"]

t.test

(

females,males

)

Welch Two Sample t-test data: females and males t = -2.1541, df = 6866.57, p-value = 0.03127 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -7.5095545 -0.3536548 sample estimates: mean of x mean of y 88.17030 92.10191Slide30

t-test in R (2)

lm(

x~y

)

dimorphism<-read.csv("

ssd.csv",header

=T)

attach(dimorphism)

names(dimorphism)

model<-lm(size~Sex,data=dimorphism)

summary(model)

Estimate

standard error

t

p value

(Intercept)

88.17

1.291

68.32

<2e-16

***Sexmale3.9321.8252.1540.031*

Sex

size

female

79.7male85male

120

female

133.0

male

118

female

126.0

female

105.8

male

112

male

106

female

121.0

male

95

female

111.0

male

86

female

93.0

male

65

female

75.0

male

230

female

240.0Slide31

Species

size

Sex

Xenagama_zonura

79.7

female

Xenagama_zonura

85

male

Xenosaurus_grandis120

maleXenosaurus_grandis

133.0

female

Xenosaurus_newmanorum

118

male

Xenosaurus_newmanorum

126.0female

Xenosaurus_penai

105.8

femaleXenosaurus_penai112maleXenosaurus_platyceps106maleXenosaurus_platyceps121.0

female

Xenosaurus_rectocollaris

95male

Xenosaurus_rectocollaris111.0femaleZonosaurus_anelanelany

86

male

Zonosaurus_anelanelany

93.0

female

Zootoca_vivipara

65

male

Zootoca_vivipara

75.0

female

Zygaspis_nigra

230

male

Zygaspis_nigra

240.0

female

Zygaspis_quadrifrons

195

male

Zygaspis_quadrifrons

227.0

female

Paired t-test in R

t.test

(

x,y,paired

=TRUE)

dimorphism<-read.csv("

ssd.csv",header

=T)

attach(dimorphism)

names(dimorphism)

males

<-size[Sex=="male"]

females<-size[Sex=="female"]

t.test

(

females,males

,

paired=TRUE

)

Paired t-test data:

females and males t = -10.192, df = 3503,

p-value < 2.2e-16

alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -4.688 -3.175 sample estimates: mean of the differences -3.931

tapply

(

size,Sex,mean

)

female

male

88.17

92.10Slide32

ANOVA in R

model<-

aov

(

x~y

)

aov

island<-

read.csv("

island_type_final2.csv",

header=T)

names(island)

[1] "species" "what" "family" "insular" "Archipelago" "

largest_island

" [7] "area" "type" "age" "

iso

" "

lat" "mass" [

13] "clutch" "brood" "hatchling" "

productivity“

model<-aov(clutch~type,data=island)summary(model)

species

type

clutch

Trachylepis_sechellensisContinental

0.6

Trachylepis_wrightii

Continental

0.65

Tropidoscincus_boreus

Continental

0.4

Tropidoscincus_variabilis

Continental

0.45

Urocotyledon_inexpectata

Continental

0.3

Varanus_beccarii

Continental

0.58

Algyroides_fitzingeri

Land_bridge

0.4

Anolis_wattsi

Land_bridge

0

Archaeolacerta_bedriagae

Land_bridge

0.65

Cnemaspis_affinis

Land_bridge

0.3

Cnemaspis_limi

Land_bridge

0.18

Cnemaspis_monachorum

Land_bridge

0

Amblyrhynchus_cristatus

Oceanic

0.35

Ameiva_erythrocephala

Oceanic

0.6

Ameiva_fuscata

Oceanic

0.6

Ameiva_plei

Oceanic

0.41

Anolis_acutus

Oceanic

0

Anolis_aeneus

Oceanic

0

Anolis_agassizi

Oceanic

0

Anolis_bimaculatus

Oceanic

0.18

Anolis_bonairensis

Oceanic

0

Df

Sum sq

Mean sq

F value

Pr(>F)

type

2

0.466

0.23296

2.784

0.0635

.

Residuals

289

24.184

0.08368Slide33

Post-hoc test for ANOVA in R

TukeyHSD

(model)

The difference is not significant. Notice that zero is always in the confidence interval. The difference between Land bridge islands and Continental islands is very close to significance (p = 0.056)

Tukey multiple comparisons of means

95% family-wise confidence level

Fit:

aov

(formula = clutch ~ type, data = island)

$type

 

 

 

 

 

diff

lwr

upr

p adj

Land_bridge-Continental

0.124

-0.0025

0.2505

0.0561

Oceanic-Continental

0.0218

-0.0671

0.1108

0.8318

Oceanic-

Land_bridge

-0.102

-0.2206

0.0163

0.1066Slide34

correlation in R

cor.test

(

x,y

)

island<-read.csv("island_type_final2.csv",header=T)

names(island)

[1] "species" "what" "family" "insular" "Archipelago" "

largest_island

"

[

7] "area" "type" "age" "

iso

" "

lat

" "mass"

[13] "clutch" "brood" "hatchling" "productivity“

The variable “cor” is the correlation coefficient

r

attach(island)

cor.test(mass,lat)Pearson's product-moment correlation data

: mass and lat

t

= -1.138, df

= 317, p-value = 0.256 alternative hypothesis: true correlation is not equal to 0

95

percent confidence interval: -

0.17239 0.04635

sample

estimates:

cor

-

0.06378

lat

mass

5

1.21

5

0.83

4

1.84

18

1.39

18

0.42

18

0.29

20

0.45

18

1.54

18

0.36

18

0.27

18

0.04

18

0.01

5

1.21

21

0.95

21

0.51

21

0.29

22

0.74

21

0.92Slide35

regression in R

lm (=“linear model”):

lm (

y~x

)

Same data as in the previous example

model<-lm(

mass~lat,data

=island)

summary(model)

Call

: lm(formula = mass ~

lat

, data = island

)

Residuals:

Min 1Q Median 3Q Max

-4.708 -1.774 0.470 1.465 3.725

Residual standard error: 0.8206 on 317 degrees of freedom

Multiple

R-squared: 0.004069, Adjusted R-squared: 0.0009268 F-statistic: 1.295 on 1 and 317 DF, p-value: 0.256Coefficients:

Estimate

Std. Error

t value

Pr

(>|t|)

(Intercept)

0.958034

0.096444

9.934

<2e-16

***

lat

-0.00554

0.004872

-1.138

0.256Slide36

lm vs. aov

We

can also

use ‘lm’

with

data that fits ANOVA

In this case we’ll receive all the data that ‘summary’ gives for ‘lm’ function for regression including parameter estimates, SE, difference between factors and p-values for

contrasts

between

categories of our predictor variable Slide37

aov

vs. lm

Estimate

Std. Error

t value

Pr(>|t|)

 

(Intercept)

0.33149

0.02984

11.11

<2e-16

***

typeLand_bridge

0.12399

0.05369

2.309

0.0216

*

typeOceanic

0.021840.037770.5780.5635Residual standard error: 0.2893 on 289 degrees of freedom

(27 observations deleted due to missingness)

Multiple R-squared: 0.0189, Adjusted R-squared: 0.01211

F-statistic: 2.784 on 2 and 289 DF, p-value: 0.06346

More later on

island<-

read.csv("

island_type_final2.csv",

header=T)

model<-

aov

(

clutch~type,data

=island)

model

2

<-lm(

clutch~type,data

=island)

summary(model)

summary(model2)

Df

Sum sq

Mean sq

F value

Pr(>F)

type

2

0.466

0.23296

2.784

0.0635

.

Residuals

289

24.184

0.08368

aov

results

l

m results

We can use ‘lm’ also on data that fits ANOVA

In this case we’ll receive all the data that ‘summary’ gives for ‘lm’ function for regression including parameter estimates, SE, difference between factors and p-values for

contrasts

between

category pairs

of our predictor variable Slide38

Assumptions of statistical tests (all statistical tests)

A non-random, non-independent sample of Israeli people

Random sampling (assumption of all tests not only parametric)

Independence (spatial, phylogenetic etc.)Slide39

Assumptions of parametric test. A

. ANOVA

Homoscedasticity

Normal distribution of the residuals

In addition to the assumptions of all tests

Reading material

:

Sokal

&

Rohlf 1995. Biometry. 3rd edition. Pages 392-409 (especially 406-407 for normality)"Comments on earlier drafts of this manuscript made it clear that for many readers who analyze data but who are not particularly interested in statistical questions, any discussion of statistical methods becomes uncomfortable when the term ‘‘error variance’’ is introduced.“Smith, R. J. 2009. Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140: 476-486.

Richard Smith & 3 friendsSlide40

Anscombe

1973. Graphs

in statistical

analysis.

The American Statistician

27

: 17–21.

Always look at your

data

Anscombe's

quartet

Summary statistics are the same for all four data sets:

n = 11

means of x & y (9, 7.5

),

standard deviation (

4.12)

regression & residual SS

R

2 = (0.816)regression line (y = 3 + 0.5x)Don’t just rely on the statistics!

http://en.wikipedia.org/wiki/Anscombe%27s_quartetSlide41

Assumptions of parametric tests. B. Regression

1. Homoscedasticity

Smith, R. J. 2009. Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140: 476-486.Slide42

Homoscedasticity

The explanatory variable was sampled without error

Smith, R. J. 2009. Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140: 476-486.

Assumptions of parametric tests. B. RegressionSlide43

Homoscedasticity

The explanatory variable was sampled without error

Normal distribution of the residuals of each response variable

Smith, R. J. 2009. Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140: 476-486.

Assumptions of parametric tests. B. RegressionSlide44

Smith, R. J. 2009. Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140: 476-486.

Assumptions of parametric tests. B. Regression

Homoscedasticity

The explanatory variable was sampled without

error

Normal distribution of the residuals of each response variable

Equality

of variance between the values of the explanatory variablesSlide45

Smith, R. J. 2009. Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140: 476-486.

Assumptions of parametric tests. B. Regression

Homoscedasticity

The explanatory variable was sampled without

error

Normal distribution of the residuals of each response variable

Equality of variance between the values of the explanatory variables

Linear relationship between the response and the predictorSlide46

https://www.youtube.com/watch?v=eTZ4VUZHzxw

How will we test if our model follows the assumptions?

ראו גם:

http://stat.ethz.ch/R-manual/R-patched/library/stats/html/plot.lm.html

R has a very useful model diagnostic functions which allows us to evaluate in a graphical matter how much our model follows the model assumption (especially in regression)Slide47

What can we do when our data doesn’t follow the assumptions?

We can ignore it and hope that our test is robust enough to break the assumptions: this is not as unreasonable as it sounds

Use non-parametric tests

Use generalized

linear

models (

glm

); which means

:Transformation (in glm

it means changing the link functions)

Change error

distribution

in

glm

)

to non-normal distribution)Use non-linear tests

Use randomization (more about it in

Roi’s lessons)Slide48

Non-parametric test

Non-parametric test do not assume equality of variance or normal distribution. They are based on Ranks

Disadvantages:

There are no test for models with multiple predictors

Many times their statistical power is very low compared to a equivalent parametric test

They do not give you parameter estimation (slopes, intercepts)

I think it is really wrong to have a presentation without any animal pictures in itSlide49

Non-parametric tests

חסרונות:לא קיימים מבחנים למודלים מרובי

predictors

לעיתים קרובות ה-

statistical power

שלהם נמוך משל מבחן פרמטרי מקביל

לא מאפשרים הערכת פרמטרים (שיפועים ונקודות חיתוך)

נראה לי ממש לא בסדר שבמצגת שלמה אין לי תמונות של חיות

Non-parametric test do not assume equality of variance or normal distribution. They are based on RanksSlide50

A few useful non-parametric tests

Orycteropus

afer

The photographed is not related to the lectures

Chi-square test is a non-parametric test

Kolmogorov-Smirnov

is a non-parametric test used to compare to frequency distributions (or to compare “our” distribution to a known distribution.

For example: a normal distribution

Mann-Whitney U =

Wilcoxon rank sum

Is a non-parametric test equivalent to

students

t-test

Wilcoxon two-sample

(=Wilcoxon signed-rank) test replaces paired-t-tes

t

Kruskal-Wallis

replaces one-way ANOVASpearman test Kendall’s-tau test replaces correlation testsSlide51

Non-parametric tests in R

Orycteropus

afer

Kolmogorov-Smirnov

is a non-parametric test used to compare to frequency distributions (or to compare “our” distribution to a known distribution. For example: a normal

distribution

island<-read.csv("island_type_final2.csv",header=T

)

attach(island)

levels(type)

[1] "Continental" "Land_bridge" "

Oceanic“

Land_bridge<-mass[type=="Land_bridge"]

Oceanic <-mass[type==" Oceanic"]

ks.test

(Land_bridge, Oceanic

)

We need to define in R the grouping variable and the response: lets say we want to compare between the frequency distribution of lizard body mass on oceanic and land bridge islands

Two-sample Kolmogorov-Smirnov test

 data: Land_bridge and OceanicD = 0.1955, p-value = 0.1288alternative hypothesis: two-sidedThe photographed is not related to the lectures