Shai Meiri Everything differs We expect to find differences between x and y is a trivial saying The statistician within you asks Are the differences we found are larger that expected by chance ID: 419829
Download Presentation The PPT/PDF document "Basic statistic inference in R" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Basic statistic inference in R
Shai
MeiriSlide2
Everything differs
!!!“We expect to find differences between x and y” is a trivial saying
The statistician within you asks “Are the differences we found are larger that expected by chance?”
The biologist within you asks “Why the differences I found are in the direction and the level they are?”Slide3
Moments of central tendency
Mean
Arithmetic mean:
Σ
x
i
/n
Geometric mean: (
x
1*x2*…*xn)1/n
Harmonic mean: Slide4
1. Arithmetic mean:
Σxi
/n
2. Geometric mean: (x
1
*x
2
*…*
xn)1/n
Use the function “mean”Example:
data<-c(2,3,4,5,6,7,8)
mean(data)
[1
] 5
data<-
c(2,3,4,5,6,7,8)
exp
(mean(log(data)))[1] 4.549163
Example:
You can also use the .csv file :
dat<-read.csv("island_type_final2.csv")Attach(dat)mean(lat)[1] 17.40439
Moments of central tendency in RSlide5
Moments of central
tendency
A. mean
B. Median
C. Mode
General example:
data<-c(2,3,4,5,6,7,8)
median(data)
[1
] 5
median(mass
)
[
1]
0.69
Example from the .
csv
:Slide6
Moments of central tendency
Mean
Variance
=
Σ
(x
i
-
μ)2
/ nExample: http://www.statmethods.net/management/functions.htmldata<-c(2,3,4,5,6,7,8) var
(data)[1] 4.666667
Var
(
lat
)
[1] 89.20388
Is the mean is a good measurement to what is happening in the population when the variance is low? Slide7
Moments of central tendency
Mean
Variance
The second moment of central tendency is the measurement of how much the data is scattered around the first moment (mean)
An example for the second moment are the variance, the standard variation, standard error, coefficient of variation and the confidence interval of 90%, 95% and 99% from something Slide8
Moments of central tendency
#for:
data<-c(2,3,4,5,6,7,8)
var
(data)
sd
(data)
length(data)
se<-(
sd
(data)/length(data)^0.5)
se
[
1
] 0.8164966
CV<-
sd
(data)/mean(data)
CV
[1
] 0.4320494
Sample size:Variance:
Standard deviation:
Standard error:coefficient of variation
:Slide9
Moments of central tendency
Mean
Variance
Skew
Skewed distribution of frequencies is not symmetric
Do you think that the arithmetic mean is a good measurement of central tendency for a skewed frequency distribution
What is the mean salary of the student here and of Bill Gates? Slide10
Moments of central tendency
Skew
skew<-function(data){
m3
<-sum((data-mean(data))^3)/length(data)
s3
<-
sqrt
(
var(data))^3 m3/s3}
skew(data)The SE of skewness
:
sdskew
<-
function(x
) sqrt(6/length(x
))Slide11
Moments of central tendency
Mean
Variance
Skew
KurtosisSlide12
Moments of central tendency
Kurtosis
kurtosis<-function(x){
m4
<-sum((x-mean(x))^4)/length(x)
s4
<-
var
(x)^2
m4/s4-3 }
kurtosis(x)SE of kurtosis:
sdkurtosis
<-
function(x
)
sqrt
(
24/length(x
))Slide13
A normal distribution can get a value of mean and variance but its
skewness
and the kurtosis should equal to zero
Values of skew and kurtosis have their own variance – and zero should be outside of their confidence interval in order for them to be significantly different from zeroSlide14
46,699
₪
for a month
(excluding
the bottles)
Residuals
http://www.haaretz.co.il/1.2057452
Rab
. Dov LiorServed in IDF for 1 month
m2.06
When doing statistics we’re creating models of the reality
One of the most simple models is the mean
:
The mean height of Israeli citizens is
173
cm
The mean salary is
9271 ₪
(correct for April 2014)
The mean service in IDF is 24 months (I guess)Slide15
We can see here that our models: 24 months,
9271 ₪
and
173 cm are
not very successful
Residuals
Residual = 33 cm
Residual =
₪
37428
Residual = -23 month IDF service
When doing statistics we’re creating models of the reality
The Residual
Is how
much a certain value is far from the prediction of the model
.
Omri
Caspi is far away in 32 cm from the model “Israeli = 173” and in 29 cm from the more complicated model: “Israeli man = 177, Israeli women = 168”Slide16
Residuals
Residual = -23 month service
Residual =
₪
37
428
Residual = 33 cm
dat
<-
read.csv("island_type_final2.csv")
model
<-
lm(
mass
~
iso+area+age+lat
, data=
dat
)
out<-model$residuals
outwrite.table(out, file = "residuals.txt",sep="\t",col.names=F,row.names=F)#note
that residual values are in the order entered (i.e
., not alphabetic, not by residual size – first in, first out)
When doing statistics we’re creating models of the realitySlide17
In statistical inference
we are testing the behavior of our data compared to a certain hypothesis
We can present our hypothesis as a statistical model
For Example:
The distribution of the heights is normal
Number of species increases with area
Number of species increases with area with a power function of 0.25
Theoretical statistics and statistical
inference
When we have data it is best that we first describe them: plot graphs, calculate the mean and so on Slide18
Frequency distribution*
*graphic form = “histogram”
Describes the distribution of all observations
dat
<-read.csv("island_type_final2.csv
")
attach(
dat
)
names(dat)Hist
(mass)
How many observations are in each bin?Slide19
Frequency distribution
What did we learn?
dat
<-read.csv("island_type_final2.csv
")
attach(
dat
)
Hist
(mass)There are no mass smaller than one tenth of a gram or larger than 100 kg
Lizard with mass between 1 and 10 are very common – larger or smaller lizards are rare
The distribution is
unimodal
and skewed to the rightSlide20
Frequency distribution
Histograms don’t
have to be so ugly
dat
<-read.csv("island_type_final2.csv
")
attach(
dat
)
hist(mass, col="purple",breaks=25,xlab="log mass (g)",main="masses of island lizards - great data by Maria",cex.axis
=1.2,cex.lab=1.5)Slide21
Presenting a categorical
predictor
with a continuous
response variable
dat
<-read.csv("island_type_final2.csv
")
attach(
dat
)
Always prefer boxplot
to
barplot
plot(type,brood
)Slide22
Presenting a continuous variable
against another
continuous variable
dat
<-read.csv("island_type_final2.csv
")
attach(
dat
)
plot(mass,clutch
)
plot
(
mass,clutch,pch
=16
,
col=“blue”)Slide23
Which test should we choose?
If the response variable is “success or failure” and the null hypothesis is equality of both we’ll use a binomial test
If the response variable is counts we’ll usually use chi-square or G
In many cases our response variable will be continuous (14 species, 78 individuals, 54 heartbeats per second, 7.3 eggs, 23 degrees)
It changes according to the nature of our
response
variable
(=y variable), and mostly according to the nature of our predictor variablesSlide24
Which test should we choose?
What is your response variable
?
Continuous
(14 species, 78 individuals, 23 degrees, 7.3 eggs)
Success or failure
(found the cheese/idiot)
BinomialChi-square or G (=log-likelihood)
Counts(
frequency
: 6 females, 4 males)
Soon…Slide25
Binomial test in R
binom.test
(19,34)
Exact binomial test data: 19 and 34 number of successes = 19, number of trials = 34
p-value = 0.6076 alternative hypothesis: true probability of success is not equal to 0.5 95 percent confidence interval: 0.3788576 0.7281498 sample estimates: probability of success 0.5588235
binom.test
(19,20)
Exact binomial test data: 19 and 20 number of successes = 19, number of trials = 20,
p-value =
4.005e-05
alternative hypothesis: true probability of success is not equal to 0.5 95 percent confidence interval: 0.7512672 0.9987349 sample estimates: probability of success 0.95
You need to define the number of successes from the whole sample size.
For example: 19 out of 34 is not significant
19 out of 20 is significantSlide26
Chi-square test in R
c
hisq.test
habitat
diet
species#
island
carnivore
488
island
herbivore43island
omnivore
177
mainland
carnivore
1901
mainland
herbivore101
mainland
omnivore269
M<-as.table(
rbind
(c(1901,101,269),c(488,43,177)))
chisq.test
(M)
Data
: lizard insularity & diet
:
data: M
χ
2
= 80.04, df = 2, p-value < 2.2e-16Slide27
Chi-square test in R
c
hisq.test
M<-
as.table
(
rbind
(c(7,45,45),c(1,30,14),c(23,110,44)))
chisq.test
(M)
data: M
χ
2
= 17.568, df = 4, p-value = 0.0015
Now lets use our dataset:
dat
<-read.csv("island_type_final2.csv")
install.packages
("reshape
")
library(reshape)cast(dat, type ~ what, length)
type
anoleselse
gecko
Continental745
45
Land_bridge
1
30
14
Oceanic
23
110
44Slide28
Which test should we choose?
If our response variable is continuous then we’ll choose our test based on the predictor variables
If our predictor variable is categorical (Area 1, Area 2, Area 3 or species A, species B, species C)
We’ll use
ANOVA
If our predictor variable is continuous (temperature, body mass, height)
We’ll use
REGRESSION Slide29
Sex
size
female
79.7
male
85
male
120
female
133.0male118
female
126.0
female
105.8
male
112
male
106female
121.0
male
95female111.0male86female93.0male65
female
75.0
male230
female240.0t-test in R
t.test
(
x,y
)
dimorphism<-
read.csv("
ssd.csv",header
=T)
attach(dimorphism)
names(dimorphism
)
males
<-size[Sex=="male"]
females<-size[Sex=="female"]
t.test
(
females,males
)
Welch Two Sample t-test data: females and males t = -2.1541, df = 6866.57, p-value = 0.03127 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -7.5095545 -0.3536548 sample estimates: mean of x mean of y 88.17030 92.10191Slide30
t-test in R (2)
lm(
x~y
)
dimorphism<-read.csv("
ssd.csv",header
=T)
attach(dimorphism)
names(dimorphism)
model<-lm(size~Sex,data=dimorphism)
summary(model)
Estimate
standard error
t
p value
(Intercept)
88.17
1.291
68.32
<2e-16
***Sexmale3.9321.8252.1540.031*
Sex
size
female
79.7male85male
120
female
133.0
male
118
female
126.0
female
105.8
male
112
male
106
female
121.0
male
95
female
111.0
male
86
female
93.0
male
65
female
75.0
male
230
female
240.0Slide31
Species
size
Sex
Xenagama_zonura
79.7
female
Xenagama_zonura
85
male
Xenosaurus_grandis120
maleXenosaurus_grandis
133.0
female
Xenosaurus_newmanorum
118
male
Xenosaurus_newmanorum
126.0female
Xenosaurus_penai
105.8
femaleXenosaurus_penai112maleXenosaurus_platyceps106maleXenosaurus_platyceps121.0
female
Xenosaurus_rectocollaris
95male
Xenosaurus_rectocollaris111.0femaleZonosaurus_anelanelany
86
male
Zonosaurus_anelanelany
93.0
female
Zootoca_vivipara
65
male
Zootoca_vivipara
75.0
female
Zygaspis_nigra
230
male
Zygaspis_nigra
240.0
female
Zygaspis_quadrifrons
195
male
Zygaspis_quadrifrons
227.0
female
Paired t-test in R
t.test
(
x,y,paired
=TRUE)
dimorphism<-read.csv("
ssd.csv",header
=T)
attach(dimorphism)
names(dimorphism)
males
<-size[Sex=="male"]
females<-size[Sex=="female"]
t.test
(
females,males
,
paired=TRUE
)
Paired t-test data:
females and males t = -10.192, df = 3503,
p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -4.688 -3.175 sample estimates: mean of the differences -3.931
tapply
(
size,Sex,mean
)
female
male
88.17
92.10Slide32
ANOVA in R
model<-
aov
(
x~y
)
aov
island<-
read.csv("
island_type_final2.csv",
header=T)
names(island)
[1] "species" "what" "family" "insular" "Archipelago" "
largest_island
" [7] "area" "type" "age" "
iso
" "
lat" "mass" [
13] "clutch" "brood" "hatchling" "
productivity“
model<-aov(clutch~type,data=island)summary(model)
species
type
clutch
Trachylepis_sechellensisContinental
0.6
Trachylepis_wrightii
Continental
0.65
Tropidoscincus_boreus
Continental
0.4
Tropidoscincus_variabilis
Continental
0.45
Urocotyledon_inexpectata
Continental
0.3
Varanus_beccarii
Continental
0.58
Algyroides_fitzingeri
Land_bridge
0.4
Anolis_wattsi
Land_bridge
0
Archaeolacerta_bedriagae
Land_bridge
0.65
Cnemaspis_affinis
Land_bridge
0.3
Cnemaspis_limi
Land_bridge
0.18
Cnemaspis_monachorum
Land_bridge
0
Amblyrhynchus_cristatus
Oceanic
0.35
Ameiva_erythrocephala
Oceanic
0.6
Ameiva_fuscata
Oceanic
0.6
Ameiva_plei
Oceanic
0.41
Anolis_acutus
Oceanic
0
Anolis_aeneus
Oceanic
0
Anolis_agassizi
Oceanic
0
Anolis_bimaculatus
Oceanic
0.18
Anolis_bonairensis
Oceanic
0
Df
Sum sq
Mean sq
F value
Pr(>F)
type
2
0.466
0.23296
2.784
0.0635
.
Residuals
289
24.184
0.08368Slide33
Post-hoc test for ANOVA in R
TukeyHSD
(model)
The difference is not significant. Notice that zero is always in the confidence interval. The difference between Land bridge islands and Continental islands is very close to significance (p = 0.056)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit:
aov
(formula = clutch ~ type, data = island)
$type
diff
lwr
upr
p adj
Land_bridge-Continental
0.124
-0.0025
0.2505
0.0561
Oceanic-Continental
0.0218
-0.0671
0.1108
0.8318
Oceanic-
Land_bridge
-0.102
-0.2206
0.0163
0.1066Slide34
correlation in R
cor.test
(
x,y
)
island<-read.csv("island_type_final2.csv",header=T)
names(island)
[1] "species" "what" "family" "insular" "Archipelago" "
largest_island
"
[
7] "area" "type" "age" "
iso
" "
lat
" "mass"
[13] "clutch" "brood" "hatchling" "productivity“
The variable “cor” is the correlation coefficient
r
attach(island)
cor.test(mass,lat)Pearson's product-moment correlation data
: mass and lat
t
= -1.138, df
= 317, p-value = 0.256 alternative hypothesis: true correlation is not equal to 0
95
percent confidence interval: -
0.17239 0.04635
sample
estimates:
cor
-
0.06378
lat
mass
5
1.21
5
0.83
4
1.84
18
1.39
18
0.42
18
0.29
20
0.45
18
1.54
18
0.36
18
0.27
18
0.04
18
0.01
5
1.21
21
0.95
21
0.51
21
0.29
22
0.74
21
0.92Slide35
regression in R
lm (=“linear model”):
lm (
y~x
)
Same data as in the previous example
model<-lm(
mass~lat,data
=island)
summary(model)
Call
: lm(formula = mass ~
lat
, data = island
)
Residuals:
Min 1Q Median 3Q Max
-4.708 -1.774 0.470 1.465 3.725
Residual standard error: 0.8206 on 317 degrees of freedom
Multiple
R-squared: 0.004069, Adjusted R-squared: 0.0009268 F-statistic: 1.295 on 1 and 317 DF, p-value: 0.256Coefficients:
Estimate
Std. Error
t value
Pr
(>|t|)
(Intercept)
0.958034
0.096444
9.934
<2e-16
***
lat
-0.00554
0.004872
-1.138
0.256Slide36
lm vs. aov
We
can also
use ‘lm’
with
data that fits ANOVA
In this case we’ll receive all the data that ‘summary’ gives for ‘lm’ function for regression including parameter estimates, SE, difference between factors and p-values for
contrasts
between
categories of our predictor variable Slide37
aov
vs. lm
Estimate
Std. Error
t value
Pr(>|t|)
(Intercept)
0.33149
0.02984
11.11
<2e-16
***
typeLand_bridge
0.12399
0.05369
2.309
0.0216
*
typeOceanic
0.021840.037770.5780.5635Residual standard error: 0.2893 on 289 degrees of freedom
(27 observations deleted due to missingness)
Multiple R-squared: 0.0189, Adjusted R-squared: 0.01211
F-statistic: 2.784 on 2 and 289 DF, p-value: 0.06346
More later on
island<-
read.csv("
island_type_final2.csv",
header=T)
model<-
aov
(
clutch~type,data
=island)
model
2
<-lm(
clutch~type,data
=island)
summary(model)
summary(model2)
Df
Sum sq
Mean sq
F value
Pr(>F)
type
2
0.466
0.23296
2.784
0.0635
.
Residuals
289
24.184
0.08368
aov
results
l
m results
We can use ‘lm’ also on data that fits ANOVA
In this case we’ll receive all the data that ‘summary’ gives for ‘lm’ function for regression including parameter estimates, SE, difference between factors and p-values for
contrasts
between
category pairs
of our predictor variable Slide38
Assumptions of statistical tests (all statistical tests)
A non-random, non-independent sample of Israeli people
Random sampling (assumption of all tests not only parametric)
Independence (spatial, phylogenetic etc.)Slide39
Assumptions of parametric test. A
. ANOVA
Homoscedasticity
Normal distribution of the residuals
In addition to the assumptions of all tests
Reading material
:
Sokal
&
Rohlf 1995. Biometry. 3rd edition. Pages 392-409 (especially 406-407 for normality)"Comments on earlier drafts of this manuscript made it clear that for many readers who analyze data but who are not particularly interested in statistical questions, any discussion of statistical methods becomes uncomfortable when the term ‘‘error variance’’ is introduced.“Smith, R. J. 2009. Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140: 476-486.
Richard Smith & 3 friendsSlide40
Anscombe
1973. Graphs
in statistical
analysis.
The American Statistician
27
: 17–21.
Always look at your
data
Anscombe's
quartet
Summary statistics are the same for all four data sets:
n = 11
means of x & y (9, 7.5
),
standard deviation (
4.12)
regression & residual SS
R
2 = (0.816)regression line (y = 3 + 0.5x)Don’t just rely on the statistics!
http://en.wikipedia.org/wiki/Anscombe%27s_quartetSlide41
Assumptions of parametric tests. B. Regression
1. Homoscedasticity
Smith, R. J. 2009. Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140: 476-486.Slide42
Homoscedasticity
The explanatory variable was sampled without error
Smith, R. J. 2009. Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140: 476-486.
Assumptions of parametric tests. B. RegressionSlide43
Homoscedasticity
The explanatory variable was sampled without error
Normal distribution of the residuals of each response variable
Smith, R. J. 2009. Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140: 476-486.
Assumptions of parametric tests. B. RegressionSlide44
Smith, R. J. 2009. Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140: 476-486.
Assumptions of parametric tests. B. Regression
Homoscedasticity
The explanatory variable was sampled without
error
Normal distribution of the residuals of each response variable
Equality
of variance between the values of the explanatory variablesSlide45
Smith, R. J. 2009. Use and misuse of the reduced major axis for line-fitting. American Journal of Physical Anthropology 140: 476-486.
Assumptions of parametric tests. B. Regression
Homoscedasticity
The explanatory variable was sampled without
error
Normal distribution of the residuals of each response variable
Equality of variance between the values of the explanatory variables
Linear relationship between the response and the predictorSlide46
https://www.youtube.com/watch?v=eTZ4VUZHzxw
How will we test if our model follows the assumptions?
ראו גם:
http://stat.ethz.ch/R-manual/R-patched/library/stats/html/plot.lm.html
R has a very useful model diagnostic functions which allows us to evaluate in a graphical matter how much our model follows the model assumption (especially in regression)Slide47
What can we do when our data doesn’t follow the assumptions?
We can ignore it and hope that our test is robust enough to break the assumptions: this is not as unreasonable as it sounds
Use non-parametric tests
Use generalized
linear
models (
glm
); which means
:Transformation (in glm
it means changing the link functions)
Change error
distribution
in
glm
)
to non-normal distribution)Use non-linear tests
Use randomization (more about it in
Roi’s lessons)Slide48
Non-parametric test
Non-parametric test do not assume equality of variance or normal distribution. They are based on Ranks
Disadvantages:
There are no test for models with multiple predictors
Many times their statistical power is very low compared to a equivalent parametric test
They do not give you parameter estimation (slopes, intercepts)
I think it is really wrong to have a presentation without any animal pictures in itSlide49
Non-parametric tests
חסרונות:לא קיימים מבחנים למודלים מרובי
predictors
לעיתים קרובות ה-
statistical power
שלהם נמוך משל מבחן פרמטרי מקביל
לא מאפשרים הערכת פרמטרים (שיפועים ונקודות חיתוך)
נראה לי ממש לא בסדר שבמצגת שלמה אין לי תמונות של חיות
Non-parametric test do not assume equality of variance or normal distribution. They are based on RanksSlide50
A few useful non-parametric tests
Orycteropus
afer
The photographed is not related to the lectures
Chi-square test is a non-parametric test
Kolmogorov-Smirnov
is a non-parametric test used to compare to frequency distributions (or to compare “our” distribution to a known distribution.
For example: a normal distribution
Mann-Whitney U =
Wilcoxon rank sum
Is a non-parametric test equivalent to
students
t-test
Wilcoxon two-sample
(=Wilcoxon signed-rank) test replaces paired-t-tes
t
Kruskal-Wallis
replaces one-way ANOVASpearman test Kendall’s-tau test replaces correlation testsSlide51
Non-parametric tests in R
Orycteropus
afer
Kolmogorov-Smirnov
is a non-parametric test used to compare to frequency distributions (or to compare “our” distribution to a known distribution. For example: a normal
distribution
island<-read.csv("island_type_final2.csv",header=T
)
attach(island)
levels(type)
[1] "Continental" "Land_bridge" "
Oceanic“
Land_bridge<-mass[type=="Land_bridge"]
Oceanic <-mass[type==" Oceanic"]
ks.test
(Land_bridge, Oceanic
)
We need to define in R the grouping variable and the response: lets say we want to compare between the frequency distribution of lizard body mass on oceanic and land bridge islands
Two-sample Kolmogorov-Smirnov test
data: Land_bridge and OceanicD = 0.1955, p-value = 0.1288alternative hypothesis: two-sidedThe photographed is not related to the lectures