Why Statistics Can Be Wrong Anscombes quartet comprises four datasets that have nearly identical simple statistical properties yet appear very different when graphed Each dataset consists of eleven ID: 555682
Download Presentation The PPT/PDF document "Anscombe’s Quartet" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Anscombe’s QuartetSlide2
Why Statistics Can Be Wrong
Anscombe's quartet
comprises four
datasets
that have nearly identical simple statistical properties, yet appear very different when graphed.
Each dataset consists of eleven (
x
,
y
) points.
They were constructed in 1973 by the
statistician
Francis Anscombe
to demonstrate both the importance of graphing data before analyzing it and the effect of
outliers
on statistical properties.
[1]Slide3
The Statistical Properties
Property
Value
Mean
of
x
in each case
9 (exact)
Sample
variance
of
x
in each case
11 (exact)
Mean of
y
in each case
7.50 (to 2 decimal places)
Sample variance of
y
in each case
4.122 or 4.127 (to 3 decimal places)
Correlation
between
x
and
y
in each case
0.816 (to 3 decimal places)
Linear regression
line in each case
y
= 3.00 + 0.500
x
(to 2 and 3 decimal places, respectively)Slide4
All four sets are identical when examined using simple summary statistics, but vary considerably when graphedSlide5
The Datasets
I
II
III
IV
x
y
x
y
x
yxy10.08.0410.09.1410.07.468.06.588.06.958.08.148.06.778.05.7613.07.5813.08.7413.012.748.07.719.08.819.08.779.07.118.08.8411.08.3311.09.2611.07.818.08.4714.09.9614.08.1014.08.848.07.046.07.246.06.136.06.088.05.254.04.264.03.104.05.3919.012.5012.010.8412.09.1312.08.158.05.567.04.827.07.267.06.428.07.915.05.685.04.745.05.738.06.89
A procedure to generate similar data sets with identical statistics and dissimilar graphics has since been developed.Slide6
The Representation
The first
scatter plot
(top left) appears to be a simple linear relationship, corresponding to two
variables
correlated and following the assumption of
normality
.
The second graph (top right) is not distributed normally; while an obvious relationship between the two variables can be observed, it is not linear, and the
Pearson correlation coefficient
is not relevant (a more general regression and the corresponding coefficient of determination would be more appropriate).In the third graph (bottom left), the distribution is linear, but with a different regression line, which is offset by the one outlier which exerts enough influence to alter the regression line and lower the correlation coefficient from 1 to 0.816 (a robust regression would have been called for).Finally, the fourth graph (bottom right) shows an example when one outlier is enough to produce a high correlation coefficient, even though the relationship between the two variables is not linear.The quartet is still often used to illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic datasets.[2][3][4][5][6]Slide7
References
Anscombe, F. J.
(1973). "Graphs in Statistical Analysis".
American Statistician
27
(1): 17–21.
JSTOR
2682899
.Elert, Glenn. "Linear Regression". The Physics Hypertextbook.Janert, Philipp K. (2010). Data Analysis with Open Source Tools. O'Reilly Media, Inc. pp. 65–66. ISBN 0-596-80235-8.Chatterjee, Samprit; Hadi, Ali S. (2006). Regression analysis by example. John Wiley and Sons. p. 91. ISBN 0-471-74696-7.Saville, David J.; Wood, Graham R. (1991). Statistical methods: the geometric approach. Springer. p. 418. ISBN 0-387-97517-9.Tufte, Edward R. (2001). The Visual Display of Quantitative Information (2nd ed.). Cheshire, CT: Graphics Press. ISBN 0-9613921-4-2.Chatterjee, Sangit; Firat, Aykut (2007). "Generating Data with Identical Statistics but Dissimilar Graphics: A Follow up to the Anscombe Dataset". American Statistician 61 (3): 248–254.doi:10.1198/000313007X220057.