John Tukey Developed these procedures to help one get a first look at distributions of scores What is the shape of the distribution Are there any suspicious scores Stem and Leaf Display Box and Whiskers Plot ID: 306900
Download Presentation The PPT/PDF document "Exploratory Data Analysis" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Exploratory Data AnalysisSlide2
John Tukey
Developed these procedures to help one get a first look at distributions of scores.
What is the shape of the distribution?
Are there any suspicious scores.
Stem and Leaf Display
Box and Whiskers PlotSlide3
Stem and Leaf Display
See the pulse rate data at
Exploratory Data Analysis (EDA
)
.
The scores range from 48 to 104.
We probably want to group them into 5 to 15 intervals.
I’ll use two intervals for the 40’s, two for the 50’s, etc.Slide4
The Stem
Consists of a column of leading (aka “most significant” digits, the leftmost digits in the scores. I’ll add to the stem the leaves, the trailing (rightmost, least significant) digits of each score Slide5
The Stem With Leaves
Next, I’ll arrange the leaves (within each row) from lowest to highest and add a “depth” column.Slide6
Leaves Arranged in OrderSlide7
The Depth ColumnThis column tells you how many scores there are in that row and all rows between it and the closer tail of the distribution.
The row that contains the median has the row frequency in parentheses.Slide8
Rotated Display
It looks like a histogram, but the bars made up of the scores.
From this display, can you identify any scores that are odd, compared to most of the other scores?Slide9
Box and Whisker PlotMedian Location = (
N
+ 1)/2 = 97/2 =48.5.
The median will be located between the 48
th
and the 49
th
scores from either tail.Slide10
Are 40 scores from 68 to 48. Count up 8 more scores, starting with the first 70. The 48
th
score is a 70, the 49
th
score is a 70, the median is 70.Slide11
The Hinge Location= (Median Location + 1)/2.
Drop
any decimal on the median
location
For
our data, hinge location = (48 + 1)/2 = 24.5.
Now
, the
upper hinge
is the 24.5
th
score from the upper end of the distribution.Slide12
There are 24 scores from 80 up to 104. Go in toward the median one more score. The 25
th
score from the highest is a 78. The
upper hinge
is (78 + 79)/2 = 79.Slide13
The 26
th
score from the lowest score is a 64. Move towards the lower tail by one score and you see the 25
th
score is also a 64. One more, the 24
th
score is also a 64. The lower hinge is 64.Slide14
The H-Spread= the difference between the upper hinge and the lower hinge. For our data, 79 - 64 = 15
.
This is the range of the middle 50% of the scores.
You also know this as the interquartile range.Slide15
The Inner FencesThe upper
inner fence
=
the upper hinge plus 1.5 H‑spreads. For our data, 79 + 1.5(15) = 101.5.
The
lower inner fence
is the lower hinge minus 1.5 H‑spreads, 64 - 1.5(15) = 41.5
.
These are invisible fences, they are not plotted.Slide16
Adjacent ValuesThese are scores that are outside of the middle 50% of the scores but within the inner fences.
For our data, these will be scores that fall
between 79 and 101.5 or
b
etween 41.5 and 64Slide17
OutliersThese are scores that are beyond the inner fences.
For our data, these are scores that are
Less than 41.5 or
Greater than 101.5Slide18
Outer FencesThese invisible fences are 3 H-spreads beyond the hinges.
For our data the lower outer fence is at
79 - 3(15) =
34
and the upper outer fence is at
79 + 3(15) =
124
Scores that are beyond the outer fences are called
way-outliers
.Slide19
Drawing the PlotPrepare a numerical scale.
Draw a box that extends from the lower hinge to the upper hinge.
Draw a line through the box at the median.
May also insert a symbol at the mean.
Draw whiskers out to the most extreme adjacent values on each sideSlide20
WhiskersFor our data, the lowest adjacent value is the 48, so we draw the whiskers on the lower end out to 48.
We do not go all the way out the inner fence unless there is a score there.
The highest adjacent value is a 99, so we draw whiskers on the upper end out to 99.Slide21
OutliersEvery outlier is plotted with a special symbol, often a O for a regular outlier and an * for a way-outliers.
Some programs will also print the identification number next to every outlier
These days, we use statistical software to make these displays and plots rather than doing them by hand.Slide22
Plots Produced by SASSlide23
How tall, in inches, is your ideal mate?Slide24
Eight Foot Tall Mate !That is a WAY-OUTLIER for sure !
Investigation of the original data sheets revealed that the actual response was 69 inches, not 96 inches.Slide25
Exploratory Data Analysis (EDA) It is highly recommended that you read the document linked above.
It includes additional examples and a bit of silliness that might help you remember key concepts.
Do watch the video clip of the Id attempting to cross an outer fence on the
Forbidden Planet.