/
Exploratory Data Analysis Exploratory Data Analysis

Exploratory Data Analysis - PowerPoint Presentation

liane-varnes
liane-varnes . @liane-varnes
Follow
557 views
Uploaded On 2016-05-05

Exploratory Data Analysis - PPT Presentation

John Tukey Developed these procedures to help one get a first look at distributions of scores What is the shape of the distribution Are there any suspicious scores Stem and Leaf Display Box and Whiskers Plot ID: 306900

data scores hinge score scores data score hinge upper median fences fence outer stem whiskers location draw adjacent column outlier display box

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Exploratory Data Analysis" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Exploratory Data AnalysisSlide2

John Tukey

Developed these procedures to help one get a first look at distributions of scores.

What is the shape of the distribution?

Are there any suspicious scores.

Stem and Leaf Display

Box and Whiskers PlotSlide3

Stem and Leaf Display

See the pulse rate data at

Exploratory Data Analysis (EDA

)

.

The scores range from 48 to 104.

We probably want to group them into 5 to 15 intervals.

I’ll use two intervals for the 40’s, two for the 50’s, etc.Slide4

The Stem

Consists of a column of leading (aka “most significant” digits, the leftmost digits in the scores. I’ll add to the stem the leaves, the trailing (rightmost, least significant) digits of each score Slide5

The Stem With Leaves

Next, I’ll arrange the leaves (within each row) from lowest to highest and add a “depth” column.Slide6

Leaves Arranged in OrderSlide7

The Depth ColumnThis column tells you how many scores there are in that row and all rows between it and the closer tail of the distribution.

The row that contains the median has the row frequency in parentheses.Slide8

Rotated Display

It looks like a histogram, but the bars made up of the scores.

From this display, can you identify any scores that are odd, compared to most of the other scores?Slide9

Box and Whisker PlotMedian Location = (

N

+ 1)/2 = 97/2 =48.5.

The median will be located between the 48

th

and the 49

th

scores from either tail.Slide10

Are 40 scores from 68 to 48. Count up 8 more scores, starting with the first 70. The 48

th

score is a 70, the 49

th

score is a 70, the median is 70.Slide11

The Hinge Location= (Median Location + 1)/2.

Drop

any decimal on the median

location

For

our data, hinge location = (48 + 1)/2 = 24.5.

Now

, the

upper hinge

is the 24.5

th

score from the upper end of the distribution.Slide12

There are 24 scores from 80 up to 104. Go in toward the median one more score. The 25

th

score from the highest is a 78. The

upper hinge

is (78 + 79)/2 = 79.Slide13

The 26

th

score from the lowest score is a 64. Move towards the lower tail by one score and you see the 25

th

score is also a 64. One more, the 24

th

score is also a 64. The lower hinge is 64.Slide14

The H-Spread= the difference between the upper hinge and the lower hinge. For our data, 79 - 64 = 15

.

This is the range of the middle 50% of the scores.

You also know this as the interquartile range.Slide15

The Inner FencesThe upper

inner fence

=

the upper hinge plus 1.5 H‑spreads. For our data, 79 + 1.5(15) = 101.5.

The

lower inner fence

is the lower hinge minus 1.5 H‑spreads, 64 - 1.5(15) = 41.5

.

These are invisible fences, they are not plotted.Slide16

Adjacent ValuesThese are scores that are outside of the middle 50% of the scores but within the inner fences.

For our data, these will be scores that fall

between 79 and 101.5 or

b

etween 41.5 and 64Slide17

OutliersThese are scores that are beyond the inner fences.

For our data, these are scores that are

Less than 41.5 or

Greater than 101.5Slide18

Outer FencesThese invisible fences are 3 H-spreads beyond the hinges.

For our data the lower outer fence is at

79 - 3(15) =

34

and the upper outer fence is at

79 + 3(15) =

124

Scores that are beyond the outer fences are called

way-outliers

.Slide19

Drawing the PlotPrepare a numerical scale.

Draw a box that extends from the lower hinge to the upper hinge.

Draw a line through the box at the median.

May also insert a symbol at the mean.

Draw whiskers out to the most extreme adjacent values on each sideSlide20

WhiskersFor our data, the lowest adjacent value is the 48, so we draw the whiskers on the lower end out to 48.

We do not go all the way out the inner fence unless there is a score there.

The highest adjacent value is a 99, so we draw whiskers on the upper end out to 99.Slide21

OutliersEvery outlier is plotted with a special symbol, often a O for a regular outlier and an * for a way-outliers.

Some programs will also print the identification number next to every outlier

These days, we use statistical software to make these displays and plots rather than doing them by hand.Slide22

Plots Produced by SASSlide23

How tall, in inches, is your ideal mate?Slide24

Eight Foot Tall Mate !That is a WAY-OUTLIER for sure !

Investigation of the original data sheets revealed that the actual response was 69 inches, not 96 inches.Slide25

Exploratory Data Analysis (EDA) It is highly recommended that you read the document linked above.

It includes additional examples and a bit of silliness that might help you remember key concepts.

Do watch the video clip of the Id attempting to cross an outer fence on the

Forbidden Planet.