September 23 2013 Welcome to Mucking Around Day Sort into pairs Partner with the person next to you One group of 3 is allowed Sort into pairs Do we have a group of 3 One of the 3 will work with me ID: 649895
Download Presentation The PPT/PDF document "Feature Engineering Studio" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Feature Engineering Studio
September 23, 2013Slide2
Welcome to
Mucking Around DaySlide3
Sort into pairs
Partner with the person next to you
One group of 3 is allowedSlide4
Sort into pairs
Do we have a group of 3?
One of the 3 will work with meSlide5
Sort into pairs
Go over your reports together
A maximum of 5 minutes apieceSlide6
5 minutes for first personSlide7
5 minutes for second personSlide8
Re-assemble into one big groupSlide9
Who here found something really cool while mucking around?
Show
us, tell usSlide10
Who here found a histogram with a normal distribution?
Show us, tell usSlide11
Who here found a histogram with a
hypermode
?
Show us, tell usSlide12
Who here found a histogram with a flat distribution?
Show us, tell usSlide13
Who here found a histogram with a skewed distribution?
Show us, tell usSlide14
Who here found a histogram with a bimodal distribution?
Show us, tell usSlide15
Who here found a histogram with something else interesting?
Show us, tell usSlide16
Who here found something surprising with their min, max, average,
stdev
?Slide17
Categorical variables
Who here found something curious, weird, or interesting in the distribution of their categorical variables?Slide18
Who here hasn’t spoken yet?
(and analyzed data)
Tell us something interesting you found in your dataSlide19
Who here played with pivot tables?
What did you learn?Slide20
My turn to play with pivot tables
Who wants to volunteer their data?
(I might request a 2
nd
or 3
rd
data set, depending on how the 1
st
one goes)Slide21
Who here played with vlookup
?
What did you learn?Slide22
My turn to play with vlookup
Using the same volunteered data set(s)Slide23
Other cool things you can create with a few simple formulas (plus demos!)Slide24
Identifying specific cases of interestSlide25
Did event of interest ever occur for student?Slide26
Counts-so-far
(and total value for student)Slide27
Counts-last-N-actionsSlide28
First attemptsSlide29
Ratios between events of interestSlide30
How many students had 3 (or 4, 5, 2,…) of an eventSlide31
Times-so-farSlide32
Cutoff-based featuresSlide33
Unitized actions (such as unitized time)Slide34
Last 3 or 5 unitizedSlide35
Comparing earlier behaviors to later behaviors through cachingSlide36
Counts-ifSlide37
Percentages of action typeSlide38
Percentages of time spent per action/location/KC/etc.Slide39
Questions? Comments?Slide40
Other cool ideas?Slide41
Assignment 3
Feature Engineering 1
“Bring Me a Rock”
Get your data set
Open it in Excel
Create as many features as you feel inspired to create
Features should be created with the goal of predicting your ground truth variable
At least 12 separate features that are not just variations on a theme (e.g. “time for last 3 actions” and “time for last 4 actions” are variations on a theme; but
“time for last 3 actions” and “total time between help requests and next action” are two separate features
)
For each feature, write a 1-3 sentence “just so story” for why it might work
Test how good each features isSlide42
Testing Feature Goodness
For this assignment, there are a bunch of ways to test feature goodness
Single-feature prediction models in data mining or stats package, giving correlation or kappa (special session this Wednesday)
Compute correlation in Excel (want to see?)
You can do this with binaries variables too, although it’s not really optimal
Compute t-test in Excel (want to see?)
Compute kappa in Excel (if you don’t know how, easier to do in
RapidMiner
)Slide43
Were you right?
Which of your “just so stories” seem to be correct?
Did
any of your feature correlate in the opposite direction from what you expected?Slide44
Assignment 3
Write a brief report for me
Email me an excel sheet with your features
You don’t need to prepare a presentation
But be ready to discuss your features in classSlide45
Next Classes
9/25 Special Session
Using
RapidMiner
to Produce Prediction Models
Come to this if you’ve never built a classifier or
regressor
in
RapidMiner
(or a similar tool)
Statistical significance tests using linear regression don’t count…
9/30 Advanced Feature Distillation in Excel
Assignment 3 due
Online Equation Solver Tutorials should be in your INBOXSlide46
Upcoming Classes
10/2
Special session on prediction models
Come to this if you don’t know why student-level cross-validation is important, or if you don’t know what J48
is
10/7 Advanced Feature Distillation in Google Refine
10/9 Special session? TBD.