1 ADVENTURES IN DATA MINING Margaret H Dunham Southern Methodist University Dallas Texas 75275 mhdlylesmuedu This material is based in part upon work supported by the National Science Foundation under Grant No ID: 325866
Download Presentation The PPT/PDF document "2/25/13 - Union University" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
2/25/13 - Union University
1
ADVENTURES
IN DATA MINING
Margaret H. Dunham
Southern Methodist UniversityDallas, Texas 75275mhd@lyle.smu.eduThis material is based in part upon work supported by the National Science Foundation under Grant No. 9820841 and NIH Grant No.1R21HG005912-01A1Some slides used by permission from Dr Eamonn Keogh; University of California Riverside;eamonn@cs.ucr.edu
ACM Distinguished Speakers ProgramSlide2
2/25/13 - Union University
2
The 2000 ozone hole over the
antarctic
seen by EPTOMShttp://
jwocky.gsfc.nasa.gov/multi/multi.html#holeSlide3
Data Mining Outline
Introduction
TechniquesClassificationClusteringAssociation Rules
Examples2/25/13 - Union University3
Explore some interesting data mining applicationsSlide4
Introduction
Data is growing at a phenomenal rate
Users expect more sophisticated informationHow?
2/25/13 - Union University4
UNCOVER HIDDEN INFORMATIONDATA MININGSlide5
But it isn’t Magic
You must know what you are looking for
You must know how to look for you
2/25/13 - Union University5
Suppose you knew that a specific cave had gold: What would you look for? How would you look for it? Might need an expert minerSlide6
CLASSIFICATION
Assign data into predefined groups or classes.
2/25/13 - Union University
6Slide7
“If it looks like a duck,
walks like a duck, and quacks like a duck, then
it’s a duck.”
2/25/13 - Union University7
Description
BehaviorAssociationsClassification Clustering Link Analysis (Profiling) (Similarity) “If it looks like a terrorist, walks like a terrorist, and quacks like a terrorist, then it’s a terrorist.”Slide8
Classification Ex: Grading
2/25/13 - Union University
8
>=90
<90
x>=80<80x>=70
<70
x
F
B
A
>=60
<50
x
C
DSlide9
2/25/13 - Union University
9
Grasshoppers
Katydids
Given a collection of annotated data. (in this case 5 instances
of Katydids and five of Grasshoppers), decide what type of insect the unlabeled example is.
(c) Eamonn Keogh, eamonn@cs.ucr.eduSlide10
2/25/13 - Union University
10
Insect ID
Abdomen
Length
Antennae LengthInsect Class
1
2.7
5.5
Grasshopper
2
8.0
9.1
Katydid
3
0.9
4.7
Grasshopper
4
1.1
3.1
Grasshopper
5
5.4
8.5
Katydid
6
2.9
1.9
Grasshopper
7
6.1
6.6
Katydid
8
0.5
1.0
Grasshopper
9
8.3
6.6
Katydid
10
8.1
4.7
Katydid
11
5.1
7.0
???????
The classification problem can now be expressed as:
Given a training database predict the
class
label of a previously unseen instance
previously unseen instance
=
(c) Eamonn Keogh, eamonn@cs.ucr.eduSlide11
2/25/13 - Union University
11
Antenna Length
10
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
Grasshoppers
Katydids
Abdomen Length
(c) Eamonn Keogh, eamonn@cs.ucr.eduSlide12
2/25/13 - Union University
12
How Stuff Works, “Facial Recognition,”
http://
computer.howstuffworks.com/facial-recognition1.htmSlide13
2/25/13 - Union University
13
Facial Recognition
(c) Eamonn Keogh, eamonn@cs.ucr.eduSlide14
2/25/13 - Union University
14
Handwriting Recognition
George Washington Manuscript
0
50
100
150
200
250
300
350
400
450
0
0.5
1
(c) Eamonn Keogh, eamonn@cs.ucr.eduSlide15
Rare Event Detection
2/25/13 - Union University
15Slide16
2/25/13 - Union University
16Slide17
2/25/13 - Union University
17
Dallas Morning News
October 7, 2005Slide18
© Prentice Hall
18
Classification Performance
True Positive
True Negative
False PositiveFalse NegativeSlide19
Behavior Based Classification/Prediction
Credit Card
Fraud DetectionCredit Score
Home Mortgage Approval2/25/13 - Union University19Slide20
CLUSTERING
Partition data into previously undefined groups.
2/25/13 - Union University
20Slide21
2/25/13 - Union University
21
http://
149.170.199.144/multivar/ca.htmSlide22
2/25/13 - Union University
22
What is Similarity?
(c) Eamonn Keogh, eamonn@cs.ucr.eduSlide23
Two Types of Clustering
2/25/13 - Union University
23
Hierarchical
Partitional
(c)
Eamonn
Keogh, eamonn@cs.ucr.eduSlide24
Hierarchical Clustering Example
Iris Data Set
2/25/13 - Union University
24
Setosa
VersicolorVirginicaThe data originally appeared in Fisher, R. A. (1936). "The Use of Multiple Measurements in Axonomic Problems," Annals of Eugenics 7, 179-188.Hierarchical Clustering Explorer Version 3.0, Human-Computer Interaction Lab, University of Maryland, http://www.cs.umd.edu/hcil/multi-cluster .Slide25
ASSOCIATION RULES/
LINK ANALYSIS
Find relationships between data
2/25/13 - Union University25Slide26
ASSOCIATION RULES EXAMPLES
People who buy diapers also buy beer
If gene A is highly expressed in this disease then gene A is also expressedRelationships between people
Book StoresDepartment StoresAdvertisingProduct Placementhttp://www.amazon.com/Data-Mining-Introductory-Advanced-Topics/dp/0130888923/ref=sr_1_1?ie=UTF8&s=books&qid=1235564485&sr=1-12/25/13 - Union University
26Slide27
2/25/13 - Union University
27
Data Mining Introductory and Advanced Topics
, by Margaret H. Dunham, Prentice Hall, 2003.
DILBERT reprinted by permission of United Feature Syndicate, Inc.Slide28
Data Mining Outline
Introduction
TechniquesExamplesVision Mining Law Enforcement (Cheating, Plagiarism, Fraud, Criminal Behavior,…)
Bioinformatics2/25/13 - Union University28Slide29
Vision Mining
License Plate Recognition
Red Light CamerasToll Boothshttp://www.licenseplaterecognition.com/
Computer Visionhttp://www.eecs.berkeley.edu/Research/Projects/CS/vision/shape/vid/2/25/13 - Union University29Slide30
2/25/13 - Union University
30
Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:,
Dallas Morning News
, June 4, 2007.Slide31
No/Little Cheating
2/25/13 - Union University
31
Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:,
Dallas Morning News, June 4, 2007.Slide32
Rampant Cheating
2/25/13 - Union University
32
Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:,
Dallas Morning News, June 4, 2007.Slide33
2/25/13 - Union University
33
Jialun
Qin, Jennifer J. Xu,
Daning Hu, Marc Sageman and Hsinchun Chen, “Analyzing Terrorist Networks: A Case Study of the Global Salafi
Jihad Network” Lecture Notes in Computer Science, Publisher: Springer-Verlag GmbH, Volume 3495 / 2005 , p. 287.Slide34
Arnet
Miner
http://arnetminer.org/
2/25/13 - Union University34Slide35
DNA
Basic building blocks of organisms
Located in nucleus of cellsComposed of 4 nucleotidesTwo strands bound together
2/25/13 - Union University35
http://www.visionlearning.com/library/module_viewer.php?mid=63Slide36
Central Dogma: DNA -> RNA -> Protein
2/25/13 - Union University
36
Protein
RNA
DNAtranscriptiontranslationCCTGAGCCAACTATTGATGAA
Amino Acid
CCUGAGCCA
ACU
AUUGAUGAA
www.bioalgorithms.info
; chapter 6; Gene PredictionSlide37
Human Genome
Scientists originally thought there would be about 100,000 genes
Appear to be about 20,000WHY?
Almost identical to that of Chimps. What makes the difference?Answers appear to lie in the noncoding regions of the DNA (formerly thought to be junk)2/25/13 - Union University37Slide38
RNAi – Nobel Prize in Medicine 2006
2/25/13 - Union University
38
Double stranded RNA
Short Interfering RNA (~20-25
nt)RNA-Induced Silencing ComplexBinds to mRNACuts RNAsiRNA may be artificially added to cell!Image source: http://nobelprize.org/nobel_prizes/medicine/laureates/2006/adv.html, Advanced Information, Image 3Slide39
miRNA
Short (20-25nt) sequence of noncoding RNA
Known since 1993 but significance not widely appreciated until 2001Impact / Prevent translation of mRNA
Generally reduce protein levels without impacting mRNA levels (animal cells)FunctionsCauses some cancersGuide embryo developmentRegulate cell DifferentiationAssociated with HIV…2/25/13 - Union University
39Slide40
TCGR – Mature miRNA
(Window=5; Pattern=3)
2/25/13 - Union University
40
All Mature
Mus Musculus Homo SapiensC Elegans
ACG
CGC
GCG
UCGSlide41
TCGRs for Xue Training Data
2/25/13 - Union University
41
POS
I
TIVE
NEGAT
I
VE
C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure-Sequence Features and Support Vector Machine,”
BMC Bioinformatics
, vol 6, no 310
. Slide42
2/25/13 - Union University
42
Affymetrix GeneChip
® Array
http://www.affymetrix.com/corporate/outreach/lesson_plan/educator_resources.affxSlide43
BIG BROTHER ?
Total Information Awareness
http://en.wikipedia.org/wiki/Information_Awareness_OfficeTerror Watch List
http://www.businessweek.com/technology/content/may2005/tc20050511_8047_tc_210.htmhttp://www.theregister.co.uk/2004/08/19/senator_on_terror_watch/http://blog.wired.com/27bstroke6/2008/02/us-terror-watch.html CAPPShttp://en.wikipedia.org/wiki/CAPPS
2/25/13 - Union University43Slide44
2/25/13 - Union University
44
http://ieeexplore.ieee.org/iel5/6/32236/01502526.pdf?tp=&arnumber=1502526&isnumber=32236Slide45
2/25/13 - Union University
45Slide46
My DM
Toolbelt
C, C++Perl, RubyWekaR, SAS
Excel, XLMinerVi, word, …Grep, sed, …2/25/13 - Union University
46Slide47
2/25/13 - Union University
47
Thanks
!