/
Unsupervised Learning Unsupervised Learning

Unsupervised Learning - PowerPoint Presentation

aaron
aaron . @aaron
Follow
343 views
Uploaded On 2019-11-23

Unsupervised Learning - PPT Presentation

Unsupervised Learning DSCI 415 Brant Deppa PhD Professor of Statistics amp Data Science Winona State University bdeppawinonaedu The Entire Course in One Day Course Topics Introduction to Unsupervised Learning ID: 767061

data analysis variables cluster analysis data cluster variables ica independent images activity observations liver individuals pca methods sports distance

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Unsupervised Learning" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Unsupervised LearningDSCI 415 Brant Deppa, Ph.D. Professor of Statistics & Data Science Winona State University bdeppa@winona.edu

The Entire Course in One Day !?!?

Course TopicsIntroduction to Unsupervised Learning Review of basic and advanced graphics in R Measuring distance Multidimensional scaling (MDS) Dimension Reduction Methods (PCA & ICA)Cluster analysis and “cluster interpretation”Correspondence analysisRecommender SystemsText mining and sentiment analysis  VERY IMPORTANT!!!  “Baby NLP”

Unsupervised Learning vs. Supervised Learning Supervised Learning - Response of interest Set of predictors Build models using terms based on to predict Unsupervised Learning No response of interest Set of observations on variables Uncover structure in these data to gain insights Unstructured data – i.e. “”  

Measuring DistanceDistance between observations - numeric variables, scaling, metrics - non-numeric variables (ordinal/nominal) - mixture of variable typesDissimilarity/Similarity between variablesDistance/Dissimilarity Matrices - MDS, PCA, Clustering, Recommender Systems, Correspondence Analysis

Multidimensional Scaling (MDS) Start with distance or dissimilarity matrix Metric MDS – find lower dimensional representation that preserves the actual pairwise distances. Non-Metric MDS – find a lower dimensional representation that maximizes rank correlation between distances.  

Example: Mushroom Characteristics(n = 4,062)

Each level of every variable is turned into a dummy variable, 1 if mushroom has trait (red), 0 if not (white).

Distance/Dissimilarity Matrix

Ecological Field Studies

Principal Components Analysis (PCA) Dimension reduction method - Eigenanalysis /spectral decomposition or - SVD of data matrix  

  Example: Image Compression From top left to bottom right:   SVD of the pixel matrix allows for lower dimensional representation of the image.

Genetic Analysis of Europeans Genes mirror geography within Europe ( Novembre , et al., Nature 2008) Sample of approximately 3,000 European individuals genotyped at over half a million variable DNA sites in the human genome (i.e. p > 500,000)!!!(Love to have the raw data from this one, cannot find it…)

Activity: Liver Toxicity Study The data come from a liver toxicity study (Bushel et al., 2007) in which male rats of the inbred strain Fisher 344 were exposed to non-toxic (50 or 150 mg/kg), moderately toxic (1500 mg/kg), or severely toxic (2000 mg/kg) doses of acetaminophen (paracetamol) in a controlled experiment. Necropsies were performed at 6, 18, 24 and 48 hours after exposure and the mRNA from the liver was extracted. Ten clinical chemistry measurements of variables containing markers for liver injury are available for each subject and the serum enzymes levels are measured numerically. The data were further normalized and pre-processed by Bushel et al. (2007).   Variables Animal ID – ID number for rat Treatment.Group – conveys both dose level and time animal was sacrificedDose.Group – 50, 150, 1500, or 2000 mg/kgTime.Group – time sacrificed (6,18, 24, or 48 hrs.)10 clinical markers for liver injury and serum enzyme levelsBUN.mg.dL, Creat.mg.dL , TP.g.dL , ALB.g.dL , ALT.IU.L, SDH.IU.L, AST.IU.I, ALP.IU.L, TBA.umol.L , Cholesterol.mg.dL 3,116 genetic expression levels A_43_P14555, A_45_P2220, … , A_42_P720241   Clearly these data are high dimensional and therefore a good candidate for employing dimension reduction methods.  

Activity: Liver Toxicity Study Use principal component analysis to see if the measurements are associated with dose received and time of exposer to acetaminophen.

Independent Component Analysis (ICA)Independent component analysis  (ICA) is a statistical and computational technique for revealing hidden factors that underlie sets of random variables, measurements, or signals. Finds independent sub-parts of a complex dataset ICA is superficially related to PCA and/or Factor Analysis Useful in separating mixed signals into independent sources. http://research.ics.aalto.fi/ica/icademo/

Example: PCA of Brazilian Facial Images First four eigenfaces

Example: ICA of Brazilian Facial Images 200 Independent Components   https://mediaspace.minnstate.edu/media/Face+28+ICA+2/0_zmbc0iid  

Activity: Cocktail Party Problem with Images The other night my TV was acting up and the entire screen turned to static. While the video was complete scrambled, I swore I could hear maniacal laughing amongst the accompanying audio noise, that was pretty much a static also. I took several burst pictures (actually 1,000) of the screen with my iPhone and converted these pictures to pixel grayscale JPEG image files. Below are a sample of four of these images.  

Activity: Cocktail Party Problem with ImagesI suspect that multiple channels of input were coming through my cable box at the same time. Combine this with the fact that my HDMI cable was loose, which I am sure added considerable noise, and you have the static-filled images to the right. If I am correct about the multiple input sources, independent component analysis (ICA) should prove useful in detangling these source images and finding the underlying independent source images. ( You will be solving this one on an assignment!! )

Cluster AnalysisGiven a set of variables measured on each object/subject, cluster analysis seeks to find groups, or clusters , of similar observations.

Cluster AnalysisHierarchical cluster analysis (HCA) distance matrix – many possible metrics observations are fused successively into clusters until observations are in one cluster linkage is how the distance between clusters is determined.HCA is an example of an agglomerative method , points start as individuals and then are joined to form clusters.

Cluster Analysis clusters  

Cluster Analysis

Partitioning Methods K-means and K-medians clustering

Partitioning Methods

Density-Based Clustering (DBSCAN) Great visualization of this and related methods https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68 dbscan - package from CRAN, can be hard to tune but can work well in cases where other methods produce poor results.

Two-way ClusteringAs we can measure distance between both observations and variables, we can perform cluster analysis for each and display the results using a cell plot.

Activity: Sports Difficulty Sports experts were asked to rate sports on a variety attributes required to excel in the them. Athlete attributes: power, strength, agility, flexibility, hand-eye coordination, nerves, analytic aptitude, endurance, and durability.

Activity: Sports Difficulty Which sports are clustered together on the basis of the athlete attributes? Which attributes are most associated with each other? Which sports appear to require the most attributes? Do the attribute/sport associations make sense given your prior knowledge?

Correspondence Analysis (CA/MCA)Similar to PCA, but is generally applied to categorical or ordinal data. Simple CA is used to visualize the relationship between two categorical variables, each with more than two levels. Multiple CA extends to larger, multivariate datasets with numerous categorical/ordinal variables.

Example 1: Suicides in former W. Germany (1974 – 1979)

Example 1: Suicides in former W. Germany (1974 – 1979)

Example 2: Bird Counts and Sampling Sites

Multiple Correspondence AnalysisMultiple correspondence analysis is commonly used to examine structure in a results from a survey . Survey of individuals – explain variation between individuals. Understand relationship between survey items and subject demographics. Ecological surveys – look at species abundance at several different sample sites. Understand variation between sites, relationship between species, and relationships to site characteristics.

Example 3: Hobby Survey (n )  

Example 3: Hobby Survey (n )  

Recommender SystemsContent-based Recommenders In content-based recommendation systems we determine what it is exactly a user likes about product X and recommend products that have similar properties to product X. Kyle MacLachlan IMDB and Amazon both recommend David Lynch directed films only.

Recommender Systems Collaborative-Filtering In collaborative-filtering, we find other individuals with similar tastes and make recommendations based upon what these other individuals liked. My reviews John SallyErnesto Based upon movies we have all reviewed Highest rated movies by John, Sally, and Ernesto that I have not seen (or rated or purchased)

Activity: Bob Ross Painting Recommender Suppose that I like the Bob Ross painting Mt. McKinley (Season 1, Episode 2) , can you recommend another painting/episode I might also like?

To do listMore image stuffImproved NLP Social Network Analysis Others?