Finite Mixture Models and EM Algorithm

Finite Mixture Models and EM Algorithm Finite Mixture Models and EM Algorithm - Start

2018-03-09 23K 23 0 0

Finite Mixture Models and EM Algorithm - Description

McLachlan, G., & Peel, D. (2001). . Finite mixture models. . New York: Wiley.. Murphy, K. P. (2013). . Machine learning: a probabilistic perspective. . Cambridge, Mass.: MIT Press.. Bishop, C. M. (2013). . ID: 644752 Download Presentation

Download Presentation

Finite Mixture Models and EM Algorithm




Download Presentation - The PPT/PDF document "Finite Mixture Models and EM Algorithm" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.



Presentations text content in Finite Mixture Models and EM Algorithm

Slide1

Finite Mixture Models and EM Algorithm

McLachlan, G., & Peel, D. (2001). Finite mixture models. New York: Wiley.Murphy, K. P. (2013). Machine learning: a probabilistic perspective. Cambridge, Mass.: MIT Press.Bishop, C. M. (2013). Pattern recognition and machine learning. New York: Springer.

1

Presenter: Ege Beyazit

Slide2

Introduction

EM Algorithm was named and explained by Arthur Dempster, Nan Laird and Donald Rubin in 1977It is pointed out in the paper that the method has been proposed many times in special circumstancesEM is typically used to compute maximum likelihood estimates given incomplete data samples.2

Slide3

OutlineMotivation: Finite Mixture Models

Mixture of GaussiansK-means Clustering: A naïve approachDefining the Model for EMExpectation-Maximization AlgorithmEM: A Broader PerspectiveExample: Coin Flip3

Slide4

Motivation: Finite Mixture Models

Convex combination of multiple density functionsCapable of approximating any arbitrary distributionIn many applications, their parameters are determined by ML, typically using EM AlgorithmWidely used in:Data MiningPattern Recognition

Machine LearningStatistical Analysis.

4

Slide5

Mixture of Gaussians

Simple linear superposition of Gaussian components:Gaussian distribution suffer from significant limitations when it comes to modelling real data sets.

5

Slide6

K-means Clustering*Data set S= {x₁, …,

xN}N observations of D-dimensional Euclidian variable xGoal: Partition S into K clustersMinimize within-cluster distanceMaximize between-cluster distanceμk : Center of the kth cluster

Mean of the points in the cluster krnk

∈ {0,1}

: Binary Indicator Variable

If

x

n is assigned to cluster k

: rnk=1

Else:

r

nk

=0

Find values for

r

nk

and

μ

k

minimizing J.

6

Slide7

K-means Clustering

0- Select initial random points μ

k for each clusterIteratively do two successive optimizations until convergence:

E- Find

r

nk

using

μk:

M- Update μk using rnk calculated in the step E:

7

Slide8

Defining the Model: MixtureLet’s introduce

z:zk ∈ {0,1} and Σk

zk = 1

1 of K representation

Now define

p(x, z)

:

p(x, z) =

p(z)p(x|z)Reformulate the mixture distribution of x using z:

8

Slide9

Defining the Model: PosteriorDerive the posterior probability of

z, observing x, in terms of the mixture distribution that we defined: 9

Slide10

Joint Distribution – Marginal Distribution – Responsibility

10P(x,z

)

P(x)

P(

z

k

= 1 | x)

Slide11

Defining the Model: Log LikelihoodHaving a data set of observations {x

1, …,xN }Matrix X ∈ ℝN x D i.i.d. data points {x

n} with corresponding latent points {zn}.

11

Slide12

EM: Algorithm

In the expectation step we use the current values for the parameters to evaluate the posterior probabilitiesIn the maximization step we use these posterior probabilities to re-estimate

the means, covariance matrix and mixing coefficients.

12

Slide13

EM for Gaussian Mixtures: MeanSet the derivative of the likelihood function with respect to

μ k vector to zero:By rearranging we obtain:

13

Slide14

EM for Gaussian Mixtures: Covariance Matrix

Set the derivative of the likelihood function with respect to ∑k to zero:*Single Gaussian fitted to the data set, but each data point weighted by the corresponding posterior probability and denominated by the effective number of points associated with that component.

14

Slide15

EM for Gaussian Mixtures: Mixing Coefficient

Set the derivative of likelihood function with respect to πk:Constraint: Mixing coefficients need to sum to one.Use a Lagrange multiplier and maximize:Take the derivative, multiply both sides by

πk and sum over k:

15

Slide16

EM & K-means

16

Slide17

EM: A Broader ViewFinding maximum likelihood solutions for models with latent variables:

Complete data is in the form of {X, Z}Our observed data X is incomplete, we can not directly use maximum likelihoodBecause we can not use the complete-data likelihood,

we instead use its expected value under the posterior distribution of the latent variable: E step

Then we maximize this expectation function:

M step

17

Slide18

The General EM Algorithm

18

Slide19

The General EM AlgorithmHow do we know that EM will maximize the likelihood?

It is obvious that direct optimization of P(X|ϴ) is hard because X is incompleteEasier to optimize in the form of p(X, Z|ϴ)Introduce distribution q(Z), defined over the latent variablesFor any choice of q(Z),

this holds:Where:

19

Slide20

The General EM Algorithm

What does really KL(q||p) mean? Reminder: EntropyKullback-Leibler Divergence (Relative Entropy)

20

Slide21

The General EM Algorithm

21A useful property of KL divergence is:KL(q||p) ≥ 0  q(Z)= p(Z|X, ϴ)

We already proved this is correct:Then this is also correct:

So:

is a lower bound on

Slide22

The General EM AlgorithmLets use our equation to define EM:

E Step: M Step: 22

Slide23

The General EM Algorithm

23

Slide24

Example: Coin Flip

24Pair of coins A and B with unknown biases θA and θBFollowing procedure repeated 5 times:Randomly choose one of the two coins with equal probability

Perform ten independent coin tosses with selected coin

Slide25

Question 1What is the difference between EM algorithm and ML (Maximum Likelihood) estimation?

EM algorithm tries to find a ML estimation for the parameters of a model with latent variables.In each iteration of the algorithm, latent variables are calculated and being used to maximize the likelihood, they are like ’side effects’ of the maximization process.EM is not guaranteed to converge to the global maximum, but it is guaranteed to converge to a maximum and improve the likelihood of the model at every step.

25

Slide26

Question 2What is a mixture model and what is the benefit of using it? What is the mathematical expression for a Gaussian mixture?

Finite mixture models are convex combinations or weighted sums of multiple density functions. With enough components, they are capable of approximating any arbitrary distribution.PDF of a Gaussian mixture is in the form of

26

Slide27

Question 3How can we make use of EM to solve a clustering problem?

EM is used for maximizing the likelihood for models with incomplete/ latent variables.When we consider a clustering problem, we can model the problem by introducing cluster labels as latent variables:27

Slide28

Online Feature Selection with Streaming FeaturesWu, X., Yu, K., Ding, W., Wang, H., & Zhu, X. (2013). Online Feature Selection with Streaming Features. 

IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(5), 1178-1192. doi:10.1109/tpami.2012.19728

Presenter: Ege Beyazit

Slide29

OutlineFeature Selection

Streaming FeaturesProblem DefinitionFeature Relevance and RedundancyOSFS Algorithm and Fast-OSFS AlgorithmA Case Study29

Slide30

Feature Selection

Effectively handle data with high dimensionality:Searching for an optimal subset of featuresWhy?Decrease the computational costEasier to interpretEnhanced generalization/ increased accuracy (prevent overfitting)

Avoid the curse of dimensionalityHow?Filter Methods

Wrapper Methods

Embedded Methods

30

Slide31

Feature Selection: Filter Methods

Used as a

preprocessing stepSelection of feature is independent of any classifier

Features are selected

based on their scores in various tests

:

Distance

InformationDependency

Consistency31

Slide32

Feature Selection: Wrapper MethodsUse a subset of features and

train a model using themBased on the inferences we draw from previous model, decide to add or remove featuresComputationally expensive.

32

Slide33

Feature Selection: Embedded

Integrate feature selection into model trainingIt is implemented by learning algorithms that have built-in feature selection methodsUsually the fastest among three.

33

Slide34

Streaming Features

In many real world applications, not all features can be known in advanceImage ProcessingFeatures are expensive to generate and store, may be in streaming formatMars Crater Detection from High Resolution Planetary Images: Texture Based image segmentationDefinition (Streaming Features):

‘Streaming features involve a feature vector that flows in one by one over time while the number of training examples remaining fixed.’

34

Slide35

ProblemWhat do three methods share in common?

They assume that all features are known a-prioriWhat is different in streaming features?Dynamic and uncertain nature of the feature spaceStreaming nature of the feature spacePrevious research efforts that address streaming features;

Require information about the global feature set, can not deal with unknown feature size,

Only consider adding new features

, never evaluate

redundancy

Require

prior knowledge

about the structure of feature space35

Slide36

An Intuitive Idea

How to measure the ‘goodness’ of a subset of features?Try to select most relevant features for the taskTry not to select redundant features:Obviously, a feature can be relevant but also redundant

∴ Minimize redundancy and maximize relevance

How to do this without knowing all features a-priori?

36

Slide37

Feature Relevance Conditional Independence

Strong relevanceWeak relevance (if not strongly relevant)Irrelevance (if neither strong nor weakly relevant)

37

Slide38

Redundant Features and Markov BlanketWeakly relevant features are divided into;

Redundant featuresNon redundant features based on a Markov Blanket criterion.

38

Slide39

Streaming Feature Selection Framework

39

Slide40

QuestionHow to guarantee that after we remove a feature from our selected subset, we will never need it again?

[21] D. Koller and M. Sahami, “Toward Optimal Feature Selection,” Proc. Int’l Conf. Machine Learning, pp. 284-292, 1996.

40

Slide41

OSFS AlgorithmOnline Streaming Feature Selection

41

Slide42

Dependence and Independence Tests

Ind(C, X|S) and Dep(C, X|S) in OSFSConditional independence/dependence tests to identify irrelevant and redundant featuresTests are implemented using G2 test

42

Slide43

Complexity of OSFSTime Complexity of OSFS

Depends on the number of tests|M|: Number of features arrived so fark: Maximum size to which a conditioning set may growk|BCF|: Total number of subsets needs to be checked in BCFWorst-case complexity: Time complexity is determined by size of BCFIf only a small number of features in a large feature space are predictive and relevant to C (which is the case in many real-world applications), the algorithm is efficient.

Memory efficiency of OSFSOnly needs to sore a small number of relevant features at any moment

43

Slide44

Fast-OSFS AlgorithmIf the size of BCF is large, redundancy analysis will reduce the runtime performance of algorithm

To improve efficiency,Divide redundant feature handling step into 2:1-Determine whether incoming feature is redundant2-Identify which of the selected features may become redundant if new feature is added to selected set.

44

Slide45

Fast-OSFS Algorithm

45

Slide46

Time Complexity of Fast-OSFS|M|: Number of features arrived so far

|SF|: Number of all relevant features in |M|Best case:If the second part is performed for |SF1|: Worst case:k

|BCF|* only considers subsets in BCF that contain newly added feature, if the size of subsets do not exceed k.

46

Slide47

A Case Study: Automatic Impact Crater Detection

Impact craters are among the most studied geomorphic features in solar systemYield information about past and present processesOnly tool for remotely measuring the relative ages of geological formationsPlanetary probes deliver huge volumes of high-res images

We need tools for scientifically utilizing these images (automated analysis)

Tens of thousands of texture-based features in different scales and resolutions can be generated for crater detection

Features are expensive to generate and store

Efficient feature selection is needed

47

Slide48

Crater Detection Framework

Crater candidates Regions that may potentially contain cratersKey insight A sub-kilometer crater can be recognized as a pair of crescent-like highlight and shadow regions in an imageExperiments evaluated on MarsA portion of the high-res Stereo Camera image is selected (near-global coverage) as test set12.5 meters/ pixel, 3000x4500 pixels (37.500 x 56.250 m2)

The image represents a significant challenge:Spatially variable terrain morphologyContrast is poor

48

Slide49

49

Slide50

Streaming Feature Selection Framework

50

Slide51

Experimental Results

51

Slide52

Online Learning from Trapezoidal Data Streams

Zhang, Q., Zhang, P., Long, G., Ding, W., Zhang, C. And Wu, X. (2016). Online Learning From Trapezoidal Data Streams. IEEE Transactions On Knowledge And Data Engineering, 28(10), 2709-2723.52

Presenter: Ege Beyazit

Slide53

Motivation What if both data volume and dimension increase over time?

Graph node classificationNumber of nodes change dynamicallyNode features change dynamicallyEx. Social network nodesText classification and clusteringNumber of documents increase over timeText vocabulary increase over timeEx. Infinite vocabulary topic model

53

Slide54

Trapezoidal Data Streams

Continuous learning from doubly-streaming dataBoth data volume and feature space increase over time: TrapezoidalWhen a new training instance carrying new features arrive,1-Update existing features: Passive-aggressive Update rule

2-Update the new features: Structural Risk Minimization Principle

3-Feature sparsity

: Feature Projected Truncation

54

Slide55

Problem SettingBinary classification problem

Input training data: {(xt, yt) | t= 1,…, T}xt ∈

ℝdt is d

t

dimension vector where d

t-1

≤ dt

yt ∈ {-1, +1} for all t Linear classifier

Vector of weights: wwt ∈

dt-1

: the vector we aim to solve at round

t

w

t

has the same dimension as instance

x

t-1

Loss function

Hinge loss is used:

l(w, (

x

t

,

y

t

)) =

max{0,1 –

y

t

(

w.

x

t

)}

55

Slide56

Online Learning with Trapezoidal Data StreamsTwo challenges to be addressed:

Update the classifier with an augmenting feature spaceClassifier update strategy should be able to learn from new featuresMargin-maximum principleBuilding a feature selection method to achieve an efficient modelDimension will increase over time, feature selection should be used to prune redundant featuresProjection + Truncation strategy to obtain sparsity

56

Slide57

Algorithm 1: OLSF

57

Slide58

The Update StrategyRound t, classifier

wt ∈ ℝdt-1 , obtain the new classifier:`

58

Slide59

Update Strategy Summarized If the dimension of the new classifier does not change:

Problem is online learning problemIf the current classifier is able to classify the new instance:Do not update the classifierIf the current classifier can not classify the new instance:

Solve the optimization problem and update the classifier

59

Slide60

Update Strategy In Depth

How to solve the constrained optimization problem?Build the Lagrangian function and solve:60

Slide61

Update Strategy In Depth

Problem: Model is sensitive to noise (especially label noise)Introduce two update variants:61

Slide62

The Sparsity Strategy

Dimension of training instances increase rapidlySelect a relatively small number of features2 Steps:Project updated classifier to a L1 ball and

truncate it

62

Slide63

Question 1What is the difference between EM algorithm and ML estimation?

EM algorithm tries to find a ML estimation for the parameters of a model with latent variables.In each iteration of the algorithm, latent variables are calculated to maximize the likelihood, they are like ’side effects’ of the maximization process.EM is not guaranteed to converge to the global maximum, but it is guaranteed to converge to a maximum and improve the likelihood of the model at every step.

63

Slide64

Question 2What is a Markov Blanket, why that is considered as a good set for feature selection?

In a Bayesian network, The Markov blanket of a node contains all the variables that shield the node from the rest of the network. This means that the Markov blanket of a node is the only knowledge needed to predict the behavior of that node. 64

Slide65

Question 3How do we model learning from trapezoidal streams process as an optimization problem?

Force the new classifier to be close to old one to not lose the information gained from previous dataLet ŵ be small to avoid overfittingLose Hinge Loss as a constraint65

Slide66

Questions?

66

Slide67


About DocSlides
DocSlides allows users to easily upload and share presentations, PDF documents, and images.Share your documents with the world , watch,share and upload any time you want. How can you benefit from using DocSlides? DocSlides consists documents from individuals and organizations on topics ranging from technology and business to travel, health, and education. Find and search for what interests you, and learn from people and more. You can also download DocSlides to read or reference later.