# Finite Mixture Models and EM Algorithm

### Presentations text content in Finite Mixture Models and EM Algorithm

Finite Mixture Models and EM Algorithm

McLachlan, G., & Peel, D. (2001). Finite mixture models. New York: Wiley.Murphy, K. P. (2013). Machine learning: a probabilistic perspective. Cambridge, Mass.: MIT Press.Bishop, C. M. (2013). Pattern recognition and machine learning. New York: Springer.

1

Presenter: Ege Beyazit

Slide2Introduction

EM Algorithm was named and explained by Arthur Dempster, Nan Laird and Donald Rubin in 1977It is pointed out in the paper that the method has been proposed many times in special circumstancesEM is typically used to compute maximum likelihood estimates given incomplete data samples.2

Slide3OutlineMotivation: Finite Mixture Models

Mixture of GaussiansK-means Clustering: A naïve approachDefining the Model for EMExpectation-Maximization AlgorithmEM: A Broader PerspectiveExample: Coin Flip3

Slide4Motivation: Finite Mixture Models

Convex combination of multiple density functionsCapable of approximating any arbitrary distributionIn many applications, their parameters are determined by ML, typically using EM AlgorithmWidely used in:Data MiningPattern Recognition

Machine LearningStatistical Analysis.

4

Slide5Mixture of Gaussians

Simple linear superposition of Gaussian components:Gaussian distribution suffer from significant limitations when it comes to modelling real data sets.

5

Slide6K-means Clustering*Data set S= {x₁, …,

xN}N observations of D-dimensional Euclidian variable xGoal: Partition S into K clustersMinimize within-cluster distanceMaximize between-cluster distanceμk : Center of the kth cluster

Mean of the points in the cluster krnk

∈ {0,1}

: Binary Indicator Variable

If

x

n is assigned to cluster k

: rnk=1

Else:

r

nk

=0

Find values for

r

nk

and

μ

k

minimizing J.

6

Slide7K-means Clustering

0- Select initial random points μ

k for each clusterIteratively do two successive optimizations until convergence:

E- Find

r

nk

using

μk:

M- Update μk using rnk calculated in the step E:

7

Slide8Defining the Model: MixtureLet’s introduce

z:zk ∈ {0,1} and Σk

zk = 1

1 of K representation

Now define

p(x, z)

:

p(x, z) =

p(z)p(x|z)Reformulate the mixture distribution of x using z:

8

Slide9Defining the Model: PosteriorDerive the posterior probability of

z, observing x, in terms of the mixture distribution that we defined: 9

Slide10Joint Distribution – Marginal Distribution – Responsibility

10P(x,z

)

P(x)

P(

z

k

= 1 | x)

Slide11Defining the Model: Log LikelihoodHaving a data set of observations {x

1, …,xN }Matrix X ∈ ℝN x D i.i.d. data points {x

n} with corresponding latent points {zn}.

11

Slide12EM: Algorithm

In the expectation step we use the current values for the parameters to evaluate the posterior probabilitiesIn the maximization step we use these posterior probabilities to re-estimate

the means, covariance matrix and mixing coefficients.

12

Slide13EM for Gaussian Mixtures: MeanSet the derivative of the likelihood function with respect to

μ k vector to zero:By rearranging we obtain:

13

Slide14EM for Gaussian Mixtures: Covariance Matrix

Set the derivative of the likelihood function with respect to ∑k to zero:*Single Gaussian fitted to the data set, but each data point weighted by the corresponding posterior probability and denominated by the effective number of points associated with that component.

14

Slide15EM for Gaussian Mixtures: Mixing Coefficient

Set the derivative of likelihood function with respect to πk:Constraint: Mixing coefficients need to sum to one.Use a Lagrange multiplier and maximize:Take the derivative, multiply both sides by

πk and sum over k:

15

Slide16EM & K-means

16

Slide17EM: A Broader ViewFinding maximum likelihood solutions for models with latent variables:

Complete data is in the form of {X, Z}Our observed data X is incomplete, we can not directly use maximum likelihoodBecause we can not use the complete-data likelihood,

we instead use its expected value under the posterior distribution of the latent variable: E step

Then we maximize this expectation function:

M step

17

Slide18The General EM Algorithm

18

Slide19The General EM AlgorithmHow do we know that EM will maximize the likelihood?

It is obvious that direct optimization of P(X|ϴ) is hard because X is incompleteEasier to optimize in the form of p(X, Z|ϴ)Introduce distribution q(Z), defined over the latent variablesFor any choice of q(Z),

this holds:Where:

19

Slide20The General EM Algorithm

What does really KL(q||p) mean? Reminder: EntropyKullback-Leibler Divergence (Relative Entropy)

20

Slide21The General EM Algorithm

21A useful property of KL divergence is:KL(q||p) ≥ 0 q(Z)= p(Z|X, ϴ)

We already proved this is correct:Then this is also correct:

So:

is a lower bound on

Slide22The General EM AlgorithmLets use our equation to define EM:

E Step: M Step: 22

Slide23The General EM Algorithm

23

Slide24Example: Coin Flip

24Pair of coins A and B with unknown biases θA and θBFollowing procedure repeated 5 times:Randomly choose one of the two coins with equal probability

Perform ten independent coin tosses with selected coin

Slide25Question 1What is the difference between EM algorithm and ML (Maximum Likelihood) estimation?

EM algorithm tries to find a ML estimation for the parameters of a model with latent variables.In each iteration of the algorithm, latent variables are calculated and being used to maximize the likelihood, they are like ’side effects’ of the maximization process.EM is not guaranteed to converge to the global maximum, but it is guaranteed to converge to a maximum and improve the likelihood of the model at every step.

25

Slide26Question 2What is a mixture model and what is the benefit of using it? What is the mathematical expression for a Gaussian mixture?

Finite mixture models are convex combinations or weighted sums of multiple density functions. With enough components, they are capable of approximating any arbitrary distribution.PDF of a Gaussian mixture is in the form of

26

Slide27Question 3How can we make use of EM to solve a clustering problem?

EM is used for maximizing the likelihood for models with incomplete/ latent variables.When we consider a clustering problem, we can model the problem by introducing cluster labels as latent variables:27

Slide28Online Feature Selection with Streaming FeaturesWu, X., Yu, K., Ding, W., Wang, H., & Zhu, X. (2013). Online Feature Selection with Streaming Features.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(5), 1178-1192. doi:10.1109/tpami.2012.19728

Presenter: Ege Beyazit

Slide29OutlineFeature Selection

Streaming FeaturesProblem DefinitionFeature Relevance and RedundancyOSFS Algorithm and Fast-OSFS AlgorithmA Case Study29

Slide30Feature Selection

Effectively handle data with high dimensionality:Searching for an optimal subset of featuresWhy?Decrease the computational costEasier to interpretEnhanced generalization/ increased accuracy (prevent overfitting)

Avoid the curse of dimensionalityHow?Filter Methods

Wrapper Methods

Embedded Methods

30

Slide31Feature Selection: Filter Methods

Used as a

preprocessing stepSelection of feature is independent of any classifier

Features are selected

based on their scores in various tests

:

Distance

InformationDependency

Consistency31

Slide32Feature Selection: Wrapper MethodsUse a subset of features and

train a model using themBased on the inferences we draw from previous model, decide to add or remove featuresComputationally expensive.

32

Slide33Feature Selection: Embedded

Integrate feature selection into model trainingIt is implemented by learning algorithms that have built-in feature selection methodsUsually the fastest among three.

33

Slide34Streaming Features

In many real world applications, not all features can be known in advanceImage ProcessingFeatures are expensive to generate and store, may be in streaming formatMars Crater Detection from High Resolution Planetary Images: Texture Based image segmentationDefinition (Streaming Features):

‘Streaming features involve a feature vector that flows in one by one over time while the number of training examples remaining fixed.’

34

Slide35ProblemWhat do three methods share in common?

They assume that all features are known a-prioriWhat is different in streaming features?Dynamic and uncertain nature of the feature spaceStreaming nature of the feature spacePrevious research efforts that address streaming features;

Require information about the global feature set, can not deal with unknown feature size,

Only consider adding new features

, never evaluate

redundancy

Require

prior knowledge

about the structure of feature space35

Slide36An Intuitive Idea

How to measure the ‘goodness’ of a subset of features?Try to select most relevant features for the taskTry not to select redundant features:Obviously, a feature can be relevant but also redundant

∴ Minimize redundancy and maximize relevance

How to do this without knowing all features a-priori?

36

Slide37Feature Relevance Conditional Independence

Strong relevanceWeak relevance (if not strongly relevant)Irrelevance (if neither strong nor weakly relevant)

37

Slide38Redundant Features and Markov BlanketWeakly relevant features are divided into;

Redundant featuresNon redundant features based on a Markov Blanket criterion.

38

Slide39Streaming Feature Selection Framework

39

Slide40QuestionHow to guarantee that after we remove a feature from our selected subset, we will never need it again?

[21] D. Koller and M. Sahami, “Toward Optimal Feature Selection,” Proc. Int’l Conf. Machine Learning, pp. 284-292, 1996.

40

Slide41OSFS AlgorithmOnline Streaming Feature Selection

41

Slide42Dependence and Independence Tests

Ind(C, X|S) and Dep(C, X|S) in OSFSConditional independence/dependence tests to identify irrelevant and redundant featuresTests are implemented using G2 test

42

Slide43Complexity of OSFSTime Complexity of OSFS

Depends on the number of tests|M|: Number of features arrived so fark: Maximum size to which a conditioning set may growk|BCF|: Total number of subsets needs to be checked in BCFWorst-case complexity: Time complexity is determined by size of BCFIf only a small number of features in a large feature space are predictive and relevant to C (which is the case in many real-world applications), the algorithm is efficient.

Memory efficiency of OSFSOnly needs to sore a small number of relevant features at any moment

43

Slide44Fast-OSFS AlgorithmIf the size of BCF is large, redundancy analysis will reduce the runtime performance of algorithm

To improve efficiency,Divide redundant feature handling step into 2:1-Determine whether incoming feature is redundant2-Identify which of the selected features may become redundant if new feature is added to selected set.

44

Slide45Fast-OSFS Algorithm

45

Slide46Time Complexity of Fast-OSFS|M|: Number of features arrived so far

|SF|: Number of all relevant features in |M|Best case:If the second part is performed for |SF1|: Worst case:k

|BCF|* only considers subsets in BCF that contain newly added feature, if the size of subsets do not exceed k.

46

Slide47A Case Study: Automatic Impact Crater Detection

Impact craters are among the most studied geomorphic features in solar systemYield information about past and present processesOnly tool for remotely measuring the relative ages of geological formationsPlanetary probes deliver huge volumes of high-res images

We need tools for scientifically utilizing these images (automated analysis)

Tens of thousands of texture-based features in different scales and resolutions can be generated for crater detection

Features are expensive to generate and store

Efficient feature selection is needed

47

Slide48Crater Detection Framework

Crater candidates Regions that may potentially contain cratersKey insight A sub-kilometer crater can be recognized as a pair of crescent-like highlight and shadow regions in an imageExperiments evaluated on MarsA portion of the high-res Stereo Camera image is selected (near-global coverage) as test set12.5 meters/ pixel, 3000x4500 pixels (37.500 x 56.250 m2)

The image represents a significant challenge:Spatially variable terrain morphologyContrast is poor

48

Slide4949

Slide50Streaming Feature Selection Framework

50

Slide51Experimental Results

51

Slide52Online Learning from Trapezoidal Data Streams

Zhang, Q., Zhang, P., Long, G., Ding, W., Zhang, C. And Wu, X. (2016). Online Learning From Trapezoidal Data Streams. IEEE Transactions On Knowledge And Data Engineering, 28(10), 2709-2723.52

Presenter: Ege Beyazit

Slide53Motivation What if both data volume and dimension increase over time?

Graph node classificationNumber of nodes change dynamicallyNode features change dynamicallyEx. Social network nodesText classification and clusteringNumber of documents increase over timeText vocabulary increase over timeEx. Infinite vocabulary topic model

53

Slide54Trapezoidal Data Streams

Continuous learning from doubly-streaming dataBoth data volume and feature space increase over time: TrapezoidalWhen a new training instance carrying new features arrive,1-Update existing features: Passive-aggressive Update rule

2-Update the new features: Structural Risk Minimization Principle

3-Feature sparsity

: Feature Projected Truncation

54

Slide55Problem SettingBinary classification problem

Input training data: {(xt, yt) | t= 1,…, T}xt ∈

ℝdt is d

t

dimension vector where d

t-1

≤ dt

yt ∈ {-1, +1} for all t Linear classifier

Vector of weights: wwt ∈

ℝ

dt-1

: the vector we aim to solve at round

t

w

t

has the same dimension as instance

x

t-1

Loss function

Hinge loss is used:

l(w, (

x

t

,

y

t

)) =

max{0,1 –

y

t

(

w.

x

t

)}

55

Slide56Online Learning with Trapezoidal Data StreamsTwo challenges to be addressed:

Update the classifier with an augmenting feature spaceClassifier update strategy should be able to learn from new featuresMargin-maximum principleBuilding a feature selection method to achieve an efficient modelDimension will increase over time, feature selection should be used to prune redundant featuresProjection + Truncation strategy to obtain sparsity

56

Slide57Algorithm 1: OLSF

57

Slide58The Update StrategyRound t, classifier

wt ∈ ℝdt-1 , obtain the new classifier:`

58

Slide59Update Strategy Summarized If the dimension of the new classifier does not change:

Problem is online learning problemIf the current classifier is able to classify the new instance:Do not update the classifierIf the current classifier can not classify the new instance:

Solve the optimization problem and update the classifier

59

Slide60Update Strategy In Depth

How to solve the constrained optimization problem?Build the Lagrangian function and solve:60

Slide61Update Strategy In Depth

Problem: Model is sensitive to noise (especially label noise)Introduce two update variants:61

Slide62The Sparsity Strategy

Dimension of training instances increase rapidlySelect a relatively small number of features2 Steps:Project updated classifier to a L1 ball and

truncate it

62

Slide63Question 1What is the difference between EM algorithm and ML estimation?

EM algorithm tries to find a ML estimation for the parameters of a model with latent variables.In each iteration of the algorithm, latent variables are calculated to maximize the likelihood, they are like ’side effects’ of the maximization process.EM is not guaranteed to converge to the global maximum, but it is guaranteed to converge to a maximum and improve the likelihood of the model at every step.

63

Slide64Question 2What is a Markov Blanket, why that is considered as a good set for feature selection?

In a Bayesian network, The Markov blanket of a node contains all the variables that shield the node from the rest of the network. This means that the Markov blanket of a node is the only knowledge needed to predict the behavior of that node. 64

Slide65Question 3How do we model learning from trapezoidal streams process as an optimization problem?

Force the new classifier to be close to old one to not lose the information gained from previous dataLet ŵ be small to avoid overfittingLose Hinge Loss as a constraint65

Slide66Questions?

66

Slide67
## Finite Mixture Models and EM Algorithm

Download Presentation - The PPT/PDF document "Finite Mixture Models and EM Algorithm" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.