Directed Mixed Graph Models Ricardo Silva Statistical ScienceCSML University College London ricardostatsuclacuk Networks Processes and Causality Menorca 2012 Graphical Models Graphs provide a language for describing independence constraints ID: 587105
Download Presentation The PPT/PDF document "Representation and Learning in" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Representation and Learning inDirected Mixed Graph Models
Ricardo Silva
Statistical Science/CSML, University
College London
ricardo@stats.ucl.ac.uk
Networks: Processes and Causality, Menorca 2012Slide2
Graphical ModelsGraphs provide a language for describing independence constraints
Applications to causal and probabilistic processes
The corresponding probabilistic models should obey the constraints encoded in the graph
Example: P(X1, X2, X3) is Markov with respect toif X1 is independent of X3 given X2 in P( )
X1
X2
X3Slide3
Directed Graphical Models
X
1
X2
U
X3
X4
X
2
X
4
X
2
X
4
| X
3
X
2
X
4
| {X
3
, U}
...Slide4
Marginalization
X
1
X2
X3
U
X4
X
2
X
4
X
2
X
4
| X
3
X
2
X
4
| {X
3
, U}
...Slide5
Marginalization
No: X
1
X3 | X2
X1
X2
X3
X4
?
X1
X2
X3
X4
?
No: X
2
X
4
| X
3
X1
X2
X3
X4
?
OK, but not ideal
X
2
X
4
Slide6
The Acyclic Directed Mixed Graph (ADMG)
“Mixed” as in directed + bi-directed
“Directed” for obvious reasons
See also: chain graphs“Acyclic” for the usual reasonsIndependence model isClosed under marginalization (generalize DAGs)Different from chain graphs/undirected graphs
Analogous inference as in DAGs: m-separationX
1X
2X
3
X
4
(Richardson and Spirtes, 2002; Richardson, 2003)Slide7
Why do we care?Difficulty on computing scores or tests
Identifiability: theoretical issues and implications to optimization
Y
1
Y2
Y3
Y4
Y5
X
1
X
2
latent
observed
Y
1
Y
2
Y
3
Y
4
Y
5
X
1
X
3
Y
6
Y
6
X
2
Candidate I
Candidate IISlide8
Why do we care?
Set of
“target”
latent variables X (possibly none), and observations YSet of
“nuisance” latent variables XWith sparse structure implied over Y
Y1
Y2
Y
3
Y
4
Y
5
X
1
X
2
Y
6
X
3
X
4
X
5
X
XSlide9
Why do we care?
(
Bollen
, 1989)Slide10
The talk in a nutshell
The challenge:
How to specify families of distributions that respect the ADMG independence model, requires no explicit latent variable formulation
How NOT to do it: make everybody independent!Needed: rich families. How rich?Main results: A new construction that is fairly general, easy to use, and complements the state-of-the-art
Exploring this in structure learning problemsFirst, background: Current parameterizations, the good and bad issuesSlide11
The Gaussian bi-directed modelSlide12
The Gaussian bi-directed case
(Drton and Richardson, 2003)Slide13
Binary bi-directed case: the constrained Moebius
parameterization
(Drton and Richardson, 2008)Slide14
Binary bi-directed case:the constrained Moebius
parameterization
Disconnected sets
are marginally independent. Hence, define qA for connected sets onlyP(X1 = 0, X4
= 0) = P(X1 = 0)P(X4 = 0)
q14 = q1q4
However, notice there is a parameter q1234Slide15
Binary bi-directed case:the constrained Moebius
parameterization
The good:
this parameterization is complete. Every single binary bi-directed model can be represented with itThe bad: Moebius inverse is intractable, and number of connected sets can grow exponentially even for trees of low connectivity
...
...
...
...Slide16
The Cumulative Distribution Network (CDN) approach
Parameterizing cumulative distribution functions (CDFs) by a product of functions defined over subsets
Sufficient condition: each factor is a CDF itself
Independence model: the “same” as the bi-directed graph... but with extra constraints(Huang and Frey, 2008)
F(X1234) = F1(X12)F2(X
24)F3(X34)F4(X
13)X1 X
4
X
1
X
4
| X
2
etcSlide17
The Cumulative Distribution Network (CDN) approachWhich extra constraints?
Meaning, event “X
1
x1” is independent of “X3 x3” given “X2 x2”Clearly not true in general distributionsIf there is no natural order for variable values, encoding does matter
X1
X2
X3
F(X
123
)
=
F
1
(X
12
)F
2
(X
23
)Slide18
RelationshipCDN: the resulting PMF (usual CDF2PMF transform)
Moebius
: the resulting PMF is equivalent
Notice: qB = P(XB = 0) = P(X\B 1, XB 0)However, in a CDN, parameters further factorize over cliques
q
1234 = q12
q13q24q34Slide19
RelationshipCalculating likelihoods can be easily reduced to inference in factor graphs in a “pseudo-distribution”
Example: find the joint distribution of
X
1, X2, X3 belowX1
X2
X
3
reduces to
X
1
X
2
X
3
Z
1
Z
2
Z
3
P(
X
=
x
) = Slide20
RelationshipCDN
models are a strict subset of
marginal independence models
Binary case: Moebius should still be the approach of choice where only independence constraints are the targetE.g., jointly testing the implication of independence assumptionsBut...CDN models have a reasonable number of parameters, for small tree-widths any fitting criterion is tractable, and learning is trivially tractable anyway by marginal composite likelihood estimation (more on that later)Take-home message: a still flexible bi-directed graph model with no need for latent variables to make fitting tractableSlide21
The Mixed CDN model (MCDN)
How to construct a distribution Markov to this?
The binary ADMG parameterization by Richardson (2009) is complete, but with the same computational
difficulties
And how to easily extend it to non-Gaussian, infinite discrete cases, etc.?Slide22
Step 1: The high-level factorization
A
district
is a maximal set of vertices connected by bi-directed edgesFor an ADMG G with vertex set XV and districts {Di}, definewhere P() is a density/mass function and paG() are parent of the given set in GSlide23
Step 1: The high-level factorization
Also, assume that each
P
i( | ) is Markov with respect to subgraph Gi – the graph we obtain from the corresponding subsetWe can show the resulting distribution is Markovwith respect to the ADMG
X4
X1
X
4
X
1Slide24
Step 1: The high-level factorization
Despite the seemingly “cyclic” appearance, this factorization always gives a valid
P
() for any choice of Pi( | )P(X134) =
x2P(X1, x2 | X4
)P(X3, X4 | X1)
P(X1 | X4)P(X3
, X4 | X1)
P(X
13
) =
x4
P(X
1
| x
4
)P(X
3
, x4 | X1
)
=
x4P(X1)P(X
3, x4 | X
1) P(X1
)P(X3
| X1)Slide25
Step 2: Parameterizing Pi (barren case)
D
i
is a “barren” district is there is no directed edge within it
Barren
NOT BarrenSlide26
Step 2: Parameterizing Pi (barren case)
For a district
D
i with a clique set Ci (with respect bi-directed structure), start with a product of conditional CDFsEach factor FS(xS | xP) is a conditional CDF function, P(XS
xS | XP = xP). (They have to be transformed back to PMFs/PDFs when writing the full likelihood function.)
On top of that, each FS(xS | xP) is defined to be Markov with respect to the corresponding G
iWe show that the corresponding product is Markov with respect to GiSlide27
Step 2a: A copula formulation of Pi
Implementing the local factor restriction could be potentially complicated, but the problem can be easily approached by adopting a
copula
formulationA copula function is just a CDF with uniform [0, 1] marginalsMain point: to provide a parameterization of a joint distribution that unties the parameters from the marginals from the remaining parameters of the jointSlide28
Step 2a: A copula formulation of Pi
Gaussian latent variable analogy:
X
1
X2
U
X1
= 1U
+
e
1
,
e
1
~ N(0,
v
1
)
X
2
=
2U +
e2, e
2 ~ N(0, v
2)U ~ N(0,
1)
Marginal of
X1: N(0,
12 +
v1)Covariance of X1
, X2:
1
2
Parameter sharingSlide29
Step 2a: A copula formulation of Pi
Copula idea: start from
then define
H(Ya, Yb) accordingly, where 0 Y* 1H(, ) will be a CDF with uniform [0, 1] marginalsFor any Fi
() of choice, Ui Fi(Xi) gives an uniform [0, 1]We mix-and-match any marginals we want with any copula function we want
F(X1, X2) =
F( F1-1(F1 (
X1)), F2
-1
(
F
2
(
X
2
)))
H(Y
a
, Yb)
F
( F1-1(
Ya), F
2-1(Yb
))Slide30
Step 2a: A copula formulation of Pi
The idea is to use a conditional marginal
F
i(Xi | pa(Xi)) within a copulaExampleCheck:X1
X2
X3
X4
U
2
(x
1
)
P
2
(
X
2
x
2
|
x
1)U3(x
4)
P2(X
3 x
3 | x4)
P(X2 x
2,
X3
x3 | x
1, x4) = H(U
2(x1),
U
3
(x
4
))
P
(X
2
x
2
|
x
1,
x
4
) = H
(U
2
(x
1
), 1) = H(
U
2
(
x
1
))
=
U
2
(x
1
) = P
2
(X
2
x
2
|
x
1
) Slide31
Step 2a: A copula formulation of Pi
Not done yet! We need this
Product of copulas is not a copula
However, results in the literature are helpful here. It can be shown that plugging in Ui1/d(i), instead of Ui will turn the product into a copulawhere d(i) is the number of bi-directed cliques containing Xi
Liebscher (2008)Slide32
Step 3: The non-barren case
What should we do in this case?
Barren
NOT BarrenSlide33
Step 3: The non-barren caseSlide34
Step 3: The non-barren caseSlide35
Parameter learningFor the purposes of illustration, assume a finite mixture of experts for the conditional marginals for continuous data
For discrete data, just use the standard CPT formulation found in Bayesian networksSlide36
Parameter learningCopulas: we use a bi-variate formulation only (so we take products “over edges” instead of “over cliques”).
In the experiments: Frank copulaSlide37
Parameter learningSuggestion: two-stage quasiBayesian learning
Analogous to other approaches in the copula literature
Fit marginal parameters using the posterior expected value of the parameter for each individual mixture of experts
Plug those in the model, then do MCMC on the copula parametersRelatively efficient, decent mixing even with random walk proposalsNothing stopping you from using a fully Bayesian approach, but mixing might be bad without some smarter proposalsNotice: needs constant CDF-to-PDF/PMF transformations!Slide38
ExperimentsSlide39
ExperimentsSlide40
The story so far
General toolbox for construction for ADMG models
Alternative estimators would be welcome:
Bayesian inference is still “doubly-intractable” (Murray et al., 2006), but district size might be small enough even if one has many variablesEither way, composite likelihood still simple. Combined with the Huang + Frey dynamic programming method, it could go a long wayHybrid Moebius/CDN parameterizations to be exploitedEmpirical applications in problems with extreme value issues, exploring non-independence constraints, relations to effect models in the potential outcome framework etc.Slide41
Back to: Learning Latent Structure
Difficulty on computing scores or tests
Identifiability: theoretical issues and implications to optimization
Y1
Y2
Y3
Y4
Y5
X
1
X
2
latent
observed
Y
1
Y
2
Y
3
Y
4
Y
5
X
1
X
3
Y
6
Y
6
X
2
Candidate I
Candidate IISlide42
Leveraging Domain StructureExploiting “main” factors
Y
7a
Y
7
b
Y7c
X7
Y
7
d
Y
7e
Y
12a
X
12
Y
12b
Y
12c
(NHS Staff Survey, 2009)Slide43
The “Structured Canonical Correlation” Structural Space
Set of pre-specified latent variables
X
, observations YEach Y in Y has a pre-specified single parent in
XSet of unknown latent variables X XEach Y in Y can have potentially infinite parents in X
“Canonical correlation” in the sense of modeling dependencies within a partition of observed variables
Y1
Y2
Y
3
Y
4
Y
5
X
1
X
2
Y
6
X
3
X
4
X
5
X
XSlide44
The “Structured Canonical Correlation”: Learning Task
Assume a
partition structure of
Y according to X is knownDefine the mixed graph projection of a graph over (X, Y) by a bi-directed edge Yi Yj if they share a common ancestor in XPractical assumption: bi-directed substructure is sparseGoal: learn bi-directed structure (and parameters) so that one can estimate functionals of P(
X | Y)
Y1
Y2
Y3
Y
4
Y
5
X
1
X
2
Y
6Slide45
Parametric Formulation
X
~ N(0,
), positive definiteIgnore possibility of causal/sparse structure in X for simplicityFor a fixed graph G, parameterize the conditional cumulative distribution function (CDF) of Y given X according to bi-directed structure:F(y | x
) P(Y y | X = x) Pi(
Yi yi | X[i
] = x[i])Each set Yi
forms a bi-directed clique in G, X[i
]
being the corresponding parents in
X
of the set
Y
i
We
assume here each
Y
is binary for simplicitySlide46
Parametric FormulationIn order to calculate the likelihood function, one should convert from the (conditional) CDF to the probability mass function (PMF)
P(
y
, x) = {F(y | x)} P(x)F(y | x) represents a difference operator. As we discussed before, for p-dimensional binary (unconditional) F(y
) this boils down toSlide47
Learning with Marginal Likelihoods
For
X
j parent of Yi in X:LetMarginal likelihood:Pick graph Gm that maximizes the marginal likelihood (maximizing also with respect to and ), where parameterizes local conditional CDFs Fi(y
i | x[i])Slide48
Computational ConsiderationsIntractable, of course
Including possible large tree-width of bi-directed component
First option: marginal bivariate composite likelihood
G
m
+/-
is the space of graphs that differ from
Gm by at most one bi-directed edge
Integrates
ij
and
X
1:N
with a crude quadrature methodSlide49
Beyond Pairwise ModelsWanted: to include terms that account for more than pairwise interactions
Gets expensive really fast
An indirect compromise:
Still only pairwise terms just like PCLHowever, integrate ij not over the prior, but over some posterior that depends on more than on Yi1:N, Yj1:N:Key idea: collect evidence from p(ij | Y
S1:N), {i, j} S, plug it into the expected log of marginal likelihood . This corresponds to bounding each term of the log-composite likelihood score with different distributions for
ij:Slide50
Beyond Pairwise ModelsNew score function
S
k
: observed children of Xk in XNotice: multiple copies of likelihood for ij when Yi and Yj have the same latent parentUse this function to optimize parameters {, } (but not necessarily structure)Slide51
Learning with Marginal LikelihoodsIllustration: for each pair of
latents
Xi, Xj, do
i
j
q
ij
(
ij
) p(
ij
|
Y
S
, , )
ijk
Compute by
Laplace
approximation
and dynamic
programming
q
ij
(
ij
)
Defines
~
… +
E
q
(
ijk
)
[log P(
Y
ij1a
,
Y
ij1b,
ijk
|
, )] + …
Y
ij1a
Y
ij1b
Marginalize
and add term
~Slide52
Algorithm 2
q
mn
comes from conditioning on all variables that share a parent with Yi and YjIn practice, we use PCL when optimizing structureEM issues with discrete optimization: model without edge has an advantage, sometimesbad saddlepointSlide53
Experiments: Synthetic Data20 networks of 4 latent variables with 4 children per latent variable
Average number of bi-directed edges: ~18
Evaluation criteria:
Mean-squared error of estimate of slope for each observed variableEdge omission error (false negatives)Edge commission error (false positives)Comparison against “single-shot” learningFit model without bi-directed edges, add edge Yi Yj if implied pairwise distribution P(Yi, Y
j) doesn’t fit the dataEssentially a single iteration of Algorithm 1Slide54
Experiments: Synthetic DataQuantify results by taking the difference
between number of times Algorithm
2
does better than Algorithm 1 and 0 (“single-shot” learning)The number of times where the difference is positive with the corresponding p-values for a Wilcoxon signed rank test (stars indicate numbers less than 0.05)Slide55
Experiments: NHS DataFit model with 9 factors and 50 variables on the NHS data, using questionnaire as the partition structure
100,000 points in training set, about 40 edges discovered
Evaluation:
Test contribution of bi-directed edge dependencies to P(X | Y): compare against model without bi-directed edgesComparison by predictive ability: find embedding for each X(d) given Y(d) by maximizing Test on independent 50,000 points by evaluating how well we can predict other 11 answers based on latent representation using logistic regressionSlide56
Experiments: NHS DataMCCA: mixed graph structured canonical correlation model
SCCA: null model (without bi-directed edges)
Table contains AUC scores for each of the 11 binary prediction problems using estimated
X as covariates:Slide57
ConclusionMarginal composite likelihood and mixed graph models are a good match
Still requires some choices of approximations for posteriors over parameters, and numerical methods for integration
Future work:
Theoretical properties of the alternative marginal composite likelihood estimatorIdentifiability issuesReduction on the number of evaluations of qmnNon-binary dataWhich families could avoid multiple passes over data?