Relational Learning for NLP William Y Wang William W Cohen Machine Learning Dept and Language Technologies Inst joint work with Kathryn Rivard Mazaitis Outline Motivation ID: 789129
Download The PPT/PDF document "Scalable Statistical" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Scalable Statistical Relational Learning for NLP
William Y. WangWilliam W. CohenMachine Learning Dept and Language Technologies Inst.joint work with:Kathryn Rivard Mazaitis
Slide2OutlineMotivationBackgroundLogicProbabilityCombining logic and probabilities: MLNsProPPRKey ideasLearning methodResults for parameter learning
Structure learning for ProPPR for KB completionJoint IE and KB completionComparison to neural KBC modelsBeyond ProPPR….
Slide3Motivation
Slide4Slide5KR & ReasoningInference Methods, Inference Rules
AnswersQueries…Challenges for KR
:
Robustness
: noise, incompleteness, ambiguity (“
Sunnybrook
”), statistical information (“
foundInRoom
(bathtub, bathroom)
”), …
Complex queries:
“which Canadian hockey teams have won the Stanley Cup?”
Learning
: how to
acquire and maintain
knowledge and inference rules as well as how to use it
“Expressive, probabilistic, efficient:
pick any two”
Current state of the art
What if the DB/KB or inference rules are imperfect?
Slide6Large ML-based software system
Machine Learning(for complex tasks)
Relational, joint learning and inference
Slide7Background
H/T: “Probabilistic Logic Programming, De Raedt and Kersting
Slide8Background: Logic ProgramsA program with one definite clause (Horn clauses):grandparent(X,Y) :- parent(X,Z),parent(Z,Y)Logical variables: X,Y,ZConstant symbols: bob, alice, …
We’ll consider two types of clauses:Horn clauses A:-B1,…,Bk with no constantsUnit clauses A:- with no variables (facts):parent(alice,bob):- or parent(alice,bob)
H/T: “Probabilistic Logic Programming, De Raedt and Kersting
head
body
“neck”
Intensional definition,
rules
Extensional definition, database
Slide9Background: Logic ProgramsA program with one definite clause:grandparent(X,Y) :- parent(X,Z),parent(Z,Y)Logical variables: X,Y,Z
Constant symbols: bob, alice, …Predicates: grandparent/2, parent/2Alphabet: set of possible predicates and constantsAtomic formulae: parent(X,Y), parent(alice,bob)Ground atomic formulae: parent(alice,bob), …
H/T: “Probabilistic Logic Programming, De Raedt and Kersting
Slide10Background: Logic ProgramsThe set of all ground atomic formulae (consistent with a fixed alphabet) is the Herbrand base of a program: {parent(alice,alice),parent(alice,bob),…,parent(zeke,zeke),grandparent(alice,alice),…}The
interpretation of a program is a subset of the Herbrand base. An interpretation M is a model of a program if For any A:-B1,…,Bk in the program and any mapping Theta from the variables in A,B1,..,Bk to constants:If Theta(B1) in M and … and Theta(Bk) in M then Theta(A) in MA program defines a unique least Herbrand model
H/T: “Probabilistic Logic Programming, De Raedt and Kersting
Slide11Background: Logic ProgramsA program defines a unique least Herbrand modelExample program:grandparent(X,Y):-parent(X,Z),parent(Z,Y).parent(alice,bob). parent(bob,chip). parent(bob,dana).
The least Herbrand model also includes grandparent(alice,dana) and grandparent(alice,chip).Finding the least Herbrand model: theorem proving…Usually we case about answering queries: What are values of W: grandparent(alice,W) ?
H/T: “Probabilistic Logic Programming, De Raedt and Kersting
Slide12MotivationInference Methods, Inference Rules
AnswersQueriesChallenges for KR:Robustness: noise, incompleteness, ambiguity (“Sunnybrook”), statistical information (“foundInRoom(bathtub, bathroom)”), …Complex queries: “which Canadian hockey teams have won the Stanley Cup?”Learning: how to acquire and maintain knowledge and inference rules as well as how to use itquery(T):- play(T,hockey), hometown(T,C), country(C,canada)
{T : query(T) } ?
Slide13BackgroundRandom variable: burglary, earthquake, …Usually denote with upper-case letters: B,E,A,J,MJoint distribution: Pr(B,E,A,J,M)
H/T: “Probabilistic Logic Programming, De Raedt and KerstingBE
A
J
M
prob
T
T
T
T
T
0.00001
F
T
T
T
T
0.03723
…
Slide14Background: Bayes networksRandom variable: B,E,A,J,MJoint distribution: Pr(B,E,A,J,M)Directed graphical models
give one way of defining a compact model of the joint distribution:Queries: Pr(A=t|J=t,M=f) ?
H/T: “Probabilistic Logic Programming, De Raedt and Kersting
A
J
Prob
(J|A)
F
F
0.95
F
T
0.05
T
F
0.25
T
T
0.75
A
M
Prob
(J|A)
F
F
0.80
…
Slide15BackgroundRandom variable: B,E,A,J,MJoint distribution: Pr(B,E,A,J,M)Directed graphical models
give one way of defining a compact model of the joint distribution:Queries: Pr(A=t|J=t,M=f) ?
H/T: “Probabilistic Logic Programming, De Raedt and Kersting
A
J
Prob
(J|A)
F
F
0.95
F
T
0.05
T
F
0.25
T
T
0.75
Slide16Background: Markov networksRandom variable: B,E,A,J,MJoint distribution: Pr(B,E,A,J,M)Undirected graphical models
give another way of defining a compact model of the joint distribution…via potential functions.ϕ(A=a,J=j) is a scalar measuring the “compatibility” of A=a J=j
x
x
x
x
A
J
ϕ
(
a,j
)
F
F
20
F
T
1
T
F
0.1
T
T
0.4
Slide17Backgroundϕ(A=a,J=j) is a scalar measuring the “compatibility” of A=a J=j
x
x
x
x
A
J
ϕ
(
a,j
)
F
F
20
F
T
1
T
F
0.1
T
T
0.4
clique potential
…
…
Slide18MotivationInference Methods, Inference Rules
AnswersQueriesChallenges for KR:Robustness: noise, incompleteness, ambiguity (“Sunnybrook”), statistical information (“foundInRoom(bathtub, bathroom)”), …Complex queries: “which Canadian hockey teams have won the Stanley Cup?”Learning: how to acquire and maintain knowledge and inference rules as well as how to use it
In space of “flat” propositions corresponding to single random variables
Slide19Background
H/T: “Probabilistic Logic Programming, De Raedt and Kersting???
Slide20OutlineMotivationBackgroundLogicProbabilityCombining logic and probabilities: MLNsProPPRKey ideasLearning methodResults for parameter learning
Structure learning for ProPPR for KB completionJoint IE and KB completionComparison to neural KBC modelsBeyond ProPPR….
Slide21Markov Networks: [Review]Undirected graphical modelsCancer
CoughAsthmaSmokingSmoking
Cancer
Ф
(S,C)
False
False
4.5
False
True
4.5
True
False
2.7
True
True
4.5
H/T: Pedro Domingos
x
= vector
x
c
= short vector
Slide22Markov Logic: IntuitionA logical KB is a set of hard constraintson the set of possible worldsLet’s make them soft constraints
:When a world violates a formula,It becomes less probable, not impossibleGive each formula a weight(Higher weight Stronger constraint)
H/T: Pedro Domingos
Slide23Markov Logic: DefinitionA Markov Logic Network (MLN) is a set of pairs (F, w) whereF is a formula in first-order logicw
is a real numberTogether with a set of constants,it defines a Markov network withOne node for each grounding of each predicate in the MLNOne feature for each grounding of each formula F in the MLN, with the corresponding weight w
H/T: Pedro Domingos
Slide24Example: Friends & Smokers
H/T: Pedro Domingos
Slide25Example: Friends & Smokers
H/T: Pedro Domingos
Slide26Example: Friends & Smokers
H/T: Pedro Domingos
Slide27Example: Friends & Smokers
Two constants:
Anna
(A) and
Bob
(B)
H/T: Pedro Domingos
Slide28Example: Friends & Smokers
Cancer(A)Smokes(A)
Smokes(B)
Cancer(B)
Two constants:
Anna
(A) and
Bob
(B)
H/T: Pedro Domingos
Slide29Example: Friends & Smokers
Cancer(A)Smokes(A)
Friends(A,A)
Friends(B,A)
Smokes(B)
Friends(A,B)
Cancer(B)
Friends(B,B)
Two constants:
Anna
(A) and
Bob
(B)
H/T: Pedro Domingos
Slide30Example: Friends & Smokers
Cancer(A)Smokes(A)
Friends(A,A)
Friends(B,A)
Smokes(B)
Friends(A,B)
Cancer(B)
Friends(B,B)
Two constants:
Anna
(A) and
Bob
(B)
H/T: Pedro Domingos
Slide31Example: Friends & Smokers
Cancer(A)Smokes(A)
Friends(A,A)
Friends(B,A)
Smokes(B)
Friends(A,B)
Cancer(B)
Friends(B,B)
Two constants:
Anna
(A) and
Bob
(B)
H/T: Pedro Domingos
Slide32Markov Logic NetworksMLN is template for ground Markov netsProbability of a world x:
Weight of formula
i
No. of true groundings of formula
i
in
x
H/T: Pedro Domingos
Slide33MLNs generalize many statistical models Special cases:Markov networksMarkov random fieldsBayesian networksLog-linear models
Exponential modelsMax. entropy modelsGibbs distributionsBoltzmann machinesLogistic regressionHidden Markov modelsConditional random fieldsObtained by making all predicates zero-arityMarkov logic allows objects to be interdependent (non-i.i.d.)
Slide34MLNs generalize logic programs Subsets of Herbrand base domain of joint distributionInterpretation element of the joint
Consistency with all clauses A:-B1,…,Bk == “model of program” compatibility with program as determined by clique potentialsReaches logic in the limit when potentials are infinite.
Slide35MLNs are expensive Inference done by explicitly building a ground MLNHerbrand base is huge for reasonable programs: It grows faster than the size of the DB of facts
You’d like to able to use a huge DB—NELL is O(10M)Inference on an arbitrary MLN is expensive: #P-completeIt’s not obvious how to restrict the template so the MLNs will be tractable
Slide36What’s the alternative?There are many probabilistic LPs:Compile to other 0th-order formats: (Bayesian LPs, ProbLog, ….), Impose a distribution over proofs, not interpretations (Probabilistic Constraint LPs, Stochastic LPs, …): requires generating all proofs to answer queries, also a large spaceSample from the space of proofs (PRISM, Blog)
Limited relational extensions to 0th-order models (PRMs, RDTs, MEBNs, …)Probabilistic programming languages (Church, …)Imperative languages for defining complex probabilistic models (Related LP work: PRISM)
Slide37OutlineMotivationBackgroundLogicProbabilityCombining logic and probabilities: MLNsProPPRKey ideasLearning methodResults for parameter learning
Structure learning for ProPPR for KB completionJoint IE and KB completionComparison to neural KBC modelsBeyond ProPPR….