/
Scalable   Statistical Scalable   Statistical

Scalable Statistical - PowerPoint Presentation

reimbursevolkswagon
reimbursevolkswagon . @reimbursevolkswagon
Follow
343 views
Uploaded On 2020-06-29

Scalable Statistical - PPT Presentation

Relational Learning for NLP William Y Wang William W Cohen Machine Learning Dept and Language Technologies Inst joint work with Kathryn Rivard Mazaitis Outline Motivation ID: 789129

friends logic bob parent logic friends parent bob alice probabilistic learning pedro domingos program markov programming inference raedt cancer

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Scalable Statistical" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Scalable Statistical Relational Learning for NLP

William Y. WangWilliam W. CohenMachine Learning Dept and Language Technologies Inst.joint work with:Kathryn Rivard Mazaitis

Slide2

OutlineMotivationBackgroundLogicProbabilityCombining logic and probabilities: MLNsProPPRKey ideasLearning methodResults for parameter learning

Structure learning for ProPPR for KB completionJoint IE and KB completionComparison to neural KBC modelsBeyond ProPPR….

Slide3

Motivation

Slide4

Slide5

KR & ReasoningInference Methods, Inference Rules

AnswersQueries…Challenges for KR

:

Robustness

: noise, incompleteness, ambiguity (“

Sunnybrook

”), statistical information (“

foundInRoom

(bathtub, bathroom)

”), …

Complex queries:

“which Canadian hockey teams have won the Stanley Cup?”

Learning

: how to

acquire and maintain

knowledge and inference rules as well as how to use it

“Expressive, probabilistic, efficient:

pick any two”

Current state of the art

What if the DB/KB or inference rules are imperfect?

Slide6

Large ML-based software system

Machine Learning(for complex tasks)

Relational, joint learning and inference

Slide7

Background

H/T: “Probabilistic Logic Programming, De Raedt and Kersting

Slide8

Background: Logic ProgramsA program with one definite clause (Horn clauses):grandparent(X,Y) :- parent(X,Z),parent(Z,Y)Logical variables: X,Y,ZConstant symbols: bob, alice, …

We’ll consider two types of clauses:Horn clauses A:-B1,…,Bk with no constantsUnit clauses A:- with no variables (facts):parent(alice,bob):- or parent(alice,bob)

H/T: “Probabilistic Logic Programming, De Raedt and Kersting

head

body

“neck”

Intensional definition,

rules

Extensional definition, database

Slide9

Background: Logic ProgramsA program with one definite clause:grandparent(X,Y) :- parent(X,Z),parent(Z,Y)Logical variables: X,Y,Z

Constant symbols: bob, alice, …Predicates: grandparent/2, parent/2Alphabet: set of possible predicates and constantsAtomic formulae: parent(X,Y), parent(alice,bob)Ground atomic formulae: parent(alice,bob), …

H/T: “Probabilistic Logic Programming, De Raedt and Kersting

Slide10

Background: Logic ProgramsThe set of all ground atomic formulae (consistent with a fixed alphabet) is the Herbrand base of a program: {parent(alice,alice),parent(alice,bob),…,parent(zeke,zeke),grandparent(alice,alice),…}The

interpretation of a program is a subset of the Herbrand base. An interpretation M is a model of a program if For any A:-B1,…,Bk in the program and any mapping Theta from the variables in A,B1,..,Bk to constants:If Theta(B1) in M and … and Theta(Bk) in M then Theta(A) in MA program defines a unique least Herbrand model

H/T: “Probabilistic Logic Programming, De Raedt and Kersting

Slide11

Background: Logic ProgramsA program defines a unique least Herbrand modelExample program:grandparent(X,Y):-parent(X,Z),parent(Z,Y).parent(alice,bob). parent(bob,chip). parent(bob,dana).

The least Herbrand model also includes grandparent(alice,dana) and grandparent(alice,chip).Finding the least Herbrand model: theorem proving…Usually we case about answering queries: What are values of W: grandparent(alice,W) ?

H/T: “Probabilistic Logic Programming, De Raedt and Kersting

Slide12

MotivationInference Methods, Inference Rules

AnswersQueriesChallenges for KR:Robustness: noise, incompleteness, ambiguity (“Sunnybrook”), statistical information (“foundInRoom(bathtub, bathroom)”), …Complex queries: “which Canadian hockey teams have won the Stanley Cup?”Learning: how to acquire and maintain knowledge and inference rules as well as how to use itquery(T):- play(T,hockey), hometown(T,C), country(C,canada)

{T : query(T) } ?

Slide13

BackgroundRandom variable: burglary, earthquake, …Usually denote with upper-case letters: B,E,A,J,MJoint distribution: Pr(B,E,A,J,M)

H/T: “Probabilistic Logic Programming, De Raedt and KerstingBE

A

J

M

prob

T

T

T

T

T

0.00001

F

T

T

T

T

0.03723

Slide14

Background: Bayes networksRandom variable: B,E,A,J,MJoint distribution: Pr(B,E,A,J,M)Directed graphical models

give one way of defining a compact model of the joint distribution:Queries: Pr(A=t|J=t,M=f) ?

H/T: “Probabilistic Logic Programming, De Raedt and Kersting

A

J

Prob

(J|A)

F

F

0.95

F

T

0.05

T

F

0.25

T

T

0.75

A

M

Prob

(J|A)

F

F

0.80

Slide15

BackgroundRandom variable: B,E,A,J,MJoint distribution: Pr(B,E,A,J,M)Directed graphical models

give one way of defining a compact model of the joint distribution:Queries: Pr(A=t|J=t,M=f) ?

H/T: “Probabilistic Logic Programming, De Raedt and Kersting

A

J

Prob

(J|A)

F

F

0.95

F

T

0.05

T

F

0.25

T

T

0.75

Slide16

Background: Markov networksRandom variable: B,E,A,J,MJoint distribution: Pr(B,E,A,J,M)Undirected graphical models

give another way of defining a compact model of the joint distribution…via potential functions.ϕ(A=a,J=j) is a scalar measuring the “compatibility” of A=a J=j

x

x

x

x

A

J

ϕ

(

a,j

)

F

F

20

F

T

1

T

F

0.1

T

T

0.4

Slide17

Backgroundϕ(A=a,J=j) is a scalar measuring the “compatibility” of A=a J=j

x

x

x

x

A

J

ϕ

(

a,j

)

F

F

20

F

T

1

T

F

0.1

T

T

0.4

clique potential

Slide18

MotivationInference Methods, Inference Rules

AnswersQueriesChallenges for KR:Robustness: noise, incompleteness, ambiguity (“Sunnybrook”), statistical information (“foundInRoom(bathtub, bathroom)”), …Complex queries: “which Canadian hockey teams have won the Stanley Cup?”Learning: how to acquire and maintain knowledge and inference rules as well as how to use it

In space of “flat” propositions corresponding to single random variables

Slide19

Background

H/T: “Probabilistic Logic Programming, De Raedt and Kersting???

Slide20

OutlineMotivationBackgroundLogicProbabilityCombining logic and probabilities: MLNsProPPRKey ideasLearning methodResults for parameter learning

Structure learning for ProPPR for KB completionJoint IE and KB completionComparison to neural KBC modelsBeyond ProPPR….

Slide21

Markov Networks: [Review]Undirected graphical modelsCancer

CoughAsthmaSmokingSmoking

Cancer

Ф

(S,C)

False

False

4.5

False

True

4.5

True

False

2.7

True

True

4.5

H/T: Pedro Domingos

x

= vector

x

c

= short vector

Slide22

Markov Logic: IntuitionA logical KB is a set of hard constraintson the set of possible worldsLet’s make them soft constraints

:When a world violates a formula,It becomes less probable, not impossibleGive each formula a weight(Higher weight  Stronger constraint)

H/T: Pedro Domingos

Slide23

Markov Logic: DefinitionA Markov Logic Network (MLN) is a set of pairs (F, w) whereF is a formula in first-order logicw

is a real numberTogether with a set of constants,it defines a Markov network withOne node for each grounding of each predicate in the MLNOne feature for each grounding of each formula F in the MLN, with the corresponding weight w

H/T: Pedro Domingos

Slide24

Example: Friends & Smokers

H/T: Pedro Domingos

Slide25

Example: Friends & Smokers

H/T: Pedro Domingos

Slide26

Example: Friends & Smokers

H/T: Pedro Domingos

Slide27

Example: Friends & Smokers

Two constants:

Anna

(A) and

Bob

(B)

H/T: Pedro Domingos

Slide28

Example: Friends & Smokers

Cancer(A)Smokes(A)

Smokes(B)

Cancer(B)

Two constants:

Anna

(A) and

Bob

(B)

H/T: Pedro Domingos

Slide29

Example: Friends & Smokers

Cancer(A)Smokes(A)

Friends(A,A)

Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)

Two constants:

Anna

(A) and

Bob

(B)

H/T: Pedro Domingos

Slide30

Example: Friends & Smokers

Cancer(A)Smokes(A)

Friends(A,A)

Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)

Two constants:

Anna

(A) and

Bob

(B)

H/T: Pedro Domingos

Slide31

Example: Friends & Smokers

Cancer(A)Smokes(A)

Friends(A,A)

Friends(B,A)

Smokes(B)

Friends(A,B)

Cancer(B)

Friends(B,B)

Two constants:

Anna

(A) and

Bob

(B)

H/T: Pedro Domingos

Slide32

Markov Logic NetworksMLN is template for ground Markov netsProbability of a world x:

Weight of formula

i

No. of true groundings of formula

i

in

x

H/T: Pedro Domingos

Slide33

MLNs generalize many statistical models Special cases:Markov networksMarkov random fieldsBayesian networksLog-linear models

Exponential modelsMax. entropy modelsGibbs distributionsBoltzmann machinesLogistic regressionHidden Markov modelsConditional random fieldsObtained by making all predicates zero-arityMarkov logic allows objects to be interdependent (non-i.i.d.)

Slide34

MLNs generalize logic programs Subsets of Herbrand base  domain of joint distributionInterpretation  element of the joint

Consistency with all clauses A:-B1,…,Bk == “model of program”  compatibility with program as determined by clique potentialsReaches logic in the limit when potentials are infinite.

Slide35

MLNs are expensive Inference done by explicitly building a ground MLNHerbrand base is huge for reasonable programs: It grows faster than the size of the DB of facts

You’d like to able to use a huge DB—NELL is O(10M)Inference on an arbitrary MLN is expensive: #P-completeIt’s not obvious how to restrict the template so the MLNs will be tractable

Slide36

What’s the alternative?There are many probabilistic LPs:Compile to other 0th-order formats: (Bayesian LPs, ProbLog, ….), Impose a distribution over proofs, not interpretations (Probabilistic Constraint LPs, Stochastic LPs, …): requires generating all proofs to answer queries, also a large spaceSample from the space of proofs (PRISM, Blog)

Limited relational extensions to 0th-order models (PRMs, RDTs, MEBNs, …)Probabilistic programming languages (Church, …)Imperative languages for defining complex probabilistic models (Related LP work: PRISM)

Slide37

OutlineMotivationBackgroundLogicProbabilityCombining logic and probabilities: MLNsProPPRKey ideasLearning methodResults for parameter learning

Structure learning for ProPPR for KB completionJoint IE and KB completionComparison to neural KBC modelsBeyond ProPPR….