1 Inside-Outside & Forward-Backward Algorithms - PowerPoint Presentation

371 views
Uploaded On 2018-07-11

1 Inside-Outside & Forward-Backward Algorithms - PPT Presentation

are just Backprop Jason Eisner tutorial paper The insideoutside algorithm is the hardest algorithm I know a senior NLP researcher ID: 663547

time arrow locations flies arrow time flies locations algorithm expected amp backprop today det nps vps compute circuit energy

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/663547" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "1 Inside-Outside & Forward-Backw..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Slide1

Inside-Outside & Forward-Backward Algorithms are just Backprop

Jason Eisner

(tutorial paper)Slide2

“The inside-outside algorithm is the hardest algorithm I know.” – a senior NLP researcher, in the 1990’sWhat does inside-outside do?It computes expected counts of substructures ...… that appear in the true parse of some sentence w. 2



Slide3

Where are the constituents?3

coal

energy

expert

witness

p=0.5Slide4

Where are the constituents?4

coal

energy

expert

witness

p=0.1Slide5

Where are the constituents?5

coal

energy

expert

witness

p=0.1Slide6

Where are the constituents?6

coal

energy

expert

witness

p=0.1Slide7

Where are the constituents?7

coal

energy

expert

witness

p=0.2Slide8

Where are the constituents?8

coal

energy

expert

witness

0.5

+0.1

+0.2

= 1Slide9

flieslikeanTimearrow

Where are NPs, VPs, … ?

flies

Time

arrow

NP locations

VP locations

Det

NSlide10

Where are NPs, VPs, … ?10

flies

Time

arrow

flies

Time

arrow

NP locations

VP locations

p=0.5

(S (NP Time) (VP flies (PP like (NP an arrow))))Slide11

Where are NPs, VPs, … ?11

flies

Time

arrow

flies

Time

arrow

NP locations

VP locations

p=0.3

(S (NP Time flies) (VP like (NP an arrow)))Slide12

Where are NPs, VPs, … ?12

flies

Time

arrow

flies

Time

arrow

NP locations

VP locations

p=0.1

(S (VP Time (NP (NP flies) (PP like (NP an arrow)))))Slide13

Where are NPs, VPs, … ?13

flies

Time

arrow

flies

Time

arrow

NP locations

VP locations

p=0.1

(S (VP (VP Time (NP flies)) (PP like (NP an arrow))))Slide14

Where are NPs, VPs, … ?14

flies

Time

arrow

NP locations

flies

Time

arrow

VP locations

0.5

+0.3

+0.1

= 1Slide15

How many NPs, VPs, … in all?15

flies

Time

arrow

NP locations

flies

Time

arrow

VP locations

0.5

+0.3

+0.1

= 1Slide16

How many NPs, VPs, … in all?16flies

likean

Timearrow

NP locations

flieslike

anTime

arrow

VP locations

2.1 NPs

(expected)

1.1 VPs

(expected)Slide17

Where did the rules apply?17

flies

Time

arrow

flies

Time

arrow

 NP VP locations



Det

N locationsSlide18

Where did the rules apply?18

flies

Time

arrow

flies

Time

arrow

 NP VP locations



Det

N locations

p=0.5

(S (NP Time) (VP flies (PP like (NP an arrow))))Slide19

Where is S  NP VP substructure?19

flies

Time

arrow

flies

Time

arrow

 NP VP locations



Det

N locations

p=0.3

(S (NP Time flies) (VP like (NP an arrow)))Slide20

Where is S  NP VP substructure?20

flies

Time

arrow

flies

Time

arrow

 NP VP locations



Det

N locations

p=0.1

(S (VP Time (NP (NP flies) (PP like (NP an arrow)))))Slide21

Where is S  NP VP substructure?21

flies

Time

arrow

flies

Time

arrow

 NP VP locations



Det

N locations

p=0.1

(S (VP (VP Time (NP flies)) (PP like (NP an arrow))))Slide22

Why do we want this info?Grammar reestimation by EM methodE step collects those expected countsM step setsMinimum Bayes Risk decodingFind a tree that maximizes expected reward,e.g., expected total # of correct constituentsCKY-like dynamic programming algorithm The input specifies the probability of correctness for each possible constituent (e.g., VP from 1 to 5)22Slide23

Why do we want this info?Soft features of a sentence for other tasksNER system asks: “Is there an NP from 0 to 2?”True answer is 1 (true) or 0 (false)But we return 0.3, averaging over all parsesThat’s a perfectly good feature value – can be fed as to a CRF or a neural network as an input featureWriting tutor system asks: “How many times did the student use S  NP[singular] VP[plural]?”True answer is in {0, 1, 2, …}But we return 1.8, averaging over all parses23Slide24

How do we compute it all?How I usually teach it:24CKY recognizerinside algorithminside-outside algorithm

develop an analogy

to forward-backward

add weights

But algorithm is confusing, and I wave

my hands instead of giving a real proof

historically

accurateSlide25

600.465 - Intro to NLP - J. Eisner



(1,5) =

flies like an arrow |

)



(1,5) =

time VP today |

Inside & Outside Probabilities

time

today

flies

Det

arrow



(1,5)



(1,5)

= p(

time [VP

flies like an arrow

] today

“inside” the VP

“outside” the VPSlide26

600.465 - Intro to NLP - J. Eisner



(1,5) =

flies like an arrow |

)



(1,5) =

time VP today |

Inside & Outside Probabilities

time

today

flies

Det

arrow



(1,5)



(1,5)

time

flies like an arrow

today

(1,5)

| S)



(0,6)

p(time flies like an arrow today | S)

= p(

(1,5)

| time flies like an arrow today, S)Slide27

600.465 - Intro to NLP - J. Eisner



(1,5) =

flies like an arrow |

)



(1,5) =

time VP today |

Inside & Outside Probabilities

time

today

flies

Det

arrow



(1,5)



(1,5)



(0,6)

is probability that there is a VP here,

given all of the observed data (words)

analogous

to forward-backward

in the finite-state case!

Start

HSlide28

How do we compute it all?How I usually teach it:28CKY recognizerinside algorithminside-outside algorithm

develop an analogy

to forward-backward

add weights

But algorithm is confusing, and I wave

my hands instead of giving a real proof

historically

accurate

This step

for free

backprop!Slide29

Back-Propagation29arithmetic circuit(computation graph)Slide30

Energy-Based ModelsDefineThenLog-linear case: 30Slide31

p(tree T | sentence w) is log-linear31Therefore, ∇ log Z will give expected rule counts.

total

prob

of all parsesSlide32

∇log Z = Expected Counts32Optimization

Gradient

Expected counts

novel alg

.Slide33

∇log Z = Expected Counts33OptimizationGradient

Expected counts

backprop

(EM) & MBR decoding & soft features

Not only does this

get the same answer,

it actually gives the

identical algorithm!Slide34

TerminologyAutomatic differentiation: Given a circuit that computes a function f, construct a circuit to compute (f, ∇f)34Slide35

TerminologyAutomatic differentiation: Given a circuit that computes a function f, construct a circuit to compute (f, ∇f)Algorithmic differentiation:Given code that computes a function f, construct code to compute (f, ∇f)Back-propagation:One computational strategy that the new circuit or code can use35Slide36

Inside algorithm computes ZEssentially creates a circuit “on the fly” Like other dynamic programming algorithmsCircuit structure gives parse forest hypergraphO(Vn2) nodes in the circuit, with names like (VP,1,5)Each node sums O(V2n) 3-way productsThe 3-way products and the intermediate sums aren’t nodes of the formal circuitNo names, not stored long-term36Slide37

Inside algorithm computes Z37

visit  nodesbottom-up

(toposort)Slide38

How to differentiate38

(“swap” x, y

)

(“swap” x, y

)Slide39

Inside algorithm has 3-way products39

(swap A, G)

(swap A, B)

(swap A, C)

(A += G*B*C)Slide40

Inside algorithm has 3-way products40

[…]

is traditionally

called […]

We’ll similarly write

 B C

] as [A

 B C

]Slide41

Differentiate to compute (Z, ∇Z)41

visit ð nodesin reverse

order of 

(so top-down)wait, what’s this line?Slide42

The final step:  / Z42

could reasonably

rename G(R) to

(R)

we want this

but oops, we used backprop

to get this,

so fix itSlide43

The final step:  / ZThis final step is annoying. It’s more chain rule. Shouldn’t that be part of backprop? We won’t need it if we actually compute ∇ log Z in the first place (straightforwardly).We only did ∇Z for pedagogical reasons.It’s closer to the original inside-outside algorithm, and ensures ð = traditional .But the “straightforward” way is slightly more efficient.43Slide44

Inside algorithm has 3-way products44

[…]

is traditionally

called […]

Paper spells out why the traditional “outer weight” interpretation of

 is actually a partial derivativeSlide45

Other algorithms! E.g.:45FormalismHow to compute Z(total weight of all derivations)Expected counts of substructures = ∇log ZHMM

backward algorithmbackpropPCFG in CNFinside algorithmbackprop

CRF-CFG in CNF

inside algorithmbackpropsemi-Markov CRFSaragawi & Cohen 2004

backproparbitrary PCFGEarley’s algorithmbackprop

CCGVijay-Shankar & Weir 1990backprop

TAG

Vijay-Shankar & Joshi 1987

backprop

dependency grammar

Eisner

Satta

backprop

transition-based parsing

PDA

with graph-structured stack

backprop

synchronous grammar

synchronous inside

algorithm

backprop

graphical model

junction tree

backpropSlide46

Other tips in the paperAn inside-outside algorithm should only be a few times as slow as its inside algorithm.People sometimes publish algorithms that are n times slower. This is a good reason to formally go through algorithmic differentiation.46Slide47

Other tips in the paperAn inside-outside algorithm should only be a few times as slow as its inside algorithm.HMMs can be treated as a special case of PCFGs.47Slide48

Other tips in the paperAn inside-outside algorithm should only be a few times as slow as its inside algorithm.HMMs can be treated as a special case of PCFGs.There are 2 other nice derivations of inside-outside. All yield essentially the same code, and the other perspectives are useful too.The Viterbi variant is useful for pruning (max-marginal counts) and parsing (Viterbi outside scores). It can be obtained as the 0-temperature limit of the gradient-based algorithm.The gradient view makes it easy to do end-to-end training if the parser talks to / listens to neural nets.48

1 Inside-Outside & Forward-Backward Algorithms - PowerPoint Presentation

1 Inside-Outside & Forward-Backward Algorithms - PPT Presentation

Share:

Link:

Embed:

Related Contents