David Burkett John Blitzer amp Dan Klein TexPoint fonts used in EMF Read the TexPoint manual before you delete this box A A A A A A A A A Statistical MT Training Pipeline 1 Align sentence pairs GIZA ID: 418459
Download Presentation The PPT/PDF document "Joint Parsing and Alignment with Weakly ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Joint Parsing and Alignment with Weakly Synchronized Grammars
David Burkett, John Blitzer, & Dan Klein
TexPoint
fonts used in EMF.
Read the
TexPoint
manual before you delete this box.:
A
A
A
A
A
A
A
A
ASlide2
Statistical MT Training Pipeline
1) Align sentence pairs (GIZA++)2) Parse English sentences (Berkeley parser)
Parse Foreign sentences
3) Extract rules (Galley et al. 2006)
4) Tune discriminative parameters
在at办公室office里in读了read书bookreadthebookintheoffice
}
Joint model for (1) & (2) Slide3
Data Setting for
Joint Models
(
中文
;
)English WSJ...(EN; )(EN; )(EN; )(中文; )...(中文; )Chinese CTBParallel, Aligned CTB...(EN,中文; )(EN,中文; )(EN,中文; )
Unlabeled parallel text
.
.
.
(
EN
;
中文
)
(
EN
;
中文
)
(
EN
;
中文
)Slide4
Word alignment grids
在
at
办公室
office
里in读了read书bookreadthebookintheofficeSlide5
Syntactic
Correspondences
EN
中文
Build a modelSlide6
Correspondence via Synchronous GrammarsSlide7
Synchronous derivationSlide8
Synchronous DerivationSlide9
Weakly Synchronized ExampleSlide10
Weakly Synchronized Example
Separate PCFGsSlide11
Weakly Synchronized Example
ITG alignmentSlide12
Weakly Synchronized Example
Points for synchronization, but not requiredSlide13
Correspondence Model & Feature Types
办公室
office
Feature type 1: Word Alignment
EN中文PPPPFeature type 3: CorrespondenceFeature type 2: Monolingual ParserENPPin the officeEN中文EN中文EN中文
EN
中文
EN
中文
[HBDK09]Slide14
Estimating
EN
中文
EN
中文
Set to maximize the log-likelihood of the correct parses & alignmentsENEN中文中文EN中文EN中文 normalizes to sum to 1Slide15
Computing
PP
PP
Correspondence features tie pieces together
EN
中文EN中文Computing exactly is intractableEN中文EN中文Individual , , have polynomial-time dynamic programming algorithmsSlide16
Approximating : Mean Field
Exploit tractability in individual models:
Factored approximation:
EN
中文
PPPPInitialize separatelyIterate:Set to minimize EN中文
EN
中文
AlgorithmSlide17
Large scale inference
We can approximate in polynomial time, but . . .
EN
中文
Sum over possible alignments is an algorithm.
But computers are fast, right?Medium-length sentences are 50 words longSmall translation data sets are 250,000 sentences~4 quadrillion operations (See for speedup details)[BBK10, HBDK09]Slide18
Quantitative Results: ParsingSlide19
Quantitative Results: Parsing
85.7%
83.6%Slide20
Quantitative Results: Parsing
81.2%
84.5%Slide21
Incorrect English PP AttachmentSlide22
Corrected English PP AttachmentSlide23
Quantitative Results: Translation
69.5%
85.0%
BLEU improvement from
29.4
to 30.6 79.5%Slide24
Better Translations with Bilingual Adaptation
Reference
At this point the cause of the plane collision is still unclear. The local
caa
will launch an investigation into this .
Baseline (GIZA++)The cause of planes is still not clear yet, local civil aviation department will investigate this . 目前导致飞机相撞的原因尚不清楚,当地民航部门将对此展开调查Cur-rentlycauseplanecrashDEreasonyetnotclear,localcivilaero-nauticsbureauwill
toward
open
investi-gations
Bilingual Adaptation Model
The cause of plane collision remained unclear, local civil aviation departments will launch an investigation .Slide25
Thanks