/
Stat-XFER:  A General Framework for Search-based Syntax-driven MT Stat-XFER:  A General Framework for Search-based Syntax-driven MT

Stat-XFER: A General Framework for Search-based Syntax-driven MT - PowerPoint Presentation

tawny-fly
tawny-fly . @tawny-fly
Follow
349 views
Uploaded On 2019-06-29

Stat-XFER: A General Framework for Search-based Syntax-driven MT - PPT Presentation

Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with Erik Peterson Alok Parlikar Vamshi Ambati Abhaya Agarwal Greg Hanneman Kevin Gimpel Edmund Huber March 28 2008 ID: 760713

stat xfer cmu nist xfer stat nist cmu 2008 march rules transfer english np1 based aligned parallel translation alignments

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Stat-XFER: A General Framework for Sear..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Stat-XFER: A General Framework for Search-based Syntax-driven MT

Alon Lavie

Language Technologies Institute

Carnegie Mellon University

Joint work with:

Erik Peterson, Alok Parlikar, Vamshi Ambati, Abhaya Agarwal, Greg Hanneman, Kevin Gimpel, Edmund Huber

Slide2

March 28, 2008

NIST MT-08: CMU Stat-XFER

2

Outline

Context and Rationale

CMU Statistical Transfer MT Framework

Automatic Acquisition of Syntax-based MT Resources

Chinese-to-English System

Urdu-to-English System

Open Research Challenges

Conclusions

Slide3

March 28, 2008

NIST MT-08: CMU Stat-XFER

3

Rule-based vs. Statistical MT

Traditional Rule-based MT:

Expressive and linguistically-rich formalisms capable of describing complex mappings between the two languages

Accurate “clean” resources

Everything constructed manually by experts

Main challenge: obtaining broad coverage

Phrase-based Statistical MT:

Learn word and phrase correspondences automatically from large volumes of parallel data

Search-based “decoding” framework:

Models propose many alternative translations

Effective search algorithms find the “best” translation

Main challenge: obtaining high translation accuracy

Slide4

Research Goals

Long-term research agenda (since 2000) focused on developing a unified framework for MT that addresses the core fundamental weaknesses of previous approaches:Representation – explore richer formalisms that can capture complex divergences between languagesAbility to handle morphologically complex languagesMethods for automatically acquiring MT resources from available data and combining them with manual resourcesAbility to address both rich and poor resource scenariosFocus has been on low-resource scenarios, scaling up to resource-rich scenarios in the past yearMain research funding sources: NSF (AVENUE and LETRAS projects) and DARPA (GALE)

March 28, 2008

4

NIST MT-08: CMU Stat-XFER

Slide5

March 28, 2008

NIST MT-08: CMU Stat-XFER

5

CMU Statistical Transfer (Stat-XFER) MT Approach

Integrate the major strengths of rule-based and statistical MT within a common framework:

Linguistically rich formalism

that can express complex and abstract compositional transfer rules

Rules can be

written by human experts

and also

acquired automatically from data

Easy integration of

morphological analyzers and generators

Word and syntactic-phrase correspondences can be

automatically acquired from parallel text

Search-based decoding

from statistical MT adapted to find the best translation within the search space: multi-feature scoring, beam-search, parameter optimization, etc.

Framework suitable for both resource-rich and resource-poor language scenarios

Slide6

March 28, 2008

NIST MT-08: CMU Stat-XFER

6

Stat-XFER MT Approach

Interlingua

Syntactic Parsing

Semantic Analysis

Sentence Planning

Text Generation

Source

(e.g. Quechua)

Target(e.g. English)

Transfer Rules

Direct: SMT, EBMT

Statistical-XFER

Slide7

Transfer Engine

Language Model + Additional Features

Transfer Rules

{NP1,3}

NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1]

((X3::Y1)

(X1::Y2)

((X1 def) = +)

((X1 status) =c absolute)

((X1 num) = (X3 num)) ((X1 gen) = (X3 gen)) (X0 = X1))

Translation Lexicon

N::N |: ["$WR"] -> ["BULL"]

((X1::Y1)

((X0 NUM) = s) ((Y0 lex) = "BULL"))N::N |: ["$WRH"] -> ["LINE"]((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "LINE"))

Source Input

בשורה הבאה

Decoder

English Output

in the next line

Translation Output Lattice

(0 1 "IN" @PREP)

(1 1 "THE" @DET)

(2 2 "LINE" @N)

(1 2 "THE LINE" @NP)

(0 2 "IN LINE" @PP)

(0 4 "IN THE NEXT LINE" @PP)

Preprocessing

Morphology

Slide8

March 28, 2008

NIST MT-08: CMU Stat-XFER

8

Transfer Rule Formalism

Type informationPart-of-speech/constituent informationAlignmentsx-side constraintsy-side constraintsxy-constraints, e.g. ((Y1 AGR) = (X1 AGR))

;SL: the old man, TL: ha-ish ha-zaqenNP::NP [DET ADJ N] -> [DET N DET ADJ]((X1::Y1)(X1::Y3)(X2::Y4)(X3::Y2)((X1 AGR) = *3-SING)((X1 DEF = *DEF)((X3 AGR) = *3-SING)((X3 COUNT) = +)((Y1 DEF) = *DEF)((Y3 DEF) = *DEF)((Y2 AGR) = *3-SING)((Y2 GENDER) = (Y4 GENDER)))

Slide9

March 28, 2008

NIST MT-08: CMU Stat-XFER

9

Translation Lexicon: Examples

PRO::PRO |: ["ANI"] -> ["I"]((X1::Y1)((X0 per) = 1)((X0 num) = s)((X0 case) = nom))PRO::PRO |: ["ATH"] -> ["you"]((X1::Y1)((X0 per) = 2)((X0 num) = s)((X0 gen) = m)((X0 case) = nom))

N::N |: ["$&H"] -> ["HOUR"]

(

(X1::Y1)

((X0 NUM) = s)

((Y0 NUM) = s)

((Y0 lex) = "HOUR")

)

N::N |: ["$&H"] -> ["hours"]

(

(X1::Y1)

((Y0 NUM) = p)

((X0 NUM) = p)

((Y0 lex) = "HOUR")

)

Slide10

March 28, 2008

NIST MT-08: CMU Stat-XFER

10

Hebrew Transfer GrammarExample Rules

{NP1,2};;SL: $MLH ADWMH;;TL: A RED DRESSNP1::NP1 [NP1 ADJ] -> [ADJ NP1]((X2::Y1)(X1::Y2)((X1 def) = -)((X1 status) =c absolute)((X1 num) = (X2 num))((X1 gen) = (X2 gen))(X0 = X1))

{NP1,3}

;;SL: H $MLWT H ADWMWT

;;TL: THE RED DRESSES

NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1]

(

(X3::Y1)

(X1::Y2)

((X1 def) = +)

((X1 status) =c absolute)

((X1 num) = (X3 num))

((X1 gen) = (X3 gen))

(X0 = X1)

)

Slide11

March 28, 2008

NIST MT-08: CMU Stat-XFER

11

The Transfer Engine

Input:

source-language input sentence, or source-language confusion network

Output:

lattice representing collection of translation fragments at all levels supported by transfer rules

Basic Algorithm:

“bottom-up” integrated “parsing-transfer-generation” guided by the transfer rules

Start with translations of individual words and phrases from translation lexicon

Create translations of larger constituents by applying applicable transfer rules to previously created lattice entries

Beam-search controls the exponential combinatorics of the search-space, using multiple scoring features

Slide12

March 28, 2008

NIST MT-08: CMU Stat-XFER

12

The Transfer Engine

Some Unique Features:

Works with either learned or manually-developed transfer grammars

Handles rules with or without unification constraints

Supports interfacing with servers for morphological analysis and generation

Can handle ambiguous source-word analyses and/or SL segmentations represented in the form of lattice structures

Slide13

March 28, 2008

NIST MT-08: CMU Stat-XFER

13

XFER Output Lattice

(28 28 "AND" -5.6988 "W" "(CONJ,0 'AND')")

(29 29 "SINCE" -8.20817 "MAZ " "(ADVP,0 (ADV,5 'SINCE')) ")

(29 29 "SINCE THEN" -12.0165 "MAZ " "(ADVP,0 (ADV,6 'SINCE THEN')) ")

(29 29 "EVER SINCE" -12.5564 "MAZ " "(ADVP,0 (ADV,4 'EVER SINCE')) ")

(30 30 "WORKED" -10.9913 "&BD " "(VERB,0 (V,11 'WORKED')) ")

(30 30 "FUNCTIONED" -16.0023 "&BD " "(VERB,0 (V,10 'FUNCTIONED')) ")

(30 30 "WORSHIPPED" -17.3393 "&BD " "(VERB,0 (V,12 'WORSHIPPED')) ")

(30 30 "SERVED" -11.5161 "&BD " "(VERB,0 (V,14 'SERVED')) ")

(30 30 "SLAVE" -13.9523 "&BD " "(NP0,0 (N,34 'SLAVE')) ")

(30 30 "BONDSMAN" -18.0325 "&BD " "(NP0,0 (N,36 'BONDSMAN')) ")

(30 30 "A SLAVE" -16.8671 "&BD " "(NP,1 (LITERAL 'A') (NP2,0 (NP1,0 (NP0,0 (N,34 'SLAVE')) ) ) ) ")

(30 30 "A BONDSMAN" -21.0649 "&BD " "(NP,1 (LITERAL 'A') (NP2,0 (NP1,0 (NP0,0 (N,36 'BONDSMAN')) ) ) ) ")

Slide14

March 28, 2008

NIST MT-08: CMU Stat-XFER

14

The Lattice Decoder

Simple Stack Decoder, similar in principle to simple Statistical MT decoders

Searches for best-scoring path of non-overlapping lattice arcs

No reordering during decoding

Scoring based on log-linear combination of scoring features, with weights trained using Minimum Error Rate Training (MERT)

Scoring components:

Statistical Language Model

Rule Scores (currently: freq-based relative likelihood)

Lexical Probabilities

Fragmentation: how many arcs to cover the entire translation?

Length Penalty: how far from expected target length?

Slide15

March 28, 2008

NIST MT-08: CMU Stat-XFER

15

XFER Lattice Decoder

0 0 ON THE FOURTH DAY THE LION ATE THE RABBIT TO A MORNING MEAL

Overall: -8.18323, Prob: -94.382, Rules: 0, Frag: 0.153846, Length: 0,

Words: 13,13

235 < 0 8 -19.7602: B H IWM RBI&I (PP,0 (PREP,3 'ON')(NP,2 (LITERAL 'THE')

(NP2,0 (NP1,1 (ADJ,2 (QUANT,0 'FOURTH'))(NP1,0 (NP0,1 (N,6 'DAY')))))))>

918 < 8 14 -46.2973: H ARIH AKL AT H $PN (S,2 (NP,2 (LITERAL 'THE') (NP2,0

(NP1,0 (NP0,1 (N,17 'LION')))))(VERB,0 (V,0 'ATE'))(NP,100

(NP,2 (LITERAL 'THE') (NP2,0 (NP1,0 (NP0,1 (N,24 'RABBIT')))))))>

584 < 14 17 -30.6607: L ARWXH BWQR (PP,0 (PREP,6 'TO')(NP,1 (LITERAL 'A')

(NP2,0 (NP1,0 (NNP,3 (NP0,0 (N,32 'MORNING'))(NP0,0 (N,27 'MEAL')))))))>

Slide16

March 28, 2008

NIST MT-08: CMU Stat-XFER

16

Stat-XFER MT Systems

General Stat-XFER framework under development for past seven years

Systems so far:

Chinese-to-English

Hebrew-to-English

Urdu-to-English

Hindi-to-English

Dutch-to-English

Mapudungun-to-Spanish

In progress or planned:

Arabic-to-English

Brazilian Portuguese-to-English

Native-Brazilian languages to Brazilian Portuguese

Hebrew-to-Arabic

Quechua-to-Spanish

Turkish-to-English

Slide17

MT Resource Acquisition in Resource-rich Scenarios

Scenario: Significant amounts of parallel-text at sentence-level are availableParallel sentences can be word-aligned and parsed (at least on one side, ideally on both sides)Goal: Acquire syntax-based broad-coverage translation lexicons and transfer rule grammars automatically from the dataSyntax-based translation lexicons:Broad-coverage constituent-level translation equivalents at all levels of granularityCan serve as the elementary building blocks for transfer trees constructed at runtime using the transfer rules

March 28, 2008

17

NIST MT-08: CMU Stat-XFER

Slide18

Acquisition Process

Automatic Process for Extracting Syntax-driven Rules and Lexicons from sentence-parallel data:Word-align the parallel corpus (GIZA++)Parse the sentences independently for both languagesRun our new PFA Constituent Aligner over the parsed sentence pairsExtract all aligned constituents from the parallel treesExtract all derived synchronous transfer rules from the constituent-aligned parallel treesConstruct a “data-base” of all extracted parallel constituents and synchronous rules with their frequencies and model them statistically (assign them relative-likelihood probabilities)

March 28, 2008

18

NIST MT-08: CMU Stat-XFER

Slide19

PFA Constituent Node Aligner

Input: a bilingual pair of parsed and word-aligned sentencesGoal: find all sub-sentential constituent alignments between the two trees which are translation equivalents of each otherEquivalence Constraint: a pair of constituents <S,T> are considered translation equivalents if:All words in yield of <S> are aligned only to words in yield of <T> (and vice-versa)If <S> has a sub-constituent <S1> that is aligned to <T1>, then <T1> must be a sub-constituent of <T> (and vice-versa) Algorithm is a bottom-up process starting from word-level, marking nodes that satisfy the constraints

March 28, 2008

19

NIST MT-08: CMU Stat-XFER

Slide20

PFA Node Alignment Algorithm Example

Words don’t have to align one-to-one

Constituent labels can be different in each languageTree Structures can be highly divergent

March 28, 2008

20

NIST MT-08: CMU Stat-XFER

Slide21

PFA Node Alignment Algorithm Example

Aligner uses a clever arithmetic manipulation to enforce equivalence constraints

Resulting aligned nodes are highlighted in figure

March 28, 2008

21

NIST MT-08: CMU Stat-XFER

Slide22

PFA Node Alignment Algorithm Example

Extraction of Phrases:

Get the Yields of the aligned nodes and add them to a phrase table tagged with syntactic categories on both source and target sidesExample:NP # NP :: 澳洲 # Australia

March 28, 2008

22

NIST MT-08: CMU Stat-XFER

Slide23

PFA Node Alignment Algorithm Example

All Phrases from this tree pair:IP # S :: 澳洲 是 与 北韩 有 邦交 的 少数 国家 之一 。 # Australia is one of the few countries that have diplomatic relations with North Korea .VP # VP :: 是 与 北韩 有 邦交 的 少数 国家 之一 # is one of the few countries that have diplomatic relations with North KoreaNP # NP :: 与 北韩 有 邦交 的 少数 国家 之一 # one of the few countries that have diplomatic relations with North KoreaVP # VP :: 与 北韩 有 邦交 # have diplomatic relations with North KoreaNP # NP :: 邦交 # diplomatic relationsNP # NP :: 北韩 # North KoreaNP # NP :: 澳洲 # Australia

Slide24

PFA Constituent Node Alignment Performance

Evaluation Data: Chinese-English TreebankParallel Chinese-English Treebank with manual word-alignments3342 Sentence PairsCreated a “Gold Standard” constituent alignments using the manual word-alignments and treebank treesNode Alignments: 39874 (About 12/tree pair)NP to NP Alignments: 5427Manual inspection confirmed that the constituent alignments are quite accurate (P > 0.80, R > 0.70)Evaluation: Run PFA Aligner with automatic word alignments on same data and compare with the “gold Standard” alignments

March 28, 2008

24

NIST MT-08: CMU Stat-XFER

Slide25

PFA Constituent Node Alignment Performance

Viterbi Combination PrecisionRecallF-MeasureIntersection0.63820.53950.5847Union0.81140.29150.4289Sym-1 (Thot Toolkit)0.71420.45340.5547Sym-2 (Thot Toolkit)0.71350.46310.5617Grow-Diag-Final0.77770.34620.4791Grow-Diag-Final-and0.69880.47000.5620

Viterbi word alignments from Chinese-English and reverse directions were merged using different algorithmsTested the performance of Node-Alignment with each resulting alignment

March 28, 2008

25

NIST MT-08: CMU Stat-XFER

Slide26

Transfer Rule Learning

Input: Constituent-aligned parallel treesIdea: Aligned nodes act as possible decomposition points of the parallel treesThe sub-trees of any aligned pair of nodes can be further decomposed at lower-level aligned nodes, creating an inventory of synchronous “TIG” correspondencesWe decompose only at the “highest” level possibleSynchronous “TIGs” can be converted into synchronous rulesAlgorithm: Find and extract all possible synchronous TIG decompositions from the node aligned trees“Flatten” the TIGs into synchronous CFG rules

March 28, 2008

26

NIST MT-08: CMU Stat-XFER

Slide27

Rule Extraction

Algorithm

Flat Rule Creation:Sample rule:NP::NP [VP 北 CD 有 邦交 ] -> [one of the CD countries that VP](;; Alignments(X1::Y7)(X3::Y4))Note: Any one-to-one aligned words are elevated to Part-Of-Speech in flat rule. Any non-aligned words on either source or target side remain lexicalized

27

Slide28

Rule Extraction Algorithm

All rules extracted: VP::VP [VC NP] -> [VBZ NP]((*score* 0.5);; Alignments(X1::Y1)(X2::Y2))VP::VP [VC NP] -> [VBZ NP]((*score* 0.5);; Alignments(X1::Y1)(X2::Y2))NP::NP [NR] -> [NNP]((*score* 0.5);; Alignments(X1::Y1)(X2::Y2))VP::VP [北 NP VE NP] -> [ VBP NP with NP]((*score* 0.5);; Alignments(X2::Y4)(X3::Y1)(X4::Y2))

All rules extracted: NP::NP [VP 北 CD 有 邦交 ] -> [one of the CD countries that VP]((*score* 0.5);; Alignments(X1::Y7)(X3::Y4))IP::S [ NP VP ] -> [NP VP ]((*score* 0.5);; Alignments(X1::Y1)(X2::Y2))NP::NP [ “北韩”] -> [“North” “Korea”](;Many to one alignment is a phrase)

28

Slide29

Chinese-English System

Developed over past year under DARPA/GALE funding (within IBM-led “Rosetta” team)Participated in recent NIST MT-08 EvaluationLarge-scale broad-coverage systemIntegrates large manual resources with automatically extracted resourcesCurrent performance-level is still inferior to state-of-the-art phrase-based systems

March 28, 2008

29

NIST MT-08: CMU Stat-XFER

Slide30

Chinese-English System

Lexical Resources:Manual Lexicons (base forms): LDC, ADSO, WikiTotal number of entries: 1.07 millionAutomatically acquired from parallel data:Approx 5 million sentences LDC/GALE dataFiltered down to phrases < 10 words in lengthFull formedTotal number of entries: 2.67 million

March 28, 2008

30

NIST MT-08: CMU Stat-XFER

Slide31

Chinese-English System

Transfer Rules:61 manually developed transfer rulesHigh-accuracy rules extracted from manually word-aligned parallel data

CorpusSize (sens)Rules with StructureRules (count>=2)Complete Lexical rulesParallel Treebank (3K)3,34345,2661,96211,521993 sentences99312,6613312,199Parallel Treebank (7K)6,54141,9981,75616,081Merged Corpus set10K94,117316029,340

March 28, 2008

31

NIST MT-08: CMU Stat-XFER

Slide32

March 28, 2008

NIST MT-08: CMU Stat-XFER

32

Translation Example

SrcSent 3 澳洲是与北韩有邦交的少数国家之一。

Gloss:

Australia is with north korea have diplomatic relations DE few country world

Reference:

Australia is one of the few countries that have diplomatic relations with North Korea.

Translation:

Australia is one of the few countries that has diplomatic relations with north korea .

Overall: -5.77439, Prob: -2.58631, Rules: -0.66874, TransSGT: -2.58646, TransTGS: -1.52858, Frag: -0.0413927, Length: -0.127525, Words: 11,15

( 0 10 "Australia is one of the few countries that has diplomatic relations with north korea" -5.66505 "澳洲 是 与 北韩 有 邦交 的 少数 国家 之一 " "(S1,1124731 (S,1157857 (NP,2 (NB,1 (LDC_N,1267 'Australia') ) ) (VP,1046077 (MISC_V,1 'is') (NP,1077875 (LITERAL 'one') (LITERAL 'of') (NP,1045537 (NP,1017929 (NP,1 (LITERAL 'the') (NUMNB,2 (LDC_NUM,420 'few') (NB,1 (WIKI_N,62230 'countries') ) ) ) (LITERAL 'that') (VP,1021811 (LITERAL 'has') (FBIS_NP,11916 'diplomatic relations') ) ) (FBIS_PP,84791 'with north korea') ) ) ) ) ) ")

( 10 11 "." -11.9549 "。" "(MISC_PUNC,20 '.')")

Slide33

March 28, 2008

NIST MT-08: CMU Stat-XFER

33

Example: Syntactic Lexical Phrases

(LDC_N,1267 'Australia')

(WIKI_N,62230 'countries')

(FBIS_NP,11916 'diplomatic relations')

(FBIS_PP,84791 'with north korea')

Slide34

March 28, 2008

NIST MT-08: CMU Stat-XFER

34

Example: Learned XFER Rules

;;SL::(2,4) 对 台 贸易

;;TL::(3,5) trade to taiwan

;;Score::22

{NP,1045537}

NP::NP [PP NP ] -> [NP PP ]

((*score* 0.916666666666667)

(X2::Y1)

(X1::Y2))

;;SL::(2,7) 直接 提到 伟 哥 的 广告

;;TL::(1,7) commercials that directly mention the name viagra

;;Score::5

{NP,1017929}

NP::NP [VP "的" NP ] -> [NP "that" VP ]

((*score* 0.111111111111111)

(X3::Y1)

(X1::Y3))

;;SL::(4,14) 有 一 至 多 个 高 新 技术 项目 或 产品

;;TL::(3,14) has one or more new , high level technology projects or products

;;Score::4

{VP,1021811}

VP::VP ["有" NP ] -> ["has" NP ]

((*score* 0.1)

(X2::Y2))

Slide35

Current Performance

BLEUMETEORMT-03 (Dev-test)0.22270.4998MT-06 (Dev-test)0.20830.4713MT-08 (Test)0.13090.4614

March 28, 2008

NIST MT-08: CMU Stat-XFER

35

Slide36

Urdu-to-English System

Primary condition did not allow parsing of parallel data  low-resource scenario:Lexical resources: used provided LDC lexicon (tagged with POS), plus lexical entries acquired from word-aligning parallel-dataXFER rules: manually developed (48 rules)Language Model built from English side of parallel dataPrimary and Contrastive systems:Cont-1: Built a phrase-based system using MosesPrimary: Used our multi-engine MT system to combine our XFER system with our Moses systemCont-2: Our constrained Stat-XFER system onlyCont-3: Unconstrained version of Stat-XFER with just a large LMCont-4: MEMT of Stat-XFER (Cont-2) and Columbia’s constrained phrase-based systemCont-5: similar to Cont-4, but with unconstrained systems

March 28, 2008

NIST MT-08: CMU Stat-XFER

36

Slide37

Urdu-to-English System

Our official reported scores are incorrect due to a “one-off” bug half way through our outputCorrected scores, as reported by NIST

March 28, 2008

NIST MT-08: CMU Stat-XFER

37

IBM BLEU

METEOR

Primary: MEMT (XFER+Moses)

0.1500

0.5087

Cont-1: Moses only

0.1820

0.5069

Cont-2: Stat-XFER only

0.1158

0.4528

Cont-3: Stat-XFER unconstrained

0.1443

0.4740

Cont-4: MEMT (XFER+Columbia)

0.1526

0.5142

Cont-5: MEMT (XFER+CU unconst)

0.1623

0.5127

Slide38

Open Research Questions

Our large-scale Chinese-English system is still significantly behind phrase-based SMT. Why?Feature set is not sufficiently discriminant?Problems with the parsers for the two sides?Weaker decoder?Syntactic constituents don’t provide sufficient coverage?Bugs and deficiencies in the underlying algorithms?The ISI experience indicates that it may take a couple of years to catch up with and surpass the phrase-based systemsSignificant engineering issues to improve speed and efficient runtime processing and improved search

March 28, 2008

38

NIST MT-08: CMU Stat-XFER

Slide39

Open Research Questions

Immediate Research Issues:Rule Learning:Study effects of learning rules from manually vs automatically word aligned dataStudy effects of parser accuracy on learned rulesEffective discriminant methods for modeling rule scoresRule filtering strategiesSyntax-based LMs: Our translations come out with a syntax-tree attached to themAdd a syntax-based LM feature that can discriminate between good and bad trees

March 28, 2008

39

NIST MT-08: CMU Stat-XFER

Slide40

Conclusions

Stat-XFER is a promising general MT framework, suitable to a variety of MT scenarios and languagesProvides a complete solution for building end-to-end MT systems from parallel data, akin to phrase-based SMT systems (training, tuning, runtime system)No open-source publicly available toolkits (yet), but we welcome further collaboration activitiesComplex but highly interesting set of open research issues

March 28, 2008

40

NIST MT-08: CMU Stat-XFER

Slide41

March 28, 2008

NIST MT-08: CMU Stat-XFER

41

Questions?