/
Stat-XFER:  A General Framework for Search-based Syntax-driven MT Stat-XFER:  A General Framework for Search-based Syntax-driven MT

Stat-XFER: A General Framework for Search-based Syntax-driven MT - PowerPoint Presentation

pamella-moone
pamella-moone . @pamella-moone
Follow
343 views
Uploaded On 2019-06-21

Stat-XFER: A General Framework for Search-based Syntax-driven MT - PPT Presentation

Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with Erik Peterson Alok Parlikar Vamshi Ambati Christian Monson Ari Font Llitjos Lori Levin Jaime Carbonell Carnegie Mellon University ID: 759577

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Stat-XFER: A General Framework for Sear..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Stat-XFER: A General Framework for Search-based Syntax-driven MT

Alon Lavie

Language Technologies Institute

Carnegie Mellon University

Joint work with:

Erik Peterson, Alok Parlikar, Vamshi Ambati, Christian Monson, Ari Font Llitjos, Lori Levin, Jaime Carbonell – Carnegie Mellon University

Shuly Wintner, Danny Shacham, Nurit Melnik - University of Haifa

Roberto Aranovitch – University of Pittsburgh

Slide2

February 18, 2008

CICLing-2008

2

Outline

Context and Rationale

CMU Statistical Transfer MT Framework

Broad Resource Scenario: Chinese-to-English

Low Resource Scenario: Hebrew-to-English

Open Research Challenges

Conclusions

Slide3

February 18, 2008

CICLing-2008

3

Current State-of-the-Art in Machine Translation

MT underwent a major paradigm shift over the past 15 years:

From

manually crafted

rule-based systems

with manually designed knowledge resources

To

search-based approaches

founded on automatic extraction of translation models/units from large sentence-parallel corpora

Current Dominant Approach:

Phrase-based Statistical MT

:

Extract and statistically model large volumes of phrase-to-phrase correspondences from automatically word-aligned parallel corpora

“Decode” new input by searching for the most likely sequence of phrase matches, using a combination of features, including a statistical Language Model for the target language

Slide4

February 18, 2008

CICLing-2008

4

Current State-of-the-art in Machine Translation

Phrase-based MT State-of-the-art

:

Requires minimally several million words of parallel text for adequate training

Mostly limited to language-pairs for which such data exists: major European languages, Arabic, Chinese, Japanese, a few others…

Linguistically shallow and highly lexicalized models result in weak generalization

Best performance levels (BLEU=~0.6) on Arabic-to-English provide understandable but often still ungrammatical or somewhat disfluent translations

Ill suited for Hebrew and most of the world’s minor and resource-poor languages

Slide5

February 18, 2008

CICLing-2008

5

Rule-based vs. Statistical MT

Traditional Rule-based MT:

Expressive and linguistically-rich formalisms capable of describing complex mappings between the two languages

Accurate “clean” resources

Everything constructed manually by experts

Main challenge: obtaining broad coverage

Phrase-based Statistical MT:

Learn word and phrase correspondences automatically from large volumes of parallel data

Search-based “decoding” framework:

Models propose many alternative translations

Effective search algorithms find the “best” translation

Main challenge: obtaining high translation accuracy

Slide6

Research Goals

Long-term research agenda (since 2000) focused on developing a unified framework for MT that addresses the core fundamental weaknesses of previous approaches:Representation – explore richer formalisms that can capture complex divergences between languagesAbility to handle morphologically complex languagesMethods for automatically acquiring MT resources from available data and combining them with manual resourcesAbility to address both rich and poor resource scenariosMain research funding sources: NSF (AVENUE and LETRAS projects) and DARPA (GALE)

February 18, 2008

6

CICLing-2008

Slide7

February 18, 2008

CICLing-2008

7

CMU Statistical Transfer (Stat-XFER) MT Approach

Integrate the major strengths of rule-based and statistical MT within a common framework:

Linguistically rich formalism

that can express complex and abstract compositional transfer rules

Rules can be

written by human experts

and also

acquired automatically from data

Easy integration of

morphological analyzers and generators

Word and syntactic-phrase correspondences can be

automatically acquired from parallel text

Search-based decoding

from statistical MT adapted to find the best translation within the search space: multi-feature scoring, beam-search, parameter optimization, etc.

Framework suitable for both resource-rich and resource-poor language scenarios

Slide8

February 18, 2008

CICLing-2008

8

Stat-XFER Main Principles

Framework:

Statistical search-based approach with syntactic translation transfer rules that can be acquired from data but also developed and extended by experts

Automatic Word and Phrase translation lexicon acquisition from parallel data

Transfer-rule Learning:

apply ML-based methods to automatically acquire syntactic transfer rules for translation between the two languages

Elicitation:

use bilingual native informants to produce a small high-quality word-aligned bilingual corpus of translated phrases and sentences

Rule Refinement:

refine the acquired rules via a process of interaction with bilingual informants

XFER + Decoder:

XFER engine produces a lattice of possible transferred structures at all levels

Decoder searches and selects the best scoring combination

Slide9

February 18, 2008

CICLing-2008

9

Stat-XFER MT Approach

Interlingua

Syntactic Parsing

Semantic Analysis

Sentence Planning

Text Generation

Source

(e.g. Quechua)

Target(e.g. English)

Transfer Rules

Direct: SMT, EBMT

Statistical-XFER

Slide10

Transfer Engine

Language Model + Additional Features

Transfer Rules

{NP1,3}

NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1]

((X3::Y1)

(X1::Y2)

((X1 def) = +)

((X1 status) =c absolute)

((X1 num) = (X3 num)) ((X1 gen) = (X3 gen)) (X0 = X1))

Translation Lexicon

N::N |: ["$WR"] -> ["BULL"]

((X1::Y1)

((X0 NUM) = s) ((Y0 lex) = "BULL"))N::N |: ["$WRH"] -> ["LINE"]((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "LINE"))

Source Input

בשורה הבאה

Decoder

English Output

in the next line

Translation Output Lattice

(0 1 "IN" @PREP)

(1 1 "THE" @DET)

(2 2 "LINE" @N)

(1 2 "THE LINE" @NP)

(0 2 "IN LINE" @PP)

(0 4 "IN THE NEXT LINE" @PP)

Preprocessing

Morphology

Slide11

February 18, 2008

CICLing-2008

11

Transfer Rule Formalism

Type informationPart-of-speech/constituent informationAlignmentsx-side constraintsy-side constraintsxy-constraints, e.g. ((Y1 AGR) = (X1 AGR))

;SL: the old man, TL: ha-ish ha-zaqenNP::NP [DET ADJ N] -> [DET N DET ADJ]((X1::Y1)(X1::Y3)(X2::Y4)(X3::Y2)((X1 AGR) = *3-SING)((X1 DEF = *DEF)((X3 AGR) = *3-SING)((X3 COUNT) = +)((Y1 DEF) = *DEF)((Y3 DEF) = *DEF)((Y2 AGR) = *3-SING)((Y2 GENDER) = (Y4 GENDER)))

Slide12

February 18, 2008

CICLing-2008

12

Transfer Rule Formalism

Value constraints Agreement constraints

;SL: the old man, TL: ha-ish ha-zaqen

NP::NP [DET ADJ N] -> [DET N DET ADJ]((X1::Y1)(X1::Y3)(X2::Y4)(X3::Y2)((X1 AGR) = *3-SING)((X1 DEF = *DEF)((X3 AGR) = *3-SING)((X3 COUNT) = +)((Y1 DEF) = *DEF)((Y3 DEF) = *DEF)((Y2 AGR) = *3-SING)((Y2 GENDER) = (Y4 GENDER)))

Slide13

February 18, 2008

CICLing-2008

13

Translation Lexicon: Examples

PRO::PRO |: ["ANI"] -> ["I"]((X1::Y1)((X0 per) = 1)((X0 num) = s)((X0 case) = nom))PRO::PRO |: ["ATH"] -> ["you"]((X1::Y1)((X0 per) = 2)((X0 num) = s)((X0 gen) = m)((X0 case) = nom))

N::N |: ["$&H"] -> ["HOUR"]

(

(X1::Y1)

((X0 NUM) = s)

((Y0 NUM) = s)

((Y0 lex) = "HOUR")

)

N::N |: ["$&H"] -> ["hours"]

(

(X1::Y1)

((Y0 NUM) = p)

((X0 NUM) = p)

((Y0 lex) = "HOUR")

)

Slide14

February 18, 2008

CICLing-2008

14

Hebrew Transfer GrammarExample Rules

{NP1,2};;SL: $MLH ADWMH;;TL: A RED DRESSNP1::NP1 [NP1 ADJ] -> [ADJ NP1]((X2::Y1)(X1::Y2)((X1 def) = -)((X1 status) =c absolute)((X1 num) = (X2 num))((X1 gen) = (X2 gen))(X0 = X1))

{NP1,3}

;;SL: H $MLWT H ADWMWT

;;TL: THE RED DRESSES

NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1]

(

(X3::Y1)

(X1::Y2)

((X1 def) = +)

((X1 status) =c absolute)

((X1 num) = (X3 num))

((X1 gen) = (X3 gen))

(X0 = X1)

)

Slide15

February 18, 2008

CICLing-2008

15

The Transfer Engine

Input:

source-language input sentence, or source-language confusion network

Output:

lattice representing collection of translation fragments at all levels supported by transfer rules

Basic Algorithm:

“bottom-up” integrated “parsing-transfer-generation” guided by the transfer rules

Start with translations of individual words and phrases from translation lexicon

Create translations of larger constituents by applying applicable transfer rules to previously created lattice entries

Beam-search controls the exponential combinatorics of the search-space, using multiple scoring features

Slide16

February 18, 2008

CICLing-2008

16

The Transfer Engine

Some Unique Features:

Works with either learned or manually-developed transfer grammars

Handles rules with or without unification constraints

Supports interfacing with servers for morphological analysis and generation

Can handle ambiguous source-word analyses and/or SL segmentations represented in the form of lattice structures

Slide17

February 18, 2008

CICLing-2008

17

XFER Output Lattice

(28 28 "AND" -5.6988 "W" "(CONJ,0 'AND')")

(29 29 "SINCE" -8.20817 "MAZ " "(ADVP,0 (ADV,5 'SINCE')) ")

(29 29 "SINCE THEN" -12.0165 "MAZ " "(ADVP,0 (ADV,6 'SINCE THEN')) ")

(29 29 "EVER SINCE" -12.5564 "MAZ " "(ADVP,0 (ADV,4 'EVER SINCE')) ")

(30 30 "WORKED" -10.9913 "&BD " "(VERB,0 (V,11 'WORKED')) ")

(30 30 "FUNCTIONED" -16.0023 "&BD " "(VERB,0 (V,10 'FUNCTIONED')) ")

(30 30 "WORSHIPPED" -17.3393 "&BD " "(VERB,0 (V,12 'WORSHIPPED')) ")

(30 30 "SERVED" -11.5161 "&BD " "(VERB,0 (V,14 'SERVED')) ")

(30 30 "SLAVE" -13.9523 "&BD " "(NP0,0 (N,34 'SLAVE')) ")

(30 30 "BONDSMAN" -18.0325 "&BD " "(NP0,0 (N,36 'BONDSMAN')) ")

(30 30 "A SLAVE" -16.8671 "&BD " "(NP,1 (LITERAL 'A') (NP2,0 (NP1,0 (NP0,0 (N,34 'SLAVE')) ) ) ) ")

(30 30 "A BONDSMAN" -21.0649 "&BD " "(NP,1 (LITERAL 'A') (NP2,0 (NP1,0 (NP0,0 (N,36 'BONDSMAN')) ) ) ) ")

Slide18

February 18, 2008

CICLing-2008

18

The Lattice Decoder

Simple Stack Decoder, similar in principle to simple Statistical MT decoders

Searches for best-scoring path of non-overlapping lattice arcs

No reordering during decoding

Scoring based on log-linear combination of scoring features, with weights trained using Minimum Error Rate Training (MERT)

Scoring components:

Statistical Language Model

Rule Scores

Lexical Probabilities

Fragmentation: how many arcs to cover the entire translation?

Length Penalty: how far from expected target length?

Slide19

February 18, 2008

CICLing-2008

19

XFER Lattice Decoder

0 0 ON THE FOURTH DAY THE LION ATE THE RABBIT TO A MORNING MEAL

Overall: -8.18323, Prob: -94.382, Rules: 0, Frag: 0.153846, Length: 0,

Words: 13,13

235 < 0 8 -19.7602: B H IWM RBI&I (PP,0 (PREP,3 'ON')(NP,2 (LITERAL 'THE')

(NP2,0 (NP1,1 (ADJ,2 (QUANT,0 'FOURTH'))(NP1,0 (NP0,1 (N,6 'DAY')))))))>

918 < 8 14 -46.2973: H ARIH AKL AT H $PN (S,2 (NP,2 (LITERAL 'THE') (NP2,0

(NP1,0 (NP0,1 (N,17 'LION')))))(VERB,0 (V,0 'ATE'))(NP,100

(NP,2 (LITERAL 'THE') (NP2,0 (NP1,0 (NP0,1 (N,24 'RABBIT')))))))>

584 < 14 17 -30.6607: L ARWXH BWQR (PP,0 (PREP,6 'TO')(NP,1 (LITERAL 'A')

(NP2,0 (NP1,0 (NNP,3 (NP0,0 (N,32 'MORNING'))(NP0,0 (N,27 'MEAL')))))))>

Slide20

February 18, 2008

CICLing-2008

20

Stat-XFER MT Systems

General Stat-XFER framework under development for past seven years

Systems so far:

Chinese-to-English

Hebrew-to-English

Urdu-to-English

Hindi-to-English

Dutch-to-English

Mapudungun-to-Spanish

In progress or planned:

Brazilian Portuguese-to-English

Native-Brazilian languages to Brazilian Portuguese

Hebrew-to-Arabic

Quechua-to-Spanish

Turkish-to-English

Slide21

MT Resource Acquisition in Resource-rich Scenarios

Scenario: Significant amounts of parallel-text at sentence-level are availableParallel sentences can be word-aligned and parsed (at least on one side, ideally on both sides)Goal: Acquire both broad-coverage translation lexicons and transfer rule grammars automatically from the dataSyntax-based translation lexicons:Broad-coverage constituent-level translation equivalents at all levels of granularityCan serve as the elementary building blocks for transfer trees constructed at runtime using the transfer rules

February 18, 2008

21

CICLing-2008

Slide22

Acquisition Process

Automatic Process for Extracting Syntax-driven Rules and Lexicons from sentence-parallel data:Word-align the parallel corpus (GIZA++)Parse the sentences independently for both languagesRun our new PFA Constituent Aligner over the parsed sentence pairsExtract all aligned constituents from the parallel treesExtract all derived synchronous transfer rules from the constituent-aligned parallel treesConstruct a “data-base” of all extracted parallel constituents and synchronous rules with their frequencies and model them statistically (assign them relative-likelihood probabilities)

February 18, 2008

22

CICLing-2008

Slide23

PFA Constituent Node Aligner

Input: a bilingual pair of parsed and word-aligned sentencesGoal: find all sub-sentential constituent alignments between the two trees which are translation equivalents of each otherEquivalence Constraint: a pair of constituents <S,T> are considered translation equivalents if:All words in yield of <S> are aligned only to words in yield of <T> (and vice-versa)If <S> has a sub-constituent <S1> that is aligned to <T1>, then <T1> must be a sub-constituent of <T> (and vice-versa) Algorithm is a bottom-up process starting from word-level, marking nodes that satisfy the constraints

February 18, 2008

23

CICLing-2008

Slide24

PFA Node Alignment Algorithm Example

Words don’t have to align one-to-one

Constituent labels can be different in each language

Tree Structures can be highly divergent

Slide25

PFA Node Alignment Algorithm Example

Aligner uses a clever arithmetic manipulation to enforce equivalence constraints

Resulting aligned nodes are highlighted in figure

Slide26

PFA Node Alignment Algorithm Example

Extraction of Phrases:

Get the Yields of the aligned nodes and add them to a phrase table tagged with syntactic categories on both source and target sides

Example:

NP # NP ::

澳洲

# Australia

Slide27

PFA Node Alignment Algorithm Example

All Phrases from this tree pair:IP # S :: 澳洲 是 与 北韩 有 邦交 的 少数 国家 之一 。 # Australia is one of the few countries that have diplomatic relations with North Korea .VP # VP :: 是 与 北韩 有 邦交 的 少数 国家 之一 # is one of the few countries that have diplomatic relations with North KoreaNP # NP :: 与 北韩 有 邦交 的 少数 国家 之一 # one of the few countries that have diplomatic relations with North KoreaVP # VP :: 与 北韩 有 邦交 # have diplomatic relations with North KoreaNP # NP :: 邦交 # diplomatic relationsNP # NP :: 北韩 # North KoreaNP # NP :: 澳洲 # Australia

Slide28

PFA Constituent Node Alignment Performance

Evaluation Data: Chinese-English TreebankParallel Chinese-English Treebank with manual word-alignments3342 Sentence PairsCreated a “Gold Standard” constituent alignments using the manual word-alignments and treebank treesNode Alignments: 39874 (About 12/tree pair)NP to NP Alignments: 5427Manual inspection confirmed that the constituent alignments are extremely accurate (>95%)Evaluation: Run PFA Aligner with automatic word alignments on same data and compare with the “gold Standard” alignments

February 18, 2008

28

CICLing-2008

Slide29

PFA Constituent Node Alignment Performance

Viterbi Combination PrecisionRecallF-MeasureIntersection0.62780.55250.5877Union0.80540.27780.4131Sym-1 (Thot Toolkit)0.71820.45250.5552Sym-2 (Thot Toolkit)0.71700.46020.5606Grow-Diag-Final0.40400.25000.3089

Viterbi word alignments from Chinese-English and reverse directions were merged using different algorithms

Tested the performance of Node-Alignment with each resulting alignment

Slide30

Transfer Rule Learning

Input: Constituent-aligned parallel treesIdea: Aligned nodes act as possible decomposition points of the parallel treesThe sub-trees of any aligned pair of nodes can be broken apart at any lower-level aligned nodes, creating an inventory of “treelet” correspondencesSynchronous “treelets” can be converted into synchronous rulesAlgorithm: Find all possible treelet decompositions from the node aligned trees“Flatten” the treelets into synchronous CFG rules

February 18, 2008

30

CICLing-2008

Slide31

Rule Extraction

Algorithm

Sub-Treelet extraction:Extract Sub-tree segments including synchronous alignment information in the target tree. All the sub-trees and the super-tree are extracted.

Slide32

Rule Extraction

Algorithm

Flat Rule Creation:

Each of the treelets pairs is flattened to create a Rule in the ‘Avenue Formalism’ –

Four major parts to the rule:

1. Type of the rule: Source and Target side type information

2. Constituent sequence of the synchronous flat rule

3. Alignment information of the constituents

4. Constraints in the rule

(Currently not extracted)

Slide33

Rule Extraction

Algorithm

Flat Rule Creation:Sample rule:IP::S [ NP VP .] -> [NP VP .](;; Alignments(X1::Y1)(X2::Y2);;Constraints)

Slide34

Rule Extraction

Algorithm

Flat Rule Creation:Sample rule:NP::NP [VP 北 CD 有 邦交 ] -> [one of the CD countries that VP](;; Alignments(X1::Y7)(X3::Y4))Note: Any one-to-one aligned words are elevated to Part-Of-Speech in flat rule. Any non-aligned words on either source or target side remain lexicalized

Slide35

Rule Extraction Algorithm

All rules extracted: VP::VP [VC NP] -> [VBZ NP]((*score* 0.5);; Alignments(X1::Y1)(X2::Y2))VP::VP [VC NP] -> [VBZ NP]((*score* 0.5);; Alignments(X1::Y1)(X2::Y2))NP::NP [NR] -> [NNP]((*score* 0.5);; Alignments(X1::Y1)(X2::Y2))VP::VP [北 NP VE NP] -> [ VBP NP with NP]((*score* 0.5);; Alignments(X2::Y4)(X3::Y1)(X4::Y2))

All rules extracted: NP::NP [VP 北 CD 有 邦交 ] -> [one of the CD countries that VP]((*score* 0.5);; Alignments(X1::Y7)(X3::Y4))IP::S [ NP VP ] -> [NP VP ]((*score* 0.5);; Alignments(X1::Y1)(X2::Y2))NP::NP [ “北韩”] -> [“North” “Korea”](;Many to one alignment is a phrase)

Slide36

Chinese-English System

Developed over past year under DARPA/GALE funding (within IBM-led “Rosetta” team)Participated in recent NIST MT-08 EvaluationLarge-scale broad-coverage systemIntegrates large manual resources with automatically extracted resourcesCurrent performance-level is still inferior to state-of-the-art phrase-based systems

February 18, 2008

36

CICLing-2008

Slide37

Chinese-English System

Lexical Resources:Manual Lexicons (base forms): LDC, ADSO, WikiTotal number of entries: 1.07 millionAutomatically acquired from parallel data:Approx 5 million sentences LDC/GALE dataFiltered down to phrases < 10 words in lengthFull formedTotal number of entries: 2.67 million

February 18, 2008

37

CICLing-2008

Slide38

Chinese-English System

Transfer Rules:61 manually developed transfer rulesHigh-accuracy rules extracted from manually word-aligned parallel data

CorpusSize (sens)Rules with StructureRules (count>=2)Complete Lexical rulesParallel Treebank (3K)3,34345,2661,96211,521993 sentences99312,6613312,199Parallel Treebank (7K)6,54141,9981,75616,081Merged Corpus set10K94,117316029,340

February 18, 2008

38

CICLing-2008

Slide39

February 5, 2008

CMU MT Update for Joe Olive

39

Translation Example

SrcSent 3 澳洲是与北韩有邦交的少数国家之一。

Gloss:

Australia is with north korea have diplomatic relations DE few country world

Reference:

Australia is one of the few countries that have diplomatic relations with North Korea.

Translation:

Australia is one of the few countries that has diplomatic relations with north korea .

Overall: -5.77439, Prob: -2.58631, Rules: -0.66874, TransSGT: -2.58646, TransTGS: -1.52858, Frag: -0.0413927, Length: -0.127525, Words: 11,15

( 0 10 "Australia is one of the few countries that has diplomatic relations with north korea" -5.66505 "澳洲 是 与 北韩 有 邦交 的 少数 国家 之一 " "(S1,1124731 (S,1157857 (NP,2 (NB,1 (LDC_N,1267 'Australia') ) ) (VP,1046077 (MISC_V,1 'is') (NP,1077875 (LITERAL 'one') (LITERAL 'of') (NP,1045537 (NP,1017929 (NP,1 (LITERAL 'the') (NUMNB,2 (LDC_NUM,420 'few') (NB,1 (WIKI_N,62230 'countries') ) ) ) (LITERAL 'that') (VP,1021811 (LITERAL 'has') (FBIS_NP,11916 'diplomatic relations') ) ) (FBIS_PP,84791 'with north korea') ) ) ) ) ) ")

( 10 11 "." -11.9549 "。" "(MISC_PUNC,20 '.')")

Slide40

February 5, 2008

CMU MT Update for Joe Olive

40

Example: Syntactic Lexical Phrases

(LDC_N,1267 'Australia')

(WIKI_N,62230 'countries')

(FBIS_NP,11916 'diplomatic relations')

(FBIS_PP,84791 'with north korea')

Slide41

February 5, 2008

CMU MT Update for Joe Olive

41

Example: XFER Rules

;;SL::(2,4) 对 台 贸易

;;TL::(3,5) trade to taiwan

;;Score::22

{NP,1045537}

NP::NP [PP NP ] -> [NP PP ]

((*score* 0.916666666666667)

(X2::Y1)

(X1::Y2))

;;SL::(2,7) 直接 提到 伟 哥 的 广告

;;TL::(1,7) commercials that directly mention the name viagra

;;Score::5

{NP,1017929}

NP::NP [VP "的" NP ] -> [NP "that" VP ]

((*score* 0.111111111111111)

(X3::Y1)

(X1::Y3))

;;SL::(4,14) 有 一 至 多 个 高 新 技术 项目 或 产品

;;TL::(3,14) has one or more new , high level technology projects or products

;;Score::4

{VP,1021811}

VP::VP ["有" NP ] -> ["has" NP ]

((*score* 0.1)

(X2::Y2))

Slide42

MT Resource Acquisition in Resource-poor Scenarios

Scenario: Very limited amounts of parallel-text at sentence-level are availableSignificant amounts of monolingual text available for one of the two languages (i.e. English, Spanish)Approach: Manually acquire and/or construct translation lexiconsTransfer rule grammars can be manually developed and/or automatically acquired from an elicitation corpusStrategy:Learn transfer rules by syntax projection from major language to minor languageBuild MT system to translate from minor language to major language

February 18, 2008

42

CICLing-2008

Slide43

February 18, 2008

CICLing-2008

43

Learning Transfer-Rules for Languages with Limited Resources

Rationale:Large bilingual corpora not availableBilingual native informant(s) can translate and align a small pre-designed elicitation corpus, using elicitation toolElicitation corpus designed to be typologically comprehensive and compositionalTransfer-rule engine and new learning approach support acquisition of generalized transfer-rules from the data

Slide44

February 18, 2008

CICLing-2008

44

Elicitation Tool:English-Hindi Example

Slide45

February 18, 2008

CICLing-2008

45

Elicitation Tool:English-Arabic Example

Slide46

February 18, 2008

CICLing-2008

46

Elicitation Tool:Spanish-Mapudungun Example

Slide47

February 18, 2008

CICLing-2008

47

Hebrew-to-English MT Prototype

Initial prototype developed within a two month intensive effort

Accomplished:

Adapted available morphological analyzer

Constructed a preliminary translation lexicon

Translated and aligned Elicitation Corpus

Learned XFER rules

Developed (small) manual XFER grammar

System debugging and development

Evaluated performance on unseen test data using automatic evaluation metrics

Slide48

February 18, 2008

CICLing-2008

48

Challenges for Hebrew MT

Puacity in existing language resources for Hebrew

No publicly available broad coverage morphological analyzer

No publicly available bilingual lexicons or dictionaries

No POS-tagged corpus or parse tree-bank corpus for Hebrew

No large Hebrew/English parallel corpus

Scenario well suited for Stat-XFER framework for languages with limited resources

Slide49

February 18, 2008

CICLing-2008

49

Modern Hebrew Spelling

Two main spelling variants

KTIV XASER

” (difficient): spelling with the vowel diacritics, and consonant words when the diacritics are removed

KTIV MALEH

” (full): words with I/O/U vowels are written with long vowels which include a letter

KTIV MALEH is predominant, but not strictly adhered to even in newspapers and official publications

 inconsistent spelling

Example:

niqud

(spelling): NIQWD, NQWD, NQD

When written as NQD, could also be

niqed, naqed, nuqad

Slide50

February 18, 2008

CICLing-2008

50

Morphological Analyzer

We use a publicly available morphological analyzer distributed by the Technion’s Knowledge Center, adapted for our system

Coverage is reasonable (for nouns, verbs and adjectives)

Produces all analyses or a disambiguated analysis for each word

Output format includes lexeme (base form), POS, morphological features

Output was adapted to our representation needs (POS and feature mappings)

Slide51

February 18, 2008

CICLing-2008

51

Morphology Example

Input word: B$WRH

0 1 2 3 4

|--------B$WRH--------|

|-----B-----|$WR|--H--|

|--B--|-H--|--$WRH---|

Slide52

February 18, 2008

CICLing-2008

52

Morphology Example

Y0: ((SPANSTART 0) Y1: ((SPANSTART 0) Y2: ((SPANSTART 1)

(SPANEND 4) (SPANEND 2) (SPANEND 3)

(LEX B$WRH) (LEX B) (LEX $WR)

(POS N) (POS PREP)) (POS N)

(GEN F) (GEN M)

(NUM S) (NUM S)

(STATUS ABSOLUTE)) (STATUS ABSOLUTE))

Y3: ((SPANSTART 3) Y4: ((SPANSTART 0) Y5: ((SPANSTART 1)

(SPANEND 4) (SPANEND 1) (SPANEND 2)

(LEX $LH) (LEX B) (LEX H)

(POS POSS)) (POS PREP)) (POS DET))

Y6: ((SPANSTART 2) Y7: ((SPANSTART 0)

(SPANEND 4) (SPANEND 4)

(LEX $WRH) (LEX B$WRH)

(POS N) (POS LEX))

(GEN F)

(NUM S)

(STATUS ABSOLUTE))

Slide53

February 18, 2008

CICLing-2008

53

Translation Lexicon

Constructed our own Hebrew-to-English lexicon, based primarily on existing “Dahan” H-to-E and E-to-H dictionary made available to us, augmented by other public sources

Coverage is not great but not bad as a start

Dahan H-to-E is about 15K translation pairs

Dahan E-to-H is about 7K translation pairs

Base forms, POS information on both sides

Converted Dahan into our representation, added entries for missing closed-class entries (pronouns, prepositions, etc.)

Had to deal with spelling conventions

Recently augmented with ~50K translation pairs extracted from Wikipedia (mostly proper names and named entities)

Slide54

February 18, 2008

CICLing-2008

54

Manual Transfer Grammar (human-developed)

Initially developed by Alon in a couple of days, extended and revised by Nurit over time

Current grammar has 36 rules:

21 NP rules

one PP rule

6 verb complexes and VP rules

8 higher-phrase and sentence-level rules

Captures the most common (mostly local) structural differences between Hebrew and English

Slide55

February 18, 2008

CICLing-2008

55

Transfer GrammarExample Rules

{NP1,2};;SL: $MLH ADWMH;;TL: A RED DRESSNP1::NP1 [NP1 ADJ] -> [ADJ NP1]((X2::Y1)(X1::Y2)((X1 def) = -)((X1 status) =c absolute)((X1 num) = (X2 num))((X1 gen) = (X2 gen))(X0 = X1))

{NP1,3}

;;SL: H $MLWT H ADWMWT

;;TL: THE RED DRESSES

NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1]

(

(X3::Y1)

(X1::Y2)

((X1 def) = +)

((X1 status) =c absolute)

((X1 num) = (X3 num))

((X1 gen) = (X3 gen))

(X0 = X1)

)

Slide56

February 18, 2008

CICLing-2008

56

Example Translation

Input:

לאחר דיונים רבים החליטה הממשלה לערוך משאל עם בנושא הנסיגה

Gloss: After debates many decided the government to hold referendum in issue the withdrawal

Output:

AFTER MANY DEBATES THE GOVERNMENT DECIDED TO HOLD A REFERENDUM ON THE ISSUE OF THE WITHDRAWAL

Slide57

February 18, 2008

CICLing-2008

57

Noun Phrases – Construct State

HXL@T [HNSIA HRA$WN]decision.3SF-CS the-president.3SM the-first.3SM

החלטת הנשיא הראשון

החלטת הנשיא הראשונה

[HXL@T HNSIA] HRA$WNHdecision.3SF-CS the-president.3SM the-first.3SF

THE DECISION OF THE FIRST PRESIDENT

THE FIRST DECISION OF THE PRESIDENT

Slide58

February 18, 2008

CICLing-2008

58

Noun Phrases - Possessives

HNSIA HKRIZ $HM$IMH HRA$WNH $LW THIHthe-president announced that-the-task.3SF the-first.3SF of-him will.3SF

LMCWA PTRWN LSKSWK BAZWRNWto-find solution to-the-conflict in-region-POSS.1P

הנשיא הכריז שהמשימה הראשונה שלו תהיה למצוא פתרון לסכסוך באזורנו

Without transfer grammar:THE PRESIDENT ANNOUNCED THAT THE TASK THE BEST OF HIM WILL BE TO FIND SOLUTION TO THE CONFLICT IN REGION OUR

With transfer grammar

:

THE PRESIDENT ANNOUNCED THAT

HIS

FIRST TASK WILL BE TO FIND A SOLUTION TO THE CONFLICT IN

OUR

REGION

Slide59

February 18, 2008

CICLing-2008

59

Subject-Verb Inversion

ATMWL HWDI&H HMM$LH yesterday announced.3SF the-government.3SF

אתמול הודיעה הממשלה שתערכנה בחירות בחודש הבא

$T&RKNH BXIRWT BXWD$ HBAthat-will-be-held.3PF elections.3PF in-the-month the-next

Without transfer grammar:YESTERDAY ANNOUNCED THE GOVERNMENT THAT WILL RESPECT OF THE FREEDOM OF THE MONTH THE NEXT

With transfer grammar

:

YESTERDAY

THE GOVERNMENT ANNOUNCED

THAT ELECTIONS WILL ASSUME IN THE NEXT MONTH

Slide60

February 18, 2008

CICLing-2008

60

Subject-Verb Inversion

LPNI KMH $BW&WT HWDI&H HNHLT HMLWNbefore several weeks announced.3SF management.3SF.CS the-hotel

לפני כמה שבועות הודיעה הנהלת המלון שהמלון יסגר בסוף השנה

$HMLWN ISGR BSWF H$NH that-the-hotel.3SM will-be-closed.3SM at-end.3SM.CS the-year

Without transfer grammar:IN FRONT OF A FEW WEEKS ANNOUNCED ADMINISTRATION THE HOTEL THAT THE HOTEL WILL CLOSE AT THE END THIS YEAR

With transfer grammar

:

SEVERAL WEEKS AGO

THE MANAGEMENT OF THE HOTEL ANNOUNCED

THAT THE HOTEL WILL CLOSE AT THE END OF THE YEAR

Slide61

February 18, 2008

CICLing-2008

61

Evaluation Results

Test set of 62 sentences from Haaretz newspaper, 2 reference translations

System

BLEU

NIST

P

R

METEOR

No Gram

0.0616

3.4109

0.4090

0.4427

0.3298

Learned

0.0774

3.5451

0.4189

0.4488

0.3478

Manual

0.1026

3.7789

0.4334

0.4474

0.3617

Slide62

Open Research Questions

Our large-scale Chinese-English system is still significantly behind phrase-based SMT. Why?Weaker decoder?Feature set is not sufficiently discriminant?Problems with the parsers for the two sides?Syntactic constituents don’t provide sufficient coverage?Bugs and deficiencies in the underlying algorithms?The ISI experience indicates that it may take a couple of years to catch up with and surpass the phrase-based systemsSignificant engineering issues to improve speed and efficient runtime processing and improved search

February 18, 2008

62

CICLing-2008

Slide63

Open Research Questions

Immediate Research Issues:Rule Learning:Study effects of learning rules from manually vs automatically word aligned dataStudy effects of parser accuracy on learned rulesEffective discriminant methods for modeling rule scoresRule filtering strategiesSyntax-based LMs: Our translations come out with a syntax-tree attached to themAdd a syntax-based LM feature that can discriminate between good and bad trees

February 18, 2008

63

CICLing-2008

Slide64

Conclusions

Stat-XFER is a promising general MT framework, suitable to a variety of MT scenarios and languagesProvides a complete solution for building end-to-end MT systems from parallel data, akin to phrase-based SMT systems (training, tuning, runtime system)No open-source publically available toolkits (yet), but we welcome further collaboration activitiesComplex but highly interesting set of open research issuesPrediction: this is the future direction of MT!

February 18, 2008

64

CICLing-2008

Slide65

February 18, 2008

CICLing-2008

65

Questions?