Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with Erik Peterson Alok Parlikar Vamshi Ambati Christian Monson Ari Font Llitjos Lori Levin Jaime Carbonell Carnegie Mellon University ID: 759577
Download Presentation The PPT/PDF document "Stat-XFER: A General Framework for Sear..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Stat-XFER: A General Framework for Search-based Syntax-driven MT
Alon Lavie
Language Technologies Institute
Carnegie Mellon University
Joint work with:
Erik Peterson, Alok Parlikar, Vamshi Ambati, Christian Monson, Ari Font Llitjos, Lori Levin, Jaime Carbonell – Carnegie Mellon University
Shuly Wintner, Danny Shacham, Nurit Melnik - University of Haifa
Roberto Aranovitch – University of Pittsburgh
Slide2February 18, 2008
CICLing-2008
2
Outline
Context and Rationale
CMU Statistical Transfer MT Framework
Broad Resource Scenario: Chinese-to-English
Low Resource Scenario: Hebrew-to-English
Open Research Challenges
Conclusions
Slide3February 18, 2008
CICLing-2008
3
Current State-of-the-Art in Machine Translation
MT underwent a major paradigm shift over the past 15 years:
From
manually crafted
rule-based systems
with manually designed knowledge resources
To
search-based approaches
founded on automatic extraction of translation models/units from large sentence-parallel corpora
Current Dominant Approach:
Phrase-based Statistical MT
:
Extract and statistically model large volumes of phrase-to-phrase correspondences from automatically word-aligned parallel corpora
“Decode” new input by searching for the most likely sequence of phrase matches, using a combination of features, including a statistical Language Model for the target language
Slide4February 18, 2008
CICLing-2008
4
Current State-of-the-art in Machine Translation
Phrase-based MT State-of-the-art
:
Requires minimally several million words of parallel text for adequate training
Mostly limited to language-pairs for which such data exists: major European languages, Arabic, Chinese, Japanese, a few others…
Linguistically shallow and highly lexicalized models result in weak generalization
Best performance levels (BLEU=~0.6) on Arabic-to-English provide understandable but often still ungrammatical or somewhat disfluent translations
Ill suited for Hebrew and most of the world’s minor and resource-poor languages
Slide5February 18, 2008
CICLing-2008
5
Rule-based vs. Statistical MT
Traditional Rule-based MT:
Expressive and linguistically-rich formalisms capable of describing complex mappings between the two languages
Accurate “clean” resources
Everything constructed manually by experts
Main challenge: obtaining broad coverage
Phrase-based Statistical MT:
Learn word and phrase correspondences automatically from large volumes of parallel data
Search-based “decoding” framework:
Models propose many alternative translations
Effective search algorithms find the “best” translation
Main challenge: obtaining high translation accuracy
Slide6Research Goals
Long-term research agenda (since 2000) focused on developing a unified framework for MT that addresses the core fundamental weaknesses of previous approaches:Representation – explore richer formalisms that can capture complex divergences between languagesAbility to handle morphologically complex languagesMethods for automatically acquiring MT resources from available data and combining them with manual resourcesAbility to address both rich and poor resource scenariosMain research funding sources: NSF (AVENUE and LETRAS projects) and DARPA (GALE)
February 18, 2008
6
CICLing-2008
Slide7February 18, 2008
CICLing-2008
7
CMU Statistical Transfer (Stat-XFER) MT Approach
Integrate the major strengths of rule-based and statistical MT within a common framework:
Linguistically rich formalism
that can express complex and abstract compositional transfer rules
Rules can be
written by human experts
and also
acquired automatically from data
Easy integration of
morphological analyzers and generators
Word and syntactic-phrase correspondences can be
automatically acquired from parallel text
Search-based decoding
from statistical MT adapted to find the best translation within the search space: multi-feature scoring, beam-search, parameter optimization, etc.
Framework suitable for both resource-rich and resource-poor language scenarios
Slide8February 18, 2008
CICLing-2008
8
Stat-XFER Main Principles
Framework:
Statistical search-based approach with syntactic translation transfer rules that can be acquired from data but also developed and extended by experts
Automatic Word and Phrase translation lexicon acquisition from parallel data
Transfer-rule Learning:
apply ML-based methods to automatically acquire syntactic transfer rules for translation between the two languages
Elicitation:
use bilingual native informants to produce a small high-quality word-aligned bilingual corpus of translated phrases and sentences
Rule Refinement:
refine the acquired rules via a process of interaction with bilingual informants
XFER + Decoder:
XFER engine produces a lattice of possible transferred structures at all levels
Decoder searches and selects the best scoring combination
Slide9February 18, 2008
CICLing-2008
9
Stat-XFER MT Approach
Interlingua
Syntactic Parsing
Semantic Analysis
Sentence Planning
Text Generation
Source
(e.g. Quechua)
Target(e.g. English)
Transfer Rules
Direct: SMT, EBMT
Statistical-XFER
Slide10Transfer Engine
Language Model + Additional Features
Transfer Rules
{NP1,3}
NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1]
((X3::Y1)
(X1::Y2)
((X1 def) = +)
((X1 status) =c absolute)
((X1 num) = (X3 num)) ((X1 gen) = (X3 gen)) (X0 = X1))
Translation Lexicon
N::N |: ["$WR"] -> ["BULL"]
((X1::Y1)
((X0 NUM) = s) ((Y0 lex) = "BULL"))N::N |: ["$WRH"] -> ["LINE"]((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "LINE"))
Source Input
בשורה הבאה
Decoder
English Output
in the next line
Translation Output Lattice
(0 1 "IN" @PREP)
(1 1 "THE" @DET)
(2 2 "LINE" @N)
(1 2 "THE LINE" @NP)
(0 2 "IN LINE" @PP)
(0 4 "IN THE NEXT LINE" @PP)
Preprocessing
Morphology
Slide11February 18, 2008
CICLing-2008
11
Transfer Rule Formalism
Type informationPart-of-speech/constituent informationAlignmentsx-side constraintsy-side constraintsxy-constraints, e.g. ((Y1 AGR) = (X1 AGR))
;SL: the old man, TL: ha-ish ha-zaqenNP::NP [DET ADJ N] -> [DET N DET ADJ]((X1::Y1)(X1::Y3)(X2::Y4)(X3::Y2)((X1 AGR) = *3-SING)((X1 DEF = *DEF)((X3 AGR) = *3-SING)((X3 COUNT) = +)((Y1 DEF) = *DEF)((Y3 DEF) = *DEF)((Y2 AGR) = *3-SING)((Y2 GENDER) = (Y4 GENDER)))
Slide12February 18, 2008
CICLing-2008
12
Transfer Rule Formalism
Value constraints Agreement constraints
;SL: the old man, TL: ha-ish ha-zaqen
NP::NP [DET ADJ N] -> [DET N DET ADJ]((X1::Y1)(X1::Y3)(X2::Y4)(X3::Y2)((X1 AGR) = *3-SING)((X1 DEF = *DEF)((X3 AGR) = *3-SING)((X3 COUNT) = +)((Y1 DEF) = *DEF)((Y3 DEF) = *DEF)((Y2 AGR) = *3-SING)((Y2 GENDER) = (Y4 GENDER)))
Slide13February 18, 2008
CICLing-2008
13
Translation Lexicon: Examples
PRO::PRO |: ["ANI"] -> ["I"]((X1::Y1)((X0 per) = 1)((X0 num) = s)((X0 case) = nom))PRO::PRO |: ["ATH"] -> ["you"]((X1::Y1)((X0 per) = 2)((X0 num) = s)((X0 gen) = m)((X0 case) = nom))
N::N |: ["$&H"] -> ["HOUR"]
(
(X1::Y1)
((X0 NUM) = s)
((Y0 NUM) = s)
((Y0 lex) = "HOUR")
)
N::N |: ["$&H"] -> ["hours"]
(
(X1::Y1)
((Y0 NUM) = p)
((X0 NUM) = p)
((Y0 lex) = "HOUR")
)
Slide14February 18, 2008
CICLing-2008
14
Hebrew Transfer GrammarExample Rules
{NP1,2};;SL: $MLH ADWMH;;TL: A RED DRESSNP1::NP1 [NP1 ADJ] -> [ADJ NP1]((X2::Y1)(X1::Y2)((X1 def) = -)((X1 status) =c absolute)((X1 num) = (X2 num))((X1 gen) = (X2 gen))(X0 = X1))
{NP1,3}
;;SL: H $MLWT H ADWMWT
;;TL: THE RED DRESSES
NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1]
(
(X3::Y1)
(X1::Y2)
((X1 def) = +)
((X1 status) =c absolute)
((X1 num) = (X3 num))
((X1 gen) = (X3 gen))
(X0 = X1)
)
Slide15February 18, 2008
CICLing-2008
15
The Transfer Engine
Input:
source-language input sentence, or source-language confusion network
Output:
lattice representing collection of translation fragments at all levels supported by transfer rules
Basic Algorithm:
“bottom-up” integrated “parsing-transfer-generation” guided by the transfer rules
Start with translations of individual words and phrases from translation lexicon
Create translations of larger constituents by applying applicable transfer rules to previously created lattice entries
Beam-search controls the exponential combinatorics of the search-space, using multiple scoring features
Slide16February 18, 2008
CICLing-2008
16
The Transfer Engine
Some Unique Features:
Works with either learned or manually-developed transfer grammars
Handles rules with or without unification constraints
Supports interfacing with servers for morphological analysis and generation
Can handle ambiguous source-word analyses and/or SL segmentations represented in the form of lattice structures
Slide17February 18, 2008
CICLing-2008
17
XFER Output Lattice
(28 28 "AND" -5.6988 "W" "(CONJ,0 'AND')")
(29 29 "SINCE" -8.20817 "MAZ " "(ADVP,0 (ADV,5 'SINCE')) ")
(29 29 "SINCE THEN" -12.0165 "MAZ " "(ADVP,0 (ADV,6 'SINCE THEN')) ")
(29 29 "EVER SINCE" -12.5564 "MAZ " "(ADVP,0 (ADV,4 'EVER SINCE')) ")
(30 30 "WORKED" -10.9913 "&BD " "(VERB,0 (V,11 'WORKED')) ")
(30 30 "FUNCTIONED" -16.0023 "&BD " "(VERB,0 (V,10 'FUNCTIONED')) ")
(30 30 "WORSHIPPED" -17.3393 "&BD " "(VERB,0 (V,12 'WORSHIPPED')) ")
(30 30 "SERVED" -11.5161 "&BD " "(VERB,0 (V,14 'SERVED')) ")
(30 30 "SLAVE" -13.9523 "&BD " "(NP0,0 (N,34 'SLAVE')) ")
(30 30 "BONDSMAN" -18.0325 "&BD " "(NP0,0 (N,36 'BONDSMAN')) ")
(30 30 "A SLAVE" -16.8671 "&BD " "(NP,1 (LITERAL 'A') (NP2,0 (NP1,0 (NP0,0 (N,34 'SLAVE')) ) ) ) ")
(30 30 "A BONDSMAN" -21.0649 "&BD " "(NP,1 (LITERAL 'A') (NP2,0 (NP1,0 (NP0,0 (N,36 'BONDSMAN')) ) ) ) ")
Slide18February 18, 2008
CICLing-2008
18
The Lattice Decoder
Simple Stack Decoder, similar in principle to simple Statistical MT decoders
Searches for best-scoring path of non-overlapping lattice arcs
No reordering during decoding
Scoring based on log-linear combination of scoring features, with weights trained using Minimum Error Rate Training (MERT)
Scoring components:
Statistical Language Model
Rule Scores
Lexical Probabilities
Fragmentation: how many arcs to cover the entire translation?
Length Penalty: how far from expected target length?
Slide19February 18, 2008
CICLing-2008
19
XFER Lattice Decoder
0 0 ON THE FOURTH DAY THE LION ATE THE RABBIT TO A MORNING MEAL
Overall: -8.18323, Prob: -94.382, Rules: 0, Frag: 0.153846, Length: 0,
Words: 13,13
235 < 0 8 -19.7602: B H IWM RBI&I (PP,0 (PREP,3 'ON')(NP,2 (LITERAL 'THE')
(NP2,0 (NP1,1 (ADJ,2 (QUANT,0 'FOURTH'))(NP1,0 (NP0,1 (N,6 'DAY')))))))>
918 < 8 14 -46.2973: H ARIH AKL AT H $PN (S,2 (NP,2 (LITERAL 'THE') (NP2,0
(NP1,0 (NP0,1 (N,17 'LION')))))(VERB,0 (V,0 'ATE'))(NP,100
(NP,2 (LITERAL 'THE') (NP2,0 (NP1,0 (NP0,1 (N,24 'RABBIT')))))))>
584 < 14 17 -30.6607: L ARWXH BWQR (PP,0 (PREP,6 'TO')(NP,1 (LITERAL 'A')
(NP2,0 (NP1,0 (NNP,3 (NP0,0 (N,32 'MORNING'))(NP0,0 (N,27 'MEAL')))))))>
Slide20February 18, 2008
CICLing-2008
20
Stat-XFER MT Systems
General Stat-XFER framework under development for past seven years
Systems so far:
Chinese-to-English
Hebrew-to-English
Urdu-to-English
Hindi-to-English
Dutch-to-English
Mapudungun-to-Spanish
In progress or planned:
Brazilian Portuguese-to-English
Native-Brazilian languages to Brazilian Portuguese
Hebrew-to-Arabic
Quechua-to-Spanish
Turkish-to-English
Slide21MT Resource Acquisition in Resource-rich Scenarios
Scenario: Significant amounts of parallel-text at sentence-level are availableParallel sentences can be word-aligned and parsed (at least on one side, ideally on both sides)Goal: Acquire both broad-coverage translation lexicons and transfer rule grammars automatically from the dataSyntax-based translation lexicons:Broad-coverage constituent-level translation equivalents at all levels of granularityCan serve as the elementary building blocks for transfer trees constructed at runtime using the transfer rules
February 18, 2008
21
CICLing-2008
Slide22Acquisition Process
Automatic Process for Extracting Syntax-driven Rules and Lexicons from sentence-parallel data:Word-align the parallel corpus (GIZA++)Parse the sentences independently for both languagesRun our new PFA Constituent Aligner over the parsed sentence pairsExtract all aligned constituents from the parallel treesExtract all derived synchronous transfer rules from the constituent-aligned parallel treesConstruct a “data-base” of all extracted parallel constituents and synchronous rules with their frequencies and model them statistically (assign them relative-likelihood probabilities)
February 18, 2008
22
CICLing-2008
Slide23PFA Constituent Node Aligner
Input: a bilingual pair of parsed and word-aligned sentencesGoal: find all sub-sentential constituent alignments between the two trees which are translation equivalents of each otherEquivalence Constraint: a pair of constituents <S,T> are considered translation equivalents if:All words in yield of <S> are aligned only to words in yield of <T> (and vice-versa)If <S> has a sub-constituent <S1> that is aligned to <T1>, then <T1> must be a sub-constituent of <T> (and vice-versa) Algorithm is a bottom-up process starting from word-level, marking nodes that satisfy the constraints
February 18, 2008
23
CICLing-2008
Slide24PFA Node Alignment Algorithm Example
Words don’t have to align one-to-one
Constituent labels can be different in each language
Tree Structures can be highly divergent
Slide25PFA Node Alignment Algorithm Example
Aligner uses a clever arithmetic manipulation to enforce equivalence constraints
Resulting aligned nodes are highlighted in figure
Slide26PFA Node Alignment Algorithm Example
Extraction of Phrases:
Get the Yields of the aligned nodes and add them to a phrase table tagged with syntactic categories on both source and target sides
Example:
NP # NP ::
澳洲
# Australia
Slide27PFA Node Alignment Algorithm Example
All Phrases from this tree pair:IP # S :: 澳洲 是 与 北韩 有 邦交 的 少数 国家 之一 。 # Australia is one of the few countries that have diplomatic relations with North Korea .VP # VP :: 是 与 北韩 有 邦交 的 少数 国家 之一 # is one of the few countries that have diplomatic relations with North KoreaNP # NP :: 与 北韩 有 邦交 的 少数 国家 之一 # one of the few countries that have diplomatic relations with North KoreaVP # VP :: 与 北韩 有 邦交 # have diplomatic relations with North KoreaNP # NP :: 邦交 # diplomatic relationsNP # NP :: 北韩 # North KoreaNP # NP :: 澳洲 # Australia
Slide28PFA Constituent Node Alignment Performance
Evaluation Data: Chinese-English TreebankParallel Chinese-English Treebank with manual word-alignments3342 Sentence PairsCreated a “Gold Standard” constituent alignments using the manual word-alignments and treebank treesNode Alignments: 39874 (About 12/tree pair)NP to NP Alignments: 5427Manual inspection confirmed that the constituent alignments are extremely accurate (>95%)Evaluation: Run PFA Aligner with automatic word alignments on same data and compare with the “gold Standard” alignments
February 18, 2008
28
CICLing-2008
Slide29PFA Constituent Node Alignment Performance
Viterbi Combination PrecisionRecallF-MeasureIntersection0.62780.55250.5877Union0.80540.27780.4131Sym-1 (Thot Toolkit)0.71820.45250.5552Sym-2 (Thot Toolkit)0.71700.46020.5606Grow-Diag-Final0.40400.25000.3089
Viterbi word alignments from Chinese-English and reverse directions were merged using different algorithms
Tested the performance of Node-Alignment with each resulting alignment
Slide30Transfer Rule Learning
Input: Constituent-aligned parallel treesIdea: Aligned nodes act as possible decomposition points of the parallel treesThe sub-trees of any aligned pair of nodes can be broken apart at any lower-level aligned nodes, creating an inventory of “treelet” correspondencesSynchronous “treelets” can be converted into synchronous rulesAlgorithm: Find all possible treelet decompositions from the node aligned trees“Flatten” the treelets into synchronous CFG rules
February 18, 2008
30
CICLing-2008
Slide31Rule Extraction
Algorithm
Sub-Treelet extraction:Extract Sub-tree segments including synchronous alignment information in the target tree. All the sub-trees and the super-tree are extracted.
Slide32Rule Extraction
Algorithm
Flat Rule Creation:
Each of the treelets pairs is flattened to create a Rule in the ‘Avenue Formalism’ –
Four major parts to the rule:
1. Type of the rule: Source and Target side type information
2. Constituent sequence of the synchronous flat rule
3. Alignment information of the constituents
4. Constraints in the rule
(Currently not extracted)
Slide33Rule Extraction
Algorithm
Flat Rule Creation:Sample rule:IP::S [ NP VP .] -> [NP VP .](;; Alignments(X1::Y1)(X2::Y2);;Constraints)
Slide34Rule Extraction
Algorithm
Flat Rule Creation:Sample rule:NP::NP [VP 北 CD 有 邦交 ] -> [one of the CD countries that VP](;; Alignments(X1::Y7)(X3::Y4))Note: Any one-to-one aligned words are elevated to Part-Of-Speech in flat rule. Any non-aligned words on either source or target side remain lexicalized
Slide35Rule Extraction Algorithm
All rules extracted: VP::VP [VC NP] -> [VBZ NP]((*score* 0.5);; Alignments(X1::Y1)(X2::Y2))VP::VP [VC NP] -> [VBZ NP]((*score* 0.5);; Alignments(X1::Y1)(X2::Y2))NP::NP [NR] -> [NNP]((*score* 0.5);; Alignments(X1::Y1)(X2::Y2))VP::VP [北 NP VE NP] -> [ VBP NP with NP]((*score* 0.5);; Alignments(X2::Y4)(X3::Y1)(X4::Y2))
All rules extracted: NP::NP [VP 北 CD 有 邦交 ] -> [one of the CD countries that VP]((*score* 0.5);; Alignments(X1::Y7)(X3::Y4))IP::S [ NP VP ] -> [NP VP ]((*score* 0.5);; Alignments(X1::Y1)(X2::Y2))NP::NP [ “北韩”] -> [“North” “Korea”](;Many to one alignment is a phrase)
Slide36Chinese-English System
Developed over past year under DARPA/GALE funding (within IBM-led “Rosetta” team)Participated in recent NIST MT-08 EvaluationLarge-scale broad-coverage systemIntegrates large manual resources with automatically extracted resourcesCurrent performance-level is still inferior to state-of-the-art phrase-based systems
February 18, 2008
36
CICLing-2008
Slide37Chinese-English System
Lexical Resources:Manual Lexicons (base forms): LDC, ADSO, WikiTotal number of entries: 1.07 millionAutomatically acquired from parallel data:Approx 5 million sentences LDC/GALE dataFiltered down to phrases < 10 words in lengthFull formedTotal number of entries: 2.67 million
February 18, 2008
37
CICLing-2008
Slide38Chinese-English System
Transfer Rules:61 manually developed transfer rulesHigh-accuracy rules extracted from manually word-aligned parallel data
CorpusSize (sens)Rules with StructureRules (count>=2)Complete Lexical rulesParallel Treebank (3K)3,34345,2661,96211,521993 sentences99312,6613312,199Parallel Treebank (7K)6,54141,9981,75616,081Merged Corpus set10K94,117316029,340
February 18, 2008
38
CICLing-2008
Slide39February 5, 2008
CMU MT Update for Joe Olive
39
Translation Example
SrcSent 3 澳洲是与北韩有邦交的少数国家之一。
Gloss:
Australia is with north korea have diplomatic relations DE few country world
Reference:
Australia is one of the few countries that have diplomatic relations with North Korea.
Translation:
Australia is one of the few countries that has diplomatic relations with north korea .
Overall: -5.77439, Prob: -2.58631, Rules: -0.66874, TransSGT: -2.58646, TransTGS: -1.52858, Frag: -0.0413927, Length: -0.127525, Words: 11,15
( 0 10 "Australia is one of the few countries that has diplomatic relations with north korea" -5.66505 "澳洲 是 与 北韩 有 邦交 的 少数 国家 之一 " "(S1,1124731 (S,1157857 (NP,2 (NB,1 (LDC_N,1267 'Australia') ) ) (VP,1046077 (MISC_V,1 'is') (NP,1077875 (LITERAL 'one') (LITERAL 'of') (NP,1045537 (NP,1017929 (NP,1 (LITERAL 'the') (NUMNB,2 (LDC_NUM,420 'few') (NB,1 (WIKI_N,62230 'countries') ) ) ) (LITERAL 'that') (VP,1021811 (LITERAL 'has') (FBIS_NP,11916 'diplomatic relations') ) ) (FBIS_PP,84791 'with north korea') ) ) ) ) ) ")
( 10 11 "." -11.9549 "。" "(MISC_PUNC,20 '.')")
Slide40February 5, 2008
CMU MT Update for Joe Olive
40
Example: Syntactic Lexical Phrases
(LDC_N,1267 'Australia')
(WIKI_N,62230 'countries')
(FBIS_NP,11916 'diplomatic relations')
(FBIS_PP,84791 'with north korea')
Slide41February 5, 2008
CMU MT Update for Joe Olive
41
Example: XFER Rules
;;SL::(2,4) 对 台 贸易
;;TL::(3,5) trade to taiwan
;;Score::22
{NP,1045537}
NP::NP [PP NP ] -> [NP PP ]
((*score* 0.916666666666667)
(X2::Y1)
(X1::Y2))
;;SL::(2,7) 直接 提到 伟 哥 的 广告
;;TL::(1,7) commercials that directly mention the name viagra
;;Score::5
{NP,1017929}
NP::NP [VP "的" NP ] -> [NP "that" VP ]
((*score* 0.111111111111111)
(X3::Y1)
(X1::Y3))
;;SL::(4,14) 有 一 至 多 个 高 新 技术 项目 或 产品
;;TL::(3,14) has one or more new , high level technology projects or products
;;Score::4
{VP,1021811}
VP::VP ["有" NP ] -> ["has" NP ]
((*score* 0.1)
(X2::Y2))
Slide42MT Resource Acquisition in Resource-poor Scenarios
Scenario: Very limited amounts of parallel-text at sentence-level are availableSignificant amounts of monolingual text available for one of the two languages (i.e. English, Spanish)Approach: Manually acquire and/or construct translation lexiconsTransfer rule grammars can be manually developed and/or automatically acquired from an elicitation corpusStrategy:Learn transfer rules by syntax projection from major language to minor languageBuild MT system to translate from minor language to major language
February 18, 2008
42
CICLing-2008
Slide43February 18, 2008
CICLing-2008
43
Learning Transfer-Rules for Languages with Limited Resources
Rationale:Large bilingual corpora not availableBilingual native informant(s) can translate and align a small pre-designed elicitation corpus, using elicitation toolElicitation corpus designed to be typologically comprehensive and compositionalTransfer-rule engine and new learning approach support acquisition of generalized transfer-rules from the data
Slide44February 18, 2008
CICLing-2008
44
Elicitation Tool:English-Hindi Example
Slide45February 18, 2008
CICLing-2008
45
Elicitation Tool:English-Arabic Example
Slide46February 18, 2008
CICLing-2008
46
Elicitation Tool:Spanish-Mapudungun Example
Slide47February 18, 2008
CICLing-2008
47
Hebrew-to-English MT Prototype
Initial prototype developed within a two month intensive effort
Accomplished:
Adapted available morphological analyzer
Constructed a preliminary translation lexicon
Translated and aligned Elicitation Corpus
Learned XFER rules
Developed (small) manual XFER grammar
System debugging and development
Evaluated performance on unseen test data using automatic evaluation metrics
Slide48February 18, 2008
CICLing-2008
48
Challenges for Hebrew MT
Puacity in existing language resources for Hebrew
No publicly available broad coverage morphological analyzer
No publicly available bilingual lexicons or dictionaries
No POS-tagged corpus or parse tree-bank corpus for Hebrew
No large Hebrew/English parallel corpus
Scenario well suited for Stat-XFER framework for languages with limited resources
Slide49February 18, 2008
CICLing-2008
49
Modern Hebrew Spelling
Two main spelling variants
“
KTIV XASER
” (difficient): spelling with the vowel diacritics, and consonant words when the diacritics are removed
“
KTIV MALEH
” (full): words with I/O/U vowels are written with long vowels which include a letter
KTIV MALEH is predominant, but not strictly adhered to even in newspapers and official publications
inconsistent spelling
Example:
niqud
(spelling): NIQWD, NQWD, NQD
When written as NQD, could also be
niqed, naqed, nuqad
Slide50February 18, 2008
CICLing-2008
50
Morphological Analyzer
We use a publicly available morphological analyzer distributed by the Technion’s Knowledge Center, adapted for our system
Coverage is reasonable (for nouns, verbs and adjectives)
Produces all analyses or a disambiguated analysis for each word
Output format includes lexeme (base form), POS, morphological features
Output was adapted to our representation needs (POS and feature mappings)
Slide51February 18, 2008
CICLing-2008
51
Morphology Example
Input word: B$WRH
0 1 2 3 4
|--------B$WRH--------|
|-----B-----|$WR|--H--|
|--B--|-H--|--$WRH---|
February 18, 2008
CICLing-2008
52
Morphology Example
Y0: ((SPANSTART 0) Y1: ((SPANSTART 0) Y2: ((SPANSTART 1)
(SPANEND 4) (SPANEND 2) (SPANEND 3)
(LEX B$WRH) (LEX B) (LEX $WR)
(POS N) (POS PREP)) (POS N)
(GEN F) (GEN M)
(NUM S) (NUM S)
(STATUS ABSOLUTE)) (STATUS ABSOLUTE))
Y3: ((SPANSTART 3) Y4: ((SPANSTART 0) Y5: ((SPANSTART 1)
(SPANEND 4) (SPANEND 1) (SPANEND 2)
(LEX $LH) (LEX B) (LEX H)
(POS POSS)) (POS PREP)) (POS DET))
Y6: ((SPANSTART 2) Y7: ((SPANSTART 0)
(SPANEND 4) (SPANEND 4)
(LEX $WRH) (LEX B$WRH)
(POS N) (POS LEX))
(GEN F)
(NUM S)
(STATUS ABSOLUTE))
Slide53February 18, 2008
CICLing-2008
53
Translation Lexicon
Constructed our own Hebrew-to-English lexicon, based primarily on existing “Dahan” H-to-E and E-to-H dictionary made available to us, augmented by other public sources
Coverage is not great but not bad as a start
Dahan H-to-E is about 15K translation pairs
Dahan E-to-H is about 7K translation pairs
Base forms, POS information on both sides
Converted Dahan into our representation, added entries for missing closed-class entries (pronouns, prepositions, etc.)
Had to deal with spelling conventions
Recently augmented with ~50K translation pairs extracted from Wikipedia (mostly proper names and named entities)
Slide54February 18, 2008
CICLing-2008
54
Manual Transfer Grammar (human-developed)
Initially developed by Alon in a couple of days, extended and revised by Nurit over time
Current grammar has 36 rules:
21 NP rules
one PP rule
6 verb complexes and VP rules
8 higher-phrase and sentence-level rules
Captures the most common (mostly local) structural differences between Hebrew and English
Slide55February 18, 2008
CICLing-2008
55
Transfer GrammarExample Rules
{NP1,2};;SL: $MLH ADWMH;;TL: A RED DRESSNP1::NP1 [NP1 ADJ] -> [ADJ NP1]((X2::Y1)(X1::Y2)((X1 def) = -)((X1 status) =c absolute)((X1 num) = (X2 num))((X1 gen) = (X2 gen))(X0 = X1))
{NP1,3}
;;SL: H $MLWT H ADWMWT
;;TL: THE RED DRESSES
NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1]
(
(X3::Y1)
(X1::Y2)
((X1 def) = +)
((X1 status) =c absolute)
((X1 num) = (X3 num))
((X1 gen) = (X3 gen))
(X0 = X1)
)
Slide56February 18, 2008
CICLing-2008
56
Example Translation
Input:
לאחר דיונים רבים החליטה הממשלה לערוך משאל עם בנושא הנסיגה
Gloss: After debates many decided the government to hold referendum in issue the withdrawal
Output:
AFTER MANY DEBATES THE GOVERNMENT DECIDED TO HOLD A REFERENDUM ON THE ISSUE OF THE WITHDRAWAL
February 18, 2008
CICLing-2008
57
Noun Phrases – Construct State
HXL@T [HNSIA HRA$WN]decision.3SF-CS the-president.3SM the-first.3SM
החלטת הנשיא הראשון
החלטת הנשיא הראשונה
[HXL@T HNSIA] HRA$WNHdecision.3SF-CS the-president.3SM the-first.3SF
THE DECISION OF THE FIRST PRESIDENT
THE FIRST DECISION OF THE PRESIDENT
Slide58February 18, 2008
CICLing-2008
58
Noun Phrases - Possessives
HNSIA HKRIZ $HM$IMH HRA$WNH $LW THIHthe-president announced that-the-task.3SF the-first.3SF of-him will.3SF
LMCWA PTRWN LSKSWK BAZWRNWto-find solution to-the-conflict in-region-POSS.1P
הנשיא הכריז שהמשימה הראשונה שלו תהיה למצוא פתרון לסכסוך באזורנו
Without transfer grammar:THE PRESIDENT ANNOUNCED THAT THE TASK THE BEST OF HIM WILL BE TO FIND SOLUTION TO THE CONFLICT IN REGION OUR
With transfer grammar
:
THE PRESIDENT ANNOUNCED THAT
HIS
FIRST TASK WILL BE TO FIND A SOLUTION TO THE CONFLICT IN
OUR
REGION
February 18, 2008
CICLing-2008
59
Subject-Verb Inversion
ATMWL HWDI&H HMM$LH yesterday announced.3SF the-government.3SF
אתמול הודיעה הממשלה שתערכנה בחירות בחודש הבא
$T&RKNH BXIRWT BXWD$ HBAthat-will-be-held.3PF elections.3PF in-the-month the-next
Without transfer grammar:YESTERDAY ANNOUNCED THE GOVERNMENT THAT WILL RESPECT OF THE FREEDOM OF THE MONTH THE NEXT
With transfer grammar
:
YESTERDAY
THE GOVERNMENT ANNOUNCED
THAT ELECTIONS WILL ASSUME IN THE NEXT MONTH
Slide60February 18, 2008
CICLing-2008
60
Subject-Verb Inversion
LPNI KMH $BW&WT HWDI&H HNHLT HMLWNbefore several weeks announced.3SF management.3SF.CS the-hotel
לפני כמה שבועות הודיעה הנהלת המלון שהמלון יסגר בסוף השנה
$HMLWN ISGR BSWF H$NH that-the-hotel.3SM will-be-closed.3SM at-end.3SM.CS the-year
Without transfer grammar:IN FRONT OF A FEW WEEKS ANNOUNCED ADMINISTRATION THE HOTEL THAT THE HOTEL WILL CLOSE AT THE END THIS YEAR
With transfer grammar
:
SEVERAL WEEKS AGO
THE MANAGEMENT OF THE HOTEL ANNOUNCED
THAT THE HOTEL WILL CLOSE AT THE END OF THE YEAR
February 18, 2008
CICLing-2008
61
Evaluation Results
Test set of 62 sentences from Haaretz newspaper, 2 reference translations
System
BLEU
NIST
P
R
METEOR
No Gram
0.0616
3.4109
0.4090
0.4427
0.3298
Learned
0.0774
3.5451
0.4189
0.4488
0.3478
Manual
0.1026
3.7789
0.4334
0.4474
0.3617
Slide62Open Research Questions
Our large-scale Chinese-English system is still significantly behind phrase-based SMT. Why?Weaker decoder?Feature set is not sufficiently discriminant?Problems with the parsers for the two sides?Syntactic constituents don’t provide sufficient coverage?Bugs and deficiencies in the underlying algorithms?The ISI experience indicates that it may take a couple of years to catch up with and surpass the phrase-based systemsSignificant engineering issues to improve speed and efficient runtime processing and improved search
February 18, 2008
62
CICLing-2008
Slide63Open Research Questions
Immediate Research Issues:Rule Learning:Study effects of learning rules from manually vs automatically word aligned dataStudy effects of parser accuracy on learned rulesEffective discriminant methods for modeling rule scoresRule filtering strategiesSyntax-based LMs: Our translations come out with a syntax-tree attached to themAdd a syntax-based LM feature that can discriminate between good and bad trees
February 18, 2008
63
CICLing-2008
Slide64Conclusions
Stat-XFER is a promising general MT framework, suitable to a variety of MT scenarios and languagesProvides a complete solution for building end-to-end MT systems from parallel data, akin to phrase-based SMT systems (training, tuning, runtime system)No open-source publically available toolkits (yet), but we welcome further collaboration activitiesComplex but highly interesting set of open research issuesPrediction: this is the future direction of MT!
February 18, 2008
64
CICLing-2008
Slide65February 18, 2008
CICLing-2008
65
Questions?