Resource Acquisition - PowerPoint Presentation

344 views
Uploaded On 2019-11-21

Resource Acquisition - PPT Presentation

Resource Acquisition for Syntaxbased MT from Parsed Parallel data Alon Lavie Alok Parlikar and Vamshi Ambati Language Technologies Institute Carnegie Mellon University Research Goals Longterm research agenda since 2000 focused on developing a unified framework for MT that addresses the core ID: 766252

rule node aligned alignment node rule alignment aligned word alignments english rules constituent parallel pfa tree nodes algorithm side

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/766252" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "Resource Acquisition" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Resource Acquisition for Syntax-based MT from Parsed Parallel data Alon Lavie, Alok Parlikar and Vamshi AmbatiLanguage Technologies InstituteCarnegie Mellon University

Research GoalsLong-term research agenda (since 2000) focused on developing a unified framework for MT that addresses the core fundamental weaknesses of previous approaches: Representation – explore richer formalisms that can capture complex divergences between languagesAbility to handle morphologically complex languagesMethods for automatically acquiring MT resources from available data and combining them with manual resourcesAbility to address both rich and poor resource scenariosMain research funding sources: NSF (AVENUE and LETRAS projects) and DARPA (GALE) June 20, 2008 2 SSST-2

June 20, 2008SSST-2 3CMU Statistical Transfer (Stat-XFER) MT Approach Integrate the major strengths of rule-based and statistical MT within a common framework: Linguistically rich formalism that can express complex and abstract compositional transfer rules Rules can be written by human experts and also acquired automatically from data Easy integration of morphological analyzers and generators Word and syntactic-phrase correspondences can be automatically acquired from parallel text Search-based decoding from statistical MT adapted to find the best translation within the search space: multi-feature scoring, beam-search, parameter optimization, etc. Framework suitable for both resource-rich and resource-poor language scenarios

June 20, 2008SSST-2 4Stat-XFER MT Systems General Stat-XFER framework under development for past seven years Systems so far: Chinese-to-English Hebrew-to-English Urdu-to-English German-to-English French-to-English Hindi-to-English Dutch-to-English Mapudungun-to-Spanish In progress or planned: Arabic-to-English Brazilian Portuguese-to-English Inupiaq-to-English Hebrew-to-Arabic Quechua-to-Spanish Turkish-to-English

Stat-XFER Framework SourceInput Preprocessing Morphology Transfer Engine Transfer Rules Bilingual Lexicon Translation Lattice Second-Stage Decoder Language Model Weighted Features Target Output June 20, 2008 5 SSST-2

Transfer Engine Language Model + Additional Features Transfer Rules {NP1,3} NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1] ((X3::Y1) (X1::Y2) ((X1 def) = +) ((X1 status) =c absolute) ((X1 num) = (X3 num)) ((X1 gen) = (X3 gen)) (X0 = X1)) Translation Lexicon N::N |: ["$WR"] -> ["BULL"] ((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "BULL")) N::N |: ["$WRH"] -> ["LINE"] ((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "LINE")) Source Input בשורה הבאה Decoder English Output in the next line Translation Output Lattice (0 1 "IN" @PREP) (1 1 "THE" @DET) (2 2 "LINE" @N) (1 2 "THE LINE" @NP) (0 2 "IN LINE" @PP) (0 4 "IN THE NEXT LINE" @PP) Preprocessing Morphology

June 20, 2008SSST-2 7Transfer Rule Formalism Type informationPart-of-speech/constituent informationAlignments x-side constraints y-side constraints xy-constraints, e.g. ((Y1 AGR) = (X1 AGR)) ; SL: the old man, TL: ha-ish ha-zaqen NP::NP [DET ADJ N] -> [DET N DET ADJ] ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) )

MT Resource Acquisition in Resource-rich ScenariosScenario: Significant amounts of parallel-text at sentence-level are available Parallel sentences can be word-aligned and parsed (at least on one side, ideally on both sides)Goal: Acquire both broad-coverage translation lexicons and transfer rule grammars automatically from the dataSyntax-based translation lexicons:Broad-coverage constituent-level translation equivalents at all levels of syntactic granularity Can serve as the elementary building blocks for transfer trees constructed at runtime using the transfer rulesJune 20, 2008 8 SSST-2

Acquisition ProcessAutomatic Process for Extracting Syntax-driven Rules and Lexicons from sentence-parallel data: Word-align the parallel corpus (GIZA++)Parse the sentences independently for both languagesRun our new PFA Constituent Aligner over the parsed sentence pairsExtract all aligned constituents from the parallel trees Extract all derived synchronous transfer rules from the constituent-aligned parallel trees Construct a “data-base” of all extracted parallel constituents and synchronous rules with their frequencies and model them statistically (assign them max-likelihood probabilities ) June 20, 2008 9 SSST-2

PFA Constituent Node AlignerInput: a bilingual pair of parsed and word-aligned sentences Goal: find all sub-sentential constituent alignments between the two trees which are translation equivalents of each otherEquivalence Constraint: a pair of constituents <S,T> are considered translation equivalents if:All words in yield of <S> are aligned only to words in yield of <T> (and vice-versa)If <S> has a sub-constituent <S1> that is aligned to <T1>, then <T1> must be a sub-constituent of <T> (and vice-versa) Algorithm is a bottom-up process starting from word-level, marking nodes that satisfy the constraints June 20, 2008 10 SSST-2

PFA Node Alignment Algorithm Each of the nodes stores a value. All nodes are initialized with the value 1. Each Word to Word alignment is assigned a unique prime number.

PFA Node Alignment Algorithm For every word to word alignment, we do the following: Let p be the unique prime value assigned to the alignment. Let ws and wt be the aligned words on the source and target side. Assign the value p to the POS nodes corresponding to the words w s and w t . Example: “Australia” gets value 2, “is” gets value 3.

PFA Node Alignment Algorithm In case there are “one-to-many” alignments, they are considered as multiple “one-to-one” alignments, and all of these alignments are given the same prime value. Example: “North Korea” is just one word on Chinese side. That word is assigned the value 25, which is a product 5*5.

PFA Node Alignment Algorithm Once all the lexical items have values, we propogate the values up the tree as follows: Work bottom-up A node updates its value as the product of the values of its children.

PFA Node Alignment Algorithm Once all nodes have values, they can be aligned as follows: If a node on Chinese side has a value same as node on English side, align them. If two nodes have equal values, take the node at lowest level in the tree.

PFA Node Alignment Algorithm Features of the algorithm: Aligned constituents can have different labelsOrder of the sub-constituents does not matter in node alignment Unaligned words in constituents are allowed, but we are conservative (attach low).

PFA Node Alignment Algorithm Extraction of Phrases:Get the yields of the aligned nodes and add them to a phrase table tagged with syntactic categories on source and target sides. Example:NP # NP :: 澳洲 # Australia

PFA Node Alignment Algorithm All Phrases from this tree:IP # S :: 澳洲是与北韩有邦交的少数国家之一。 # Australia is one of the few countries that have diplomatic relations with North Korea . VP # VP :: 是与北韩有邦交的少数国家之一 # is one of the few countries that have diplomatic relations with North Korea NP # NP :: 与北韩有邦交的少数国家之一 # one of the few countries that have diplomatic relations with North KoreaVP # VP :: 与北韩有邦交 # have diplomatic relations with North Korea NP # NP :: 邦交 # diplomatic relations NP # NP :: 北韩 # North Korea NP # NP :: 澳洲 # Australia

PFA Constituent Node Alignment PerformanceCompare with manually-aligned constituent nodes: Selected 30 sentences from Chinese-English parallel treebankBilingual expert manually aligned the nodes in the treesMain sources of disagreement: 1-to-many and many-to-many word alignments Errors or inconsistencies in the manual word alignmentsJune 20, 2008 21 SSST-2 Precision Recall F-1 F-0.5 0.8129 0.7325 0.7705 0.7841

PFA Constituent Node Alignment PerformanceEvaluation Data: Chinese-English Treebank Parallel Chinese-English Treebank with manual word-alignments3342 Sentence PairsCreated a “Gold Standard” constituent alignments using the manual word-alignments and treebank treesNode Alignments: 39874 (About 12/tree pair)NP to NP Alignments: 5427Evaluation: Run PFA Aligner with automatic word alignments on same data and compare with the “gold Standard” alignments June 20, 2008 22 SSST-2

PFA Constituent Node Alignment Performance Viterbi Combination PrecisionRecallF-1 Intersection0.63820.5395 0.5847 Union 0.8114 0.2915 0.4289 Sym-1 ( Thot Toolkit) 0.7142 0.4534 0.5547 Sym-2 ( Thot Toolkit) 0.7135 0.4631 0.5617 Grow- Diag -Final 0.7777 0.3462 0.4791 Grow- Diag -Final-and 0.6988 0.4700 0.5620 Viterbi word alignments from Chinese-English and reverse directions were merged using different algorithms Tested the performance of Node-Alignment with each resulting alignmentJune 20, 2008 23 SSST-2

Transfer Rule LearningInput: Constituent-aligned parallel treesIdea: Aligned nodes act as possible decomposition points of the parallel treesThe sub-trees of any aligned pair of nodes can be broken apart at lower-level aligned nodes, creating an inventory of “tree-fragment” correspondencesSynchronous “tree-frags” can be converted into synchronous rulesSimilar in nature to [Galley et al 2004, 2006] Algorithm:Find all possible minimal tree fragment decompositions from the node aligned trees“Flatten” the tree fragments into Stat-XFER style synchronous CFG rules June 20, 2008 24 SSST-2

Rule Extraction AlgorithmTree-fragment extraction:Extract Sub-tree segments including synchronous alignment information in the target tree. All the sub-trees and the super-tree are extracted.

Rule Extraction AlgorithmFlat Rule Creation: Each of the tree fragment pairs is flattened to create a Rule in the “Stat-XFER” Formalism –Four major parts to the rule: 1. Type of the rule: Source and Target side type information 2. Constituent sequence of the synchronous flat rule 3. Alignment information of the constituents 4. Constraints in the rule (Currently not extracted)

Rule Extraction AlgorithmFlat Rule Creation: Sample rule: IP::S [ NP VP .] -> [NP VP .] ( ;; Alignments (X1::Y1) (X2::Y2) ;;Constraints )

Rule Extraction AlgorithmFlat Rule Creation: Sample rule: NP::NP [VP 北 CD 有邦交 ] -> [one of the CD countries that VP] ( ;; Alignments (X1::Y7) (X3::Y4) ) Note: Any one-to-one aligned words are elevated to Part-Of-Speech in flat rule. Any non-aligned words on either source or target side remain lexicalized

Rule Extraction Algorithm All rules extracted: VP::VP [VC NP] -> [VBZ NP](;; Alignments (X1::Y1)(X2::Y2)) NP::NP [NR] -> [NNP] ( ;; Alignments (X1::Y1) (X2::Y2) ) VP::VP [ 北 NP VE NP] -> [ VBP NP with NP] ( ;; Alignments (X2::Y4) (X3::Y1) (X4::Y2) ) All rules extracted: NP::NP [VP 北 CD 有邦交 ] -> [one of the CD countries that VP] ( ;; Alignments (X1::Y7) (X3::Y4) ) IP::S [ NP VP ] -> [NP VP ] ( ;; Alignments (X1::Y1) (X2::Y2) ) NP::NP [ “北韩 ”] -> [“North” “Korea”] (;Many to one alignment is a phrase )

Chinese-English Rule LearningTransfer Rules:61 manually developed transfer rules High-accuracy rules extracted from manually word-aligned parallel dataJune 20, 200830 SSST-2

June 20, 2008SSST-2 31Translation Example SrcSent 3 澳洲是与北韩有邦交的少数国家之一。 Gloss: Australia is with north korea have diplomatic relations DE few country world Reference: Australia is one of the few countries that have diplomatic relations with North Korea. Translation: Australia is one of the few countries that has diplomatic relations with north korea . Overall: -5.77439, Prob : -2.58631, Rules: -0.66874, TransSGT : -2.58646, TransTGS : -1.52858, Frag : -0.0413927, Length: -0.127525, Words: 11,15 ( 0 10 "Australia is one of the few countries that has diplomatic relations with north korea " -5.66505 " 澳洲是与北韩有邦交的少数国家之一 " "(S1,1124731 (S,1157857 (NP,2 (NB,1 (LDC_N,1267 'Australia') ) ) (VP,1046077 (MISC_V,1 'is') (NP,1077875 (LITERAL 'one') (LITERAL 'of') (NP,1045537 (NP,1017929 (NP,1 (LITERAL 'the') (NUMNB,2 (LDC_NUM,420 'few') (NB,1 (WIKI_N,62230 'countries') ) ) ) (LITERAL 'that') (VP,1021811 (LITERAL 'has') (FBIS_NP,11916 'diplomatic relations') ) ) (FBIS_PP,84791 'with north korea') ) ) ) ) ) ") ( 10 11 "." -11.9549 "。" "(MISC_PUNC,20 '.')")

June 20, 2008SSST-2 32Example: XFER Rules ;;SL::(2,4) 对台贸易 ;;TL::(3,5) trade to taiwan ;;Score::22 {NP,1045537}NP::NP [PP NP ] -> [NP PP ] ((*score* 0.916666666666667) (X2::Y1) (X1::Y2)) ;;SL::(2,7) 直接提到伟哥的广告 ;;TL::(1,7) commercials that directly mention the name viagra ;;Score::5 {NP,1017929} NP::NP [VP "的" NP ] -> [NP "that" VP ] ((*score* 0.111111111111111) (X3::Y1) (X1::Y3)) ;;SL::(4,14) 有一至多个高新技术项目或产品 ;;TL::(3,14) has one or more new , high level technology projects or products ;;Score::4{VP,1021811} VP::VP ["有" NP ] -> ["has" NP ]((*score* 0.1)(X2::Y2))

Current and Future WorkExtraction based on both trees or trees on one side (with projection)? Trees on both side provide accurate constituent boundaries, but divergent parser representations results in large coverage gapsCompromise: trees on one side + low-level constituents (chunks) on the other side…Exploring the space of extracted rules:Binarize the rules or not?Collapse constituent categories (or refine some of them)?Rule filtering strategies (keep only count > 1 ?) Rule scoring strategies (currently only max likelihood scores)Refining word alignment errorsMerging of resources acquired from data with manual lexicons and transfer rules June 20, 2008 SSST-2 33

ConclusionsStat-XFER is a promising general MT framework, suitable to a variety of MT scenarios and languages Provides a complete solution for building end-to-end MT systems from parallel data, akin to phrase-based SMT systems (training, tuning, runtime system)Syntactic resources acquired from parallel corpora may be useful for other types of MT systems (high quality phrase tables)Complex but highly interesting set of open research issuesJune 20, 2008 34 SSST-2

Resource Acquisition - PowerPoint Presentation

Resource Acquisition - PPT Presentation

Share:

Link:

Embed:

Related Contents