to Structured Translation and Analysis of LowResource Languages November 2014 Project Review at ARL Jaime Carbonell CMU amp Team MURI via ARO PM Joseph Myers The Faculty CMU ID: 637540
Download Presentation The PPT/PDF document "The Linguistic-Core Approach" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
The Linguistic-Core Approach to Structured Translation and Analysis of Low-Resource Languages
November 2014
Project Review at ARL
Jaime
Carbonell
(CMU) & Team
MURI via ARO (PM: Joseph Myers)Slide2
The Faculty
CMU
:
Jaime Carbonell USC-ISI:Kevin KnightDavid Chiang(Notre Dame)MIT: UT Austin:Regina Barzilay Jason Baldridge
Supporting roles: 2
other PhDs
, 8
Grad Students,
3
Postdocs, N UGs,
Noah Smith
Lori Levin
Chris DyerSlide3
LCMT: The Elevator Pitch
The fundamental challenge
“Modern” MT requires massive parallel data
There are 7000+ L’s with scant ||-data Rule-based MT requires extensive trained-linguist effortsThe linguistic-core approachGoal: 90% linguistic benefit with 10% linguist effort Annotation deep and light, linguistics “lay” bilinguals Augmented with machine learning from bi & mono-L textAccomplishments to dateTheory: GFL, graph-semantics, AMR & other parsers, sparse ML training, linguistically-anchored models, … 40+ papersTool suites: GFL, TurboParser, MT-in-works, Morph,
SuperTag,…Languages:
Kiriwanda, Malagasy, Swhhili, YorubaSlide4
4The Setting
MURI Languages
Kinyarwanda
Bantu (7.5M speakers)MalagasyMalayo-Polynesian (14.5M)SwahiliBantu (5M native, 150M 2nd/3rd)YorubaNiger-Congo (22+M)
Swahili Anamwona
“he is seeing him/her”
Morpho-syntacticsSlide5
5Which MT Paradigms are Best?
Towards Filling the Table
Large T
Med T
Small T
Large S
SMT
LCMT
LCMT
Med S
LCMT
???
???
Small S
LCMT
???
???
“Old” DARPA
MT: Large S
Large T
Arabic
English; Chinese English
Source
TargetSlide6
6
Evolutionary Tree of
MT ParadigmsLeading up to LCMT
1950
2014
1980
Transfer MT
DecodingMT
Analogy MT
Large-scale TMT
Interlingua MT
Example-based MT
Large-scale
TMT
Context-Based MT
Statistical MT
Phrasal SMT
Transfer MT w stat phrases
SMT with syntax
LCMTSlide7
7Linguistically omnivorous
parsing
Linguistic
universals
GFL annotated corpus
Unannotated
corpus
Small CCG
Lexicon
Parsers
CMU, Texas
Texas
Texas, MIT
He has been writing a letter.
Dependencies
(j / join-01 :ARG0 (p / person :name (p2 / name :op1 "Pierre" :op2 "Vinken") :age (t / temporal-quantity :quant 61 :unit (y / year))) :ARG1 (b / board) :prep-as (d2 / director :mod (e / executive :polarity -)) :time (d / date-entity :month 11 :day 29))
Abstract Meaning Reps
ISI
CMU
CMU
MITSlide8
Linguistic Core Team(LL, JB, SV, JC)
Linguistic Analyzers Team
(NS, RB, JB)
MT Systems Team (KK, DC, SV, JC)Parser, Taggers, Morph. Analyzers
Hand-built Linguistic Core
Triple Gold Data
Triple
Ungold
Data
MT Visualizations and logs
MT Features
MT Error Analysis
MT Systems
Inference Algorithms
Data:
Parallel
Monolingual
Elicited
Related language
Multi-parallel
Comparable
Elicitation corpus
Data selection for annotation
Original VisionSlide9
Linguistic Core Team(LL, JB, CD,
JC)
Linguistic Analyzers Team
(NS, RB, JB)MT Systems Team (KK, DC, CD, JC)Parser, Taggers,
Semantic analyzers
Hand-built Linguistic Core
Triple Gold
and GFL annotated
String/tree/graph transducers
Complex Morph Analyzers
Dependency parses
MT/TA
Error Analysis
MT
Systems and TA modules
Semantic Parsing
Algorithms
Data:
Parallel
Monolingual
Elicited
Related language
Multi-parallel
Comparable
Elicitation corpus
Data selection for annotation
Current Vision
Definiteness/
DiscourseSlide10
PFA Node Alignment Algorithm Example
Tree-tree aligner enforces equivalence constraints and optimizes over terminal alignment scores (words/phrases)
Resulting aligned nodes are highlighted in figure
Transfer rules are partially lexicalized and read off tree.Slide11
LCMT: NLP Workflow and Tools
Annotated Data
Unannotated
DataTexas
Supervised POS Taggers
CMU
Semisupervised
POS Taggers
Texas
Unsupervised Dependency Parsers
MIT
GFL annotator Framework
CMU (current)
=
Toolsuite
software (more to come)
Supervised Dependency & AMR Parsers
CMU
CMU + Texas
Semisupervised
Dependency Parsers
MIT
CMU
Tree-Graph
Syn
/
Sem trx
ISISlide12
Machine Translation ParadigmsPhrase-based MT (LCMT 20+% of effort)
Morph-Syntax-based MT (LCMT 30+%)
Meaning-based MT (LCMT 40+%)
sourcestringmeaningrepresentationtargetstringsourcestring
targetstring
sourcestring
sourcetree
targettree
targetstring
sourcetreetargettree
NIST 2009 c2eSlide13
Some Key Results to DateTheory of transducers (string, tree, graph)
Massive Lexical borrowing across diverse languages
Linguistic universals
Dependencies, semantic roles, conservation, AMR, discourse, …Statistical learning over strings, trees, graphsBayesian, HMM/CRF, active sampling model parametersParsing into deep semantics (AMR)MT demonstrations: Focus on M, K, S Y, but also across ~20 languages (WMT honors, synthetic phrases)A suite of 11 serious software modules and tools (morphology, variable-depth linguistic annotation, dependency parsing, MT, …)Current scientific challengesIs general graph topology induction possible?Bridging structural divergences via semi-universals?Semantic invariance: lexical, structural, non-propositional?Slide14
List of “Firsts” for the Linguistic CoreFirst use of models incorporating linguistic knowledge in the form of hand-written
morpho
-grammatical rules combined with limited-volume corpus statistics
First use of models of lexical "borrowing" from other (major) languages to improve translation and analysis of low resource languages (publication in prep). First efficient and exact probabilistic model for structured prediction with arbitrary syntactic and semantic dependencies derived from the input language. First exploitation of large monolingual foreign text collections (vs bilingually translated collections) to improve low-density MT, via treating foreign text as a mapped/encoded version of English. First application of formal graph transduction theory to natural language analysis; earlier efforts applied to string transduction and tree transduction theory only. First substantial corpora annotated cheaply by novices used to build effective NLP tools First statistical parser to map language into abstract meaning representation of semantics First to show that for resource-impoverished languages, a multilingual parser based on language universals outperforms a target language parser target language First analyses to prove formally and empirically demonstrate that inference in dependency parsing is computationally easy on average case (despite NP-hard for the worst case). Slide15
External Honors for the LC ProjectBest human judgments of English-Russian translations at WMT2013
Best BLEU on Hindi-English translation at WMT2014
Best student paper, ACL 2014 Low-Rank Tensors for Scoring Dependency Structures. Tao Lei, Yu
Xin, Yuan Zhang, Regina Barzilay and Tommi Jaakkola. http://people.csail.mit.edu/taolei/papers/acl2014.pdf Best paper, honorable mention, ACL 2014 A Discriminative Graph-Based Parser for the Abstract Meaning Representation. Jeffrey Flanigan, Sam Thomson, Jaime Carbonell, Chris Dyer and Noah A. Smith. http://www.cs.cmu.edu/~jmflanig/flanigan+etal.acl2014.pdf Best paper, runner up, EMNLP 2014 Language Modeling with Power Low Rank Ensembles. http://www.aclweb.org/anthology/D/D14/D14-1158.pdf Best paper (one of four), NIPS 2014 Conditional Random Field Autoencoders for Unsupervised Structured Prediction.
Waleed Ammar, Chris Dyer, and Noah A. SmithSlide16
Lexical Borrowing of Common WordsSlide17
Swahili morphology using a crowdsourced lexiconPatrick Littell
, Lori Levin, Chris Dyer
No provenance: the root of the word was collected by hand.
[GUESS1]: the root is inferred from the
Kamusi lexicon part of speech tag including noun class. [GUESS2]: the root is from Kamusi, but no noun class is given.[GUESS3]: possible English loan word
[GUESS4]: complete guessFST written by Patrick
LittellLexicon extracted from dictionaries and textbooks Slide18
Parsing Progress (F1)
On
CoNNL
Dataset: 88.72 (CMU), 89.44 (MIT)Slide19
CCG Supertagging
the
lazy
dogsnp/n
n
np
n/n
np
wander
(s\np)/np
n
n/n
np
/n
s\
np
…
HMM
Linguistically-
Motivated
PriorsSlide20
Parsing into
AMR
(ACL 2014 honorable mention for best paper)
Approximately
11000
guards
patrol
the
1200
-
kilometre
border
between
Russia
and
Afghanistan
.
(
p / patrol-01
:ARG0 (
g / guard :quant (a2 / approximately :op1 11000)) :ARG1 (b / border :quant (d4 / distance-quantity :unit (k2 / kilometer) :quant 1200) :location (b2 / between :op1 (c / country :name (n / name :op1 "Russia")) :op2 (c2 / country :name (n2 / name :op1 "Afghanistan")))))
a2 / approximately
11000
p / patrol-01
g / guard
(d4 / distance-quantity :unit (k2 / kilometer)
1200
b / border
b2 / between
(c / country :name (n / name :op1 "Russia"))
(c2 / country :name (n2 / name :op1 "Afghanistan"))
Add relations
New Results:
61% F1
CMU and ISISlide21
Unsupervised Part-of-Speech Tagging
V-measure (higher is better)
Arabic Basque Danish Greek
Hungarian Italian Kin. Mal. Turkish Ave.
conditional random field autoencoder
classic hidden Markov modelfeaturized
hidden Markov modelSlide22
Automatic Classification of the Communicative Functions of Definiteness
Annotated Corpus
Semantics of Definiteness
Syntactic features extracted from dependency parserLogistic regression classifierPredicted semantic functions of definiteness: 78.2% accuracy
Why Definiteness:
One instance of non-propositional semantics
Major determinant of word order
Wildly divergent in
morpho
-syntactic expression Problems in word alignment and language modelsSlide23
Integrating Alignment and Decipherment for Better Low-Density MT
Small bilingual Malagasy/English text
(need to
align words [Brown et al 93])Large Malagasy monolingual text(need to decipher [Dou & Knight 13])
Decipherment helps
Word Alignment
Decipherment helps
Machine
Translation
joint
Bleu
ISI jointly with CMU/Texas/MITSlide24
Graph Formalisms for
Language Understanding and Generation
String Automata Algorithms
Tree Automata Algorithms
Graph Automata Algorithms
N-best answer extraction
… paths through an WFSA
(
Viterbi
, 1967;
Eppstein
, 1998)
… trees in a weighted forest (
Jim
é
nez
&
Marzal
, 2000; Huang & Chiang, 2005)
Investigating:
Linguistically adequate
representations
Efficient algorithms
Using them in:
Text
Meaning (NLU)
Meaning Text (NLG) Meaning-based MT
Unsupervised EM training
Forward-backward EM (Baum/Welch, 1971; Eisner 2003)
Tree transducer EM training
(
Graehl
& Knight, 2004)
Determinization
, minimization
… of weighted string acceptors (
Mohri
, 1997)
… of weighted tree acceptors (
Borchardt
&
Vogler
, 2003;
May & Knight, 2005)
Intersection
WFSA intersection
Tree acceptor intersection
Application of transducers
string
WFST WFSA
tree
TT weighted tree
acceptor
Composition of transducers
WFST composition
(Pereira & Riley, 1996)
Many tree transducers not closed under composition
(
Maletti
et al 09)
Software tools
Carmel,
OpenFST
Tiburon (May & Knight 10)
ISI jointly with CMUSlide25
NE Gold Standard (native speaker)
(
Azim
)
NE Pyrite Standard (linguist)
(Alexa, Lori)
Morphology Lists
for NE
(Alexa, David)
Morphological
Analyzers
(
Swabha
, Chris)
Gazetteers
(Pat, Chris)
Brown
Clusters
(
Kartik
)
Tajik Corpus from
Leipzig Archive
Supervision
(
Azim
)Supervision(David)Tajik and PersianWikipediasTajik Reference Grammar(Perry, 2005)
PerLex Persian Lexicon(Sagot and Walther, 2010)
IPA Converter
(
Kartik
, Pat, Chris)
Named Entity Recognizer
(
Kartik
, Chris)
Persian Treebank (
Rasooli
et al., 2013)
Tajik POS Tagger
(Chris)
Persian-Tajik Converter
(Chris)Slide26
Strings
Graphs
FSA
CFG
DAG acceptor
HRG
probabilistic
yes
yes
yes
yes
intersects
with finite-state
yes
yes
yes
yes
EM training
yes
yes
yes
yes
transduction
O
(
n
)O(n3)O(|Q|T+1n)O((3dn)T+1)implementedyesyesyesyesNew Results for Graph Automata for Mapping Between Text and Meaningd = graph degree for AMR, high in practiceT = treewidth complexity for AMR, low in practice (2-3)Slide27
Next Steps (high level overview)Finalize MT systems: K, M, S, Y
Package and make available externally
Possibly integrate with government translator workbench
Compare with Govt systems when available and appropriate (e.g. Malagasy with Carl Rubino)Complete scientific investigations (Graph transduction, MT with AMR, supertagging parsing, borrowing++, …)Document and distribute tool suites (rapid annotation, morphology, CCG supertagging, dependency parsing, AMR parsing, generation, end-to-end MT, lexicon borrowing, ML modules, …) 15 +/-Publish, publish, publish (40+ papers and counting)Detailed next steps at the end of each major presentationSlide28
Jaime Carbonell, CMU28
THANK YOU!Slide29
Supplementary SlidesSelect/show as needed for discussion periodSlide30
Tag
Dictionary Generalization
.
TOK_the_1
TOK_dog_2
TOK_the_4
TOK_thug_5
NEXT_walks
.
PREV_<b>
PREV_the
PRE1_t
PRE2_th
SUF1_g
TYPE_the
TYPE_thug
TYPE_dog
Token Annotations
____________
____________
Type Annotations
____________
____________
the
dog
the dog walk
s
DT
NN
VBZ
DT
NN
Raw Corpus
________
________
________
Any arbitrary features could
be
addedSlide31
RULE 1:
DT(these)
这RULE 2:VBP(include)
中包括
RULE 6:
NNP(Russia)
俄罗斯
RULE 4:
NNP(France)
法国
RULE 8:
NP(NNS(astronauts))
宇航 , 员
RULE 5:
CC(and)
和
RULE 9:
PUNC(.)
.
这
7人 中包括 来自 法国 和 俄罗斯 的 宇航 员 .RULE 10:NP(x0:DT, CD(7), NNS(people) x0 , 7人RULE 13:NP(x0:NNP, x1:CC, x2:NNP) x0 , x1 , x2RULE 15:S(x0:NP, x1:VP, x2:PUNC) x0 , x1 , x2RULE 16:NP(x0:NP, x1:VP) x1 , 的 , x0RULE 11:VP(VBG(coming), PP(IN(from), x0:NP)) 来自 , x0
RULE 14:
VP(x0:VBP, x1:NP)
x0 , x1
“These 7 people include astronauts coming from France and Russia”
“France and Russia”
“coming from France and Russia”
“astronauts coming from
France and Russia”
“these 7 people”
“include astronauts coming from
France and Russia”
“these”
“Russia”
“astronauts”
“.”
“include”
“France”
“&”Slide32
Model Minimization
.
.
<b> The man saw the saw <b>
<b>
DT
NN
VBD
1.0
1.0
1.0
0.8
0.2
0.4
0.7
0.3
1.0
0 1 2 3 4 5 6
0.6Slide33
33Linguistically opportunistic parsing
Linguistic
universals
GFL annotated corpus
Unannotated
corpus
Small CCG
Lexicon
Parsers
CMU, Texas
Texas
Texas, MIT
He has been writing a letter.
Dependencies
(j / join-01 :ARG0 (p / person :name (p2 / name :op1 "Pierre" :op2 "Vinken") :age (t / temporal-quantity :quant 61 :unit (y / year))) :ARG1 (b / board) :prep-as (d2 / director :mod (e / executive :polarity -)) :time (d / date-entity :month 11 :day 29))
Abstract Meaning Reps
ISI
CMU
CMU
MITSlide34
Fragmentary Unlabeled Dependency Grammar(Schneider, O’Connor, Saphra, Bamman, Faruqui,
Smith, Dyer, and Baldridge, 2013)
Represents unlabeled dependencies
Special handling for:multiword expressionscoordination
anaphoraAllows underspecification
Graph fragment language for easy annotationSlide35
Graph Fragment Language (GFL)
{Our three} > weapons > are < $a
$a :: {fear surprise efficiency} :: {and~1 and~2}
ruthless > efficiency
Provide a detailed analysis of coordination…
(Our three weapons*) > are <
(fear surprise and ruthless efficiency)
Or focus just on the high level…
(((Ataon’ < (ny > mpanao < fihetsiketsehana)) <
hoe < mpikiky < manko) < (i > Gaddafi))
Atoan’ < noho < (ny_1 > kabariny < lavareny)
Provide detailed syntactic dependency structure
Ataon’ < (ny_1 mpanao* fihetsiketsehana)
Atoan’ < (hoe* mpikiky manko)
Atoan’ < (i Gaddafi*)
Atoan’ < noho < (ny_2 kabariny lavareny)
Or focus on predictate/arguments
“Gaddafi has referred to protesters as rodents in his rambling speeches.”Slide36
36GFL (CMU/Texas) & AMR (ISI)
The classic: “
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
”(j / join-01 :ARG0 (p / person :name (p2 / name :op1 "Pierre" :op2 "Vinken") :age (t / temporal-quantity :quant 61 :unit (y / year))) :ARG1 (b / board) :prep-as (d2 / director :mod (e / executive :polarity -)) :time (d / date-entity :month 11 :day 29))join < [ Pierre Vinken ]join < boardjoin < as < directorjoin < [Nov. 29]nonexecutive > director61 > years > old > [ Pierre Vinken ]
( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) ))Slide37
instance
ARG0
WANT
ARG1
instance
ARG0
BELIEVE
ARG1
instance
WANT
ARG1
WE CAN DERIVE AND TRANSFORM SEMANTIC GRAPHS:
Probabilistic Graph Grammars
“the boy wants
something
involving himself
”
instance
ARG0
WANT
B
X
ARG1
WITH BASIC RULES LIKE THIS:
instance
GIRL
instanceBOY“the boy wants the girl to believe he is wanted”Slide38
Example Parsing into AMR
Approximately
11000 guards patrol the 1200 -
kilometre border
between Russia
and Afghanistan.
(
p / patrol-01
:ARG0 (g / guard
:quant (a2 / approximately
:op1 11000
))
:ARG1 (b / border
:quant
(d4 / distance-quantity
:unit (k2 / kilometer)
:quant
1200
)
:location (
b2 / between
:op1
(c / country
:name (n / name
:op1 "Russia")) :op2 (c2 / country :name (n2 / name :op1 "Afghanistan")))))a2 / approximately11000p / patrol-01g / guard(d4 / distance-quantity :unit (k2 / kilometer)1200b / borderb2 / between(c / country :name (n / name :op1 "Russia"))(c2 / country :name (n2 / name :op1 "Afghanistan"))Add relationsSlide39Slide40
How much foreign text (running words)
Accuracy of learned
bilingual dictionary
Deciphering Foreign Language(Dou & Knight 2013)Dependency-based:Linguistic analysis helps substantially!
(Dou & Knight 2012)Ngram
-basedtest on Spanish
Englishtext
Foreigntext
not translationsof each other
DecipheringEnginebilingual
word-for-worddictionarySlide41
Constituent Structure Trees
Strength:
You can use tests for constituency (movement, deletion, substitution, coordination) to get reproducible results for corpus annotation.
Weakness: 1. Tests for constituency sometimes fail to provide reproducible results.The five trees (based on an exercise in Radford , 1988) have each been proposed in a published paper and can each be defended by tests for constituency. 2. People do not have uniform intuitions about which tree is “correct”. Slide42
42
Morpho-syntactics
Iñupiaq
(North Slope Alaska) Tauqsiġñiaġviŋmuŋniaŋitchugut. ‘We won’t go to the store.’Slide43
Mathematical Foundations for Semantics-Based Machine Translation
Previous MT systems have been based on clean
string automata
and tree automataGeneral purpose algorithms have been worked out (in part by MT scientists), with wide applicabilitysoftware toolkits even implement those algorithmsBut new models of meaning-based MT deal in semantic graph structuresForeign string Meaning graph English stringQUESTION: Do efficient, general-purpose algorithms for graph automata exist to support these linguistic models?Slide44
General-Purpose Algorithms for Manipulating Linguistic Structures: Acceptors
String Acceptors
successfully applied to speech recognition
Tree Acceptors
successfully applied to syntax-based MT
Graph Acceptors
now being applied to semantics-based MT
Membership checking ...
... of string (length n) in WFSA.
O(n) if WFSA is
determinized
.
... of tree in forest.
O(n) if
determinized
.
... of graph in
hyperedge
-replacement grammar (HERG) (
Drewes
97)
New algorithm:
Chiang
(forthcoming), O
(2
d
n)k+1 : d & n properties of individual grammark-best … … best k paths through an WFSA with n states and e edges (Viterbi 67; Eppstein 98)O(e + n log n + k log k)
… trees in a weighted forest (Jiménez
& Marzal
00; Huang &
Chiang 05)
O(e + n k log k)
... graphs in weighted HERG.
Efficient
Huang &
Chiang
results carry over
.
EM training of probabilistic weights
Forward-backward EM (Baum/Welch 71; Eisner 03)
O(n)
Tree acceptor training (
Graehl
&
Knight
04)
O(n)
Efficient
Graehl
&
Knight
results carry over
.
Intersection
WFSA intersection
O(n
2
) classical
Tree acceptor intersection
O(n
2
) classical
Graph acceptor intersection
NOT CLOSED (in general)
co-PI
supported under MURI projectSlide45
General-Purpose Algorithms for Feature Structures (Graphs)
String World
Tree World
Graph World
Acceptor
Finite-state acceptors
Tree automata
HRG
Transducer
Finite-state transducers
Tree transducers
Synchronous HRG
Membership checking
O
(
n
)
O
(
n
) for trees
O
(
n
3
) for strings
O
(
n
k
+1
) for graphs
N-best …
… paths through an WFSA
(Viterbi, 1967; Eppstein, 1998)
… trees in a weighted forest (Jiménez & Marzal, 2000; Huang & Chiang, 2005)
… graphs in a weighted forest
EM training
Forward-backward EM (Baum/Welch, 1971; Eisner 2003)
Tree transducer EM training
(Graehl & Knight, 2004)
EM on forests of graphs
Intersection
WFSA intersection
Tree acceptor intersection
Not closed
Transducer composition
WFST composition
(Pereira & Riley, 1996)
Many tree transducers not closed under composition
(Maletti et al 09)
Not closed
General tools
Carmel,
OpenFST
Tiburon (May & Knight 10)
BolinasSlide46
Linguistic Core Team(LL, JB, SV, JC)
Linguistic Analyzers Team
(NS, RB, JB)
MT Systems Team (KK, DC, SV, JC)Parser, Taggers, Morph. Analyzers
Hand-built Linguistic Core
Triple Gold Data
Triple
Ungold
Data
MT Visualizations and logs
MT Features
MT Error Analysis
MT Systems
Inference Algorithms
Data:
Parallel
Monolingual
Elicited
Related language
Multi-parallel
Comparable
Elicitation corpus
Data selection for annotation
Functional CollaborationSlide47
Malagasy Resources
Tokens
Types
HapaxBible (Year 1)579,57819,4608,401Leipzig corpus (Year 2)618,28241,46223,659CMU Global Voices (Year 2)2,148,97684,744
46,627Total3,346,836115,172
62,517
Malagasy - English Resources
eng-Tokens
eng-Types
mlg-Tokensmlg-TypesBible (Year 1)
584,87213,084579,57819,460
CMU Global Voices (Year 2)1,785,472
63,3572,148,976
84,744Total2,370,344
67,7903,346,836
115,172Slide48
48
Evolutionary Tree of
MT ParadigmsPrior to LCMT
1950
2012
1980
Transfer MT
DecodingMT
Analogy MT
Large-scale TMT
Interlingua MT
Example-based MT
Large-scale
TMT
Context-Based MT
Statistical MT
Phrasal SMT
Transfer MT w stat phrases
SMT on syntax
struct
.
LCMTSlide49
Model ParametersDistribution over number of arguments given the parent tag Weights for selection features, shared across all set sizes
Weights for ordering features
49
All parameters are shared across languagesSlide50
Malagasy Language Modeling
Model
Data
Seq. X-entWord X-entTotal X-ent.PerplexityOOVs3-gram+charBible10.357.6618.01
264,32323.94%3-gram+charGV
7.021.148.16
286.03.30%
3-gram+morphGV7.02
0.907.92241.4
3.30%Successes
Malagasy analyzer has << 100% coverage, but we still get substantial gains
Year
3 GoalsImprove word sequence model with
morphosyntactic information
Improve coverage of Malagasy morphological phenomenaIncorporation in MT system
Kinyarwanda analyzer/generator under developmentSlide51
How CMU ISI UT and MIT collaborateMonthly teleconference calls
Focused on management and project coordination
Technical topics follow when appropriate
Semi-annual face-to-face meetings Last ones in Nov 2012 and March 2013Include students/postdocs, etc. Focused on scienceMuch more frequent focused calls/chats/etc.Data collection, annotations, SW APIs, brainstorming new algorithms, …Sharing/reviewing results and papersWebsite/repository + shared SW/data sets + papers + more goodieswww.linguisticcore.infoStudent exchanges (e.g. week, month, summer)Occasional individual faculty trips Combined research (GFL, AMR parsing, CCG parsing, decipherment,…)