/
The Linguistic-Core Approach The Linguistic-Core Approach

The Linguistic-Core Approach - PowerPoint Presentation

pasty-toler
pasty-toler . @pasty-toler
Follow
384 views
Uploaded On 2018-02-27

The Linguistic-Core Approach - PPT Presentation

to Structured Translation and Analysis of LowResource Languages November 2014 Project Review at ARL Jaime Carbonell CMU amp Team MURI via ARO PM Joseph Myers The Faculty CMU ID: 637540

amp tree linguistic graph tree amp graph linguistic cmu based data language op1 lcmt parsing corpus dependency algorithms amr mit string meaning

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "The Linguistic-Core Approach" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

The Linguistic-Core Approach to Structured Translation and Analysis of Low-Resource Languages

November 2014

Project Review at ARL

Jaime

Carbonell

(CMU) & Team

MURI via ARO (PM: Joseph Myers)Slide2

The Faculty

CMU

:

Jaime Carbonell USC-ISI:Kevin KnightDavid Chiang(Notre Dame)MIT: UT Austin:Regina Barzilay Jason Baldridge

Supporting roles: 2

other PhDs

, 8

Grad Students,

3

Postdocs, N UGs,

Noah Smith

Lori Levin

Chris DyerSlide3

LCMT: The Elevator Pitch

The fundamental challenge

“Modern” MT requires massive parallel data

There are 7000+ L’s with scant ||-data Rule-based MT requires extensive trained-linguist effortsThe linguistic-core approachGoal: 90% linguistic benefit with 10% linguist effort Annotation deep and light, linguistics  “lay” bilinguals Augmented with machine learning from bi & mono-L textAccomplishments to dateTheory: GFL, graph-semantics, AMR & other parsers, sparse ML training, linguistically-anchored models, …  40+ papersTool suites: GFL, TurboParser, MT-in-works, Morph,

SuperTag,…Languages:

Kiriwanda, Malagasy, Swhhili, YorubaSlide4

4The Setting

MURI Languages

Kinyarwanda

Bantu (7.5M speakers)MalagasyMalayo-Polynesian (14.5M)SwahiliBantu (5M native, 150M 2nd/3rd)YorubaNiger-Congo (22+M)

Swahili Anamwona

“he is seeing him/her”

Morpho-syntacticsSlide5

5Which MT Paradigms are Best?

Towards Filling the Table

Large T

Med T

Small T

Large S

SMT

LCMT

LCMT

Med S

LCMT

???

???

Small S

LCMT

???

???

“Old” DARPA

MT: Large S

 Large T

Arabic

 English; Chinese  English

Source

TargetSlide6

6

Evolutionary Tree of

MT ParadigmsLeading up to LCMT

1950

2014

1980

Transfer MT

DecodingMT

Analogy MT

Large-scale TMT

Interlingua MT

Example-based MT

Large-scale

TMT

Context-Based MT

Statistical MT

Phrasal SMT

Transfer MT w stat phrases

SMT with syntax

LCMTSlide7

7Linguistically omnivorous

parsing

Linguistic

universals

GFL annotated corpus

Unannotated

corpus

Small CCG

Lexicon

Parsers

CMU, Texas

Texas

Texas, MIT

He has been writing a letter.

Dependencies

(j / join-01 :ARG0 (p / person :name (p2 / name :op1 "Pierre" :op2 "Vinken") :age (t / temporal-quantity :quant 61 :unit (y / year))) :ARG1 (b / board) :prep-as (d2 / director :mod (e / executive :polarity -)) :time (d / date-entity :month 11 :day 29))

Abstract Meaning Reps

ISI

CMU

CMU

MITSlide8

Linguistic Core Team(LL, JB, SV, JC)

Linguistic Analyzers Team

(NS, RB, JB)

MT Systems Team (KK, DC, SV, JC)Parser, Taggers, Morph. Analyzers

Hand-built Linguistic Core

Triple Gold Data

Triple

Ungold

Data

MT Visualizations and logs

MT Features

MT Error Analysis

MT Systems

Inference Algorithms

Data:

Parallel

Monolingual

Elicited

Related language

Multi-parallel

Comparable

Elicitation corpus

Data selection for annotation

Original VisionSlide9

Linguistic Core Team(LL, JB, CD,

JC)

Linguistic Analyzers Team

(NS, RB, JB)MT Systems Team (KK, DC, CD, JC)Parser, Taggers,

Semantic analyzers

Hand-built Linguistic Core

Triple Gold

and GFL annotated

String/tree/graph transducers

Complex Morph Analyzers

Dependency parses

MT/TA

Error Analysis

MT

Systems and TA modules

Semantic Parsing

Algorithms

Data:

Parallel

Monolingual

Elicited

Related language

Multi-parallel

Comparable

Elicitation corpus

Data selection for annotation

Current Vision

Definiteness/

DiscourseSlide10

PFA Node Alignment Algorithm Example

Tree-tree aligner enforces equivalence constraints and optimizes over terminal alignment scores (words/phrases)

Resulting aligned nodes are highlighted in figure

Transfer rules are partially lexicalized and read off tree.Slide11

LCMT: NLP Workflow and Tools

Annotated Data

Unannotated

DataTexas

Supervised POS Taggers

CMU

Semisupervised

POS Taggers

Texas

Unsupervised Dependency Parsers

MIT

GFL annotator Framework

CMU (current)

=

Toolsuite

software (more to come)

Supervised Dependency & AMR Parsers

CMU

CMU + Texas

Semisupervised

Dependency Parsers

MIT

CMU

Tree-Graph

Syn

/

Sem trx

ISISlide12

Machine Translation ParadigmsPhrase-based MT (LCMT 20+% of effort)

Morph-Syntax-based MT (LCMT 30+%)

Meaning-based MT (LCMT 40+%)

sourcestringmeaningrepresentationtargetstringsourcestring

targetstring

sourcestring

sourcetree

targettree

targetstring

sourcetreetargettree

NIST 2009 c2eSlide13

Some Key Results to DateTheory of transducers (string, tree, graph)

Massive Lexical borrowing across diverse languages

Linguistic universals

Dependencies, semantic roles, conservation, AMR, discourse, …Statistical learning over strings, trees, graphsBayesian, HMM/CRF, active sampling  model parametersParsing into deep semantics (AMR)MT demonstrations: Focus on M, K, S Y, but also across ~20 languages (WMT honors, synthetic phrases)A suite of 11 serious software modules and tools (morphology, variable-depth linguistic annotation, dependency parsing, MT, …)Current scientific challengesIs general graph topology induction possible?Bridging structural divergences via semi-universals?Semantic invariance: lexical, structural, non-propositional?Slide14

List of “Firsts” for the Linguistic CoreFirst use of models incorporating linguistic knowledge in the form of hand-written

morpho

-grammatical rules combined with limited-volume corpus statistics

 First use of models of lexical "borrowing" from other (major) languages to improve translation and analysis of low resource languages (publication in prep). First efficient and exact probabilistic model for structured prediction with arbitrary syntactic and semantic dependencies derived from the input language. First exploitation of large monolingual foreign text collections (vs bilingually translated collections) to improve low-density MT, via treating foreign text as a mapped/encoded version of English. First application of formal graph transduction theory to natural language analysis; earlier efforts applied to string transduction and tree transduction theory only.  First substantial corpora annotated cheaply by novices used to build effective NLP tools  First statistical parser to map language into abstract meaning representation of semantics  First to show that for resource-impoverished languages, a multilingual parser based on language universals outperforms a target language parser target language First analyses to prove formally and empirically demonstrate that inference in dependency parsing is computationally easy on average case (despite NP-hard for the worst case). Slide15

External Honors for the LC ProjectBest human judgments of English-Russian translations at WMT2013

Best BLEU on Hindi-English translation at WMT2014

Best student paper, ACL 2014 Low-Rank Tensors for Scoring Dependency Structures. Tao Lei, Yu

Xin, Yuan Zhang, Regina Barzilay and Tommi Jaakkola. http://people.csail.mit.edu/taolei/papers/acl2014.pdf Best paper, honorable mention, ACL 2014 A Discriminative Graph-Based Parser for the Abstract Meaning Representation. Jeffrey Flanigan, Sam Thomson, Jaime Carbonell, Chris Dyer and Noah A. Smith. http://www.cs.cmu.edu/~jmflanig/flanigan+etal.acl2014.pdf Best paper, runner up, EMNLP 2014 Language Modeling with Power Low Rank Ensembles. http://www.aclweb.org/anthology/D/D14/D14-1158.pdf Best paper (one of four), NIPS 2014 Conditional Random Field Autoencoders for Unsupervised Structured Prediction.

Waleed Ammar, Chris Dyer, and Noah A. SmithSlide16

Lexical Borrowing of Common WordsSlide17

Swahili morphology using a crowdsourced lexiconPatrick Littell

, Lori Levin, Chris Dyer

No provenance: the root of the word was collected by hand.

[GUESS1]: the root is inferred from the

Kamusi lexicon part of speech tag including noun class. [GUESS2]: the root is from Kamusi, but no noun class is given.[GUESS3]: possible English loan word

[GUESS4]: complete guessFST written by Patrick

LittellLexicon extracted from dictionaries and textbooks Slide18

Parsing Progress (F1)

On

CoNNL

Dataset: 88.72 (CMU), 89.44 (MIT)Slide19

CCG Supertagging

the

lazy

dogsnp/n

n

np

n/n

np

wander

(s\np)/np

n

n/n

np

/n

s\

np

HMM

Linguistically-

Motivated

PriorsSlide20

Parsing into

AMR

(ACL 2014 honorable mention for best paper)

Approximately

11000

guards

patrol

the

1200

-

kilometre

border

between

Russia

and

Afghanistan

.

(

p / patrol-01

:ARG0 (

g / guard :quant (a2 / approximately :op1 11000)) :ARG1 (b / border :quant (d4 / distance-quantity :unit (k2 / kilometer) :quant 1200) :location (b2 / between :op1 (c / country :name (n / name :op1 "Russia")) :op2 (c2 / country :name (n2 / name :op1 "Afghanistan")))))

a2 / approximately

11000

p / patrol-01

g / guard

(d4 / distance-quantity :unit (k2 / kilometer)

1200

b / border

b2 / between

(c / country :name (n / name :op1 "Russia"))

(c2 / country :name (n2 / name :op1 "Afghanistan"))

Add relations

New Results:

61% F1

CMU and ISISlide21

Unsupervised Part-of-Speech Tagging

V-measure (higher is better)

Arabic Basque Danish Greek

Hungarian Italian Kin. Mal. Turkish Ave.

conditional random field autoencoder

classic hidden Markov modelfeaturized

hidden Markov modelSlide22

Automatic Classification of the Communicative Functions of Definiteness

Annotated Corpus

Semantics of Definiteness

Syntactic features extracted from dependency parserLogistic regression classifierPredicted semantic functions of definiteness: 78.2% accuracy

Why Definiteness:

One instance of non-propositional semantics

Major determinant of word order

Wildly divergent in

morpho

-syntactic expression Problems in word alignment and language modelsSlide23

Integrating Alignment and Decipherment for Better Low-Density MT

Small bilingual Malagasy/English text

(need to

align words [Brown et al 93])Large Malagasy monolingual text(need to decipher [Dou & Knight 13])

Decipherment helps

Word Alignment

Decipherment helps

Machine

Translation

joint

Bleu

ISI jointly with CMU/Texas/MITSlide24

Graph Formalisms for

Language Understanding and Generation

String Automata Algorithms

Tree Automata Algorithms

Graph Automata Algorithms

N-best answer extraction

… paths through an WFSA

(

Viterbi

, 1967;

Eppstein

, 1998)

… trees in a weighted forest (

Jim

é

nez

&

Marzal

, 2000; Huang & Chiang, 2005)

Investigating:

Linguistically adequate

representations

Efficient algorithms

Using them in:

Text

 Meaning (NLU)

Meaning  Text (NLG) Meaning-based MT

Unsupervised EM training

Forward-backward EM (Baum/Welch, 1971; Eisner 2003)

Tree transducer EM training

(

Graehl

& Knight, 2004)

Determinization

, minimization

… of weighted string acceptors (

Mohri

, 1997)

… of weighted tree acceptors (

Borchardt

&

Vogler

, 2003;

May & Knight, 2005)

Intersection

WFSA intersection

Tree acceptor intersection

Application of transducers

string

 WFST  WFSA

tree

 TT  weighted tree

acceptor

Composition of transducers

WFST composition

(Pereira & Riley, 1996)

Many tree transducers not closed under composition

(

Maletti

et al 09)

Software tools

Carmel,

OpenFST

Tiburon (May & Knight 10)

ISI jointly with CMUSlide25

NE Gold Standard (native speaker)

(

Azim

)

NE Pyrite Standard (linguist)

(Alexa, Lori)

Morphology Lists

for NE

(Alexa, David)

Morphological

Analyzers

(

Swabha

, Chris)

Gazetteers

(Pat, Chris)

Brown

Clusters

(

Kartik

)

Tajik Corpus from

Leipzig Archive

Supervision

(

Azim

)Supervision(David)Tajik and PersianWikipediasTajik Reference Grammar(Perry, 2005)

PerLex Persian Lexicon(Sagot and Walther, 2010)

IPA Converter

(

Kartik

, Pat, Chris)

Named Entity Recognizer

(

Kartik

, Chris)

Persian Treebank (

Rasooli

et al., 2013)

Tajik POS Tagger

(Chris)

Persian-Tajik Converter

(Chris)Slide26

Strings

Graphs

FSA

CFG

DAG acceptor

HRG

probabilistic

yes

yes

yes

yes

intersects

with finite-state

yes

yes

yes

yes

EM training

yes

yes

yes

yes

transduction

O

(

n

)O(n3)O(|Q|T+1n)O((3dn)T+1)implementedyesyesyesyesNew Results for Graph Automata for Mapping Between Text and Meaningd = graph degree for AMR, high in practiceT = treewidth complexity for AMR, low in practice (2-3)Slide27

Next Steps (high level overview)Finalize MT systems: K, M, S, Y

Package and make available externally

Possibly integrate with government translator workbench

Compare with Govt systems when available and appropriate (e.g. Malagasy with Carl Rubino)Complete scientific investigations (Graph transduction, MT with AMR, supertagging  parsing, borrowing++, …)Document and distribute tool suites (rapid annotation, morphology, CCG supertagging, dependency parsing, AMR parsing, generation, end-to-end MT, lexicon borrowing, ML modules, …)  15 +/-Publish, publish, publish (40+ papers and counting)Detailed next steps at the end of each major presentationSlide28

Jaime Carbonell, CMU28

THANK YOU!Slide29

Supplementary SlidesSelect/show as needed for discussion periodSlide30

Tag

Dictionary Generalization

.

TOK_the_1

TOK_dog_2

TOK_the_4

TOK_thug_5

NEXT_walks

.

PREV_<b>

PREV_the

PRE1_t

PRE2_th

SUF1_g

TYPE_the

TYPE_thug

TYPE_dog

Token Annotations

____________

____________

Type Annotations

____________

____________

the

dog

the dog walk

s

DT

NN

VBZ

DT

NN

Raw Corpus

________

________

________

Any arbitrary features could

be

addedSlide31

RULE 1:

DT(these)

这RULE 2:VBP(include) 

中包括

RULE 6:

NNP(Russia)

 俄罗斯

RULE 4:

NNP(France) 

法国

RULE 8:

NP(NNS(astronauts))

宇航 , 员

RULE 5:

CC(and)

RULE 9:

PUNC(.)

.

7人 中包括 来自 法国 和 俄罗斯 的 宇航 员 .RULE 10:NP(x0:DT, CD(7), NNS(people)  x0 , 7人RULE 13:NP(x0:NNP, x1:CC, x2:NNP)  x0 , x1 , x2RULE 15:S(x0:NP, x1:VP, x2:PUNC)  x0 , x1 , x2RULE 16:NP(x0:NP, x1:VP)  x1 , 的 , x0RULE 11:VP(VBG(coming), PP(IN(from), x0:NP))  来自 , x0

RULE 14:

VP(x0:VBP, x1:NP)

 x0 , x1

“These 7 people include astronauts coming from France and Russia”

“France and Russia”

“coming from France and Russia”

“astronauts coming from

France and Russia”

“these 7 people”

“include astronauts coming from

France and Russia”

“these”

“Russia”

“astronauts”

“.”

“include”

“France”

“&”Slide32

Model Minimization

.

.

<b> The man saw the saw <b>

<b>

DT

NN

VBD

1.0

1.0

1.0

0.8

0.2

0.4

0.7

0.3

1.0

0 1 2 3 4 5 6

0.6Slide33

33Linguistically opportunistic parsing

Linguistic

universals

GFL annotated corpus

Unannotated

corpus

Small CCG

Lexicon

Parsers

CMU, Texas

Texas

Texas, MIT

He has been writing a letter.

Dependencies

(j / join-01 :ARG0 (p / person :name (p2 / name :op1 "Pierre" :op2 "Vinken") :age (t / temporal-quantity :quant 61 :unit (y / year))) :ARG1 (b / board) :prep-as (d2 / director :mod (e / executive :polarity -)) :time (d / date-entity :month 11 :day 29))

Abstract Meaning Reps

ISI

CMU

CMU

MITSlide34

Fragmentary Unlabeled Dependency Grammar(Schneider, O’Connor, Saphra, Bamman, Faruqui,

Smith, Dyer, and Baldridge, 2013)

Represents unlabeled dependencies

Special handling for:multiword expressionscoordination

anaphoraAllows underspecification

Graph fragment language for easy annotationSlide35

Graph Fragment Language (GFL)

{Our three} > weapons > are < $a

$a :: {fear surprise efficiency} :: {and~1 and~2}

ruthless > efficiency

Provide a detailed analysis of coordination…

(Our three weapons*) > are <

(fear surprise and ruthless efficiency)

Or focus just on the high level…

(((Ataon’ < (ny > mpanao < fihetsiketsehana)) <

hoe < mpikiky < manko) < (i > Gaddafi))

Atoan’ < noho < (ny_1 > kabariny < lavareny)

Provide detailed syntactic dependency structure

Ataon’ < (ny_1 mpanao* fihetsiketsehana)

Atoan’ < (hoe* mpikiky manko)

Atoan’ < (i Gaddafi*)

Atoan’ < noho < (ny_2 kabariny lavareny)

Or focus on predictate/arguments

“Gaddafi has referred to protesters as rodents in his rambling speeches.”Slide36

36GFL (CMU/Texas) & AMR (ISI)

The classic: “

Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .

”(j / join-01 :ARG0 (p / person :name (p2 / name :op1 "Pierre" :op2 "Vinken") :age (t / temporal-quantity :quant 61 :unit (y / year))) :ARG1 (b / board) :prep-as (d2 / director :mod (e / executive :polarity -)) :time (d / date-entity :month 11 :day 29))join < [ Pierre Vinken ]join < boardjoin < as < directorjoin < [Nov. 29]nonexecutive > director61 > years > old > [ Pierre Vinken ]

( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) ))Slide37

instance

ARG0

WANT

ARG1

instance

ARG0

BELIEVE

ARG1

instance

WANT

ARG1

WE CAN DERIVE AND TRANSFORM SEMANTIC GRAPHS:

Probabilistic Graph Grammars

“the boy wants

something

involving himself

instance

ARG0

WANT

B

X

ARG1

WITH BASIC RULES LIKE THIS:

instance

GIRL

instanceBOY“the boy wants the girl to believe he is wanted”Slide38

Example Parsing into AMR

Approximately

11000 guards patrol the 1200 -

kilometre border

between Russia

and Afghanistan.

(

p / patrol-01

:ARG0 (g / guard

:quant (a2 / approximately

:op1 11000

))

:ARG1 (b / border

:quant

(d4 / distance-quantity

:unit (k2 / kilometer)

:quant

1200

)

:location (

b2 / between

:op1

(c / country

:name (n / name

:op1 "Russia")) :op2 (c2 / country :name (n2 / name :op1 "Afghanistan")))))a2 / approximately11000p / patrol-01g / guard(d4 / distance-quantity :unit (k2 / kilometer)1200b / borderb2 / between(c / country :name (n / name :op1 "Russia"))(c2 / country :name (n2 / name :op1 "Afghanistan"))Add relationsSlide39
Slide40

How much foreign text (running words)

Accuracy of learned

bilingual dictionary

Deciphering Foreign Language(Dou & Knight 2013)Dependency-based:Linguistic analysis helps substantially!

(Dou & Knight 2012)Ngram

-basedtest on Spanish

Englishtext

Foreigntext

not translationsof each other

DecipheringEnginebilingual

word-for-worddictionarySlide41

Constituent Structure Trees

Strength:

You can use tests for constituency (movement, deletion, substitution, coordination) to get reproducible results for corpus annotation.

Weakness: 1. Tests for constituency sometimes fail to provide reproducible results.The five trees (based on an exercise in Radford , 1988) have each been proposed in a published paper and can each be defended by tests for constituency. 2. People do not have uniform intuitions about which tree is “correct”. Slide42

42

Morpho-syntactics

Iñupiaq

(North Slope Alaska) Tauqsiġñiaġviŋmuŋniaŋitchugut. ‘We won’t go to the store.’Slide43

Mathematical Foundations for Semantics-Based Machine Translation

Previous MT systems have been based on clean

string automata

and tree automataGeneral purpose algorithms have been worked out (in part by MT scientists), with wide applicabilitysoftware toolkits even implement those algorithmsBut new models of meaning-based MT deal in semantic graph structuresForeign string  Meaning graph  English stringQUESTION: Do efficient, general-purpose algorithms for graph automata exist to support these linguistic models?Slide44

General-Purpose Algorithms for Manipulating Linguistic Structures: Acceptors

String Acceptors

successfully applied to speech recognition

Tree Acceptors

successfully applied to syntax-based MT

Graph Acceptors

now being applied to semantics-based MT

Membership checking ...

... of string (length n) in WFSA.

O(n) if WFSA is

determinized

.

... of tree in forest.

O(n) if

determinized

.

... of graph in

hyperedge

-replacement grammar (HERG) (

Drewes

97)

New algorithm:

Chiang

(forthcoming), O

(2

d

n)k+1 : d & n properties of individual grammark-best … … best k paths through an WFSA with n states and e edges (Viterbi 67; Eppstein 98)O(e + n log n + k log k)

… trees in a weighted forest (Jiménez

& Marzal

00; Huang &

Chiang 05)

O(e + n k log k)

... graphs in weighted HERG.

Efficient

Huang &

Chiang

results carry over

.

EM training of probabilistic weights

Forward-backward EM (Baum/Welch 71; Eisner 03)

O(n)

Tree acceptor training (

Graehl

&

Knight

04)

O(n)

Efficient

Graehl

&

Knight

results carry over

.

Intersection

WFSA intersection

O(n

2

) classical

Tree acceptor intersection

O(n

2

) classical

Graph acceptor intersection

NOT CLOSED (in general)

co-PI

supported under MURI projectSlide45

General-Purpose Algorithms for Feature Structures (Graphs)

String World

Tree World

Graph World

Acceptor

Finite-state acceptors

Tree automata

HRG

Transducer

Finite-state transducers

Tree transducers

Synchronous HRG

Membership checking

O

(

n

)

O

(

n

) for trees

O

(

n

3

) for strings

O

(

n

k

+1

) for graphs

N-best …

… paths through an WFSA

(Viterbi, 1967; Eppstein, 1998)

… trees in a weighted forest (Jiménez & Marzal, 2000; Huang & Chiang, 2005)

… graphs in a weighted forest

EM training

Forward-backward EM (Baum/Welch, 1971; Eisner 2003)

Tree transducer EM training

(Graehl & Knight, 2004)

EM on forests of graphs

Intersection

WFSA intersection

Tree acceptor intersection

Not closed

Transducer composition

WFST composition

(Pereira & Riley, 1996)

Many tree transducers not closed under composition

(Maletti et al 09)

Not closed

General tools

Carmel,

OpenFST

Tiburon (May & Knight 10)

BolinasSlide46

Linguistic Core Team(LL, JB, SV, JC)

Linguistic Analyzers Team

(NS, RB, JB)

MT Systems Team (KK, DC, SV, JC)Parser, Taggers, Morph. Analyzers

Hand-built Linguistic Core

Triple Gold Data

Triple

Ungold

Data

MT Visualizations and logs

MT Features

MT Error Analysis

MT Systems

Inference Algorithms

Data:

Parallel

Monolingual

Elicited

Related language

Multi-parallel

Comparable

Elicitation corpus

Data selection for annotation

Functional CollaborationSlide47

Malagasy Resources

Tokens

Types

HapaxBible (Year 1)579,57819,4608,401Leipzig corpus (Year 2)618,28241,46223,659CMU Global Voices (Year 2)2,148,97684,744

46,627Total3,346,836115,172

62,517

Malagasy - English Resources

eng-Tokens

eng-Types

mlg-Tokensmlg-TypesBible (Year 1)

584,87213,084579,57819,460

CMU Global Voices (Year 2)1,785,472

63,3572,148,976

84,744Total2,370,344

67,7903,346,836

115,172Slide48

48

Evolutionary Tree of

MT ParadigmsPrior to LCMT

1950

2012

1980

Transfer MT

DecodingMT

Analogy MT

Large-scale TMT

Interlingua MT

Example-based MT

Large-scale

TMT

Context-Based MT

Statistical MT

Phrasal SMT

Transfer MT w stat phrases

SMT on syntax

struct

.

LCMTSlide49

Model ParametersDistribution over number of arguments given the parent tag Weights for selection features, shared across all set sizes

Weights for ordering features

49

All parameters are shared across languagesSlide50

Malagasy Language Modeling

Model

Data

Seq. X-entWord X-entTotal X-ent.PerplexityOOVs3-gram+charBible10.357.6618.01

264,32323.94%3-gram+charGV

7.021.148.16

286.03.30%

3-gram+morphGV7.02

0.907.92241.4

3.30%Successes

Malagasy analyzer has << 100% coverage, but we still get substantial gains

Year

3 GoalsImprove word sequence model with

morphosyntactic information

Improve coverage of Malagasy morphological phenomenaIncorporation in MT system

Kinyarwanda analyzer/generator under developmentSlide51

How CMU ISI UT and MIT collaborateMonthly teleconference calls

Focused on management and project coordination

Technical topics follow when appropriate

Semi-annual face-to-face meetings Last ones in Nov 2012 and March 2013Include students/postdocs, etc. Focused on scienceMuch more frequent focused calls/chats/etc.Data collection, annotations, SW APIs, brainstorming new algorithms, …Sharing/reviewing results and papersWebsite/repository + shared SW/data sets + papers + more goodieswww.linguisticcore.infoStudent exchanges (e.g. week, month, summer)Occasional individual faculty trips Combined research (GFL, AMR parsing, CCG parsing, decipherment,…)