An Investigation of Machine Translation Evaluation Metrics - PowerPoint Presentation

patricia . @patricia

343 views
Uploaded On 2022-06-15

An Investigation of Machine Translation Evaluation Metrics - PPT Presentation

i n Crosslingual Question Answering Kyoshiro Sugiyama Masahiro Mizukami Graham Neubig Koichiro Yoshino Sakriani Sakti Tomoki Toda Satoshi Nakamura NAIST Japan Question answering QA ID: 918719

set translation sentence correlation translation set correlation sentence questions nist level question accuracy quality answered group metrics japan sample

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/918719" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "An Investigation of Machine Translation..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Slide1

An Investigation of Machine Translation Evaluation Metricsin Cross-lingual Question Answering

Kyoshiro Sugiyama

, Masahiro Mizukami, Graham Neubig,

Koichiro

Yoshino, Sakriani Sakti, Tomoki Toda, Satoshi Nakamura

NAIST, Japan

Slide2

Question answering (QA)One of the techniques for information

etrieval

Input: Question  Output: Answer

InformationSource

Where is the capital of Japan?

Tokyo.

Retrieval

Retrieval Result

/22

Slide3

QA using knowledge basesConvert question sentence into a query

Low ambiguity

Linguistic restriction of knowledge base

 Cross-lingual QA is necessary

Where is the capital of Japan?

Tokyo.

Type.Location

Country.Japan.CapitalCity

Knowledge base

Location.City.Tokyo

QA system using knowledge base

Query

Response

/22

Slide4

Cross-lingual QA (CLQA)Question sentence

(Linguistic difference) Information source

日本の首都は

どこ？

東京

Type.Location

Country.Japan.CapitalCity

Knowledge base

Location.City.Tokyo

QA system using knowledge base

Query

Response

To create mapping:

High cost and

not re-usable in other languages

/22

Any language

Slide5

CLQA using machine translationMachine translation (MT) can be used to perform CLQA

Easy, low cost and usable in many languages

QA accuracy depends on MT quality

日本の首都はどこ？

Where is the capital of Japan?

ExistingQA system

Tokyo

Machine

Translation

東京

Machine

Translation

/22

Any language

Slide6

Purpose of our workTo make clear how translation affects QA accuracy

Which MT metrics

are

suitable for the CLQA task? Creation of QA dataset using various translations systems

 Evaluation of the translation quality and QA accuracyWhat kind of translation results influences QA accuracy?

 Case study (manual analysis of the QA results)

6/22

Slide7

QA systemSEMPRE

framework [Berant et al., 13]

3 steps of query generation:

AlignmentConvert entities in the question sentence into “logical forms”BridgingGenerate predicates compatible with neighboring predicates

ScoringEvaluate candidates using scoring function

7/22

Scoring

Slide8

Data set creation8/22

Training

(512 pairs)

Dev.

(129 pairs)

Test(276 pairs)

(OR set)

Free917

JA set

HT set

GT set

YT set

Mo set

Tra set

Manual t

ranslation

into Japanese

Translation

into English

Slide9

Translation methodManual Translation (“HT” set): Professional humans

Commercial MT systems

Google Translate (“GT” set)

Yahoo! Translate (“YT” set)Moses (“Mo” set): Phrase-based MT systemTravatar (“Tra” set): Tree-to-String based

MT system9/22

Slide10

ExperimentsEvaluation of translation quality of created data setsReference is the questions in the OR set

QA accuracy evaluation using created data sets

Using same model

 Investigation of correlation between them

10/22

Slide11

Metrics for evaluation of translation quality11/22

BLEU+1

: Evaluates

local n-grams1-WER: Evaluates whole word order

strictlyRIBES: Evaluates rank correlation of word order NIST: Evaluates local word order and correctness of

infrequent wordsAcceptability: Human evaluation

Slide12

Translation quality

/22

Slide13

QA accuracy

/22

Slide14

Translation quality and QA accuracy

/22

Slide15

Translation quality and QA accuracy

/22

Slide16

Sentence-level analysis47% questions of OR set are not answered correctly



hese questions might be difficult to answer even with the correct translation result

Dividing questions into two groupsCorrect group (141*5=705 questions): Translated from 141 questions answered correctly in OR setIncorrect group (123*5=615 questions):

Translated from remaining 123 questions in OR set16/22

Slide17

Sentence-level correlation

Metrics

(correct group)

(incorrect group)

BLUE+1

0.900

0.0071-WER0.690

0.092RIBES0.4180.311

NIST

0.942

0.210Acceptability

0.890

0.547

Metrics

BLUE+1

0.900

0.007

1-WER

0.690

0.092

RIBES

0.418

0.311

NIST

0.942

0.210

Acceptability

0.890

0.547

/22

Slide18

Sentence-level correlation

Metrics

(correct group)

(incorrect group)

BLUE+1

0.900

0.0071-WER0.6900.092

RIBES0.4180.311

NIST

0.9420.210

Acceptability0.890

0.547

Metrics

BLUE+1

0.900

0.007

1-WER

0.690

0.092

RIBES

0.418

0.311

NIST

0.942

0.210

Acceptability

0.890

0.547

Very little correlation

NIST has the highest correlation

 Importance of content words

If the reference cannot be answered correctly, the sentences are not suitable,

even for negative samples

/22

Slide19

Sample 1

/22

Slide20

Sample 220/22

Lack of the

question type-word

Slide21

Sample 3

/22

All questions were answered correctly

though they are grammatically incorrect.

Slide22

ConclusionNIST score has the highest correlation

NIST is sensitive to the change of content words

If reference cannot be answered correctly,

there is very little correlation between translation quality and QA accuracyAnswerable references should be used3 factors which cause change of QA results:

content words, question types and syntax22/22

Slide23

Sentence-level correlation: BLEU+1

/22

Slide24

Sentence-level correlation: 1-WER

/22

Slide25

Sentence-level correlation: RIBES

/22

Slide26

Sentence-level correlation: NIST