i n Crosslingual Question Answering Kyoshiro Sugiyama Masahiro Mizukami Graham Neubig Koichiro Yoshino Sakriani Sakti Tomoki Toda Satoshi Nakamura NAIST Japan Question answering QA ID: 918719
Download Presentation The PPT/PDF document "An Investigation of Machine Translation..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
An Investigation of Machine Translation Evaluation Metricsin Cross-lingual Question Answering
Kyoshiro Sugiyama
, Masahiro Mizukami, Graham Neubig,
Koichiro
Yoshino, Sakriani Sakti, Tomoki Toda, Satoshi Nakamura
NAIST, Japan
Slide2Question answering (QA)One of the techniques for information
r
etrieval
Input: Question Output: Answer
InformationSource
Where is the capital of Japan?
Tokyo.
Retrieval
Retrieval Result
2
/22
Slide3QA using knowledge basesConvert question sentence into a query
Low ambiguity
Linguistic restriction of knowledge base
Cross-lingual QA is necessary
Where is the capital of Japan?
Tokyo.
Type.Location
Country.Japan.CapitalCity
Knowledge base
Location.City.Tokyo
QA system using knowledge base
Query
Response
3
/22
Slide4Cross-lingual QA (CLQA)Question sentence
(Linguistic difference) Information source
日本の首都は
どこ?
東京
Type.Location
Country.Japan.CapitalCity
Knowledge base
Location.City.Tokyo
QA system using knowledge base
Query
Response
To create mapping:
High cost and
not re-usable in other languages
4
/22
Any language
Any language
Slide5CLQA using machine translationMachine translation (MT) can be used to perform CLQA
Easy, low cost and usable in many languages
QA accuracy depends on MT quality
日本の首都はどこ?
Where is the capital of Japan?
ExistingQA system
Tokyo
Machine
Translation
東京
Machine
Translation
5
/22
Any language
Any language
Slide6Purpose of our workTo make clear how translation affects QA accuracy
Which MT metrics
are
suitable for the CLQA task? Creation of QA dataset using various translations systems
Evaluation of the translation quality and QA accuracyWhat kind of translation results influences QA accuracy?
Case study (manual analysis of the QA results)
6/22
Slide7QA systemSEMPRE
framework [Berant et al., 13]
3 steps of query generation:
AlignmentConvert entities in the question sentence into “logical forms”BridgingGenerate predicates compatible with neighboring predicates
ScoringEvaluate candidates using scoring function
7/22
Scoring
Slide8Data set creation8/22
Training
(512 pairs)
Dev.
(129 pairs)
Test(276 pairs)
(OR set)
Free917
JA set
HT set
GT set
YT set
Mo set
Tra set
Manual t
ranslation
into Japanese
Translation
into English
Slide9Translation methodManual Translation (“HT” set): Professional humans
Commercial MT systems
Google Translate (“GT” set)
Yahoo! Translate (“YT” set)Moses (“Mo” set): Phrase-based MT systemTravatar (“Tra” set): Tree-to-String based
MT system9/22
Slide10ExperimentsEvaluation of translation quality of created data setsReference is the questions in the OR set
QA accuracy evaluation using created data sets
Using same model
Investigation of correlation between them
10/22
Slide11Metrics for evaluation of translation quality11/22
BLEU+1
: Evaluates
local n-grams1-WER: Evaluates whole word order
strictlyRIBES: Evaluates rank correlation of word order NIST: Evaluates local word order and correctness of
infrequent wordsAcceptability: Human evaluation
Slide12Translation quality
12
/22
Slide13QA accuracy
13
/22
Slide14Translation quality and QA accuracy
14
/22
Slide15Translation quality and QA accuracy
15
/22
Slide16Sentence-level analysis47% questions of OR set are not answered correctly
T
hese questions might be difficult to answer even with the correct translation result
Dividing questions into two groupsCorrect group (141*5=705 questions): Translated from 141 questions answered correctly in OR setIncorrect group (123*5=615 questions):
Translated from remaining 123 questions in OR set16/22
Slide17Sentence-level correlation
Metrics
(correct group)
(incorrect group)
BLUE+1
0.900
0.0071-WER0.690
0.092RIBES0.4180.311
NIST
0.942
0.210Acceptability
0.890
0.547
Metrics
BLUE+1
0.900
0.007
1-WER
0.690
0.092
RIBES
0.418
0.311
NIST
0.942
0.210
Acceptability
0.890
0.547
17
/22
Slide18Sentence-level correlation
Metrics
(correct group)
(incorrect group)
BLUE+1
0.900
0.0071-WER0.6900.092
RIBES0.4180.311
NIST
0.9420.210
Acceptability0.890
0.547
Metrics
BLUE+1
0.900
0.007
1-WER
0.690
0.092
RIBES
0.418
0.311
NIST
0.942
0.210
Acceptability
0.890
0.547
Very little correlation
NIST has the highest correlation
Importance of content words
If the reference cannot be answered correctly, the sentences are not suitable,
even for negative samples
18
/22
Slide19Sample 1
19
/22
Slide20Sample 220/22
Lack of the
question type-word
Slide21Sample 3
21
/22
All questions were answered correctly
though they are grammatically incorrect.
Slide22ConclusionNIST score has the highest correlation
NIST is sensitive to the change of content words
If reference cannot be answered correctly,
there is very little correlation between translation quality and QA accuracyAnswerable references should be used3 factors which cause change of QA results:
content words, question types and syntax22/22
Slide23Sentence-level correlation: BLEU+1
23
/22
Slide24Sentence-level correlation: 1-WER
24
/22
Slide25Sentence-level correlation: RIBES
25
/22
Slide26Sentence-level correlation: NIST
26
/22
Slide27Sentence-level correlation: Acceptability
27
/22
Slide28Sample 2
28
/22
Slide29Sample 4
29
/22
Slide30Sample 4
30
/22
They were wrong answered
though they are grammatically correct
Correct grammar is
not so important