ConvAI2 Competition Results and Analysis Automatic Evaluation ORIGINAL PERSONACHAT DATASET TOTAL utterances 162064 TOTAL dialogs 10907 TOTAL personas 1155 VALID SET Utterances ID: 742926
Download Presentation The PPT/PDF document "Speaker: Emily Dinan Facebook AI Researc..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Speaker: Emily Dinan
Facebook AI Research
ConvAI2 Competition
: Results and AnalysisSlide2
Automatic Evaluation
ORIGINAL PERSONA-CHAT DATASET:TOTAL utterances: 162,064
TOTAL dialogs: 10,907TOTAL personas: 1155VALID SET: Utterances: 15,602Dialogs: 1000
Personas: 100TEST SET:Utterances: 15,024Dialogs: 968Personas:
100
HIDDEN TEST SET:
Utterances:
13,268
Dialogs: 1015Personas: 100
SET-UPSlide3
Automatic Evaluation THREE BASELINES AVAILABLE IN PARLAI (visit http://parl.ai/):
SET-UP
Model
PPLHits@1F1
KV Profile Memory
-
55.2
11.9
Seq2Seq + Attention29.812.616.18Language Model 46.0- 15.02
Three automated metrics:
Perplexity
Hits@1 (out of 20 possible candidates)
F1
Participants need to use the same model for all three metrics, but they did not need to be evaluated on all three metricsSlide4
Automatic Evaluation Over 23 teams submitted!
The rank was determined by sorting by the minimum rank of the score in any of the three metrics, where ties were broken by considering the second (and then third) smallest ranks.
The top 7 teams made it to the next round
RESULTSSlide5
Automatic Evaluation
RESULTSSlide6
Automatic Evaluation RESULTSSlide7
Automatic Evaluation RESULTSSlide8
Further AnalysisDISTRACTOR CANDIDATES
Added the last partner message to the list of candidates to rank as a distractor
The Hugging Face model was most resistant to this type of attackSlide9
Further AnalysisREVISED PERSONAS
Can these models understand personas (instead of just copying)?
Hugging Face performed the best on the revised test set. Little Baby close behind.
ORIGINAL REVISED
I am very shy
I am not a social person
I just got my nails done
I love to pamper myself on a regular basis I am on a diet now I need to lose weightSlide10
Human Evaluation 100 evaluations per model
Mechanical Turk worker and model were each assigned a persona and chat for 4-6 dialog turns each After the chat, the worker is asked:
How much did you enjoy talking to this user?Choices: not at all, a little, somewhat, a lot 1, 2, 3, 4Next, the worker is shown the model’s persona + a random persona, and asked:
Which prompt (character) do you think the other user was given for this conversation?Mechanical Turk SET-UPSlide11
Human Evaluation
Mechanical Turk SET-UPSlide12
Human Evaluation
Wild SET-UP
Evaluation done in Facebook Messenger and TelegramThrough Wednesday, anyone could message and get paired randomly with one of the bots to have a conversation and rate itSlide13
Human Evaluation Wild RESULTS
Some conversations were really great!Slide14
Human Evaluation Wild RESULTS
Some conversations were really great!
Others… not so muchSlide15
Human Evaluation Wild RESULTS
Some problems with spammers in the data collection; after reading through data we decided to discount these results
OPEN PROBLEM: detecting spamSlide16
And the winner is…Slide17
And the winner is…
Lost in
Conversation!Slide18
Human Evaluation Mechanical Turk RESULTSSlide19
Human Evaluation Mechanical Turk RESULTS
LOTS OF VARIANCE!Slide20
Human Evaluation CALIBRATION Reducing annotator bias with Bayesian calibration!Some annotators are quite harsh while others are quite generous
average score has high varianceMethod from Importance of a Search Strategy in Neural Dialogue Modeling, Kulikov et. al 2018 (available on arXiv)
Mechanical Turk RESULTS
SET-UP:Slide21
Human Evaluation Mechanical Turk RESULTS
AFTER CALIBRATIONSlide22
Human Evaluation AFTER CALIBRATION
Mechanical Turk RESULTS
BEFORE
CALIBRATION
Same conclusion after reducing annotator biasSlide23
Human Evaluation How well did the models use the personas?
Mechanical Turk RESULTS
Every team did better than the baseline except Happy Minions
98% detection rate!Slide24
Further Analysis
Model Name
Human Eval Score Avg. # words (model)
Avg. # words (human)Avg. # characters (model)
Avg. # characters (human)
Human
3.46
14.1
13.759.957.7
Lost in Conversation3.1110.18
11.9
39.2
48.2
Hugging Face
2.67
11.5
11.9
44.4
49.2
Little Baby
2.4
11.5
11.3
51.5
47.3
Mohd
Shadab
Alam
2.36
9.5
10.2
33.8
42.5
Happy Minions
1.92
8.0
10.2
27.9
42.5
ADAPT Centre
1.59
15.1
11.8
60.0
48.0
HUMAN & BOT MESSAGE WORD STATISTICS on Mechanical Turk Logs
Some correlation between human message length and
eval
score…Slide25
Further Analysis
Model Name
Human Eval Score Freq1h (model)
Freq1h (human)Freq1k (model)Freq 1k (human)Unique (model)Human
3.46
4.8
4.3
17.2
16.399%Lost in Conversation3.112.23.4
9.913.286%
Hugging Face
2.67
2.5
4.2
9.0
15.6
97%
Little Baby
2.4
4.9
3.7
18.3
15.6
91%
Mohd
Shadab
Alam
2.36
1.3
3.2
9.5
14.1
83%
Happy Minions
1.92
0.3
4.1
4.3
14.3
53%
ADAPT Centre
1.59
1.7
3.5
8.8
15.1
98%
HUMAN & BOT MESSAGE WORD STATISTICS on Mechanical Turk Logs
Uniqueness is important for continued engagement with a bot
Humans use more rare words, otherwise no clear conclusionSlide26
Further Analysis
Model Name
Human Eval Score Unigram RepeatsBigram Repeats
Trigram RepeatsHuman3.461.832.47
0.51
Lost in Conversation
3.11
2.11
5.62.67Hugging Face2.67
1.495.040.6
Little Baby
2.4
2.53
2.69
1.43
Mohd
Shadab
Alam
2.36
3.48
11.34
7.06
Happy Minions
1.92
1.62
6.56
3.81
ADAPT Centre
1.59
6.74
11.53
1.44
BOT MESSAGE WORD STATISTICS on Mechanical Turk Logs
Humans have lowest repeats, otherwise no clear conclusionSlide27
UNDERSTANDING THE HUMAN EVALUATION RESULTSRandomly sample logs from Hugging Face and Lost in Conversation
AVERAGE RATINGS FOR THIS SUBSET OF CONVERSATIONS:
Blind EvaluationLost in Conversation:
TURKER: 3.29EMILY: 2.78JASON:
2.71
Hugging Face
:
TURKER
: 2.8EMILY: 2.47
JASON: 2
Jason is harsher than Emily who is harsher than
Turkers
(different bias), but all three agree on which model is betterSlide28
BLIND EVALUATION SAMplesHUGGING FACE
“good at answering questions, but asks a question that's already been answered”“karaoke repeat, otherwise ok. bit boring”
“contradicts itself twice here”“asked too many questions, and contradicted itself a couple times”“nice acknowledgement about dogs by the model, makes a slight mistake by asking about what kind of music the person plays”“some detail mistakes (rock, artist), otherwise ok”
“too many questions, changes topics a lot”
LOST IN CONVERSATION
“not super interesting, but it's able to respond well to the comment about reading and cooking”
“pretty good up until the last utterance (weird follow-up to "I'm a student who plays baseball")”
“v good. e.g. the position stuff”
“not that bad, just really uninteresting”
“asks what the person does for a living twice…”“repeat mistake”
“too much awesome. otherwise good”
“this conversation is super coherent, and the model responds well to the users messages”Slide29
Hugging FaceJason:
work question is already answered, never really answers, just goes to another question. repeat travel question1/4Emily: repeats the question about traveling, tends to ask a lot of questions, making the conversation hop between subjects rather quickly
2/4BOT IN GREEN
Blind evaluationSlide30
BLIND EVALUATIOHugging FaceJason:
alaska mistake, seems to ignore school. bit boring2/4Emily: sort of contradicts itself about
alaska (it says it's from there, then it says it has been there, which is consistent but unnatural)2/4
BOT IN BLUESlide31
Blind evaluationLost in ConversationJason:
v good. e.g. the position stuff4/4Emily: this conversation is super coherent, and the model responds well to the users messages
4/4
BOT IN GREENSlide32
BLIND EVALUATIONLost in ConversationJason:
too much awesome. otherwise good2/4Emily: it says "that's awesome" three times2/4
BOT IN BLUESlide33
UNDERSTANDING THE HUMAN EVALUATION RESULTS
Blind Evaluation
Hugging Face
Lost in ConversationSlide34
UNDERSTANDING THE EVALUATION RESULTS
Human Evaluation Statistics
Hugging Face asks too many questions!Slide35
UNDERSTANDING THE EVALUATION RESULTSHuman Evaluation StatisticsSlide36
ADAPT Centre
Lots of repeats in this conversation
Somewhat nonsensical replies, like “I love the beach so I have to go with my new job”Mechanical Turk RESULTS
SCORE: 2/4
BOT IN BLUE
TeamSlide37
Happy MinionsLots of repeats here, says “I am not sure what you/that means” 3 times
Mechanical Turk RESULTS
SCORE: 1/4
BOT IN GREEN
TeamSlide38
MOHD SHADAB ALAMShort and repetitive sentences
Mechanical Turk RESULTS
SCORE: 1/4
BOT IN GREEN
TeamSlide39
LITTLE BABY (AI小奶娃)
Some non-sensical
or random responses Mechanical Turk RESULTS
SCORE: 1/4
BOT IN GREEN
TeamSlide40
HUGGING FACEwhile best at the automatic evaluations – seems to ask too many questions
This can make the conversations feel disjointed
Mechanical Turk RESULTS
SCORE: 2/4
BOT IN BLUE
TeamSlide41
LOST IN CONVERSATION Seems to be good at answering questions
Mechanical Turk RESULTS
SCORE: 4/4
BOT IN GREEN
TeamSlide42
STuff MY BOT SAYS