/
Speaker: Emily Dinan Facebook AI Research Speaker: Emily Dinan Facebook AI Research

Speaker: Emily Dinan Facebook AI Research - PowerPoint Presentation

liane-varnes
liane-varnes . @liane-varnes
Follow
348 views
Uploaded On 2018-12-18

Speaker: Emily Dinan Facebook AI Research - PPT Presentation

ConvAI2 Competition Results and Analysis Automatic Evaluation ORIGINAL PERSONACHAT DATASET TOTAL utterances 162064 TOTAL dialogs 10907 TOTAL personas 1155 VALID SET Utterances ID: 742926

human evaluation mechanical results evaluation human results mechanical turk model bot conversation score hugging set face lost questions automatic

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Speaker: Emily Dinan Facebook AI Researc..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Speaker: Emily Dinan

Facebook AI Research

ConvAI2 Competition

: Results and AnalysisSlide2

Automatic Evaluation

ORIGINAL PERSONA-CHAT DATASET:TOTAL utterances: 162,064

TOTAL dialogs: 10,907TOTAL personas: 1155VALID SET: Utterances: 15,602Dialogs: 1000

Personas: 100TEST SET:Utterances: 15,024Dialogs: 968Personas:

100

HIDDEN TEST SET:

Utterances:

13,268

Dialogs: 1015Personas: 100

SET-UPSlide3

Automatic Evaluation THREE BASELINES AVAILABLE IN PARLAI (visit http://parl.ai/):

SET-UP

Model

PPLHits@1F1

KV Profile Memory

-

55.2

11.9

Seq2Seq + Attention29.812.616.18Language Model 46.0- 15.02

Three automated metrics:

Perplexity

Hits@1 (out of 20 possible candidates)

F1

Participants need to use the same model for all three metrics, but they did not need to be evaluated on all three metricsSlide4

Automatic Evaluation Over 23 teams submitted!

The rank was determined by sorting by the minimum rank of the score in any of the three metrics, where ties were broken by considering the second (and then third) smallest ranks. 

The top 7 teams made it to the next round

RESULTSSlide5

Automatic Evaluation

RESULTSSlide6

Automatic Evaluation RESULTSSlide7

Automatic Evaluation RESULTSSlide8

Further AnalysisDISTRACTOR CANDIDATES

Added the last partner message to the list of candidates to rank as a distractor

The Hugging Face model was most resistant to this type of attackSlide9

Further AnalysisREVISED PERSONAS

Can these models understand personas (instead of just copying)?

Hugging Face performed the best on the revised test set. Little Baby close behind.

ORIGINAL  REVISED

I am very shy

I am not a social person

I just got my nails done

 I love to pamper myself on a regular basis I am on a diet now  I need to lose weightSlide10

Human Evaluation 100 evaluations per model

Mechanical Turk worker and model were each assigned a persona and chat for 4-6 dialog turns each After the chat, the worker is asked:

How much did you enjoy talking to this user?Choices: not at all, a little, somewhat, a lot  1, 2, 3, 4Next, the worker is shown the model’s persona + a random persona, and asked:

Which prompt (character) do you think the other user was given for this conversation?Mechanical Turk SET-UPSlide11

Human Evaluation

Mechanical Turk SET-UPSlide12

Human Evaluation

Wild SET-UP

Evaluation done in Facebook Messenger and TelegramThrough Wednesday, anyone could message and get paired randomly with one of the bots to have a conversation and rate itSlide13

Human Evaluation Wild RESULTS

Some conversations were really great!Slide14

Human Evaluation Wild RESULTS

Some conversations were really great!

Others… not so muchSlide15

Human Evaluation Wild RESULTS

Some problems with spammers in the data collection; after reading through data we decided to discount these results

OPEN PROBLEM: detecting spamSlide16

And the winner is…Slide17

And the winner is…

Lost in

Conversation!Slide18

Human Evaluation Mechanical Turk RESULTSSlide19

Human Evaluation Mechanical Turk RESULTS

LOTS OF VARIANCE!Slide20

Human Evaluation CALIBRATION Reducing annotator bias with Bayesian calibration!Some annotators are quite harsh while others are quite generous

 average score has high varianceMethod from Importance of a Search Strategy in Neural Dialogue Modeling, Kulikov et. al 2018 (available on arXiv)

Mechanical Turk RESULTS

SET-UP:Slide21

Human Evaluation Mechanical Turk RESULTS

AFTER CALIBRATIONSlide22

Human Evaluation AFTER CALIBRATION

Mechanical Turk RESULTS

BEFORE

CALIBRATION

Same conclusion after reducing annotator biasSlide23

Human Evaluation How well did the models use the personas?

Mechanical Turk RESULTS

Every team did better than the baseline except Happy Minions

98% detection rate!Slide24

Further Analysis

Model Name

Human Eval Score Avg. # words (model)

Avg. # words (human)Avg. # characters (model)

Avg. # characters (human)

Human

3.46

14.1

13.759.957.7

Lost in Conversation3.1110.18

11.9

39.2

48.2

Hugging Face

2.67

11.5

11.9

44.4

49.2

Little Baby

2.4

11.5

11.3

51.5

47.3

Mohd

Shadab

Alam

2.36

9.5

10.2

33.8

42.5

Happy Minions

1.92

8.0

10.2

27.9

42.5

ADAPT Centre

1.59

15.1

11.8

60.0

48.0

HUMAN & BOT MESSAGE WORD STATISTICS on Mechanical Turk Logs

Some correlation between human message length and

eval

score…Slide25

Further Analysis

Model Name

Human Eval Score Freq1h (model)

Freq1h (human)Freq1k (model)Freq 1k (human)Unique (model)Human

3.46

4.8

4.3

17.2

16.399%Lost in Conversation3.112.23.4

9.913.286%

Hugging Face

2.67

2.5

4.2

9.0

15.6

97%

Little Baby

2.4

4.9

3.7

18.3

15.6

91%

Mohd

Shadab

Alam

2.36

1.3

3.2

9.5

14.1

83%

Happy Minions

1.92

0.3

4.1

4.3

14.3

53%

ADAPT Centre

1.59

1.7

3.5

8.8

15.1

98%

HUMAN & BOT MESSAGE WORD STATISTICS on Mechanical Turk Logs

Uniqueness is important for continued engagement with a bot

Humans use more rare words, otherwise no clear conclusionSlide26

Further Analysis

Model Name

Human Eval Score Unigram RepeatsBigram Repeats

Trigram RepeatsHuman3.461.832.47

0.51

Lost in Conversation

3.11

2.11

5.62.67Hugging Face2.67

1.495.040.6

Little Baby

2.4

2.53

2.69

1.43

Mohd

Shadab

Alam

2.36

3.48

11.34

7.06

Happy Minions

1.92

1.62

6.56

3.81

ADAPT Centre

1.59

6.74

11.53

1.44

BOT MESSAGE WORD STATISTICS on Mechanical Turk Logs

Humans have lowest repeats, otherwise no clear conclusionSlide27

UNDERSTANDING THE HUMAN EVALUATION RESULTSRandomly sample logs from Hugging Face and Lost in Conversation

AVERAGE RATINGS FOR THIS SUBSET OF CONVERSATIONS:

Blind EvaluationLost in Conversation:

TURKER: 3.29EMILY: 2.78JASON:

2.71

Hugging Face

:

TURKER

: 2.8EMILY: 2.47

JASON: 2

Jason is harsher than Emily who is harsher than

Turkers

(different bias), but all three agree on which model is betterSlide28

BLIND EVALUATION SAMplesHUGGING FACE

“good at answering questions, but asks a question that's already been answered”“karaoke repeat, otherwise ok. bit boring”

“contradicts itself twice here”“asked too many questions, and contradicted itself a couple times”“nice acknowledgement about dogs by the model, makes a slight mistake by asking about what kind of music the person plays”“some detail mistakes (rock, artist), otherwise ok”

“too many questions, changes topics a lot”

LOST IN CONVERSATION

“not super interesting, but it's able to respond well to the comment about reading and cooking”

“pretty good up until the last utterance (weird follow-up to "I'm a student who plays baseball")”

“v good. e.g. the position stuff”

“not that bad, just really uninteresting”

“asks what the person does for a living twice…”“repeat mistake”

“too much awesome. otherwise good”

“this conversation is super coherent, and the model responds well to the users messages”Slide29

Hugging FaceJason:

work question is already answered, never really answers, just goes to another question. repeat travel question1/4Emily: repeats the question about traveling, tends to ask a lot of questions, making the conversation hop between subjects rather quickly

2/4BOT IN GREEN

Blind evaluationSlide30

BLIND EVALUATIOHugging FaceJason:

alaska mistake, seems to ignore school. bit boring2/4Emily: sort of contradicts itself about

alaska (it says it's from there, then it says it has been there, which is consistent but unnatural)2/4

BOT IN BLUESlide31

Blind evaluationLost in ConversationJason:

v good. e.g. the position stuff4/4Emily: this conversation is super coherent, and the model responds well to the users messages

4/4

BOT IN GREENSlide32

BLIND EVALUATIONLost in ConversationJason:

too much awesome. otherwise good2/4Emily: it says "that's awesome" three times2/4

BOT IN BLUESlide33

UNDERSTANDING THE HUMAN EVALUATION RESULTS

Blind Evaluation

Hugging Face

Lost in ConversationSlide34

UNDERSTANDING THE EVALUATION RESULTS

Human Evaluation Statistics

Hugging Face asks too many questions!Slide35

UNDERSTANDING THE EVALUATION RESULTSHuman Evaluation StatisticsSlide36

ADAPT Centre

Lots of repeats in this conversation

Somewhat nonsensical replies, like “I love the beach so I have to go with my new job”Mechanical Turk RESULTS

SCORE: 2/4

BOT IN BLUE

TeamSlide37

Happy MinionsLots of repeats here, says “I am not sure what you/that means” 3 times

Mechanical Turk RESULTS

SCORE: 1/4

BOT IN GREEN

TeamSlide38

MOHD SHADAB ALAMShort and repetitive sentences

Mechanical Turk RESULTS

SCORE: 1/4

BOT IN GREEN

TeamSlide39

LITTLE BABY (AI小奶娃)

Some non-sensical

or random responses Mechanical Turk RESULTS

SCORE: 1/4

BOT IN GREEN

TeamSlide40

HUGGING FACEwhile best at the automatic evaluations – seems to ask too many questions

This can make the conversations feel disjointed

Mechanical Turk RESULTS

SCORE: 2/4

BOT IN BLUE

TeamSlide41

LOST IN CONVERSATION Seems to be good at answering questions

Mechanical Turk RESULTS

SCORE: 4/4

BOT IN GREEN

TeamSlide42

STuff MY BOT SAYS