/
1 Generating Natural-Language Video Descriptions Using Text-Mined Knowledge 1 Generating Natural-Language Video Descriptions Using Text-Mined Knowledge

1 Generating Natural-Language Video Descriptions Using Text-Mined Knowledge - PowerPoint Presentation

yoshiko-marsland
yoshiko-marsland . @yoshiko-marsland
Follow
342 views
Uploaded On 2019-11-02

1 Generating Natural-Language Video Descriptions Using Text-Mined Knowledge - PPT Presentation

1 Generating NaturalLanguage Video Descriptions Using TextMined Knowledge Ray Mooney Department of Computer Science University of Texas at Austin Joint work with Niveda Krishnamoorthy Girish ID: 762243

dogs person verbs svo person dogs svo verbs ride walking man language verb model motorbike object dog video move

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "1 Generating Natural-Language Video Desc..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1 Generating Natural-Language Video Descriptions Using Text-Mined Knowledge Ray MooneyDepartment of Computer ScienceUniversity of Texas at Austin Joint work with Niveda Krishnamoorthy Girish Malkarmenkar . Kate Saenko Sergio Guadarrama ..

Video Description Dataset(Chen & Dolan, ACL 2011)2,089 YouTube videos with 122K multi-lingual descriptions. Originally collected for paraphrase and machine translation examples.Available at: http://www.cs.utexas.edu/users/ml/clamp/videoDescription/

Sample Video

Sample M-Turk Human Descriptions (average ~50 per video)A MAN PLAYING WITH TWO DOGSA man takes a walk in a field with his dogs.A man training the dogs in a big field.A person is walking his dogs.A woman is walking her dogs.A woman is walking with dogs in a field.A woman is walking with four dogs outside.A woman walks across a field with several dogs.All dogs are going along with the woman.dogs are playingDogs follow a man.Several dogs follow a person.some dog playing each otherSomeone walking in a field with dogs.very cute dogsA MAN IS GOING WITH A DOG. four dogs are walking with woman in fieldthe man and dogs walking the forestDogs are Walking with a Man. The woman is walking her dogs.A person is walking some dogs. A man walks with his dogs in the field. A man is walking dogs. a dogs are running A guy is training his dogs A man is walking with dogs. a men and some dog are running A men walking with dogs. A person is walking with dogs. A woman is walking her dogs. Somebody walking with his/her pets. the man is playing with the dogs. A guy training his dogs. A lady is roaming in the field with his dogs. A lady playing with her dogs. A man and 4 dogs are walking through a field. A man in a field playing with dogs. A man is playing with dogs.

Our Video Description TaskGenerate a short, declarative sentence describing a video in this corpus.First generate a subject (S), verb (V), object (O) triplet for describing the video.<cat, play, ball>Next generate a grammatical sentence from this triplet.A cat is playing with a ball. 5

A person is riding a motorbike. S UBJECT V ERB O BJECT person ride motorbike

OBJECT DETECTIONScow0.11 person 0.42 table 0.07 aeroplane 0.05 dog 0.15 motorbike 0.51 train 0.17 car 0.29

SORTED OBJECT DETECTIONSmotorbike0.51 person 0.42 car 0.29 aeroplane 0.05 … …

VERB DETECTIONShold0.23 drink 0.11 move 0.34 dance 0.05 slice 0.13 climb 0.17 shoot 0.07 ride 0.19

SORTED VERB DETECTIONSmove0.34 hold 0.23 ride 0.19 dance 0.05 … …

SORTED VERB DETECTIONSmove0.34 hold 0.23 ride 0.19 dance 0.05 … … motorbike 0.51 person 0.42 car 0.29 aeroplane 0.05 … … SORTED OBJECT DETECTIONS

OBJECTS VERBS EXPAND VERBS move 1.0 walk 0.8 pass 0.8 ride 0.8

OBJECTS VERBS EXPAND VERBS h old 1.0 k eep 1.0

OBJECTS VERBS EXPAND VERBS ride 1.0 go 0.8 move 0.8 walk 0.7

OBJECTS VERBS EXPAND VERBS dance 1.0 turn 0.7 jump 0.7 hop 0.6

OBJECTS VERBS EXPANDED VERBS Web-scale text corpora GigaWord, BNC, ukWaC, WaCkypedia, GoogleNgrams A man rides a horse det(man-2, A-1) nsubj(rides-3, man-2) root(ROOT-0, rides-3) det(horse-5, a-4) dobj(rides-3, horse-5) <person, ride, horse> GET Dependency Parses Subject-Verb-Object triplet

OBJECTS VERBS EXPANDED VERBS Web-scale text corpora GigaWord, BNC, ukWaC, WaCkypedia, GoogleNgrams <person, ride, horse> <person, walk, dog> <person, hit, ball> . . . SVO Language Model

OBJECTS VERBS EXPANDED VERBS Web-scale text corpora GigaWord, BNC, ukWaC, WaCkypedia, GoogleNgrams <person, ride, horse> <person, walk, dog> <person, hit, ball> . . . SVO Language Model Regular Language Model

OBJECTS VERBS EXPANDED VERBS Web-scale text corpora GigaWord, BNC, ukWaC, WaCkypedia, GoogleNgrams SVO Language Model REGULAR Language Model CONTENT PLANNING: < person, ride, motorbike>

OBJECTS VERBS EXPANDED VERBS Web-scale text corpora GigaWord, BNC, ukWaC, WaCkypedia, GoogleNgrams SVO Language Model REGULAR Language Model SURFACE REALIZATION: A person is riding a motorbike. CONTENT PLANNING: < person, ride, motorbike>

Selecting SVO Just Using Vision (Baseline)Top object detection from vision = SubjectNext highest object detection = ObjectTop activity detection = Verb

Sample SVO SelectionTop object detections:person: 0.67motorbike: 0.56dog: 0.11Top activity detections:ride: 0.41keep_hold: 0.32lift: 0.23Vision triplet: (person, ride, motorbike)

Evaluating SVO TriplesA ground-truth SVO for a test video is determined by picking the most common S, V, and O used to describe this video (as determined by dependency parsing).Predicted S, V, and O are compared to ground-truth using two metrics:Binary: 1 or 0 for exact match or notWUP: Compare predicted word to ground truth using WUP semantic word similarity score from WordNet Similarity (0≤WUP≤1)23

Test DataSelected 185 test videos that contain one of the 20 detectable objects and 58 detectable activities based on their words (or synonyms) appearing in their human descriptions. 24

Baseline SVO ResultsSubjectVerb Object AllVision baseline71.35%8.65%29.19%1.62% Subject Verb Object AllVisionbaseline 87.76% 40.20% 61.18% 63.05% Binary Accuracy WUP Accuracy

Vision Detections are Faulty!Top object detections:motorbike: 0.67person: 0.56dog: 0.11Top activity detections:go_run_bowl_move: 0.41ride: 0.32lift: 0.23Vision triplet: (motorbike, go_run_bowl_move, person)

Using Text-Mining to DetermineSVO PlausibilityBuild a probabilistic model to predict the real-world likelihood of a given SVO.P(person,ride,motorbike) > P(motorbike,run,person)Run the Stanford dependency parser on a large text corpus, and extract the S, V, and O for each sentence. Train a trigram language model on this SVO data, using Kneyser-Ney smoothing to back-off to SV and VO bigrams.27

<person, park, bat><person, ride, motorcycle><person, walk, dog> <car, move, bag> <car, move, motorcycle> <person, hit, ball> person hit ball -1.17 person ride motorcycle -1.3 person walk dog -2.18 person park bat -4.76 car move bag -5.47 car move motorcycle -5.52 SVO Language Model

Integrated Scoring of SVOsConsider the top n=5 detected objects and the top k=10 verb detections (plus their verb expansions) for a given test video.Construct all possible SVO triples from these nouns and verbs.Pick the best overall SVO using a metric that combines evidence from both vision and language.29

Linearly interpolate vision and language-model scores:Compute SVO vision score assuming independence of components and taking into account similarity of expanded verbs.Combining SVO Scores

Sample Reranked SVOsperson,ride,motorcycle -3.02person,follow,person -3.31person,push,person -3.35person,move,person -3.42person,run,person -3.50person,come,person -3.51 person,fall,person -3.53person ,walk,person -3.61motorcycle,come,person -3.63person,pull,person - 3.65 Baseline Vision triplet: motorbike, march, person

person,walk,dog -3.35person,follow,person -3.35dog,come,person -3.46person,move,person -3.46person,run,person -3.52person,come,person -3.55person,fall,person -3.57person,come,dog -3.62person,walk,person -3.65 person,go,dog -3.70Baseline Vision triplet: person, move, dog Sample Reranked SVOs

SVO Accuracy Results(w1 = 0)SubjectActivityObject AllVision baseline71.35%8.65%29.19% 1.62%SVO LM(No Verb Expansion)85.95%16.22%24.32% 11.35% SVO LM (Verb Expansion) 85.95% 36.76% 33.51% 23.78% Subject Activity Object All Vision baseline 87.76% 40.20% 61.18% 63.05% SVO LM (No Verb Expansion ) 94.90% 63.54% 69.39% 75.94% SVO LM (Verb Expansion) 94.90% 66.36% 72.74% 78.00% Binary Accuracy WUP Accuracy

Surface Realization:Template + Language ModelInput:The best SVO triplet from the content planning stageBest fitting preposition connecting the verb & object (mined from text corpora)Template: Determiner + Subject + (conjugated Verb) + Preposition(optional) + Determiner + ObjectGenerate all sentences fitting this template and rank them using a Language Model trained on Google NGrams

Automatic Evaluation of Sentence QualityEvaluate generated sentences using standard Machine Translation (MT) metrics.Treat all human provided descriptions as “reference translations”

Human Evaluation of DescriptionsAsked 9 unique MTurk workers to evaluate descriptions of each test video.Asked to choose between vision-baseline sentence, SVO-LM (VE) sentence, or “neither.”When preference expressed, 61.04% preferred SVO-LM (VE) sentence.For 84 videos where the majority of judges had a clear preference, 65.48% preferred the SVO-LM (VE) sentence.

Examples where we outperform the baseline37

Examples where we underperform the baseline38

39 Conclusions We have developed a simple, preliminary broad-scale video description system. An SVO language model trained on large-scale parsed text, improves performance across multiple evaluations. Many directions for improving the complexity and coverage of both language and vision components.