/
Illinois Group in Star Challenge Illinois Group in Star Challenge

Illinois Group in Star Challenge - PowerPoint Presentation

crandone
crandone . @crandone
Follow
344 views
Uploaded On 2020-08-27

Illinois Group in Star Challenge - PPT Presentation

PART I visual data processing PART II Audio Search Liangliang Cao Xiaodan Zhuang University of Illinois at UrbanaChampaign What is Star Challenge Competition to Develop Worlds NextGeneration Multimedia Search Technology ID: 805071

query retrieval language based retrieval query based language speech audio phone index video fsm data recognition word queries files

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Illinois Group in Star Challenge" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Illinois Group in Star Challenge

PART I: visual data processingPART II. Audio Search

Liangliang Cao,

Xiaodan

Zhuang

University of Illinois at Urbana-Champaign

Slide2

What is Star Challenge?

Competition to Develop World’s Next-Generation Multimedia Search TechnologyHosted by the Agency for Science, Technology and Research (A*STAR), Singapore.A real-world computer vision task which requires large amounts of computation power

Slide3

But low rewards

56 teams from 17 countries

Round 1

8 teams

Round 2

7 teams

Round 3

5 teams

Grand Final

in Singapore

Slide4

But low rewards

56 teams from 17 countries

Round 1

8 teams

Round 2

7 teams

Round 3

5 teams

Grand Final

in Singapore

No rewards

No rewards

No rewards

No rewards

Only one team can win US$100,000

Slide5

Xiaodan, Lyon, Paritosh,

Mark, Tom

, Mandar

Sean,

Jui-Ting, Zhen, Huazhong, Xi

Vong, Xu, Mert, Dennis, Jason, Andrey, Yuxiao

But we have a team with no fears…

Slide6

Xiaodan, Lyon, Paritosh,

Mark, Tom

, Mandar

Sean,

Jui-Ting, Zhen, Huazhong, Xi

Vong, Xu, Mert, Dennis, Jason, Andrey,

Yuxiao

But we have a team with no fears…

Slide7

Let’s go over our experience and stories…

Slide8

Outlines

Problems of Visual RetrievalDataFeaturesAlgorithmsResults (first 3 rounds)

Slide9

3 Audio Retrieval Tasks

Task

Query

Target

Metric

Data Set

AT1

IPA sequence

segments that contain the query IPA sequence regardless of its languages

“Mean Average Precision”:

25 hours monolingual database in round1;

13 hours multilingual database in round3

AT2

an utterance spoken by different speakers

all segments that contain the query word/phrase/sentence regardless of its spoken languages

AT3

No queries

extract all recurrent segments which are at least 1 second in length

F-measure

Xiaodan will talk about this part……

Slide10

3 Video Retrieval Tasks

Task

Query

Target

Criteria

Metric

Data Set

VT1

Single

Image

20

queries

(short)

Video

Segs

All the similar Segs

“visually

similar”

Mean Average Precision:

20 categories, multiple labels possible

VT2

Short Video

Shot

(<10s)

20

queries

(long)

Video

Segs

All the similar Segs

Perceptually

Similar

10 categories, multiple labels possible

VT3

Videos

with sound

(3~10s)

Order

of 10K

Category

number

learning the

common

visual

characteristics

Classification accuracy

10(20) categories, including one “others” category

Slide11

20 VT1 Categories

100. Not-Applicable, None of the labels 101. Crowd (>10 people) 102. Building with sky as backdrop, clearly visible 103. Mobile devices including handphone/PDA

104. Flag

105. Electronic chart, e.g. stock charts, airport departure chart

106. TV chart Overlay, including graphs, text, powerpoint style

107. Person using Computer, both visible

108. Track and field, sports

109. Company Trademark, including billboard, logo

110. Badminton court, sports

111. Swimming pool, sports 112. Closeup of hand, e.g. using mouse, writing, etc 113. Business meeting (> 2 people), mostly seated down, table visible

114. Natural scene, e.g. mountain, trees, sea, no pple 115. Food on dishes, plates

116. Face closeup, occupying about 3/4 of screen, frontal or side 117. Traffic Scene, many cars, trucks, road visible

118. Boat/Ship, over sea, lake 119. PC Webpages, screen of PC visible 120. Airplane

Slide12

10 Categories for VT2

201. People entering/exiting door/car202. Talking face with introductory caption 203. Fingers typing on a keyboard204. Inside a moving vehicle, looking outside

205. Large camera movement, tracking an object, person, car, etc

206. Static or minute camera movement, people(s) walking, legs visible

207. Large camera movement, panning left/right, top/down of a scene

208. Movie ending credit

209. Woman monologue

210. Sports celebratory hug

Slide13

5 Categories for VT3

101. Crowd (>10 people): 102. Building with sky as backdrop, clearly visible   107. Person using Computer, both visible 112. Closeup of hand, e.g. using mouse, writing, etc; 116. Face closeup, occupying about 3/4 of screen, frontal or side

Slide14

Video+Audio Tasks in Round 3

1) Audio search (AT1 or AT2)5 queries will be given, either in the form of IPA sequence or waveform, and the participants are required to solve 4. 

2) Video search (VT1) 

5 queries will be given and the participants are required to solve 4.

 

3) Audio + Video search (AT1 + VT2)

The search queries for this task are a combination of IPA sequence/waveform and video category. The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequence/waveform and video category respectively. 3 queries will be given and the participants are required to solve 2.

Slide15

Examples of Images

Slide16

More samples

Slide17

PART I: Data

Slide18

Evaluation Video Data in Round2

31 Mpeg Videos, ~20 hours17289 frames for VT1 in total40994 frames for VT2 in total32508 pseudo key frames, 8486 real key frames

Slide19

Evaluation Video Data of Round3

Video Files: 27 Mpeg1 files, (13 hours of video/audio in total)Key frames for VT1: 10580 .jpg filesKey frames for VT2: 64546 files in total, including 10580 .jpg files (true key frames) + 53966 .jpg files (pseudo key frames)

Video: 352*288

Slide20

Computation Powers

Work Stations in IFP10 Servers, 2~4 CPU each, 36CPU in totalIFP-32 Cluster: 32 dual-core 2.8G 64bit CPUCSL Cluster:

Trusted-ILLIAC: 256 nodes with dual 2.2 GHz Opterons, 2 GB of RAM and 73 GB SCSI Ultra320 disks;

Monolith: 128 node cluster with dual Pentium III CPUs at 1 Ghz with 1.5 GB of RAM per node

TeraGrid:

Slide21

Time Cost for Video Tasks

Data Decompression: 15 minutes;Video Format Conversion: 2 hours;Video Segmentation (for VT2): 40 minutes

Sound Track Extraction: 30 minutes;

Feature Extraction:

Global Feature 2: 2 hours (c)

Global Feature 1: 2 hours (c)

Patch-based Feature1: 2 hours (c)

Patch-based Feature2 :5 hours (matlab)

Semantic Feature 1: 24 hours (matlab)

Semantic Feature 2: 3 hours (c)Semantic Feature 3: 4 hours (c)

Motion Feature 1: 24 hours (matlab)

Motion Feature 2: 3 hours on t-IlliacClassifier Training:

Classifier 1: 1 hour (on IFP cluster,25 CPU, matlab)Classifier 2: 20 minutesClassifier 3: less than 10 minutes

Slide22

Possible Accelerations for Video

Matlab codes to CParallel computingGPU Acceleration: Patch based features

Load time is the major issue

Extracting all the features after one load

Slide23

II: Features

Slide24

Features for Round2- VT1

Image FeaturesSIFTHOGGISTAPC

LBP

Color, Texture, and etc

Semantic Feature

Slide25

Features for Round2-VT2

Character Detector Harris cornermorphological operationsOptical Flow

Lucas-Kanade on spatial intensity gradient

Gender recognition

SODA-boost based

Motion History Image

Spatial interest points

Slide26

GUFE: Grand Unified Feature Extractor

Designed by DennisCollects features generated by team members into one standard formatRetrieval by Query Expansion based on NNFeature Normalization/Combination

Result Visualization

Slide27

PART III Algorithms

Slide28

Observations

Samples under the same category are more semantic similar to each other;The shot boundaries are not well definedsome of the key frames are not labeled correctly. e.g., VT1 101, 103(26-141);

Slide29

Algorithms

Query-expansion

Input:

a query image and its category number.

0. Preprocessing: compute the matching between the evaluation and the development data

1. Expand the query image by retrieving all the images from the development data set with the same category.

2. Search the evaluation set with the expanded query.

Output:

return the top 50/20 results.

Slide30

Algorithms

Query-expansionGMM Based Approach

Motivation:

using a GMM to model the distribution of patches

1. Train a UBM (Universal Background Model) based on patches from all training images

2. MAP Estimation of the distribution of the patches belonging to one image given UBM

3. Compute pair-wise image distance based on patch kernel and within-class covariance normalization

3. Retrieving images based the normalized distance

Slide31

PART IV Performance

Slide32

VT1 Performance (#2 in 8)

Category

MAP

•101. Crowd (>10 people):

0.8419

•102. Building with sky as backdrop, clearly visible  

0.977

•103. Mobile devices including handphone/PDA

0.028

•107. Person using Computer, both visible

0.2281

•109. Company Trademark, including billboard, logo 

0.96

•112. Closeup of hand, e.g. using mouse, writing, etc;

0.4584

•113. Business meeting (> 2 people), mostly seated down, table visible

0.0644

•115. Food on dishes, plates,

0.2285

•116. Face closeup, occupying about 3/4 of screen, frontal or side

0.9783

•117. Traffic Scene, many cars, trucks, road visible,

0.2901

Slide33

VT2 Performance(#1in8)

Category

MAP

•202. Talking face with introductory caption

0.8432

•206. Static or minute camera movement, people(s) walking, legs visible,

0.0581

•207. Large camera movement, panning left/right, top/down of a scene,

0.7789

•208. Movie ending credit

0.2782

•209. Woman monologue, Zhen

0.9756

Slide34

Performance of Round3 (#1in7)

Task 2 (VT1)

Target

Estimated MAP (R=20)

101. Crowd (>10 people):

0.64

102. Building with sky as backdrop, clearly visible  

1

107. Person using Computer, both visible

0.7

112. Closeup of hand, e.g. using mouse, writing, etc;

0.527

116. Face closeup, occupying about 3/4 of screen, frontal or side

1

Task3 (AT1 + VT2)

Retrieval Target

VT2 only

AT1 + VT2

Video

R=20

202, face with introductory caption

1

0.03

209, women monolog

0.35

0.1

201, People entering door

 

N/A

 

 

We are:

2nd in Audio search

4th in Video search

2nd in AV search

1nd overall

Slide35

A general index / retrieval approach

leveraging on speech recognition output lattices

Experience in a real-world audio retrieval task

The Star Challenge

Experience in speech retrieval in an unknown language

Illinois Group in Star Challenge

Part II. Audio Search

Slide36

(Audio) Information Retrieval

Problem definition

Task Description: given a query, find the “most relevant” segments in a database

36

k r u: d p r ai s ^ z

“CRUDE PRICES”

Thousands

Of

Audio files

Top N

file IDs

Slide37

(Audio) Information Retrieval

“Standard” Methods ‏

Published Algorithms:

EXACT MATCH

: segment = argmin

i

d(query,segment

i

)

‏where d is the string edit distanceFast

SUMMARY STATISTICS

: segment = argmaxi p(query|segmenti),

bag-of-words, no concept of “sequence”Good for text, e.g., google, yahoo, etc.TRANSFORM AND INFER: segment = argmaxi p(query|segmenti)≈ argmaxi E(count(query)|segment),word order matters‏

Flexible, but slow....

Slide38

Example: Language-Independent Speech Information Retrieval

Voice activity detection

Perceptual freq warping

Gaussian mixtures

Likelihood Vector

b

i

=p(observation

t

|state

t

=i)

Retrieval Ranking = E(count(query|segment observations))

Inference Algorithm: Finite State Transducer built from ASR Lattices

E(count(query|observations))

Slide39

STAR Challenge Audio Retrieval Tasks

Task

Query

Target

Data Set

AT1

IPA sequence

segments that contain the query IPA sequence regardless of its languages

25 hours monolingual database in round1;

13 hours multilingual database in round3

AT2

an utterance spoken by different speakers

all segments that contain the query word/phrase/sentence regardless of its spoken languages

Slide40

STAR Challenge Audio Retrieval Tasks

Genuine retrieval tasks

without pre-defined categories

The queries are human speech or IPA sequences

one or multiple languages.

Queries might be only part of the speech in the provided segments in the database

The returned hits should be ordered and only the first 50 or 20 are submitted.

Slide41

41

Audio

Archive

Speech

Recognition

Spectral

Features

Lattices

Feature Extraction on

Short-time windows

Query

FSM-based

Query Construction

Empirical

Phone confusion

Knowledge-based

Phone confusion

FSM-based Indexing

FSA Generation

FSM Index

All audio segments

FSA Generation

FSM Index

. . . . . .

Group Index

. . .

Group Index

All Group

Indices

FSM-based

Retrieval

FSM-based

Query Construction

Retrieval

Results

| Speech Recog. | Indexing / Query / Retrieval

Slide42

42

Automatic speech recognition

Better performance with

language-specific systems

than language-independent systems

No inter-language mismatch

between training and testing (acoustic model, pronunciation model, language model)

e.g., for English data, we can use all knowledge of English (training data, pronunciation dictionary, language model)

language-specific (e.g.,

English) word recognition

Slide43

43

Automatic speech recognition

Multilingual database and queries

We might fail to identify which particular language the testing data is; Even if we know, we might not have the speech recognition models for that language.

We might encounter unseen languages (Tamil, Malay…)

Language-Independent

phone-based recognition instead of word-based recognition

Slide44

44

Summary of corpora of different languages

Slide45

45

Recognition results

--- All languages/datasets are not equal

Slide46

46

Acoustic model

Spectral features:

39-dim PLP,

cepstral

mean/variance normalization per speaker

Modeling: HMMs with {11,13,15,17}-mixture Gaussians

Context-dependent language-dependent modeling

Left context-

CENTRAL PHONE

+right context % language

referred to language-dependent

triphones

e.g., sound /A/ in different context and of different language

^’-

A

+b%Eng

^’-

A

+b’%Eng

>-

A

+cm%Chinese

….

Slide47

Acoustic model -

Context-Dependent Phones

Slide48

48

Acoustic model - clustering

Categories for decision tree questions

Right or left context

Distinctive phone features (manner/place of articulation)

Language identity

Lexical stress

Punctuation mark

^’-

A

+b%Eng

^’-

A

+b’%Eng

>-

A

+cm%Chinese

….

Slide49

49

Language/Sequence model --- N-gram

If there is a pronunciation model, a particular sequence of context-dependent phone models (acoustic models) can be converted to a particular word sequence, which is modeled by N-gram.

If there is no pronunciation model, the context-dependent phone sequence is directly modeled by N-gram.

Slide50

50

Solutions for English Speech Retrieval

Speech recognition frontend

Features:

Perceptual Linear Predictive Cepstra + Energy

first/second orders regression coefficients

Speech recognizer

Triphone-clustered English speech recognizer

English acoustic model, English dictionary, English language model

Lattice generation using a Viterbi decoder

Audio

Archive

Speech

Recognition

Spectral

Features

Lattices

Feature Extraction on

Short-time windows

Slide51

I. A compact way to represent numerous alternative hypotheses output by the speech recognizer

A simple example

a b a

or

b a

Some more complex examples:

Using lattices as the representation of speech data

Slide52

Mangu et al. 1999

James 1995

Some more complex examples:

Slide53

Using lattices as the representation of speech data

Enable more robust speech retrieval

One best hypothesis by speech recognition is not reliable enough.

Lattices can be represented as finite state machines, which can be used in speech retrieval and take advantage of general weighted finite state machine algorithms.

Robust matching between query and audio files

Slide54

54

FSM-based Indexing

FSA Generation

FSM Index

Indexing the audio archive

Input from ASR frontend

Lattices of database segments

English vocabulary

Constructing log semiring automatas for each segment

Construct FST-based segment index files

Combine segment index files into a few group index files for retrieval.

Solutions for English Speech Retrieval

FSA Generation

FSM Index

. . . . . .

Group Index

. . .

Group Index

All Group

Indices

All audio segments

Slide55

Allauzen et al., 2004

Indexing the audio archive (example)

Speech recognition output:

Lattices for two audio files

Index for the two files

Slide56

56

Making FST-based queries

Provided queries each as an IPA sequence

Building an automata from the IPA sequence and expand each IPA arc into alternative arcs for 'similar' IPA

Building a query FSA, incorporating constraints

Provided queries each as audio waveform

Processed by ASR frontend

Building log semiring automata

Build query FSA

Solutions for English Speech Retrieval

Slide57

57

k r u: d p r ai s ^ z

“CRUDE PRICES”

Query

FSM-based

Query Construction

Empirical

Phone confusion

Knowledge-based

Phone confusion

FSM-based

Query Construction

Example of query expansion

(phone sequence to word FSA)

Slide58

58

Retrieval using queries

Parallel retrieval in all group index files

Order retrieved segment ids

truncate/format results

Fusing results with different precision/recall tradeoff obtained by different settings

Solutions for English Speech Retrieval

All Group

Indices

FSM-based

Retrieval

Slide59

59

ASR frontend

Multi-lingual word/phone speech recognizer

Language-independent phone recognizer

Indexing the audio archive

Multi-lingual index

Language-independent index

Making FST-based queries

Multi-lingual queries

Language-independent queries

Retrieval using queries

Same as in language-specific retrieval

Solutions for Multilingual Speech Retrieval

Slide60

60

Rough CPU Time for Audio Search

ASR Frontend

Phone recognizer with bigram phone LM ~25% realtime

3k monophone-based word recognizer with unigram word LM ~100% realtime

20k monophone-based word recognizer with unigram word LM ~2500% realtime

Triphone recognizer with bigram phone LM ~10000% realtime

60k triphone-based word recognizer with unigram word LM ~10000% realtime

Indexing

Less than ~50% realtime

Query construction

~10sec each

Retrieval

~1% realtime

Slide61

61

Possible Accelerations

ASR Frontend

Faster Viterbi decoder

Optimized FSM-based decoder

Optimized parameters

Indexing

Optimized parallel job splitting

Applying constraining/filtering FSA before building the group indices

Retrieval

Optimized parallel job splitting

Pre-retrieval using less complicated techniques

Slide62

62

k r u: d p r ai s ^ z

“CRUDE PRICES”

Thousands

Of

Audio files

Top N

file IDs

Audio Retrieval

What’s the “optimal” model set?

Language-specific phones/words

Language-independent phones

General sub-word units

Size of inventory -

using only the frequent symbols is better

Data driven -

selected by clustering tree

Slide63

Audio

Archive

Speech

Recognition

Spectral

Features

Lattices

Feature Extraction on

Short-time windows

Query

FSM-based

Query Construction

Empirical

Phone confusion

Knowledge-based

Phone confusion

FSM-based Indexing

FSA Generation

FSM Index

FSA Generation

FSM Index

. . . . . .

Group Index

. . .

Group Index

All Group

Indices

FSM-based

Retrieval

FSM-based

Query Construction

Retrieval

Results

k r u: d p r ai s ^ z

“CRUDE PRICES”

Thousands

Of

Audio files

Top N

file IDs

Word

Triphone

Phone

Other acoustic units

Speech in unknown language

Query in unknown language

Represented as

?

Automatic Speech Retrieval in an Unknown Language

FSM-based fuzzy matching and retrieval

Slide64

Automatic Speech Retrieval in an Unknown Language

modeled as a special case (below) of

the cognitive process called assimilation (above).

English

Phones

Russian

Phones

Spanish

Phones

Language-dependent (LD) phones

LD models

Language-independent (LID) phone clusters

based on pairwise KL divergence

Speech in unknown language

Represented as

LID phone lattices

Accommodation ?

(introducing new models)

Slide65

… …

Multilingual

Subword

Units in an Unknown Language

Language-dependent versions of the same IPA symbols can end up:

in one cluster, e.g., /z/ or /t

/

in different clusters, e.g., /j/ or /

I

/

Different IPAs may be similar:

Retrieval w/

audio query

Retrieval w/ IPA sequence query

Clustering gives a more compact phone set, and performs better

Croatian

Phone

recognition

Croatian for training

Croatian Unknown

Slide66

Thank you.