PART I visual data processing PART II Audio Search Liangliang Cao Xiaodan Zhuang University of Illinois at UrbanaChampaign What is Star Challenge Competition to Develop Worlds NextGeneration Multimedia Search Technology ID: 805071
Download The PPT/PDF document "Illinois Group in Star Challenge" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Illinois Group in Star Challenge
PART I: visual data processingPART II. Audio Search
Liangliang Cao,
Xiaodan
Zhuang
University of Illinois at Urbana-Champaign
Slide2What is Star Challenge?
Competition to Develop World’s Next-Generation Multimedia Search TechnologyHosted by the Agency for Science, Technology and Research (A*STAR), Singapore.A real-world computer vision task which requires large amounts of computation power
Slide3But low rewards
56 teams from 17 countries
Round 1
8 teams
Round 2
7 teams
Round 3
5 teams
Grand Final
in Singapore
Slide4But low rewards
56 teams from 17 countries
Round 1
8 teams
Round 2
7 teams
Round 3
5 teams
Grand Final
in Singapore
No rewards
No rewards
No rewards
No rewards
Only one team can win US$100,000
Xiaodan, Lyon, Paritosh,
Mark, Tom
, Mandar
Sean,
Jui-Ting, Zhen, Huazhong, Xi
Vong, Xu, Mert, Dennis, Jason, Andrey, Yuxiao
But we have a team with no fears…
Slide6Xiaodan, Lyon, Paritosh,
Mark, Tom
, Mandar
Sean,
Jui-Ting, Zhen, Huazhong, Xi
Vong, Xu, Mert, Dennis, Jason, Andrey,
Yuxiao
But we have a team with no fears…
Slide7Let’s go over our experience and stories…
Slide8Outlines
Problems of Visual RetrievalDataFeaturesAlgorithmsResults (first 3 rounds)
Slide93 Audio Retrieval Tasks
Task
Query
Target
Metric
Data Set
AT1
IPA sequence
segments that contain the query IPA sequence regardless of its languages
“Mean Average Precision”:
25 hours monolingual database in round1;
13 hours multilingual database in round3
AT2
an utterance spoken by different speakers
all segments that contain the query word/phrase/sentence regardless of its spoken languages
AT3
No queries
extract all recurrent segments which are at least 1 second in length
F-measure
Xiaodan will talk about this part……
Slide103 Video Retrieval Tasks
Task
Query
Target
Criteria
Metric
Data Set
VT1
Single
Image
20
queries
(short)
Video
Segs
All the similar Segs
“visually
similar”
Mean Average Precision:
20 categories, multiple labels possible
VT2
Short Video
Shot
(<10s)
20
queries
(long)
Video
Segs
All the similar Segs
Perceptually
Similar
10 categories, multiple labels possible
VT3
Videos
with sound
(3~10s)
Order
of 10K
Category
number
learning the
common
visual
characteristics
Classification accuracy
10(20) categories, including one “others” category
Slide1120 VT1 Categories
100. Not-Applicable, None of the labels 101. Crowd (>10 people) 102. Building with sky as backdrop, clearly visible 103. Mobile devices including handphone/PDA
104. Flag
105. Electronic chart, e.g. stock charts, airport departure chart
106. TV chart Overlay, including graphs, text, powerpoint style
107. Person using Computer, both visible
108. Track and field, sports
109. Company Trademark, including billboard, logo
110. Badminton court, sports
111. Swimming pool, sports 112. Closeup of hand, e.g. using mouse, writing, etc 113. Business meeting (> 2 people), mostly seated down, table visible
114. Natural scene, e.g. mountain, trees, sea, no pple 115. Food on dishes, plates
116. Face closeup, occupying about 3/4 of screen, frontal or side 117. Traffic Scene, many cars, trucks, road visible
118. Boat/Ship, over sea, lake 119. PC Webpages, screen of PC visible 120. Airplane
Slide1210 Categories for VT2
201. People entering/exiting door/car202. Talking face with introductory caption 203. Fingers typing on a keyboard204. Inside a moving vehicle, looking outside
205. Large camera movement, tracking an object, person, car, etc
206. Static or minute camera movement, people(s) walking, legs visible
207. Large camera movement, panning left/right, top/down of a scene
208. Movie ending credit
209. Woman monologue
210. Sports celebratory hug
Slide135 Categories for VT3
101. Crowd (>10 people): 102. Building with sky as backdrop, clearly visible 107. Person using Computer, both visible 112. Closeup of hand, e.g. using mouse, writing, etc; 116. Face closeup, occupying about 3/4 of screen, frontal or side
Slide14Video+Audio Tasks in Round 3
1) Audio search (AT1 or AT2)5 queries will be given, either in the form of IPA sequence or waveform, and the participants are required to solve 4.
2) Video search (VT1)
5 queries will be given and the participants are required to solve 4.
3) Audio + Video search (AT1 + VT2)
The search queries for this task are a combination of IPA sequence/waveform and video category. The participants are required to retrieve segments of data which contains sound and video corresponding to the given IPA sequence/waveform and video category respectively. 3 queries will be given and the participants are required to solve 2.
Slide15Examples of Images
Slide16More samples
Slide17PART I: Data
Slide18Evaluation Video Data in Round2
31 Mpeg Videos, ~20 hours17289 frames for VT1 in total40994 frames for VT2 in total32508 pseudo key frames, 8486 real key frames
Slide19Evaluation Video Data of Round3
Video Files: 27 Mpeg1 files, (13 hours of video/audio in total)Key frames for VT1: 10580 .jpg filesKey frames for VT2: 64546 files in total, including 10580 .jpg files (true key frames) + 53966 .jpg files (pseudo key frames)
Video: 352*288
Slide20Computation Powers
Work Stations in IFP10 Servers, 2~4 CPU each, 36CPU in totalIFP-32 Cluster: 32 dual-core 2.8G 64bit CPUCSL Cluster:
Trusted-ILLIAC: 256 nodes with dual 2.2 GHz Opterons, 2 GB of RAM and 73 GB SCSI Ultra320 disks;
Monolith: 128 node cluster with dual Pentium III CPUs at 1 Ghz with 1.5 GB of RAM per node
TeraGrid:
Slide21Time Cost for Video Tasks
Data Decompression: 15 minutes;Video Format Conversion: 2 hours;Video Segmentation (for VT2): 40 minutes
Sound Track Extraction: 30 minutes;
Feature Extraction:
Global Feature 2: 2 hours (c)
Global Feature 1: 2 hours (c)
Patch-based Feature1: 2 hours (c)
Patch-based Feature2 :5 hours (matlab)
Semantic Feature 1: 24 hours (matlab)
Semantic Feature 2: 3 hours (c)Semantic Feature 3: 4 hours (c)
Motion Feature 1: 24 hours (matlab)
Motion Feature 2: 3 hours on t-IlliacClassifier Training:
Classifier 1: 1 hour (on IFP cluster,25 CPU, matlab)Classifier 2: 20 minutesClassifier 3: less than 10 minutes
Slide22Possible Accelerations for Video
Matlab codes to CParallel computingGPU Acceleration: Patch based features
Load time is the major issue
Extracting all the features after one load
Slide23II: Features
Slide24Features for Round2- VT1
Image FeaturesSIFTHOGGISTAPC
LBP
Color, Texture, and etc
Semantic Feature
Slide25Features for Round2-VT2
Character Detector Harris cornermorphological operationsOptical Flow
Lucas-Kanade on spatial intensity gradient
Gender recognition
SODA-boost based
Motion History Image
Spatial interest points
Slide26GUFE: Grand Unified Feature Extractor
Designed by DennisCollects features generated by team members into one standard formatRetrieval by Query Expansion based on NNFeature Normalization/Combination
Result Visualization
Slide27PART III Algorithms
Slide28Observations
Samples under the same category are more semantic similar to each other;The shot boundaries are not well definedsome of the key frames are not labeled correctly. e.g., VT1 101, 103(26-141);
Slide29Algorithms
Query-expansion
Input:
a query image and its category number.
0. Preprocessing: compute the matching between the evaluation and the development data
1. Expand the query image by retrieving all the images from the development data set with the same category.
2. Search the evaluation set with the expanded query.
Output:
return the top 50/20 results.
Slide30Algorithms
Query-expansionGMM Based Approach
Motivation:
using a GMM to model the distribution of patches
1. Train a UBM (Universal Background Model) based on patches from all training images
2. MAP Estimation of the distribution of the patches belonging to one image given UBM
3. Compute pair-wise image distance based on patch kernel and within-class covariance normalization
3. Retrieving images based the normalized distance
Slide31PART IV Performance
Slide32VT1 Performance (#2 in 8)
Category
MAP
•101. Crowd (>10 people):
0.8419
•102. Building with sky as backdrop, clearly visible
0.977
•103. Mobile devices including handphone/PDA
0.028
•107. Person using Computer, both visible
0.2281
•109. Company Trademark, including billboard, logo
0.96
•112. Closeup of hand, e.g. using mouse, writing, etc;
0.4584
•113. Business meeting (> 2 people), mostly seated down, table visible
0.0644
•115. Food on dishes, plates,
0.2285
•116. Face closeup, occupying about 3/4 of screen, frontal or side
0.9783
•117. Traffic Scene, many cars, trucks, road visible,
0.2901
Slide33VT2 Performance(#1in8)
Category
MAP
•202. Talking face with introductory caption
0.8432
•206. Static or minute camera movement, people(s) walking, legs visible,
0.0581
•207. Large camera movement, panning left/right, top/down of a scene,
0.7789
•208. Movie ending credit
0.2782
•209. Woman monologue, Zhen
0.9756
Slide34Performance of Round3 (#1in7)
Task 2 (VT1)
Target
Estimated MAP (R=20)
101. Crowd (>10 people):
0.64
102. Building with sky as backdrop, clearly visible
1
107. Person using Computer, both visible
0.7
112. Closeup of hand, e.g. using mouse, writing, etc;
0.527
116. Face closeup, occupying about 3/4 of screen, frontal or side
1
Task3 (AT1 + VT2)
Retrieval Target
VT2 only
AT1 + VT2
Video
R=20
202, face with introductory caption
1
0.03
209, women monolog
0.35
0.1
201, People entering door
N/A
We are:
2nd in Audio search
4th in Video search
2nd in AV search
1nd overall
A general index / retrieval approach
leveraging on speech recognition output lattices
Experience in a real-world audio retrieval task
The Star Challenge
Experience in speech retrieval in an unknown language
Illinois Group in Star Challenge
Part II. Audio Search
Slide36(Audio) Information Retrieval
Problem definition
Task Description: given a query, find the “most relevant” segments in a database
36
k r u: d p r ai s ^ z
“CRUDE PRICES”
Thousands
Of
Audio files
Top N
file IDs
Slide37(Audio) Information Retrieval
“Standard” Methods
Published Algorithms:
EXACT MATCH
: segment = argmin
i
d(query,segment
i
)
where d is the string edit distanceFast
SUMMARY STATISTICS
: segment = argmaxi p(query|segmenti),
bag-of-words, no concept of “sequence”Good for text, e.g., google, yahoo, etc.TRANSFORM AND INFER: segment = argmaxi p(query|segmenti)≈ argmaxi E(count(query)|segment),word order matters
Flexible, but slow....
Slide38Example: Language-Independent Speech Information Retrieval
Voice activity detection
Perceptual freq warping
Gaussian mixtures
Likelihood Vector
b
i
=p(observation
t
|state
t
=i)
Retrieval Ranking = E(count(query|segment observations))
Inference Algorithm: Finite State Transducer built from ASR Lattices
E(count(query|observations))
Slide39STAR Challenge Audio Retrieval Tasks
Task
Query
Target
Data Set
AT1
IPA sequence
segments that contain the query IPA sequence regardless of its languages
25 hours monolingual database in round1;
13 hours multilingual database in round3
AT2
an utterance spoken by different speakers
all segments that contain the query word/phrase/sentence regardless of its spoken languages
Slide40STAR Challenge Audio Retrieval Tasks
Genuine retrieval tasks
without pre-defined categories
The queries are human speech or IPA sequences
one or multiple languages.
Queries might be only part of the speech in the provided segments in the database
The returned hits should be ordered and only the first 50 or 20 are submitted.
Slide4141
Audio
Archive
Speech
Recognition
Spectral
Features
Lattices
Feature Extraction on
Short-time windows
Query
FSM-based
Query Construction
Empirical
Phone confusion
Knowledge-based
Phone confusion
FSM-based Indexing
FSA Generation
FSM Index
All audio segments
FSA Generation
FSM Index
. . . . . .
Group Index
. . .
Group Index
All Group
Indices
FSM-based
Retrieval
FSM-based
Query Construction
Retrieval
Results
| Speech Recog. | Indexing / Query / Retrieval
Slide4242
Automatic speech recognition
Better performance with
language-specific systems
than language-independent systems
No inter-language mismatch
between training and testing (acoustic model, pronunciation model, language model)
e.g., for English data, we can use all knowledge of English (training data, pronunciation dictionary, language model)
language-specific (e.g.,
English) word recognition
Slide4343
Automatic speech recognition
Multilingual database and queries
We might fail to identify which particular language the testing data is; Even if we know, we might not have the speech recognition models for that language.
We might encounter unseen languages (Tamil, Malay…)
Language-Independent
phone-based recognition instead of word-based recognition
Slide4444
Summary of corpora of different languages
Slide4545
Recognition results
--- All languages/datasets are not equal
Slide4646
Acoustic model
Spectral features:
39-dim PLP,
cepstral
mean/variance normalization per speaker
Modeling: HMMs with {11,13,15,17}-mixture Gaussians
Context-dependent language-dependent modeling
Left context-
CENTRAL PHONE
+right context % language
referred to language-dependent
“
triphones
”
e.g., sound /A/ in different context and of different language
^’-
A
+b%Eng
^’-
A
+b’%Eng
>-
A
+cm%Chinese
….
Slide47Acoustic model -
Context-Dependent Phones
Slide4848
Acoustic model - clustering
Categories for decision tree questions
Right or left context
Distinctive phone features (manner/place of articulation)
Language identity
Lexical stress
Punctuation mark
^’-
A
+b%Eng
^’-
A
+b’%Eng
>-
A
+cm%Chinese
….
Slide4949
Language/Sequence model --- N-gram
If there is a pronunciation model, a particular sequence of context-dependent phone models (acoustic models) can be converted to a particular word sequence, which is modeled by N-gram.
If there is no pronunciation model, the context-dependent phone sequence is directly modeled by N-gram.
Slide5050
Solutions for English Speech Retrieval
Speech recognition frontend
Features:
Perceptual Linear Predictive Cepstra + Energy
first/second orders regression coefficients
Speech recognizer
Triphone-clustered English speech recognizer
English acoustic model, English dictionary, English language model
Lattice generation using a Viterbi decoder
Audio
Archive
Speech
Recognition
Spectral
Features
Lattices
Feature Extraction on
Short-time windows
Slide51I. A compact way to represent numerous alternative hypotheses output by the speech recognizer
A simple example
a b a
or
b a
Some more complex examples:
Using lattices as the representation of speech data
Slide52Mangu et al. 1999
James 1995
Some more complex examples:
Slide53Using lattices as the representation of speech data
Enable more robust speech retrieval
One best hypothesis by speech recognition is not reliable enough.
Lattices can be represented as finite state machines, which can be used in speech retrieval and take advantage of general weighted finite state machine algorithms.
Robust matching between query and audio files
Slide5454
FSM-based Indexing
FSA Generation
FSM Index
Indexing the audio archive
Input from ASR frontend
Lattices of database segments
English vocabulary
Constructing log semiring automatas for each segment
Construct FST-based segment index files
Combine segment index files into a few group index files for retrieval.
Solutions for English Speech Retrieval
FSA Generation
FSM Index
. . . . . .
Group Index
. . .
Group Index
All Group
Indices
All audio segments
Slide55Allauzen et al., 2004
Indexing the audio archive (example)
Speech recognition output:
Lattices for two audio files
Index for the two files
Slide5656
Making FST-based queries
Provided queries each as an IPA sequence
Building an automata from the IPA sequence and expand each IPA arc into alternative arcs for 'similar' IPA
Building a query FSA, incorporating constraints
Provided queries each as audio waveform
Processed by ASR frontend
Building log semiring automata
Build query FSA
Solutions for English Speech Retrieval
Slide5757
k r u: d p r ai s ^ z
“CRUDE PRICES”
Query
FSM-based
Query Construction
Empirical
Phone confusion
Knowledge-based
Phone confusion
FSM-based
Query Construction
Example of query expansion
(phone sequence to word FSA)
Slide5858
Retrieval using queries
Parallel retrieval in all group index files
Order retrieved segment ids
truncate/format results
Fusing results with different precision/recall tradeoff obtained by different settings
Solutions for English Speech Retrieval
All Group
Indices
FSM-based
Retrieval
Slide5959
ASR frontend
Multi-lingual word/phone speech recognizer
Language-independent phone recognizer
Indexing the audio archive
Multi-lingual index
Language-independent index
Making FST-based queries
Multi-lingual queries
Language-independent queries
Retrieval using queries
Same as in language-specific retrieval
Solutions for Multilingual Speech Retrieval
Slide6060
Rough CPU Time for Audio Search
ASR Frontend
Phone recognizer with bigram phone LM ~25% realtime
3k monophone-based word recognizer with unigram word LM ~100% realtime
20k monophone-based word recognizer with unigram word LM ~2500% realtime
Triphone recognizer with bigram phone LM ~10000% realtime
60k triphone-based word recognizer with unigram word LM ~10000% realtime
Indexing
Less than ~50% realtime
Query construction
~10sec each
Retrieval
~1% realtime
Slide6161
Possible Accelerations
ASR Frontend
Faster Viterbi decoder
Optimized FSM-based decoder
Optimized parameters
Indexing
Optimized parallel job splitting
Applying constraining/filtering FSA before building the group indices
Retrieval
Optimized parallel job splitting
Pre-retrieval using less complicated techniques
Slide6262
k r u: d p r ai s ^ z
“CRUDE PRICES”
Thousands
Of
Audio files
Top N
file IDs
Audio Retrieval
What’s the “optimal” model set?
Language-specific phones/words
Language-independent phones
General sub-word units
Size of inventory -
using only the frequent symbols is better
Data driven -
selected by clustering tree
Slide63Audio
Archive
Speech
Recognition
Spectral
Features
Lattices
Feature Extraction on
Short-time windows
Query
FSM-based
Query Construction
Empirical
Phone confusion
Knowledge-based
Phone confusion
FSM-based Indexing
FSA Generation
FSM Index
FSA Generation
FSM Index
. . . . . .
Group Index
. . .
Group Index
All Group
Indices
FSM-based
Retrieval
FSM-based
Query Construction
Retrieval
Results
k r u: d p r ai s ^ z
“CRUDE PRICES”
Thousands
Of
Audio files
Top N
file IDs
Word
Triphone
Phone
Other acoustic units
Speech in unknown language
Query in unknown language
Represented as
?
Automatic Speech Retrieval in an Unknown Language
FSM-based fuzzy matching and retrieval
Slide64Automatic Speech Retrieval in an Unknown Language
modeled as a special case (below) of
the cognitive process called assimilation (above).
English
Phones
Russian
Phones
Spanish
Phones
Language-dependent (LD) phones
LD models
Language-independent (LID) phone clusters
based on pairwise KL divergence
Speech in unknown language
Represented as
LID phone lattices
Accommodation ?
(introducing new models)
Slide65… …
Multilingual
Subword
Units in an Unknown Language
Language-dependent versions of the same IPA symbols can end up:
in one cluster, e.g., /z/ or /t
∫
/
in different clusters, e.g., /j/ or /
I
/
Different IPAs may be similar:
Retrieval w/
audio query
Retrieval w/ IPA sequence query
Clustering gives a more compact phone set, and performs better
Croatian
Phone
recognition
Croatian for training
Croatian Unknown
Slide66Thank you.