relative importance of objects in image retrieval Sung Ju Hwang and Kristen Grauman University of Texas at Austin Image retrieval Query image Image Database Image 1 Image 2 Image k Contentbased retrieval from an image database ID: 604072
Download Presentation The PPT/PDF document "Accounting for the" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Accounting for the relative importance of objects in image retrieval
Sung
Ju
Hwang and Kristen
Grauman
University of Texas at AustinSlide2
Image retrieval
Query image
Image Database
Image 1
Image 2
Image k
Content-based retrieval from an image database
…Slide3
Relative importance of objects
Query image
Image Database
Which image is more relevant to the query?
?Slide4
Relative importance of objects
Query image
cow
bird
water
cow
bird
water
Image Database
cow
fence
mud
Which image is more relevant to the query?
?
skySlide5
Relative importance of objects
An image can contain many different objects,
but some are more “important” than others.
sky
water
mountain
architecture
bird
cowSlide6
Relative importance of objects
Some objects are background
sky
water
mountain
architecture
bird
cowSlide7
Relative importance of objects
Some objects are less salient
sky
water
mountain
architecture
bird
cowSlide8
Relative importance of objects
Some objects are more prominent or perceptually define the scene
sky
water
mountain
architecture
bird
cowSlide9
Our goal
Goal
: Retrieve those images that share important objects with the query image.
versus
How to learn a representation that accounts for this?Slide10
The order in which person assigns tags provides implicit cues about object importance to scene.
Idea: image tags as importance cue
TAGS
CowBirds
ArchitectureWaterSkySlide11
TAGS:CowBirds
Architecture
WaterSky
Idea: image tags as importance cue
Learn this connection to improve cross-modal retrieval and CBIR.
The
order
in which person assigns tags provides implicit cues about object importance to scene.
Then query with
untagged
images to retrieve most relevant images or tags.Slide12
Related
work
Previous work using tagged images focuses on the
noun ↔ object
correspondence.
Duygulu et al. 02
Fergus et al. 05
Li et al., 09
Berg et al. 04
Lavrenko et al. 2003,
Monay & Gatica-Perez 2003, Barnard et al. 2004, Schroff et al. 2007, Gupta & Davis 2008, …
Related work building richer image representations from “two-view”
text+image
data:
Bekkerman
&
Jeon
07,
Qi
et al. 09, Quack et al. 08,
Quattoni
et al 07,
Yakhnenko
&
Honavar
09,…
Gupta et al. 08
height: 6-11 weight: 235 lbs
position:forward
,
croatia
college:
Blaschko
&
Lampert
08
Hardoon
et al. 04Slide13
Approach overview:
Building the image database
Extract visual and tag-based features
Cow
Grass
Horse
Grass
Car
House
Grass
Sky
Learn projections from each feature space into common “semantic space”
Tagged training images
…Slide14
Cow
Tree
Retrieved tag-list
Image-to-image retrieval
Tag-to-image retrieval
Image-to-tag auto annotation
Approach overview:
Retrieval from
the database
Untagged query image
Cow
Tree
Grass
Tag list query
Image
database
Retrieved imagesSlide15
Dual-view semantic space
Visual features and tag-lists are two views generated by the same concept.
Semantic spaceSlide16
Learning mappings to semantic space
Canonical Correlation Analysis (CCA)
:
c
hoose projection directions that maximize the correlation of views projected from same instance.
Semantic space:
new common feature space
View
2
View
1Slide17
Kernel
Canonical Correlation Analysis
Linear CCA
Given paired data:
Select directions so as to maximize:
[
Akaho
2001, Fyfe et al. 2001,
Hardoon
et al. 2004]
Same objective, but projections in kernel space:
,
Kernel CCA
Given pair of kernel functions:
,Slide18
Semantic space
Building the kernels for each view
Word frequency,
rank kernels
Visual kernelsSlide19
Visual
features
captures the HSV color distribution
captures the total scene structure
captures local appearance
(k
-means on
DoG+SIFT)
Color Histogram
Visual WordsGist
[
Torralba
et al.]
Average the component
χ
2
kernels to build a single visual kernel .Slide20
Tag
features
Traditional bag-of-(text)words
Word Frequency
Cow
Bird
Water
Architecture
Mountain
Sky
tag countCow 1Bird 1Water 1Architecture 1
Mountain 1Sky 1Car 0Person 0Slide21
Tag
features
Absolute Rank
Cow
Bird
Water
Architecture
Mountain
SkyAbsolute rank in this image’s tag-list
tag
valueCow 1Bird 0.63Water 0.50Architecture 0.43
Mountain 0.39Sky 0.36Car 0Person 0Slide22
Tag
features
Relative Rank
Cow
Bird
Water
Architecture
Mountain
Skytag
value Cow 0.9 Bird 0.6
Water 0.8 Architecture 0.5 Mountain 0.8 Sky 0.8 Car 0 Person 0
Percentile rank, compared to word’s typical rank in all tag-lists.Slide23
Semantic space
Building the kernels for each view
Word frequency,
rank kernels
Visual kernelsSlide24
Experiments
We compare the retrieval performance of our method with two baselines:
Query image
1
st
retrieved image
Visual-Only Baseline
Query image
1
st
retrieved image
Words+Visual
Baseline
[
Hardoon
et al. 2004,
Yakhenenko
et al. 2009]
KCCA semantic spaceSlide25
We use Normalized Discounted Cumulative Gain at top K (NDCG@K) to evaluate retrieval performance:
Evaluation
Doing well in the top ranks is more important.
Sum of all the scores for the perfect ranking
(normalization)
Reward term
score for p
th
ranked example
[
Kekalainen
&
Jarvelin
, 2002]Slide26
We present the NDCG@K scores using two different reward terms:
Evaluation
scale
presence
relative rank
absolute rank
Object presence/scale
Ordered tag similarity
Cow
Tree
Grass
Person
Cow
Tree
Fence
Grass
Rewards similarity of query’s objects/scales and those in retrieved image(s).
Rewards similarity of query’s ground truth tag ranks and those in retrieved image(s).Slide27
Dataset
LabelMe
6352 images
Database: 3799 images
Query: 2553 images
Scene-oriented
Contains the ordered tag lists via labels added56 unique taggers~23 tags/image
Pascal
9963 imagesDatabase: 5011 imagesQuery: 4952 imagesObject-centralTag lists obtained on Mechanical Turk758 unique taggers~5.5 tags/image Slide28
Image
database
Image-to-image retrieval
We want to retrieve images most similar to the given query image in terms of object importance.
Tag-list kernel space
Visual kernel space
Untagged query image
Retrieved imagesSlide29
Our method
Words
+
Visual
Visual only
Image-to-image retrieval
r
esults
Query ImageSlide30
Image-to-image retrieval
r
esults
Our method
Words
+
Visual
Visual only
Query ImageSlide31
Image-to-image
retrieval
results
Our method better retrieves images that share the query’s important objects, by both measures.
Retrieval accuracy
measured by object+scale similarity
Retrieval accuracymeasured by ordered tag-list similarity
39% improvementSlide32
Tag-to-image retrieval
We want to retrieve the images that are
best described by the given tag list
Image
database
Tag-list kernel space
Visual kernel space
Query tags
Cow
Person
Tree
Grass
Retrieved imagesSlide33
Tag-to-image retrieval
r
esults
Our method better respects the importance cues implied by the user’s keyword query.
31% improvementSlide34
Image-to-tag auto annotation
We want to annotate query image with ordered tags that best describe the scene.
Image
database
Tag-list kernel space
Visual kernel space
Untagged query image
Output tag-lists
Cow
Tree
Grass
Cow
Grass
Field
Cow
FenceSlide35
Image-to-tag auto annotation
results
Boat
Person
Water
Sky
Rock
Bottle
KnifeNapkinLight
fork
PersonTree
CarChairWindow
Tree
Boat
Grass
Water
Person
Method
k=1
k=3
k=5
k=10
Visual-only
0.0826
0.1765
0.2022
0.2095
Word+Visual
0.0818
0.1712
0.1992
0.2097
Ours
0.0901
0.1936
0.2230
0.2335
k = number of nearest neighbors usedSlide36
Woman
Table
Mug
Ladder
Implicit tag cues as localization prior
Mug
Key
KeyboardToothbrushPenPhoto
Post-it
Object detector
Implicit tag features
Computer
Poster
Desk
Screen
Mug
Poster
Training: Learn object-specific connection between localization parameters and implicit tag features.
Mug
Eiffel
Desk
Mug
Office
Mug
Coffee
Testing: Given novel image, localize objects based on both tags and appearance.
P (location, scale | tags)
Implicit tag features
[Hwang &
Grauman
, CVPR 2010]Slide37
Conclusion
We want to learn what is
implied
(beyond objects present) by how a human provides tags for an image
Approach requires minimal supervision to learn the connection between importance conveyed by tags and visual features.
Consistent gains over
content-based visual search tag+visual approach that disregards importance