Bangpeng Yao and Li FeiFei Computer Science Department Stanford University bangpengfeifeilicsstanfordedu 1 Robots interact with objects Automatic sports commentary Kobe is dunking the ball ID: 707075
Download Presentation The PPT/PDF document "Modeling Mutual Context of Object and Hu..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities
Bangpeng Yao and Li Fei-FeiComputer Science Department, Stanford University
{bangpeng,feifeili}@cs.stanford.edu
1Slide2
Robots interact with objects
Automatic sports commentary
“Kobe is dunking the ball.”
2
Human-Object Interaction
Medical careSlide3
3
Vs.
Human-Object Interaction
Playing saxophone
Playing bassoon
Playing saxophone
Grouplet is a generic feature for structured objects, or interactions of groups of objects.
(Previous talk: Grouplet)
Caltech101
HOI activity: Tennis Forehand
Holistic image based classification
Detailed
understanding
and
reasoning
Berg
&
Malik
, 2005
Grauman
& Darrell, 2005
Gehler
&
Nowozin
, 2009
OURS
48%
59%
77%
62%Slide4
4
Human-Object Interaction
Torso
Right-arm
Left-arm
Right-leg
Left-leg
Head
Human pose estimation
Holistic image based classification
Detailed
understanding
and
reasoningSlide5
5
Human-Object Interaction
Tennis racket
Human pose estimation
Holistic image based classification
Detailed
understanding
and
reasoning
Object detectionSlide6
6
Human-Object Interaction
Human pose estimation
Holistic image based classification
Detailed
understanding
and
reasoning
Object detection
Torso
Right-arm
Left-arm
Right-leg
Left-leg
Head
Tennis racket
HOI activity: Tennis ForehandSlide7
Background and Intuition
Mutual Context of Object and Human Pose Model Representation Model Learning
Model Inference
Experiments
ConclusionOutline7Slide8
Background and Intuition
Mutual Context of Object and Human Pose Model Representation
Model Learning
Model Inference
Experiments ConclusionOutline
8Slide9
Felzenszwalb & Huttenlocher, 2005
Ren et al, 2005
Ramanan, 2006
Ferrari et al, 2008
Yang & Mori, 2008 Andriluka et al, 2009
Eichner & Ferrari, 2009
Difficult part appearance
Self-occlusion
Image region looks like a body part
Human pose estimation
& Object detection
9
Human pose estimation is challenging.Slide10
Human pose estimation
& Object detection
10
Human pose estimation is challenging.
Felzenszwalb & Huttenlocher, 2005
Ren et al, 2005
Ramanan, 2006
Ferrari et al, 2008
Yang & Mori, 2008
Andriluka et al, 2009
Eichner & Ferrari, 2009Slide11
Human pose estimation
& Object detection
11
Facilitate
Given the object is detected.Slide12
Viola & Jones, 2001
Lampert et al, 2008
Divvala
et al, 2009 Vedaldi et al, 2009
Small, low-resolution, partially occluded
Image region similar to detection target
Human pose estimation
& Object detection
12
Object detection is challengingSlide13
Human pose estimation
& Object detection
13
Object detection is challenging
Viola & Jones, 2001
Lampert
et al, 2008
Divvala
et al, 2009
Vedaldi
et al, 2009Slide14
Human pose estimation
& Object detection
14
Facilitate
Given the pose is estimated.Slide15
Human pose estimation
& Object detection
15
Mutual ContextSlide16
Hoiem et al, 2006
Rabinovich et al, 2007
Oliva
& Torralba, 2007 Heitz &
Koller, 2008
Desai et al, 2009
Divvala
et al, 2009
Murphy et al, 2003
Shotton
et al, 2006
Harzallah
et al, 2009
Li,
Socher
& Fei-Fei, 2009
Marszalek
et al, 2009
Bao
&
Savarese
, 2010
Context in Computer Vision
~3-4%
with context
without context
Helpful, but only moderately outperform better
Previous work – Use context cues to facilitate object detection:
Viola & Jones, 2001
Lampert
et al, 2008
16Slide17
Context in Computer Vision
Our approach – Two challenging tasks serve as mutual context of each other:
With mutual context:
Without context:
17
~3-4%
with context
without context
Helpful, but only moderately outperform better
Previous work – Use context cues to facilitate object detection:
Hoiem
et al, 2006
Rabinovich
et al, 2007
Oliva
&
Torralba
, 2007
Heitz
&
Koller
, 2008
Desai et al, 2009
Divvala
et al, 2009
Murphy et al, 2003
Shotton
et al, 2006
Harzallah
et al, 2009
Li,
Socher
& Fei-Fei, 2009
Marszalek
et al, 2009
Bao
&
Savarese
, 2010Slide18
Background and Intuition
Mutual Context of Object and Human Pose Model Representation
Model Learning
Model Inference
Experiments ConclusionOutline
18Slide19
19
H
A
Mutual Context Model Representation
More than one
H
for each
A
;
Unobserved
during training.
A
:
Croquet shot
Volleyball smash
Tennis forehand
Intra-class variations
Activity
Object
Human pose
Body parts
l
P
: location;
θ
P
: orientation;
s
P
: scale.
Croquet mallet
Volleyball
Tennis racket
O
:
H
:
P
:
f
:
Shape context.
[
Belongie
et al, 2002]
P
1
Image evidence
f
O
f
1
f
2
f
N
O
P
2
P
NSlide20
20
Mutual Context Model Representation
Markov Random Field
Clique potential
Clique weight
O
P
1
P
N
f
O
H
A
P
2
f
1
f
2
f
N
, ,
: Frequency of
co-occurrence
between
A
,
O
, and
H
.Slide21
21
A
f
1
f
2
f
N
Mutual Context Model Representation
f
O
P
1
P
N
P
2
O
H
, , :
Spatial relationship
among object and body parts.
location
orientation
size
Markov Random Field
Clique potential
Clique weight
, ,
: Frequency of
co-occurrence
between
A
,
O
, and
H
.Slide22
22
H
A
f
1
f
2
f
N
Mutual Context Model Representation
Obtained by structure learning
f
O
P
N
P
1
P
2
O
Learn structural connectivity
among the body parts and the object.
, ,
: Frequency of
co-occurrence
between
A
,
O
, and
H
.
, , :
Spatial relationship
among object and body parts.
location
orientation
size
Markov Random Field
Clique potential
Clique weightSlide23
23
H
O
A
f
O
f
1
f
2
f
N
P
1
P
2
P
N
Mutual Context Model Representation
and :
Discriminative part detection
scores.
[Andriluka et al, 2009]
Shape context + AdaBoost
Learn structural connectivity
among the body parts and the object.
[Belongie et al, 2002]
[Viola & Jones, 2001]
, ,
: Frequency of
co-occurrence
between
A
,
O
, and
H
.
, , :
Spatial relationship
among object and body parts.
location
orientation
size
Markov Random Field
Clique potential
Clique weightSlide24
Background and Intuition
Mutual Context of Object and Human Pose Model Representation
Model Learning
Model Inference
Experiments ConclusionOutline
24Slide25
25
Model Learning
H
O
A
f
O
f
1
f
2
f
N
P
1
P
2
P
N
cricket shot
cricket bowling
Input:
Goals:
Hidden human posesSlide26
26
Model Learning
H
O
A
f
O
f
1
f
2
f
N
P
1
P
2
P
N
Input:
Goals:
Hidden human poses
Structural connectivity
cricket shot
cricket bowlingSlide27
27
Model Learning
Goals:
Hidden human poses
Structural connectivity
Potential parameters
Potential weights
H
O
A
f
O
f
1
f
2
f
N
P
1
P
2
P
N
Input:
cricket shot
cricket bowlingSlide28
28
Model Learning
Goals:
Parameter estimation
Hidden variables
Structure learning
H
O
A
f
O
f
1
f
2
f
N
P
1
P
2
P
N
Input:
cricket shot
cricket bowling
Hidden human poses
Structural connectivity
Potential parameters
Potential weightsSlide29
29
Model Learning
Goals:
H
O
A
f
O
f
1
f
2
f
N
P
1
P
2
P
N
Approach:
croquet shot
Hidden human poses
Structural connectivity
Potential parameters
Potential weightsSlide30
30
Model Learning
Goals:
H
O
A
f
O
f
1
f
2
f
N
P
1
P
2
P
N
Approach:
Joint density of the model
Gaussian priori of the edge number
Add an edge
Remove an edge
Add an edge
Remove an edge
Hill-climbing
Hidden human poses
Structural connectivity
Potential parameters
Potential weightsSlide31
31
Model Learning
Goals:
H
O
A
f
O
f
1
f
2
f
N
P
1
P
2
P
N
Approach:
Maximum likelihood
Standard
AdaBoost
Hidden human poses
Structural connectivity
Potential parameters
Potential weightsSlide32
32
Model Learning
Goals:
H
O
A
f
O
f
1
f
2
f
N
P
1
P
2
P
N
Approach:
Max-margin learning
x
i
: Potential values of the
i
-th
image.
w
r
: Potential weights of the
r
-
th
pose.
y
(
r
)
: Activity of the
r
-
th
pose.
ξ
i
: A slack variable for the
i
-
th
image.
Notations
Hidden human poses
Structural connectivity
Potential parameters
Potential weightsSlide33
33
Learning Results
Cricket defensive shot
Cricket bowling
Croquet shotSlide34
34
Learning Results
Tennis serve
Volleyball smash
Tennis forehandSlide35
Background and Intuition
Mutual Context of Object and Human Pose Model Representation
Model Learning
Model Inference
Experiments ConclusionOutline
35Slide36
36
Model Inference
The learned modelsSlide37
37
Model Inference
The learned models
Head detection
Torso detection
Tennis racket detection
Layout of the
object
and
body parts
.
Compositional Inference
[Chen et al, 2007]Slide38
38
Model Inference
The learned models
OutputSlide39
Background and Intuition
Mutual Context of Object and Human Pose Model Representation
Model Learning
Model Inference
Experiments ConclusionOutline
39Slide40
40
Dataset and Experiment Setup
Object detection;
Pose estimation;
Activity classification.
Tasks:
[Gupta et al, 2009]
Cricket defensive shot
Cricket bowling
Croquet shot
Tennis forehand
Tennis serve
Volleyball smash
Sport data set
: 6 classes
180 training (supervised with object and part locations) & 120 testing imagesSlide41
[Gupta et al, 2009]
Cricket defensive shot
Cricket bowling
Croquet shot
Tennis forehand
Tennis serve
Volleyball smash
Sport data set
: 6 classes
41
Dataset and Experiment Setup
Object detection;
Pose estimation;
Activity classification.
Tasks:
180 training (supervised with object and part locations) & 120 testing imagesSlide42
Object Detection Results
Cricket bat
42
Valid region
Croquet mallet
Tennis racket
Volleyball
Cricket ball
Our Method
Sliding window
Pedestrian context
[Andriluka et al, 2009]
[Dalal & Triggs, 2006]Slide43
Object Detection Results
43
43
Volleyball
Cricket ball
Sliding window
Pedestrian context
Our method
Small object
Background clutterSlide44
44
Dataset and Experiment Setup
Object detection;
Pose estimation;
Activity classification.
Tasks:
[Gupta et al, 2009]
Cricket defensive shot
Cricket bowling
Croquet shot
Tennis forehand
Tennis serve
Volleyball smash
Sport data set
: 6 classes
180 training & 120 testing imagesSlide45
45
Human Pose Estimation Results
Method
Torso
Upper Leg
Lower Leg
Upper Arm
Lower Arm
Head
Ramanan
, 2006
.52
.22
.22
.21
.28
.24
.28
.17
.14
.42
Andriluka
et al, 2009
.50
.31
.30
.31
.27
.18
.19
.11
.11
.45
Our full model
.66
.43
.39
.44
.34
.44
.40
.27
.29
.58Slide46
46
Human Pose Estimation Results
Method
Torso
Upper Leg
Lower Leg
Upper Arm
Lower Arm
Head
Ramanan
, 2006
.52
.22
.22
.21
.28
.24
.28
.17
.14
.42
Andriluka
et al, 2009
.50
.31
.30
.31
.27
.18
.19
.11
.11
.45
Our full model
.66
.43
.39
.44
.34
.44
.40
.27
.29
.58
Andriluka et al, 2009
Our estimation result
Tennis serve model
Andriluka et al, 2009
Our estimation result
Volleyball smash modelSlide47
47
Human Pose Estimation Results
Method
Torso
Upper Leg
Lower Leg
Upper Arm
Lower Arm
Head
Ramanan
, 2006
.52
.22
.22
.21
.28
.24
.28
.17
.14
.42
Andriluka
et al, 2009
.50
.31
.30
.31
.27
.18
.19
.11
.11
.45
Our full model
.66
.43
.39
.44
.34
.44
.40
.27
.29
.58
One
pose per class
.63
.40
.36
.41
.31
.38
.35
.21
.23
.52
Estimation result
Estimation result
Estimation result
Estimation resultSlide48
48
Dataset and Experiment Setup
Object detection;
Pose estimation;
Activity classification.
Tasks:
[Gupta et al, 2009]
Cricket defensive shot
Cricket bowling
Croquet shot
Tennis forehand
Tennis serve
Volleyball smash
Sport data set
: 6 classes
180 training & 120 testing imagesSlide49
Activity Classification Results
49
No scene information
Scene is critical!!
Cricket shot
Tennis forehand
Bag-of-words
SIFT+SVM
Gupta et al, 2009
Our modelSlide50
50
Conclusion
Human-Object Interaction
Next Steps
Vs.
Pose estimation & Object detection on PPMI images.
Modeling multiple objects and humans.
Grouplet representation
Mutual context modelSlide51
Acknowledgment
Stanford Vision Lab reviewers:
Barry
Chai
(1985-2010)
Juan Carlos
Niebles
Hao
Su
Silvio
Savarese
, U. Michigan
Anonymous reviewers
51