/
Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities

Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities - PowerPoint Presentation

alexa-scheidler
alexa-scheidler . @alexa-scheidler
Follow
369 views
Uploaded On 2018-11-01

Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities - PPT Presentation

Bangpeng Yao and Li FeiFei Computer Science Department Stanford University bangpengfeifeilicsstanfordedu 1 Robots interact with objects Automatic sports commentary Kobe is dunking the ball ID: 707075

model object amp human object model human amp pose 2009 context estimation detection cricket tennis learning potential shot mutual

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Modeling Mutual Context of Object and Hu..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities

Bangpeng Yao and Li Fei-FeiComputer Science Department, Stanford University

{bangpeng,feifeili}@cs.stanford.edu

1Slide2

Robots interact with objects

Automatic sports commentary

“Kobe is dunking the ball.”

2

Human-Object Interaction

Medical careSlide3

3

Vs.

Human-Object Interaction

Playing saxophone

Playing bassoon

Playing saxophone

Grouplet is a generic feature for structured objects, or interactions of groups of objects.

(Previous talk: Grouplet)

Caltech101

HOI activity: Tennis Forehand

Holistic image based classification

Detailed

understanding

and

reasoning

Berg

&

Malik

, 2005

Grauman

& Darrell, 2005

Gehler

&

Nowozin

, 2009

OURS

48%

59%

77%

62%Slide4

4

Human-Object Interaction

Torso

Right-arm

Left-arm

Right-leg

Left-leg

Head

Human pose estimation

Holistic image based classification

Detailed

understanding

and

reasoningSlide5

5

Human-Object Interaction

Tennis racket

Human pose estimation

Holistic image based classification

Detailed

understanding

and

reasoning

Object detectionSlide6

6

Human-Object Interaction

Human pose estimation

Holistic image based classification

Detailed

understanding

and

reasoning

Object detection

Torso

Right-arm

Left-arm

Right-leg

Left-leg

Head

Tennis racket

HOI activity: Tennis ForehandSlide7

Background and Intuition

Mutual Context of Object and Human Pose Model Representation Model Learning

Model Inference

Experiments

ConclusionOutline7Slide8

Background and Intuition

Mutual Context of Object and Human Pose Model Representation

Model Learning

Model Inference

Experiments ConclusionOutline

8Slide9

Felzenszwalb & Huttenlocher, 2005

Ren et al, 2005

Ramanan, 2006

Ferrari et al, 2008

Yang & Mori, 2008 Andriluka et al, 2009

Eichner & Ferrari, 2009

Difficult part appearance

Self-occlusion

Image region looks like a body part

Human pose estimation

& Object detection

9

Human pose estimation is challenging.Slide10

Human pose estimation

& Object detection

10

Human pose estimation is challenging.

Felzenszwalb & Huttenlocher, 2005

Ren et al, 2005

Ramanan, 2006

Ferrari et al, 2008

Yang & Mori, 2008

Andriluka et al, 2009

Eichner & Ferrari, 2009Slide11

Human pose estimation

& Object detection

11

Facilitate

Given the object is detected.Slide12

Viola & Jones, 2001

Lampert et al, 2008

Divvala

et al, 2009 Vedaldi et al, 2009

Small, low-resolution, partially occluded

Image region similar to detection target

Human pose estimation

& Object detection

12

Object detection is challengingSlide13

Human pose estimation

& Object detection

13

Object detection is challenging

Viola & Jones, 2001

Lampert

et al, 2008

Divvala

et al, 2009

Vedaldi

et al, 2009Slide14

Human pose estimation

& Object detection

14

Facilitate

Given the pose is estimated.Slide15

Human pose estimation

& Object detection

15

Mutual ContextSlide16

Hoiem et al, 2006

Rabinovich et al, 2007

Oliva

& Torralba, 2007 Heitz &

Koller, 2008

Desai et al, 2009

Divvala

et al, 2009

Murphy et al, 2003

Shotton

et al, 2006

Harzallah

et al, 2009

Li,

Socher

& Fei-Fei, 2009

Marszalek

et al, 2009

Bao

&

Savarese

, 2010

Context in Computer Vision

~3-4%

with context

without context

Helpful, but only moderately outperform better

Previous work – Use context cues to facilitate object detection:

Viola & Jones, 2001

Lampert

et al, 2008

16Slide17

Context in Computer Vision

Our approach – Two challenging tasks serve as mutual context of each other:

With mutual context:

Without context:

17

~3-4%

with context

without context

Helpful, but only moderately outperform better

Previous work – Use context cues to facilitate object detection:

Hoiem

et al, 2006

Rabinovich

et al, 2007

Oliva

&

Torralba

, 2007

Heitz

&

Koller

, 2008

Desai et al, 2009

Divvala

et al, 2009

Murphy et al, 2003

Shotton

et al, 2006

Harzallah

et al, 2009

Li,

Socher

& Fei-Fei, 2009

Marszalek

et al, 2009

Bao

&

Savarese

, 2010Slide18

Background and Intuition

Mutual Context of Object and Human Pose Model Representation

Model Learning

Model Inference

Experiments ConclusionOutline

18Slide19

19

H

A

Mutual Context Model Representation

More than one

H

for each

A

;

Unobserved

during training.

A

:

Croquet shot

Volleyball smash

Tennis forehand

Intra-class variations

Activity

Object

Human pose

Body parts

l

P

: location;

θ

P

: orientation;

s

P

: scale.

Croquet mallet

Volleyball

Tennis racket

O

:

H

:

P

:

f

:

Shape context.

[

Belongie

et al, 2002]

P

1

Image evidence

f

O

f

1

f

2

f

N

O

P

2

P

NSlide20

20

Mutual Context Model Representation

Markov Random Field

Clique potential

Clique weight

O

P

1

P

N

f

O

H

A

P

2

f

1

f

2

f

N

, ,

: Frequency of

co-occurrence

between

A

,

O

, and

H

.Slide21

21

A

f

1

f

2

f

N

Mutual Context Model Representation

f

O

P

1

P

N

P

2

O

H

, , :

Spatial relationship

among object and body parts.

location

orientation

size

Markov Random Field

Clique potential

Clique weight

, ,

: Frequency of

co-occurrence

between

A

,

O

, and

H

.Slide22

22

H

A

f

1

f

2

f

N

Mutual Context Model Representation

Obtained by structure learning

f

O

P

N

P

1

P

2

O

Learn structural connectivity

among the body parts and the object.

, ,

: Frequency of

co-occurrence

between

A

,

O

, and

H

.

, , :

Spatial relationship

among object and body parts.

location

orientation

size

Markov Random Field

Clique potential

Clique weightSlide23

23

H

O

A

f

O

f

1

f

2

f

N

P

1

P

2

P

N

Mutual Context Model Representation

and :

Discriminative part detection

scores.

[Andriluka et al, 2009]

Shape context + AdaBoost

Learn structural connectivity

among the body parts and the object.

[Belongie et al, 2002]

[Viola & Jones, 2001]

, ,

: Frequency of

co-occurrence

between

A

,

O

, and

H

.

, , :

Spatial relationship

among object and body parts.

location

orientation

size

Markov Random Field

Clique potential

Clique weightSlide24

Background and Intuition

Mutual Context of Object and Human Pose Model Representation

Model Learning

Model Inference

Experiments ConclusionOutline

24Slide25

25

Model Learning

H

O

A

f

O

f

1

f

2

f

N

P

1

P

2

P

N

cricket shot

cricket bowling

Input:

Goals:

Hidden human posesSlide26

26

Model Learning

H

O

A

f

O

f

1

f

2

f

N

P

1

P

2

P

N

Input:

Goals:

Hidden human poses

Structural connectivity

cricket shot

cricket bowlingSlide27

27

Model Learning

Goals:

Hidden human poses

Structural connectivity

Potential parameters

Potential weights

H

O

A

f

O

f

1

f

2

f

N

P

1

P

2

P

N

Input:

cricket shot

cricket bowlingSlide28

28

Model Learning

Goals:

Parameter estimation

Hidden variables

Structure learning

H

O

A

f

O

f

1

f

2

f

N

P

1

P

2

P

N

Input:

cricket shot

cricket bowling

Hidden human poses

Structural connectivity

Potential parameters

Potential weightsSlide29

29

Model Learning

Goals:

H

O

A

f

O

f

1

f

2

f

N

P

1

P

2

P

N

Approach:

croquet shot

Hidden human poses

Structural connectivity

Potential parameters

Potential weightsSlide30

30

Model Learning

Goals:

H

O

A

f

O

f

1

f

2

f

N

P

1

P

2

P

N

Approach:

Joint density of the model

Gaussian priori of the edge number

Add an edge

Remove an edge

Add an edge

Remove an edge

Hill-climbing

Hidden human poses

Structural connectivity

Potential parameters

Potential weightsSlide31

31

Model Learning

Goals:

H

O

A

f

O

f

1

f

2

f

N

P

1

P

2

P

N

Approach:

Maximum likelihood

Standard

AdaBoost

Hidden human poses

Structural connectivity

Potential parameters

Potential weightsSlide32

32

Model Learning

Goals:

H

O

A

f

O

f

1

f

2

f

N

P

1

P

2

P

N

Approach:

Max-margin learning

x

i

: Potential values of the

i

-th

image.

w

r

: Potential weights of the

r

-

th

pose.

y

(

r

)

: Activity of the

r

-

th

pose.

ξ

i

: A slack variable for the

i

-

th

image.

Notations

Hidden human poses

Structural connectivity

Potential parameters

Potential weightsSlide33

33

Learning Results

Cricket defensive shot

Cricket bowling

Croquet shotSlide34

34

Learning Results

Tennis serve

Volleyball smash

Tennis forehandSlide35

Background and Intuition

Mutual Context of Object and Human Pose Model Representation

Model Learning

Model Inference

Experiments ConclusionOutline

35Slide36

36

Model Inference

The learned modelsSlide37

37

Model Inference

The learned models

Head detection

Torso detection

Tennis racket detection

Layout of the

object

and

body parts

.

Compositional Inference

[Chen et al, 2007]Slide38

38

Model Inference

The learned models

OutputSlide39

Background and Intuition

Mutual Context of Object and Human Pose Model Representation

Model Learning

Model Inference

Experiments ConclusionOutline

39Slide40

40

Dataset and Experiment Setup

Object detection;

Pose estimation;

Activity classification.

Tasks:

[Gupta et al, 2009]

Cricket defensive shot

Cricket bowling

Croquet shot

Tennis forehand

Tennis serve

Volleyball smash

Sport data set

: 6 classes

180 training (supervised with object and part locations) & 120 testing imagesSlide41

[Gupta et al, 2009]

Cricket defensive shot

Cricket bowling

Croquet shot

Tennis forehand

Tennis serve

Volleyball smash

Sport data set

: 6 classes

41

Dataset and Experiment Setup

Object detection;

Pose estimation;

Activity classification.

Tasks:

180 training (supervised with object and part locations) & 120 testing imagesSlide42

Object Detection Results

Cricket bat

42

Valid region

Croquet mallet

Tennis racket

Volleyball

Cricket ball

Our Method

Sliding window

Pedestrian context

[Andriluka et al, 2009]

[Dalal & Triggs, 2006]Slide43

Object Detection Results

43

43

Volleyball

Cricket ball

Sliding window

Pedestrian context

Our method

Small object

Background clutterSlide44

44

Dataset and Experiment Setup

Object detection;

Pose estimation;

Activity classification.

Tasks:

[Gupta et al, 2009]

Cricket defensive shot

Cricket bowling

Croquet shot

Tennis forehand

Tennis serve

Volleyball smash

Sport data set

: 6 classes

180 training & 120 testing imagesSlide45

45

Human Pose Estimation Results

Method

Torso

Upper Leg

Lower Leg

Upper Arm

Lower Arm

Head

Ramanan

, 2006

.52

.22

.22

.21

.28

.24

.28

.17

.14

.42

Andriluka

et al, 2009

.50

.31

.30

.31

.27

.18

.19

.11

.11

.45

Our full model

.66

.43

.39

.44

.34

.44

.40

.27

.29

.58Slide46

46

Human Pose Estimation Results

Method

Torso

Upper Leg

Lower Leg

Upper Arm

Lower Arm

Head

Ramanan

, 2006

.52

.22

.22

.21

.28

.24

.28

.17

.14

.42

Andriluka

et al, 2009

.50

.31

.30

.31

.27

.18

.19

.11

.11

.45

Our full model

.66

.43

.39

.44

.34

.44

.40

.27

.29

.58

Andriluka et al, 2009

Our estimation result

Tennis serve model

Andriluka et al, 2009

Our estimation result

Volleyball smash modelSlide47

47

Human Pose Estimation Results

Method

Torso

Upper Leg

Lower Leg

Upper Arm

Lower Arm

Head

Ramanan

, 2006

.52

.22

.22

.21

.28

.24

.28

.17

.14

.42

Andriluka

et al, 2009

.50

.31

.30

.31

.27

.18

.19

.11

.11

.45

Our full model

.66

.43

.39

.44

.34

.44

.40

.27

.29

.58

One

pose per class

.63

.40

.36

.41

.31

.38

.35

.21

.23

.52

Estimation result

Estimation result

Estimation result

Estimation resultSlide48

48

Dataset and Experiment Setup

Object detection;

Pose estimation;

Activity classification.

Tasks:

[Gupta et al, 2009]

Cricket defensive shot

Cricket bowling

Croquet shot

Tennis forehand

Tennis serve

Volleyball smash

Sport data set

: 6 classes

180 training & 120 testing imagesSlide49

Activity Classification Results

49

No scene information

Scene is critical!!

Cricket shot

Tennis forehand

Bag-of-words

SIFT+SVM

Gupta et al, 2009

Our modelSlide50

50

Conclusion

Human-Object Interaction

Next Steps

Vs.

Pose estimation & Object detection on PPMI images.

Modeling multiple objects and humans.

Grouplet representation

Mutual context modelSlide51

Acknowledgment

Stanford Vision Lab reviewers:

Barry

Chai

(1985-2010)

Juan Carlos

Niebles

Hao

Su

Silvio

Savarese

, U. Michigan

Anonymous reviewers

51