/
Multimodal Sequential Modeling and Recognition of Human Activities Multimodal Sequential Modeling and Recognition of Human Activities

Multimodal Sequential Modeling and Recognition of Human Activities - PowerPoint Presentation

sherrill-nordquist
sherrill-nordquist . @sherrill-nordquist
Follow
342 views
Uploaded On 2019-06-20

Multimodal Sequential Modeling and Recognition of Human Activities - PPT Presentation

Mouna Selmi 1 and Mounîm A ElYacoubi 2 1 UR SAGE Systèmes Avancés en Génie Electrique Université de Sousse Tunisia 2 SAMOVAR Telecom SudParis CNRS University Paris Saclay Palaiseau France ID: 759354

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Multimodal Sequential Modeling and Recog..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Multimodal Sequential Modeling and Recognition of Human Activities

Mouna Selmi1 and Mounîm A. El-Yacoubi21UR: SAGE - Systèmes Avancés en Génie Electrique, Université de Sousse, Tunisia 2SAMOVAR, Telecom SudParis, CNRS, University Paris Saclay, Palaiseau, France

1

Slide2

Outline

Context Objective Proposed Approach Experiments Conclusion Perspectives

2

Slide3

Context

A faster growing ageing population:Loss of autonomy is an important problem to care aboutMajority of seniors prefer to stay in their own homesIncreased costs of nursing home care and lack of resources

Year

Old Population %

3

Slide4

Context

Activity of daily living (ADL) recognition Evaluating the degree of dependency of the elderly Detecting their critical situations Detecting changes in their behavior patternsMemory Impairments, Physical Impairments, etc. ADL recognition systems can use Environment sensors. Microphone Video sensors Connected Devices

4

Slide5

Problem: Recognition of Activities of Daily Living (ADLs)

ADLs: Walking, Eating, Drinking, Phone Call, etc.Recognition is ChallengingIntra class Variability is HighDifferent people perform the same activity in different wayAmbiguity between different classesMotion similar for Drinking, Phone Call (Raise Hand)Lighting ConditionsDaylight, evening, windows, sunny days, etc.Motion Capture may be difficultOcclusionObjects hiding parts of the body, self occlusion, etc. Consider other sources of Informatione.g. Drinking (Object = Glass) vs. Phone Call (Object Telephone)

5

Slide6

Objective

Recognizing ADLs in natural videos is challengingIntra-class variability and inter-class ambiguity.Motion may be not sufficient to characterize the actionState of the Art: Activity Recognition based mostly on Motion Solution: combine activity multimodality aspects MotionContextual information (objects, scene, etc.)Sound In this work : Motion + Contextual (Object) information

Object=

Glass

Object= Banana

6

Slide7

Overview of our ADL recognition approach

ADL Motion Representation

Classification

ADL Label

Video

Context Representation

7

Slide8

How to represent ADL by motion features?

Activity Motion Representation

Holistic Approaches

Local

Approaches

2D/ 3D features from silhouette

local

interest points (IP):

Harris 3D,

Cuboïde,…

+

Explicit body representation

+ Rich representation

Background subtraction and/or body tracking

+

Discriminative representation

+ Avoiding preprocessing - Sparse representation

8

Slide9

How to classify ADLs?

Classification Strategies

Static

Sequential

Key

f

rames

Generative

Discriminative

Our Multimodal Approach

Takes into account the sequential order in the activity

Learns how to discriminate different activities in a probabilistic way

Extracts The semantic information from each modality before combination

Pre

-classification Stage

9

Slide10

Motion encoding : 3 levels

local SVM

0.1

0.4

0.2...0.1

0.300.2...0.1

00.30.2...0

00.60...0

00.20.2...0.2

local

BOWs

L frames

L frames

L frames

L trames

L frames

Low level features

Middle level features

High level features

10

Slide11

Low level: Dense point TrajectoriesDense Points [Wang et al, 2011] : uniformly sampled,Multiple spatial scales.IPs tracked over 15 frames by a median filtering kernel Trajectories are described by: HOG MBH HOF MBH: Motion Boundary Histogram

Motion encoding : 3 levels

11

Slide12

Middle level: Local Bag Of Words (BOW)

A local BOW for each temporal segment Split the video into segments of fixed size L =30 frames Overlapping rate = 50% ++ Sequence of feature vectors of fixed size-- Feature vectors with huge dimensionality (~ 1000s)

Features Extraction

Clustering with k-means algorithm

Temporal segments

Local BOW

12

Slide13

High-level feature: Local Probabilistic SVM

SVM = Low-level classifier to convert each segment into a class-conditional probability vector:

Improves performance by generating high-level informationAllows feature dimensionality reduction Avoids overfitting problem

P(Drink)

P(Eat)

P(Phone)

Answer phone

Drink

Eat

H

1

H

2

H

N

Video

Sequence

13

Dimension = 2000

Dimension = 5 to 50

Slide14

How to represent Context?

Context Information

Environment Sensors

Scene Nature

Manipulated Objects

14

Considered in this work

14

Slide15

Encoding of Context: Object Information

We encode the occurrence frequency of each objectConcatenate with vector of motion-based conditional class probabilities

P(a1)P(a2)P(an)...P(aA)

Motion-based Class Probabilities

F(o1)F(o2)F(on)...F(oN)

Objects Frequencies

L frames

s

t

= (P(a1); …; P(a

i)

; … ;P(

aA);F(o1); …,F(oj); … ;F(oN));

Concatenate

15

Slide16

Proposed approach

16

Hidden conditional random field: HCRF

 Discriminate Sequential Model

Slide17

Hidden conditional random field: HCRF

Discriminative sequential classifierModeling the relationship between the partial high-level features conveyed by each segment through the SVM output

Drink a glass

of

water (y)Go to the freezer (h1)Open the freezer (h2)Take a bottle of water (h3)Take a glass of water (h4)Drink water (h5)

17

x

t

: Feature Vector at segment t

Class

Probabilities

+

Objects Frequencies

h

t

: Hidden state at segment t

Slide18

Experiments: CAD-120 (Cornell University, 2013)

10 daily living activitiesmaking cereal, taking medicine, stacking objectsun-stacking objects, microwaving food, picking objects, cleaning objectstaking food, arranging objects and having a mealVideo and Kinect4 personsEach activity was performed 3 times Experimental protocol : leave-one-out

18

Slide19

Evaluation of our SVM- HCRF model without context modeling

Evaluation of our SVM- HCRF model without context modelingDense point are more performant compared to primitives extracted from skeletons obtained by Kinect Skeleton: sparse representation Skeleton are not very robust to occlusions SVM-HCRF model enables better modeling of the temporal aspect of the activities.

MethodRecognition rateSkeleton features + structural SVM [Koppula et al., 2013] 27.4%Dense points +SVM-HCRF 73.4%

19

Slide20

Evaluation of our SVM- HCRF model with context modeling

ObservationsCombining objects’ information with motion features within the SVM-HCRF framework improves accuracy.Our model combines the multimodal information of activities in a sequential wayOur model relies on dense IP trajectories which are less prone to occlusion issues than the Kinect skeleton joins.

MethodRecognition RateKoppula et al., IJRR 2013  motion: Kinect Skeleton Dynamics80.6%Koppula, Saxena, ICML 2013  motion: Kinect Skeleton Dynamics83.1%SVM-HCRF + occurrence objects frequency90.3%

20

Slide21

Conclusion

New two-layer SVM-HCRF model for ADLs recognition: Allows combining in a fluid manner the multimodal aspect of ADLs: motion and object information The low level classifier SVM increases performance and processing speedModels sequentially and in a discriminative way the semantic information from motion and objects within each local segment Explicitly learns the underlying temporal sub-structures of an activity and their interrelationships.

21

Slide22

Perspectives

Exploits further the multimodality aspect of human activitiesMotion, Objects, Scene, etc.Other sensors: Connected devices, Door sensorsInternet of Things Experiment our ADLs recognition model on dataset performed by elders and fragile people

22

Slide23

Juliette Project Telecom SudParis (Development of Technological Tools) Aldebaran Robotics (Humanoid Robots) Brain Vision Systems (Image Processing) Institut de La Vision (Empowering Visually Impaired People) Applications Monitoring of Elder People in Smart HomesTrigger Alarms when Abnormal Events OccurGoalRobots “living” in Smart HomesAssisting Visually Impaired or Elder PeopleDetect Abnormal BehaviorsFall DownDoor Left OpenedNo activity for a long duration

Human Activity Recognition for Health & Ageing

Nao

Romeo

23

Slide24

Recording of Human Activity DatasetActivities = Daily Home TasksWhere = HomeLab of Institut de la Vision, ParisTasks in 3 HomeLab LocationsMain Entrance, Kitchen, Living RoomParticipants18 with “Normal” Vision 8 Visually Impaired 4 Blind

Data Corpus

24

Slide25

Context: Recognition of Complex Human Activities

Daily Home TasksOne Person is InvolvedReal ConditionsConjoined Activity Detection & Recognition Continuous Activity Video Stream

ActivitiesWalkDrink Water, Prepare MealSitting downSort out mailDial up a phone numberetc.

25

Slide26

Thank you!

26