Mouna Selmi 1 and Mounîm A ElYacoubi 2 1 UR SAGE Systèmes Avancés en Génie Electrique Université de Sousse Tunisia 2 SAMOVAR Telecom SudParis CNRS University Paris Saclay Palaiseau France ID: 759354
Download Presentation The PPT/PDF document "Multimodal Sequential Modeling and Recog..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Multimodal Sequential Modeling and Recognition of Human Activities
Mouna Selmi1 and Mounîm A. El-Yacoubi21UR: SAGE - Systèmes Avancés en Génie Electrique, Université de Sousse, Tunisia 2SAMOVAR, Telecom SudParis, CNRS, University Paris Saclay, Palaiseau, France
1
Slide2Outline
Context Objective Proposed Approach Experiments Conclusion Perspectives
2
Slide3Context
A faster growing ageing population:Loss of autonomy is an important problem to care aboutMajority of seniors prefer to stay in their own homesIncreased costs of nursing home care and lack of resources
Year
Old Population %
3
Slide4Context
Activity of daily living (ADL) recognition Evaluating the degree of dependency of the elderly Detecting their critical situations Detecting changes in their behavior patternsMemory Impairments, Physical Impairments, etc. ADL recognition systems can use Environment sensors. Microphone Video sensors Connected Devices
4
Slide5Problem: Recognition of Activities of Daily Living (ADLs)
ADLs: Walking, Eating, Drinking, Phone Call, etc.Recognition is ChallengingIntra class Variability is HighDifferent people perform the same activity in different wayAmbiguity between different classesMotion similar for Drinking, Phone Call (Raise Hand)Lighting ConditionsDaylight, evening, windows, sunny days, etc.Motion Capture may be difficultOcclusionObjects hiding parts of the body, self occlusion, etc. Consider other sources of Informatione.g. Drinking (Object = Glass) vs. Phone Call (Object Telephone)
5
Slide6Objective
Recognizing ADLs in natural videos is challengingIntra-class variability and inter-class ambiguity.Motion may be not sufficient to characterize the actionState of the Art: Activity Recognition based mostly on Motion Solution: combine activity multimodality aspects MotionContextual information (objects, scene, etc.)Sound In this work : Motion + Contextual (Object) information
Object=
Glass
Object= Banana
6
Slide7Overview of our ADL recognition approach
ADL Motion Representation
Classification
ADL Label
Video
Context Representation
7
Slide8How to represent ADL by motion features?
Activity Motion Representation
Holistic Approaches
Local
Approaches
2D/ 3D features from silhouette
local
interest points (IP):
Harris 3D,
Cuboïde,…
+
Explicit body representation
+ Rich representation
Background subtraction and/or body tracking
+
Discriminative representation
+ Avoiding preprocessing - Sparse representation
8
Slide9How to classify ADLs?
Classification Strategies
Static
Sequential
Key
f
rames
Generative
Discriminative
Our Multimodal Approach
Takes into account the sequential order in the activity
Learns how to discriminate different activities in a probabilistic way
Extracts The semantic information from each modality before combination
Pre
-classification Stage
9
Slide10Motion encoding : 3 levels
local SVM
0.1
0.4
0.2...0.1
0.300.2...0.1
00.30.2...0
00.60...0
00.20.2...0.2
local
BOWs
L frames
L frames
L frames
L trames
L frames
Low level features
Middle level features
High level features
10
Slide11Low level: Dense point TrajectoriesDense Points [Wang et al, 2011] : uniformly sampled,Multiple spatial scales.IPs tracked over 15 frames by a median filtering kernel Trajectories are described by: HOG MBH HOF MBH: Motion Boundary Histogram
Motion encoding : 3 levels
11
Slide12Middle level: Local Bag Of Words (BOW)
A local BOW for each temporal segment Split the video into segments of fixed size L =30 frames Overlapping rate = 50% ++ Sequence of feature vectors of fixed size-- Feature vectors with huge dimensionality (~ 1000s)
Features Extraction
Clustering with k-means algorithm
Temporal segments
Local BOW
12
Slide13High-level feature: Local Probabilistic SVM
SVM = Low-level classifier to convert each segment into a class-conditional probability vector:
Improves performance by generating high-level informationAllows feature dimensionality reduction Avoids overfitting problem
P(Drink)
P(Eat)
P(Phone)
Answer phone
Drink
Eat
H
1
H
2
H
N
Video
Sequence
13
Dimension = 2000
Dimension = 5 to 50
Slide14How to represent Context?
Context Information
Environment Sensors
Scene Nature
Manipulated Objects
14
Considered in this work
14
Slide15Encoding of Context: Object Information
We encode the occurrence frequency of each objectConcatenate with vector of motion-based conditional class probabilities
P(a1)P(a2)P(an)...P(aA)
Motion-based Class Probabilities
F(o1)F(o2)F(on)...F(oN)
Objects Frequencies
L frames
s
t
= (P(a1); …; P(a
i)
; … ;P(
aA);F(o1); …,F(oj); … ;F(oN));
Concatenate
15
Slide16Proposed approach
16
Hidden conditional random field: HCRF
Discriminate Sequential Model
Slide17Hidden conditional random field: HCRF
Discriminative sequential classifierModeling the relationship between the partial high-level features conveyed by each segment through the SVM output
Drink a glass
of
water (y)Go to the freezer (h1)Open the freezer (h2)Take a bottle of water (h3)Take a glass of water (h4)Drink water (h5)
17
x
t
: Feature Vector at segment t
Class
Probabilities
+
Objects Frequencies
h
t
: Hidden state at segment t
Slide18Experiments: CAD-120 (Cornell University, 2013)
10 daily living activitiesmaking cereal, taking medicine, stacking objectsun-stacking objects, microwaving food, picking objects, cleaning objectstaking food, arranging objects and having a mealVideo and Kinect4 personsEach activity was performed 3 times Experimental protocol : leave-one-out
18
Slide19Evaluation of our SVM- HCRF model without context modeling
Evaluation of our SVM- HCRF model without context modelingDense point are more performant compared to primitives extracted from skeletons obtained by Kinect Skeleton: sparse representation Skeleton are not very robust to occlusions SVM-HCRF model enables better modeling of the temporal aspect of the activities.
MethodRecognition rateSkeleton features + structural SVM [Koppula et al., 2013] 27.4%Dense points +SVM-HCRF 73.4%
19
Slide20Evaluation of our SVM- HCRF model with context modeling
ObservationsCombining objects’ information with motion features within the SVM-HCRF framework improves accuracy.Our model combines the multimodal information of activities in a sequential wayOur model relies on dense IP trajectories which are less prone to occlusion issues than the Kinect skeleton joins.
MethodRecognition RateKoppula et al., IJRR 2013 motion: Kinect Skeleton Dynamics80.6%Koppula, Saxena, ICML 2013 motion: Kinect Skeleton Dynamics83.1%SVM-HCRF + occurrence objects frequency90.3%
20
Slide21Conclusion
New two-layer SVM-HCRF model for ADLs recognition: Allows combining in a fluid manner the multimodal aspect of ADLs: motion and object information The low level classifier SVM increases performance and processing speedModels sequentially and in a discriminative way the semantic information from motion and objects within each local segment Explicitly learns the underlying temporal sub-structures of an activity and their interrelationships.
21
Slide22Perspectives
Exploits further the multimodality aspect of human activitiesMotion, Objects, Scene, etc.Other sensors: Connected devices, Door sensorsInternet of Things Experiment our ADLs recognition model on dataset performed by elders and fragile people
22
Slide23Juliette Project Telecom SudParis (Development of Technological Tools) Aldebaran Robotics (Humanoid Robots) Brain Vision Systems (Image Processing) Institut de La Vision (Empowering Visually Impaired People) Applications Monitoring of Elder People in Smart HomesTrigger Alarms when Abnormal Events OccurGoalRobots “living” in Smart HomesAssisting Visually Impaired or Elder PeopleDetect Abnormal BehaviorsFall DownDoor Left OpenedNo activity for a long duration
Human Activity Recognition for Health & Ageing
Nao
Romeo
23
Slide24Recording of Human Activity DatasetActivities = Daily Home TasksWhere = HomeLab of Institut de la Vision, ParisTasks in 3 HomeLab LocationsMain Entrance, Kitchen, Living RoomParticipants18 with “Normal” Vision 8 Visually Impaired 4 Blind
Data Corpus
24
Slide25Context: Recognition of Complex Human Activities
Daily Home TasksOne Person is InvolvedReal ConditionsConjoined Activity Detection & Recognition Continuous Activity Video Stream
ActivitiesWalkDrink Water, Prepare MealSitting downSort out mailDial up a phone numberetc.
25
Slide26Thank you!
26