Machine LearningComputer Vision Alan Yuille UCLA Dept Statistics Joint App Computer Science Psychiatry Psychology Dept Brain and Cognitive Engineering Korea University Structure of Talk ID: 928558
Download Presentation The PPT/PDF document "Hierarchical Models of Vision:" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Hierarchical Models of Vision: Machine Learning/Computer Vision
Alan Yuille
UCLA: Dept. Statistics
Joint App. Computer Science, Psychiatry, Psychology
Dept
. Brain and Cognitive Engineering, Korea University
Slide2Structure of TalkComments
on
the relations between Cognitive Science and Machine Learning.
Comments about Cog. Sci. ML and Neuroscience.
Three
related Hierarchical Machine Learning Models.
(I) Convolutional Networks.
(II) Structured Discriminative Models.
(III) Grammars and Compositional Models.
The examples will be on
vision
, but the techniques are generally applicable
.
Slide3Cognitive Science helps Machine Learning
Cognitive Science is useful to ML because the human visual system has many desirable
properties:
(
not
present
in most
ML
systems
).
(
i
) flexible, adaptive, robust
(ii) capable of learning from limited data,
ability to transfer,
(iii) able to perform multiple tasks,
(iv) closely coupled to reasoning, language, and other cognitive abilities
.
Cognitive Scientists search for fundamental theories and not incremental pragmatic solutions.
Slide4Cognitive Science and Machine Learning
Machine Learning is useful to Cog. Sci. because it has experience dealing with complex tasks on huge datasets (e.g., the fundamental problem of vision).
Machine Learning – and Computer Vision --- has developed a very large number of mathematical and computational techniques, which seem necessary to deal with the complexities of the world
.
Data drives the modeling tools. Simple data requires only simple tools. But simple tasks also require simple tools. (neglected by CV).
Slide5Combining Cognitive and ML
Augmented Reality – we need computer systems that can interact with humans.
How can a visually impaired person best be helped by a ML/CV system? Wants to be able to ask the computer questions– who was that person? – i.e. interact with it as if it was a human. Turing tests for vision (S.
Geman
and D.
Geman
).
Image Analyst (Medicine, Military) – wants a ML system that can reason about images, make analogies to other images, and so on.
Slide6Data Set Dilemmas
Too complicated a dataset:
requires a lot of engineering to perform well (“neural network tricks”, N students testing 100x N parameter settings).
Too simple a dataset:
Results may not generalize to the real world
.
It may focus on side issues.
Tyranny
of Datasets:
You can only evaluate performance on a limited set of tasks (e.g., can do “object classification” and not “object segmentation” or “cat part detection”, or ask “what is the cat doing?”)
Slide7Datasets and Generalization
Machine Learning methods are tested on large benchmarked datasets.
Two of the applications involve 20,000 and 1,000,000 images.
Critical Issues of Machine Learning:
(I) Learnability
:
will the results generalize to new datasets?
(II) Inference
: can we compute properties fast enough?
Theoretical Results: Probably Approximately Correct (PAC) Theorems.
Slide8Vision: The Data and the Problem
Complexity, Variability, and Ambiguity of Images.
Enormous range of visual tasks that can be performed. Set of all images is practically infinite.
30,000 objects, 1,000 scenes.
How can humans interpret images in 150
Msec
?
Fundamental Problem:
complexity.
Slide9Neuroscience: Bio-Inspired
Theoretical Models of the Visual Cortex (e.g., T.
Poggio
) are
hierarchical and closely related
to convolutional nets.
Generative models
(later in this talk) may
help explain the increasing evidence of top-down mechanisms.
Behavior-to-Brain: propose models for the visual cortex that can be tested by fMRI,
multi-electrodes
, and related
techniques
.
(
multi-electrodes
T.S
. Lee,
fMRI D.K
.
Kersten
).
Caveat: real neurons don’t behave like neurons in textbooks
…
Conjecture: Structure of the Brain and ML systems is driven by the statistical structure of the environment.
The Pattern Theory Manifesto.
Slide10Hierarchical Models of VisionWhy Hierarchies?
Bio-inspired: Mimics the structure of the human/macaque visual system.
Computer Vision Architectures: low-, middle-, high-level. From ambiguous low-level to unambiguous high level.
Optimal design: for representing, learning, and retrieving image patterns
?
Slide11Three Types of Hierarchies:
(I)
Convolutional Neural Networks
:
ImageNet
Dataset.
Krizhevsky
,
Sutskever
, and Hinton (2013
).
LeCun
,
Salakudinov
.
(II)
Discriminative Part-Based Models
(
McAllester
,
Ramanan
,
Felzenswalb
2008, L. Zhu et al. 2010). PASCAL dataset.
(III)
Generative Models. Grammars and Compositional Models.
(
Geman
, Mumford, SC Zhu, L. Zhu,…).
Slide12Example I: Convolutional Nets
Krizhevsky
,
Sutskever
, and Hinton (2013).
Dataset
ImageNet
(
Fei
Fei
Li).
1,000,000 images.
1,000 objects.
Task: detect and localize objects.
Slide13Example I: Neural NetworkArchitecture: Neural Network.
Convolutional: each hidden unit applies the same localized linear filter to the input.
Slide14Example I: Neurons
Slide15Example I: The Hierarchy.
Slide16Example 1: Model DetailsNew model.
Slide17Learning
Slide18Example 1: Learnt FiltersImage features
learnt – the usual suspects.
Slide19Example I: Dropout
Slide20Example I: Results
Slide21Example I: ConclusionThis convolutional net was the most successful algorithm on the
ImageNet
Challenge 2012.
It requires
a very large
amounts of data to train.
Devil is in the details (“tricks for neural networks”).
Algorithm implemented on Graphics Processing Units (GPUs) to deal with complexity of inference and learning.
Slide22Example II: Structured Discriminative Models.
Star Models :
MacAllester
,
Felzenszwalb
,
Ramanan
. 2008.
Objects are made from
“parts”
(not
semantic parts).
Discriminative Models:
Hierarchical variant: L. Zhu, Y. Chen. et al. 2010.
Learning: latent support-vector machines.
Inference: window search plus dynamic programming.
Application: Pascal object detection challenge. 20,000 images, 20 objects.
Task: identify and localize (bounding box).
Slide23Example II: Mixture Models
Each Object is represented by six models – to allow for different viewpoints.
Energy function/Probabilistic Model defined on hierarchical graph.
Nodes represent parts which can move relative to each other enabling spatial deformations.
Constraints on deformations impose by potentials on the graph structure.
Parent-Child
spatial constraints
Parts: blue (1), yellow (9),
purple (36
)
Deformations of
Horse
Deformations of
Car
Slide24Example II: Mixture Models:Each object is represented by 6 hierarchical models (mixture of models).
These mixture components account for pose/viewpoint changes.
Slide25Example II: Features and PotentialsEdge-Like Cues: Histogram of Gradients (HOGs)
Appearance-Cues: Bag of Words Models (dictionary obtained by clustering SIFT or HOG features).
Learning:
(I) weights for the importance of features,
(ii) weights for the spatial relations between parts.
Slide26Example II: Learning by Latent SVM
The Graph Structure is known.
The training data is partly supervised. It gives image regions labeled by object/non-object.
But you do not know which mixture (viewpoint) component or the positions of the parts. These are hidden variables.
Learning: Latent Support Vector Machine (L SVM).
Learn the weights while simultaneously estimating the hidden variables (part positions, viewpoint).
Slide27Example II: Details (1)
Each hierarchy is a 3-layer tree.
Each node represents a part.
Total of 46 nodes: (1+9+ 4 x 9)
Each node has a spatial position
(parts can “move” or are “active”)
Graph edges from parents to child – impose spatial constraints.
Slide28Example II: Details (2)
The object model has variables:
– represents the position of the parts.
– specifies which mixture component (e.g. pose).
3. – specifies whether the object is present or not.
4. – model parameter (to be learnt).
Note: during learning the part positions and the pose are unknown – so they are latent variables and will be expressed as
Slide29Example II: Details (3)
The “energy” of the model is defined to be:
where is the image in the region.
The object is detected by solving:
If then we have detected the object.
If so, specifies the mixture component and the positions of the parts.
Slide30Example II: Details (4)There are three types of potential terms
(1) Spatial terms which specify the distribution on the positions of the parts.
(2) Data terms for the edges of the object
defined using HOG features.
(3) Regional appearance data terms
defined by histograms of words
(HOWs – using grey SIFT features and K-
means).
Slide31Example II: Details (5)Edge-like: Histogram of Oriented Gradients
HOGs (Upper
row)
Regional: Histogram Of Words (Bottom row)
Dense sampling: 13950 HOGs + 27600 HOWs
Slide32Example II: Details (6)To detect an object requiring solving:
for each image region.
We solve this by scanning over the
subwindows
of the image, use dynamic programming to estimate the part positions
and do exhaustive search over the
Slide33Example II: Details (7)The input to learning is a set of labeled image regions.
Learning require us to estimate the parameters
While simultaneously estimating the hidden variables
Slide34Example II: Details (8)We use Yu and Joachim’s (2009) formulation of latent SVM.
This specifies a non-convex criterion to be minimized. This can be re-expressed in terms of a convex plus a concave part.
Slide35Example II: Details (9)
Yu and
Joachims
(2009) propose the CCCP algorithm (Yuille and Rangarajan 2001) to minimize this criterion.
This iterates between estimating the hidden variables and the parameters (like the EM algorithm).
We propose a variant – incremental CCCP – which is faster.
Result: our method works well for learning the parameters
without
complex initialization.
Slide36Example II: Details (10)
Iterative Algorithm:
Step 1: fill in the latent positions with best score(DP)
Step 2: solve the structural SVM problem using partial negative training set (incrementally enlarge).
Initialization:
No
pretraining
(no clustering).
No displacement of all nodes (no deformation).
Pose assignment: maximum overlapping
Simultaneous multi-layer learning
Slide37Detection Results on PASCAL 2010: Cat
Slide38Example II: Cat Results
Slide39Example II: Horse Results
Slide40Example II: Car Results
Slide41Example II: Conclusion
All current methods that perform well on the Pascal Object Detection Challenge use these types of models.
Performance is fairly good for medium to large objects. Errors are understandable – cat versus dog, car versus train
.
But seems highly unlikely that this is how humans perform these tasks – humans can probably learn from much less data).
The devil is in the details. Small “engineering” changes can yield big improvements.
Improved results by combining these “top-down” object models with “bottom-up” edge cues:
Fidler
,
Mottaghi
, Yuille,
Urtasun
. CVPR 2013.
Slide42Example III: Grammars/Compositional Models
Generative models of objects and scenes.
These models have explicit representation of parts – e.g., can “parse” objects instead of just detect them.
Explicit Representations – gives the ability to perform multiple tasks (arguably closer to human cognition).
Part sharing – efficiency of inference and learning.
Adaptive and Flexible. Can learn from little data.
Tyranny of
Datatsets
: “will they
work on Pascal
?”.
Slide43Example III: Generative Models
Basic Grammars (
Grenander
, Fu, Mjolsness,
Biederman
).
Images are generated from dictionaries of elementary components – with stochastic rules for spatial and structural relations.
Slide44Example III: Analysis by Synthesis
Analyze an image by inverting image formation.
Inverse problem: determine how the data was generated, how was it caused?
Inverse computer graphics.
Slide45Example III: Real Images
Image Parsing: (Z.
Tu
, X. Chen, A.L. Yuille, and S.C. Zhu 2003).
Learn probabilistic models for the visual patterns that can appear in images.
Interpret/understand an image by decomposing it into its constituent parts.
Inference algorithm: bottom-up and top-down.
Slide46Example III: Advantages
Rich Explicit Representations enable:
Understanding of objects, scenes, and events.
Reasoning about functions and roles of objects, goals and intentions of agents, predicting the outcomes of events
. SC Zhu – MURI.
Slide47Ability to transfer between contexts and generalize or extrapolate (e.g. , from Cow to Yak). Reduces hypothesis space – PAC Theory.
Ability
to reason about the system, intervene, do diagnostics.
Allows
the system to answer many different questions based on the same underlying knowledge structure.
Scale
up to multiple objects by part-sharing.
Example III: Advantages
Slide48Example III: Car Detection
Kokkinos and Yuille 2010. A 3-layer model.
Object made from parts – Car = Red-Part AND Blue-Part AND Green-Part
Parts are made by AND-
ing
contours. Red-Part=Con-1 AND Con-2…
These contours correspond to AND-
ing
tokens extracted from the image.
48
The model has flexible geometry to deal with different types of cars:
An SUV looks different than a
Prius
.
Parts move relative to the object.
Contours can move relative to the parts.
Quantify this spatial variation by a probability distribution which is learnt from data.
Slide49Example III: Generative Models.
Slide50Example III: Analogy -- Building a puzzle
Bottom-Up solution: Combine pieces until you build the car
Does not exploit the box’ cover
Top-Down solution: Try fitting each piece to the box’ cover.
Most pieces are uniform/irrelevant
Bottom-Up/Top-Down solution:
Form car-like structures, but use cover to suggest combinations
.
Uses AI from
MacAllester
and
Felzewnswalb
.
Example III: Localize and Parse
51
Slide52Example IIISummary.
Car/Object is represented as a hierarchical graphical models.
Inference algorithm: message passing/dynamic programming/A*.
Learning algorithms: parameter estimation.
Multi-instance learning (Latent SVM is a special case).
Slide53Example III: Part Sharing.
Exploit part-sharing to deal with multiple objects.
More efficient inference and representation –
exponential gains
: quantified in Yuille and
Mottaghi
ICML. 2013
Learning requires less data: a part learnt for a Cow can be used for a Yak.
Slide54Example III: AND/OR Graphs for Baseball
Part sharing enables the model to deal with objects with multiple poses and viewpoints (~100).
Inference and Learning by bottom-up and top-down processing:
54
Slide55Example III: Results on Baseball Players:
Performed well on benchmarked datasets.
Zhu, Chen, Lin, Lin, Yuille CVPR 2008, 2010.
55
Slide56Example III: Structure Learning
Task: given 10 training images, n
o labeling, no alignment, highly ambiguous features.
Estimate Graph structure (nodes and edges)
Estimate the parameters.
56
?
Combinatorial Explosion problem
Correspondence is unknown
Slide57Example III: Unsupervised Learning
Structure Induction.
Bridges
the gap between low-, mid-, and high-level vision
.
Between Chomsky and Hinton?
57
Slide58Example III: Learning Multiple ObjectsUnsupervised learning algorithm to learn parts shared between different objects.
Zhu, Chen, Freeman, Torralba, Yuille 2010.
Structure Induction – learning the graph structures and learning the parameters.
58
Slide59Example III: Many Objects/Viewpoints
120 templates: 5 viewpoints & 26 classes
59
Slide60Example III: Learn Hierarchical Dictionary.
Low-level to Mid-level to High-level.
Automatically shares parts and stops.
60
Slide61Example III: Part Sharing decreases with Levels
61
Slide62Example III: Summary
These generative models with explicit rich representations offer potential advantages: flexibility, adaptability, transfer.
Enable reasoning
about functions and roles of objects, goals and intentions of agents, predicting the outcomes of events
.
Access to semantic descriptions. Making analogies between images.
Augmented Reality – e.g. computer vision system communicating with a visually impaired human.
“In the long term models will be generative”. G. Hinton. 2013.
Slide63Conclusions
Three examples of Hierarchical Models of Vision.
Convolutional Networks, Structured Discriminative Models, Generative Grammars/Compositional Models.
Relations to Neuroscience.
Machine Learning and Cognitive Science
.
Augmented Reality: Humans and Computers
Importance of Data and Tasks.
63
Slide64Theoretical Frameworks
All three models formulated in terms of probability distributions/energy functions defined over graphs or grammars.
Discriminative versus Generative models.
P(W|I) versus P(I|W) P(W).
Representation – are properties represented explicitly? (Requirement for performing tasks).
Inference algorithms and learning algorithm.
Generalization (PAC theorems).
Slide65A Probabilistic Model isdefined by four elements
(
i
)
Graph Structure
– Nodes/Edges --
Representation
(ii)
State Variables
– W – input I. --
Representation
(ii)
Potentials
– Phi --
Probability
(iii)
Parameters/Weights
– Lambda –
Probability
The state variables are defined at the graph nodes.
The potentials and parameters are defined over the graph edges – and relate the model to the image I.