Bing Liu University of Illinois at Chicago liubuicedu Introduction Chen and Liu 2016book Classic Machine L earning ML paradigm isolated singletask learning Given a dataset run ID: 736591
Download Presentation The PPT/PDF document "Lifelong Machine Learning" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Lifelong Machine Learning
Bing LiuUniversity of Illinois at Chicagoliub@uic.eduSlide2
Introduction(Chen and Liu, 2016-book)
Classic Machine Learning (ML) paradigm: isolated single-task
learning
Given
a dataset, run an ML algo. to build a modele.g., SVM, CRF, Neural Nets, Topic Modeling, ….Without considering previously learned knowledgeWeaknesses of “isolated learning”Knowledge learned is not retained or accumulatedNeeds a large number of training examples Suitable for well-defined & narrow tasks.
2Slide3
Humans never learn in isolation
We retain knowledge learned in the past and use it to learn more knowledge
Learn effectively from a few or no examples
Our knowledge learned
in the past enables us to learn new things with little data or effort.Nobody has ever given me 2000 training documents and ask me to build a classifierIf no accumulated knowledge, I cannot do it.E.g., if someone gives me 2000 training docs in Arabic (no translation), it is impossible for me to do.
3Slide4
Lifelong machine learning (LML)
Statistical ML is getting increasingly matureTime to work on Lifelong Machine Learning
Retain/accumulate learned knowledge in the
past &
use it to help future learningbecome more knowledgeable & better at learningChatbots and physical robots need LML in its interactions with humans and environments. Without LML, true AI is unlikely. We need a paradigm shift to LML.
4Slide5
A Motivating Example(Liu, 2012
, 2015)My interest in LML came from my experiences in a sentiment analysis startup. Sentiment analysis (SA)
Sentiment
and
target: “The screen is great, but the voice quality is poor.”Positive about screen but negative about voice qualityExtensive knowledge sharing across tasks/domains
Sentiment expressions &
product features
(aspects)
5Slide6
Knowledge Shared A
cross DomainsAfter working on many SA projects for clients, I realizeda lot of concept sharing across domains
as we see more and more application domains, fewer and fewer things are new.
(1) Easy to see
sharing of sentiment words, e.g., good, bad, poor, terrible, etc. (2) There is also a great deal of sharing of product features.6Slide7
Sharing of Product Features
A great deal of product features overlapping across domains
Every product review domain has the aspect
price
Most electronic products share the aspect batteryMany also share the aspect of screen.….It is rather “silly” not to exploit such sharing in SA to make it much more effective. 7Slide8
What does it mean for learning?
How to systematically exploit such sharing? Retain/accumulate knowledge learned in the past.Leverage the knowledge for new task learningI.e.,
lifelong machine learning
(LML)
This leads to our workLifelong topic modeling (Chen and Liu 2014a, b) Lifelong sentiment classification (Chen et al 2015) Others8Slide9
LML is Useful in General
Such sharing and relatedness is everywhere.E.g., NLP is particularly suitable for LMLWords/phrases: same meaning across domains. Sentences: same syntax in all domains/fields
It is hard to imagine:
Humans have to learn everything from scratch whenever we face a new problem or environment.
If that were the case, Intelligence is unlikely9Slide10
Definition of LML(
Thrun 1995, Chen and Liu, 2016 – book)The learner has performed
learning on a sequence of tasks,
from
1 to N.When faced with the (N+1)th task, it uses the relevant knowledge in its knowledge base (KB) to help learn the (N+1)
th task.
After learning
(
N
+1)
th
task,
KB
is updated with the learned results from
(
N
+1)
th
task.
10Slide11
Key Characteristics of LML(
Chen and Liu, 2016 – book)Continuous learning processKnowledge accumulation
Use of previously learned knowledge to help future learning
11Slide12
Two Types of Knowledge
Global knowledge: assume there is a global latent structure among tasks
shared
by
all (Bou Ammar et al., 2014, Ruvolo and Eaton, 2013b, Tanaka and Yamamura, 1997, Thrun, 1996b, Wilson et al., 2007)The global structure can be learned & used in the new task learning.These methods grew out of multi-task learning.Local knowledge: the new task uses relevant pieces of past knowledge based
on needs. (Chen and Liu, 2014a,b, Chen et al., 2015, Fei et al., 2016, Liu et al., 2016, Shu et al., 2016)
12Slide13
Transfer, Multitask
LifelongTransfer learning vs. lifelong learningUsing source labeled data to help target learning
TF learning is not continuous but one time
No retention of knowledge in TF
One directional: TF helps the target domain onlyMultitask learning vs. lifelong learningJointly optimize learning of multiple tasksAlthough it is possible to make it continuous, It does not retain any explicit knowledge except dataHard to re-learn all when faced with a new task
13Slide14
ELLA – assuming global structure(
Ruvolo & Eaton, 2013)ELLA: Efficient Lifelong Learning Algorithm It uses
a shared global knowledge
It is based on
batch multitask learning method GO-MTL (Kumar et al., 2012)ELLA is a online multitask learning method, which is LML. ELLA becomes a lifelong learning methodThe model for a new task can be added efficiently.The model for each
past task can be updated rapidly.
14Slide15
Assumption
ELLA follows GO-MTL to assume:All task models share some basic model components: L = (l
1,
l2, …, lk)L is learned from past tasks.Model parameter θt of each task t is a linear combination of the shared latent componentsst
: a weight vector.θ
t
: e.g.,
parameters
of logistic regression or linear regression
15Slide16
Initial Objective Function of ELLA
Objective Function (Average rather than sum)N is the total number of tasks. nt
is the number of training instances in task
t
ELLA makes this efficient by two approximationsGet rid of the inner summation Not update s t after learning task t. 16Slide17
Lifelong Sentiment Classification(Chen, Ma, and Liu 2015)
“I bought a cellphone a few days ago. It is such a nice phone. The touch screen is really cool. The voice quality is great too. ....”
Goal
:
classify docs or sentences as + or -. Need to manually label a lot of training data for each domain, which is highly labor-intensiveCan we not label for every domain?
17Slide18
A Simple Lifelong Learning Method
If we have worked on a large number of past domains with their training data
D
Build a classifier using D, test on new domainNote - using only one past/source domain as in transfer learning is not enough.In many cases – improve accuracy by as much as 19% (= 80%-61%). Why?In some other cases – not so good, e.g., it works poorly for toy reviews. Why? “toy”
18Slide19
Lifelong Sentiment Classification
(Chen, Ma and Liu, 2015)It adopts a Bayesian optimization framework for LML using stochastic gradient descent
Objective function for each +
ve
review: P(+ | di) – P(- | di) (and P(- | di) – P(+ | di) for each –ve review)Lifelong learning usesWord counts from the past data as priors.
Penalty terms
to
deal with domain
dependent sentiment
words
and reliability of knowledge.
19Slide20
Knowledge Base
20
Two types of previous knowledge
Document-level knowledge
Domain-level knowledgeSlide21
Knowledge Base
21
Two types of knowledge
Document-level knowledge
Domain-level knowledgeSlide22
Exploiting Knowledge via Penalties
Domain dependent sentiment wordsDomain-level knowledge: If
a word
appears
in one/two past domains, the knowledge associated with it is probably not reliable or general.22Slide23
One Result of LSC model
23
Better F1-score (left) and accuracy (right) with more past tasksSlide24
Cumulative & self-motivated learning(Fei et al., 2016)
At time t, a t-class classifier
F
𝑡
learned from the past: 𝐷𝑡 = {𝐷1,... , 𝐷𝑡} , classes 𝑌𝑡 = {𝑙1,..., 𝑙𝑡}. F𝑡 classifies each test instance x to either one of the known classes in 𝑌𝑡
or the unknown class
𝑙
0
.
At time point t+1, a class 𝑙
𝑡+1
(
D
t
+1
) is added,
F
𝑡
is
updated
to a (t+1)-class classifier
F
𝑡+1
y
=
F
t
+1
(
x
),
y
{
𝑙
1
, 𝑙
2
, ..., 𝑙
𝑡
,
𝑙
𝑡+1
,
𝑙
0
}
Self-motivated learning
: detect unseen/new things and learn them.
24Slide25
Learning cumulatively
How to incrementally add a class without retraining from scratch?“Human learning”: uses the past knowledge Ft to help learn the new class
l
t+1
.Find similar classes SC from known classes 𝑌𝑡. E.gOld classes: 𝑌𝑡 = {movie, cat, politics, soccer}.New class: lt+1= basketballSC = {soccer}Building
Ft
+1
by focusing on separating
l
t+1
and SC
.
25Slide26
Open Classification(Fei and Liu, 2016)
Detect unseen class docs (not in training)Traditional classification makes the closed world assumption:Classes in testing have been seen in training
i.e., no new classes in the test data
Not true in most real-life environments.
New data may contain unseen class documentsWe need open (world) classificationDetect the unseen class of documents 26Slide27
Open Classification
Open space risk
formulation
(see Fei & Liu 2016)
Don’t give each class too much open space
SVM is one half space for each class:
too much
Ideally, a “ball” to cover each class
𝑙
i
27Slide28
CBS Learning – space transformation
To detect unseen classes, we proposed CBS learning: Center-based similarity (CBS) space learning.It performs space transformation
Each document vector
d
is transformed to a CBS space vector28
We can use many similarity measuresSlide29
Learning in the original doc space
Training data: Positive: Solid squares
Negative:
Circles
26Slide30
Learning in the original space: no good
Training data: Positive: Solid squares
Negative:
CirclesTest data: Negative - Triangles 26Slide31
CBS Learning
Training data:
Positive:
Solid squares Negative: CirclesTest data:
Negative - Triangles
26Slide32
LTM: Lifelong Topic Modeling(Chen and Liu, ICML-2014)
Topic modeling finds topics from a collection of documents (Blei et al 2003).A document is a distribution over topics
A topic is a distribution over terms/words, e.g.,
{
price, cost, cheap, expensive, …}Question: how to find good past knowledge and use it to help new topic modeling tasks?Data: online product reviews
32Slide33
What past knowledge?
Different tasks share topics, e.g., product features (topics) in sentiment analysis. Shared knowledgeShould be in the same topic
=>
Must-Links:
e.g., {picture, photo}Should not be in the same topic => Cannot-Links: e.g., {battery, picture}33Slide34
LTM
System
34Slide35
LTM Model
35Slide36
An Example
Given a newly discovered topic: {price
,
book, cost, seller, money}We find 3 matching topics from topic base SDomain 1: {price,
color
,
cost
,
life, picture
}
Domain 2:
{
cost
,
screen
,
price,
expensive,
voice
}
Domain 3:
{
price
,
money
,
customer,
expensive
}
If we require words to appear in at least two
domains, we get two must-links (knowledge):
{
price
,
cost
}
and
{
price
,
expensive
}.
Each set is likely to belong to the same aspect/topic.
36Slide37
Model Inference: Gibbs Sampling
How to use the must-link knowledge?e.g., {
price
,
cost} & {price, expensive}Model inference is based onGeneralized Pólya Urn Model (GPU)IdeaWhen assigning a topic t to a word w, e.g., pricealso assign
a fraction of t to words (e.g., cost and
expensive
) in must-links with
w
.
37Slide38
Simple Pólya Urn Model (SPU)
38Slide39
Simple Pólya Urn Model (SPU)
39Slide40
Generalized Pólya Urn Model (GPU
) 40Slide41
Gibbs Sampling
41Slide42
Lifelong Relaxation Labeling(Shu et al., 2016)
Relaxation Labeling (RL) is an unsupervised graph-based label propagation algorithm.Unsupervised classificationGoal: identify the label of each node
Each node
n
i in the graph is associated with a distribution P(L(ni)) L(ni) is the label of ni on a label set Y.Each edge has two conditional distributions: P(L(ni) | L(
nj)) and
P
(
L
(
n
j
) |
L
(
n
i
))
42Slide43
Relaxation Labeling (contd
)Neighbors Ne(ni) of a node ni
are associated with a weight distribution
w(
nj | ni)RL iteratively updates the label distribution of each node until convergence. Initially, we have P0(L(ni)). Let Pr+1(
L(ni
))
be the
change of
P
(
L
(
n
i
))
at
iteration
r
+
1.
43Slide44
Relaxation Labeling (contd
)Updated label distribution for iteration r + 1 is computed as follows:T
he
final label of node
ni is its highest probable label. y))44Slide45
What past knowledge to use?
Lifelong-RL uses two forms of knowledgePrior edges: graphs are usually not given or fixed but are built based on some data. If the data is small, many edges may be missingBut such edges may exist in the graphs of some previous tasks
Prior labels
: initial
P0(L(ni)) is quite hard to set, but results from previous tasks can be used to set it more accurately. 45Slide46
Lifelong-RL architecture
46Slide47
Conclusions
Introduced LML & discussed some current workUnderstanding of LML is very limitedCurrent research focuses on only one type of
task
IMHO: Without accumulating the learned knowledge and using knowledge to learn more
.Artificial General Intelligence (AGI) is unlikely. As statistical machine learning is increasingly mature, we should go for a paradigm shift to LML. 47Slide48
LML Challenges (Chen and Liu 2016-book)
There are many challenges for LML, e.g.,Correctness of knowledge Applicability of knowledgeKnowledge representation and reasoningLearn with tasks of multiple types
Self-motivated learning
Compositional learning
Learning in interaction with humans & systems48Slide49
Thank you
Q&A