/
Lifelong Machine  Learning Lifelong Machine  Learning

Lifelong Machine Learning - PowerPoint Presentation

trish-goza
trish-goza . @trish-goza
Follow
359 views
Uploaded On 2018-12-05

Lifelong Machine Learning - PPT Presentation

Bing Liu University of Illinois at Chicago liubuicedu Introduction Chen and Liu 2016book Classic Machine L earning ML paradigm isolated singletask learning Given a dataset run ID: 736591

learning knowledge task lifelong knowledge learning lifelong task lml data class liu amp learned tasks topic model domain chen 2016 training sentiment

Share:

Link:

Embed:


Presentation Transcript

Slide1

Lifelong Machine Learning

Bing LiuUniversity of Illinois at Chicagoliub@uic.eduSlide2

Introduction(Chen and Liu, 2016-book)

Classic Machine Learning (ML) paradigm: isolated single-task

learning

Given

a dataset, run an ML algo. to build a modele.g., SVM, CRF, Neural Nets, Topic Modeling, ….Without considering previously learned knowledgeWeaknesses of “isolated learning”Knowledge learned is not retained or accumulatedNeeds a large number of training examples Suitable for well-defined & narrow tasks.

2Slide3

Humans never learn in isolation

We retain knowledge learned in the past and use it to learn more knowledge

Learn effectively from a few or no examples

Our knowledge learned

in the past enables us to learn new things with little data or effort.Nobody has ever given me 2000 training documents and ask me to build a classifierIf no accumulated knowledge, I cannot do it.E.g., if someone gives me 2000 training docs in Arabic (no translation), it is impossible for me to do.

3Slide4

Lifelong machine learning (LML)

Statistical ML is getting increasingly matureTime to work on Lifelong Machine Learning

Retain/accumulate learned knowledge in the

past &

use it to help future learningbecome more knowledgeable & better at learningChatbots and physical robots need LML in its interactions with humans and environments. Without LML, true AI is unlikely. We need a paradigm shift to LML.

4Slide5

A Motivating Example(Liu, 2012

, 2015)My interest in LML came from my experiences in a sentiment analysis startup. Sentiment analysis (SA)

Sentiment

and

target: “The screen is great, but the voice quality is poor.”Positive about screen but negative about voice qualityExtensive knowledge sharing across tasks/domains

Sentiment expressions &

product features

(aspects)

5Slide6

Knowledge Shared A

cross DomainsAfter working on many SA projects for clients, I realizeda lot of concept sharing across domains

as we see more and more application domains, fewer and fewer things are new.

(1) Easy to see

sharing of sentiment words, e.g., good, bad, poor, terrible, etc. (2) There is also a great deal of sharing of product features.6Slide7

Sharing of Product Features

A great deal of product features overlapping across domains

Every product review domain has the aspect

price

Most electronic products share the aspect batteryMany also share the aspect of screen.….It is rather “silly” not to exploit such sharing in SA to make it much more effective. 7Slide8

What does it mean for learning?

How to systematically exploit such sharing? Retain/accumulate knowledge learned in the past.Leverage the knowledge for new task learningI.e.,

lifelong machine learning

(LML)

This leads to our workLifelong topic modeling (Chen and Liu 2014a, b) Lifelong sentiment classification (Chen et al 2015) Others8Slide9

LML is Useful in General

Such sharing and relatedness is everywhere.E.g., NLP is particularly suitable for LMLWords/phrases: same meaning across domains. Sentences: same syntax in all domains/fields

It is hard to imagine:

Humans have to learn everything from scratch whenever we face a new problem or environment.

If that were the case, Intelligence is unlikely9Slide10

Definition of LML(

Thrun 1995, Chen and Liu, 2016 – book)The learner has performed

learning on a sequence of tasks,

from

1 to N.When faced with the (N+1)th task, it uses the relevant knowledge in its knowledge base (KB) to help learn the (N+1)

th task.

After learning

(

N

+1)

th

task,

KB

is updated with the learned results from

(

N

+1)

th

task.

10Slide11

Key Characteristics of LML(

Chen and Liu, 2016 – book)Continuous learning processKnowledge accumulation

Use of previously learned knowledge to help future learning

11Slide12

Two Types of Knowledge

Global knowledge: assume there is a global latent structure among tasks

shared

by

all (Bou Ammar et al., 2014, Ruvolo and Eaton, 2013b, Tanaka and Yamamura, 1997, Thrun, 1996b, Wilson et al., 2007)The global structure can be learned & used in the new task learning.These methods grew out of multi-task learning.Local knowledge: the new task uses relevant pieces of past knowledge based

on needs. (Chen and Liu, 2014a,b, Chen et al., 2015, Fei et al., 2016, Liu et al., 2016, Shu et al., 2016)

12Slide13

Transfer, Multitask 

LifelongTransfer learning vs. lifelong learningUsing source labeled data to help target learning

TF learning is not continuous but one time

No retention of knowledge in TF

One directional: TF helps the target domain onlyMultitask learning vs. lifelong learningJointly optimize learning of multiple tasksAlthough it is possible to make it continuous, It does not retain any explicit knowledge except dataHard to re-learn all when faced with a new task

13Slide14

ELLA – assuming global structure(

Ruvolo & Eaton, 2013)ELLA: Efficient Lifelong Learning Algorithm It uses

a shared global knowledge

It is based on

batch multitask learning method GO-MTL (Kumar et al., 2012)ELLA is a online multitask learning method, which is LML. ELLA becomes a lifelong learning methodThe model for a new task can be added efficiently.The model for each

past task can be updated rapidly.

14Slide15

Assumption

ELLA follows GO-MTL to assume:All task models share some basic model components: L = (l

1,

l2, …, lk)L is learned from past tasks.Model parameter θt of each task t is a linear combination of the shared latent componentsst

: a weight vector.θ

t

: e.g.,

parameters

of logistic regression or linear regression

15Slide16

Initial Objective Function of ELLA

Objective Function (Average rather than sum)N is the total number of tasks. nt

is the number of training instances in task

t

ELLA makes this efficient by two approximationsGet rid of the inner summation Not update s t after learning task t. 16Slide17

Lifelong Sentiment Classification(Chen, Ma, and Liu 2015)

“I bought a cellphone a few days ago. It is such a nice phone. The touch screen is really cool. The voice quality is great too. ....”

Goal

:

classify docs or sentences as + or -. Need to manually label a lot of training data for each domain, which is highly labor-intensiveCan we not label for every domain?

17Slide18

A Simple Lifelong Learning Method

If we have worked on a large number of past domains with their training data

D

Build a classifier using D, test on new domainNote - using only one past/source domain as in transfer learning is not enough.In many cases – improve accuracy by as much as 19% (= 80%-61%). Why?In some other cases – not so good, e.g., it works poorly for toy reviews. Why? “toy”

18Slide19

Lifelong Sentiment Classification

(Chen, Ma and Liu, 2015)It adopts a Bayesian optimization framework for LML using stochastic gradient descent

Objective function for each +

ve

review: P(+ | di) – P(- | di) (and P(- | di) – P(+ | di) for each –ve review)Lifelong learning usesWord counts from the past data as priors.

Penalty terms

to

deal with domain

dependent sentiment

words

and reliability of knowledge.

19Slide20

Knowledge Base

20

Two types of previous knowledge

Document-level knowledge

Domain-level knowledgeSlide21

Knowledge Base

21

Two types of knowledge

Document-level knowledge

Domain-level knowledgeSlide22

Exploiting Knowledge via Penalties

Domain dependent sentiment wordsDomain-level knowledge: If

a word

appears

in one/two past domains, the knowledge associated with it is probably not reliable or general.22Slide23

One Result of LSC model

23

Better F1-score (left) and accuracy (right) with more past tasksSlide24

Cumulative & self-motivated learning(Fei et al., 2016)

At time t, a t-class classifier

F

𝑡

learned from the past: 𝐷𝑡 = {𝐷1,... , 𝐷𝑡} , classes 𝑌𝑡 = {𝑙1,..., 𝑙𝑡}. F𝑡 classifies each test instance x to either one of the known classes in 𝑌𝑡

or the unknown class

𝑙

0

.

At time point t+1, a class 𝑙

𝑡+1

(

D

t

+1

) is added,

F

𝑡

is

updated

to a (t+1)-class classifier

F

𝑡+1

y

=

F

t

+1

(

x

),

y

 {

𝑙

1

, 𝑙

2

, ..., 𝑙

𝑡

,

𝑙

𝑡+1

,

𝑙

0

}

Self-motivated learning

: detect unseen/new things and learn them.

24Slide25

Learning cumulatively

How to incrementally add a class without retraining from scratch?“Human learning”: uses the past knowledge Ft to help learn the new class

l

t+1

.Find similar classes SC from known classes 𝑌𝑡. E.gOld classes: 𝑌𝑡 = {movie, cat, politics, soccer}.New class: lt+1= basketballSC = {soccer}Building

Ft

+1

by focusing on separating

l

t+1

and SC

.

25Slide26

Open Classification(Fei and Liu, 2016)

Detect unseen class docs (not in training)Traditional classification makes the closed world assumption:Classes in testing have been seen in training

i.e., no new classes in the test data

Not true in most real-life environments.

New data may contain unseen class documentsWe need open (world) classificationDetect the unseen class of documents 26Slide27

Open Classification

Open space risk

formulation

(see Fei & Liu 2016)

Don’t give each class too much open space

SVM is one half space for each class:

too much

Ideally, a “ball” to cover each class

𝑙

i

27Slide28

CBS Learning – space transformation

To detect unseen classes, we proposed CBS learning: Center-based similarity (CBS) space learning.It performs space transformation

Each document vector

d

is transformed to a CBS space vector28

We can use many similarity measuresSlide29

Learning in the original doc space

Training data: Positive: Solid squares

Negative:

Circles

26Slide30

Learning in the original space: no good

Training data: Positive: Solid squares

Negative:

CirclesTest data: Negative - Triangles 26Slide31

CBS Learning

Training data:

Positive:

Solid squares Negative: CirclesTest data:

Negative - Triangles

26Slide32

LTM: Lifelong Topic Modeling(Chen and Liu, ICML-2014)

Topic modeling finds topics from a collection of documents (Blei et al 2003).A document is a distribution over topics

A topic is a distribution over terms/words, e.g.,

{

price, cost, cheap, expensive, …}Question: how to find good past knowledge and use it to help new topic modeling tasks?Data: online product reviews

32Slide33

What past knowledge?

Different tasks share topics, e.g., product features (topics) in sentiment analysis. Shared knowledgeShould be in the same topic

=>

Must-Links:

e.g., {picture, photo}Should not be in the same topic => Cannot-Links: e.g., {battery, picture}33Slide34

LTM

System

34Slide35

LTM Model

35Slide36

An Example

Given a newly discovered topic: {price

,

book, cost, seller, money}We find 3 matching topics from topic base SDomain 1: {price,

color

,

cost

,

life, picture

}

Domain 2:

{

cost

,

screen

,

price,

expensive,

voice

}

Domain 3:

{

price

,

money

,

customer,

expensive

}

If we require words to appear in at least two

domains, we get two must-links (knowledge):

{

price

,

cost

}

and

{

price

,

expensive

}.

Each set is likely to belong to the same aspect/topic.

36Slide37

Model Inference: Gibbs Sampling

How to use the must-link knowledge?e.g., {

price

,

cost} & {price, expensive}Model inference is based onGeneralized Pólya Urn Model (GPU)IdeaWhen assigning a topic t to a word w, e.g., pricealso assign

a fraction of t to words (e.g., cost and

expensive

) in must-links with

w

.

37Slide38

Simple Pólya Urn Model (SPU)

38Slide39

Simple Pólya Urn Model (SPU)

39Slide40

Generalized Pólya Urn Model (GPU

) 40Slide41

Gibbs Sampling

41Slide42

Lifelong Relaxation Labeling(Shu et al., 2016)

Relaxation Labeling (RL) is an unsupervised graph-based label propagation algorithm.Unsupervised classificationGoal: identify the label of each node

Each node

n

i in the graph is associated with a distribution P(L(ni)) L(ni) is the label of ni on a label set Y.Each edge has two conditional distributions: P(L(ni) | L(

nj)) and

P

(

L

(

n

j

) |

L

(

n

i

))

42Slide43

Relaxation Labeling (contd

)Neighbors Ne(ni) of a node ni

are associated with a weight distribution

w(

nj | ni)RL iteratively updates the label distribution of each node until convergence. Initially, we have P0(L(ni)). Let Pr+1(

L(ni

))

be the

change of

P

(

L

(

n

i

))

at

iteration

r

+

1.

43Slide44

Relaxation Labeling (contd

)Updated label distribution for iteration r + 1 is computed as follows:T

he

final label of node

ni is its highest probable label. y))44Slide45

What past knowledge to use?

Lifelong-RL uses two forms of knowledgePrior edges: graphs are usually not given or fixed but are built based on some data. If the data is small, many edges may be missingBut such edges may exist in the graphs of some previous tasks

Prior labels

: initial

P0(L(ni)) is quite hard to set, but results from previous tasks can be used to set it more accurately. 45Slide46

Lifelong-RL architecture

46Slide47

Conclusions

Introduced LML & discussed some current workUnderstanding of LML is very limitedCurrent research focuses on only one type of

task

IMHO: Without accumulating the learned knowledge and using knowledge to learn more

.Artificial General Intelligence (AGI) is unlikely. As statistical machine learning is increasingly mature, we should go for a paradigm shift to LML. 47Slide48

LML Challenges (Chen and Liu 2016-book)

There are many challenges for LML, e.g.,Correctness of knowledge Applicability of knowledgeKnowledge representation and reasoningLearn with tasks of multiple types

Self-motivated learning

Compositional learning

Learning in interaction with humans & systems48Slide49

Thank you

Q&A