/
Large-Scale Distributed Non-negative Sparse Coding and Sparse Dictionary Learning Large-Scale Distributed Non-negative Sparse Coding and Sparse Dictionary Learning

Large-Scale Distributed Non-negative Sparse Coding and Sparse Dictionary Learning - PowerPoint Presentation

danika-pritchard
danika-pritchard . @danika-pritchard
Follow
381 views
Uploaded On 2018-03-17

Large-Scale Distributed Non-negative Sparse Coding and Sparse Dictionary Learning - PPT Presentation

Author Vikas Sindhwani and Amol Ghoting Presenter Jinze Li Problem Introduction we are given a collection of N data points or signals in a highdimensional space R D xi ID: 654622

dictionary sparse matrix learning sparse dictionary learning matrix coding memory phase vector negative nlasso columns column reconstruction order single

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Large-Scale Distributed Non-negative Spa..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Large-Scale Distributed Non-negative Sparse Coding and Sparse Dictionary Learning

Author:

Vikas

Sindhwani

and

Amol

Ghoting

Presenter:

Jinze

LiSlide2

Problem Introduction

we are given a collection of N data points or signals in a high-dimensional space R

D

: xi ∈ RD, 1 ≤ i ≤ N. Let hj ∈ RD, 1 ≤ j ≤ K denote a dictionary of K “atoms” which we collect as rows of the matrix H = (h1 . . . hK) T ∈ RK*D Given a suitable dictionary H, the goal of sparse coding is to represent datapoints approximately as sparse linear combinations of atoms in the dictionary.Slide3

Formulation

Constraints :

W

i >=0 non-negative coding ||Wi||p < γ sparse codingHj >=0 non-negative dictionary||h

j

||

q

<=v sparse dictionary

||

h

j

||

2

2

=1

uniquenessSlide4

First Term in the objective function is be used to measures the squared reconstruction error.

The Second term is a

regularizer

that enforces the learnt dictionary to be close to a prior H0. (If H0 is unknown, λ will be set to 0) Maintaining H as a sparse matrix and turn its updates across iterations into very cheap sequential operations.Slide5

Formulation

Optimization Strategy

Basic idea Block Coordinate Descent(BCD)

Each iteration of the algorithm cycles over the variables W and h1…hk, Optimizing a single varible at a time while holding others fixedW can be optimized parallelDictionary learning phase. We cycle sequentially over h1 ......hk. Keeping W fixed and solve the resulting subproblemSlide6

Non-negative Sparse Coding

OMP(orthogonal matching pursuit)

Help in reducing the current reconstruction residual.

Lasso: Helps on numerical optimization procedures.Slide7

OMP(Orthogonal matching pursuit)

s

=

Hx and S=HHT s is vextor of cosine similarities between singals and dictionary atomsS is the K*K matrix of inter-atom similaritiesSlide8

OMPSlide9

Non-negative Lasso

defined as

δ

(w) = 0 use proximal MethodIn proximal methods, the idea is to linearize the smooth component, R. around current iterate(Wt)Slide10

Sparse Dictionary Learning

We now discuss the dictionary learning phase where we update H

Use Hierarchical Alternating Least Squares algorithm (HALS) which solves rank-one minimization problems to sequentially update both the rows of H as well as the columns of WSlide11

Sparse Dictionary Learning Slide12

Sparse Dictionary Learning

Projection problem

First, let h be any

ν- sparse vector in RD and let I(h) be its support. Let i1 . . . iD be a permutation of the integer set 1 ≤ i ≤ D such that qi1 , . . . qiD is in sorted order and define I * = {i1, . . . , is} where s is the largest integer such that s ≤ ν and qi1 > . . . > qis > 0. Now its easy to see that,The solution for this problem will be Slide13

Implement Details

Parallelization strategy hinges

The objective function is separable in the sparse coding variables which therefore can be optimized in an embarrassingly parallel fashion

By enforcing hard sparsity on the dictionary elements we turn H into an object that can manipulated very efficiently in memorySlide14

Efficient Matrix Computations

whether they are materialized in memory in our single-node multicore and Hadoop-based cluster implementationSlide15

the maximum memory requirement for W can be bounded by O(N

γ

) for NOMP

Alternatively, matrix-vector products against S may be computed in time O(Kν) implicitly as Sv = H(HT v).Slide16

Single-Node In-memory Multicore Implementation

Plan1:

The first plan does not materialize the matrix W, but explicitly maintains the dense D×K matrix X

TW. As each invocation of NOMP or NLASSO completes, the associated sparse coding vector wi is used to update the summary statistics matrices, WTW and XTW, and then discarded, since everything needed for dictionary learning is contained in these summary statistics. When DK << N γ, this leaves more room to accommodate larger datasizes in memory. However, not materializing W means that NLASSO cannot exploit warm-starts from the sparse coding solutions found with respect to the previous dictionary.Slide17

plan2

In an alternate plan, we materialize W instead which consumes lesser memory if N

γ

<< DK. We then serve the columns of XTW, i.e., the vectors XT vk in Eqn 14, by performing a sparse matrix-vector product on the fly. However, this requires extracting a column from W which is stored in a row-major dynamic sparse format. To make column extraction efficient, we utilize the fact that the indices of non-zero entries for each row are held in sorted order, and the associated value arrays conform to this order. Hence, we can keep a record of where the i th column was found for each row, and simply advance this pointer when the (i + 1)th column is required. Thus, all columns can be served efficiently with one pass over W.Slide18

Benefits for plan2

the matrix-vector product of XT against the columns of W can be parallelized and

NLASSO can use warm-starts.Slide19

Clustering Slide20

The execution proceeds in two phases, the preprocessing phase and the learning phase. The preprocessing phase is a one-time step whose objective is to re-organize the data for better parallel execution of the BCD algorithm. The learning phase is iterative and is coordinated by a control node that first spawns NOMP or NLASSO MR jobs on the worker nodes, then runs sequential dictionary learning and finally monitors the global mean reconstruction error to decide on convergence.Slide21

Performance comparisonSlide22
Slide23