Author Vikas Sindhwani and Amol Ghoting Presenter Jinze Li Problem Introduction we are given a collection of N data points or signals in a highdimensional space R D xi ID: 654622
Download Presentation The PPT/PDF document "Large-Scale Distributed Non-negative Spa..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Large-Scale Distributed Non-negative Sparse Coding and Sparse Dictionary Learning
Author:
Vikas
Sindhwani
and
Amol
Ghoting
Presenter:
Jinze
LiSlide2
Problem Introduction
we are given a collection of N data points or signals in a high-dimensional space R
D
: xi ∈ RD, 1 ≤ i ≤ N. Let hj ∈ RD, 1 ≤ j ≤ K denote a dictionary of K “atoms” which we collect as rows of the matrix H = (h1 . . . hK) T ∈ RK*D Given a suitable dictionary H, the goal of sparse coding is to represent datapoints approximately as sparse linear combinations of atoms in the dictionary.Slide3
Formulation
Constraints :
W
i >=0 non-negative coding ||Wi||p < γ sparse codingHj >=0 non-negative dictionary||h
j
||
q
<=v sparse dictionary
||
h
j
||
2
2
=1
uniquenessSlide4
First Term in the objective function is be used to measures the squared reconstruction error.
The Second term is a
regularizer
that enforces the learnt dictionary to be close to a prior H0. (If H0 is unknown, λ will be set to 0) Maintaining H as a sparse matrix and turn its updates across iterations into very cheap sequential operations.Slide5
Formulation
Optimization Strategy
Basic idea Block Coordinate Descent(BCD)
Each iteration of the algorithm cycles over the variables W and h1…hk, Optimizing a single varible at a time while holding others fixedW can be optimized parallelDictionary learning phase. We cycle sequentially over h1 ......hk. Keeping W fixed and solve the resulting subproblemSlide6
Non-negative Sparse Coding
OMP(orthogonal matching pursuit)
Help in reducing the current reconstruction residual.
Lasso: Helps on numerical optimization procedures.Slide7
OMP(Orthogonal matching pursuit)
s
=
Hx and S=HHT s is vextor of cosine similarities between singals and dictionary atomsS is the K*K matrix of inter-atom similaritiesSlide8
OMPSlide9
Non-negative Lasso
defined as
δ
(w) = 0 use proximal MethodIn proximal methods, the idea is to linearize the smooth component, R. around current iterate(Wt)Slide10
Sparse Dictionary Learning
We now discuss the dictionary learning phase where we update H
Use Hierarchical Alternating Least Squares algorithm (HALS) which solves rank-one minimization problems to sequentially update both the rows of H as well as the columns of WSlide11
Sparse Dictionary Learning Slide12
Sparse Dictionary Learning
Projection problem
First, let h be any
ν- sparse vector in RD and let I(h) be its support. Let i1 . . . iD be a permutation of the integer set 1 ≤ i ≤ D such that qi1 , . . . qiD is in sorted order and define I * = {i1, . . . , is} where s is the largest integer such that s ≤ ν and qi1 > . . . > qis > 0. Now its easy to see that,The solution for this problem will be Slide13
Implement Details
Parallelization strategy hinges
The objective function is separable in the sparse coding variables which therefore can be optimized in an embarrassingly parallel fashion
By enforcing hard sparsity on the dictionary elements we turn H into an object that can manipulated very efficiently in memorySlide14
Efficient Matrix Computations
whether they are materialized in memory in our single-node multicore and Hadoop-based cluster implementationSlide15
the maximum memory requirement for W can be bounded by O(N
γ
) for NOMP
Alternatively, matrix-vector products against S may be computed in time O(Kν) implicitly as Sv = H(HT v).Slide16
Single-Node In-memory Multicore Implementation
Plan1:
The first plan does not materialize the matrix W, but explicitly maintains the dense D×K matrix X
TW. As each invocation of NOMP or NLASSO completes, the associated sparse coding vector wi is used to update the summary statistics matrices, WTW and XTW, and then discarded, since everything needed for dictionary learning is contained in these summary statistics. When DK << N γ, this leaves more room to accommodate larger datasizes in memory. However, not materializing W means that NLASSO cannot exploit warm-starts from the sparse coding solutions found with respect to the previous dictionary.Slide17
plan2
In an alternate plan, we materialize W instead which consumes lesser memory if N
γ
<< DK. We then serve the columns of XTW, i.e., the vectors XT vk in Eqn 14, by performing a sparse matrix-vector product on the fly. However, this requires extracting a column from W which is stored in a row-major dynamic sparse format. To make column extraction efficient, we utilize the fact that the indices of non-zero entries for each row are held in sorted order, and the associated value arrays conform to this order. Hence, we can keep a record of where the i th column was found for each row, and simply advance this pointer when the (i + 1)th column is required. Thus, all columns can be served efficiently with one pass over W.Slide18
Benefits for plan2
the matrix-vector product of XT against the columns of W can be parallelized and
NLASSO can use warm-starts.Slide19
Clustering Slide20
The execution proceeds in two phases, the preprocessing phase and the learning phase. The preprocessing phase is a one-time step whose objective is to re-organize the data for better parallel execution of the BCD algorithm. The learning phase is iterative and is coordinated by a control node that first spawns NOMP or NLASSO MR jobs on the worker nodes, then runs sequential dictionary learning and finally monitors the global mean reconstruction error to decide on convergence.Slide21
Performance comparisonSlide22Slide23