Previous SGD 1 Midterm Will cover all the lectures scheduled through today There are some sample questions up already from previous years syllabus is not very different for first half of course ID: 618643
Download Presentation The PPT/PDF document "Schedule for near future…." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Schedule for near future….
Previous
SGD
1Slide2
Midterm
Will cover all the lectures scheduled through todayThere are some sample questions up already from previous years – syllabus is not very different for first half of course.Problems are mostly going to be harder than the quiz questions
Questions often include material from a homeworkso make sure you understand a HW if you decided to drop itClosed book and closed internetYou can bring in one sheet 8.5x11 or A4 paper front and back
2Slide3
Wrap-up on iterative parameter mixing
3Slide4
NAACL 2010
4
RecapSlide5
Parallelizing perceptrons
– take 2
Instances/labels
Instances/labels – 1
Instances/labels – 2
Instances/labels – 3
w -1
w
- 2
w-3
w
Split into example subsets
Combine by some sort of weighted averaging
Compute local
vk’s
w (previous)
5
Recap: Iterative Parameter MixingSlide6
Parallelizing perceptrons
– take 2
Instances/labels
Instances/labels – 1
Instances/labels – 2
Instances/labels – 3
w -1
w
- 2
w-3
w
Split into example subsets
Combine by some sort of weighted averaging
Compute local
vk’s
w (previous)
6
Recap: Iterative Parameter MixingSlide7
Parallel Perceptrons – take 2
Idea: do the simplest possible thing iteratively.
Split the data into shards
Let w = 0
For n=1,… Train a perceptron
on each shard with one pass
starting with
w
Average the weight vectors (somehow) and let
w
be that average
Extra communication cost:
redistributing the weight vectors
done less frequently than if fully synchronized, more frequently than if fully parallelized
All-Reduce
7
Recap: Iterative Parameter MixingSlide8
All-reduce
8Slide9
Introduction
Common pattern:do some learning in parallel aggregate local changes from each processor
to shared parametersdistribute the new shared parameters
back to each processorand repeat….
AllReduce implemented in MPI, also in VW code (John Langford) in a Hadoop/compatible scheme
MAP
ALLREDUCE
9Slide10
10Slide11
11Slide12
12Slide13
13Slide14
14Slide15
15Slide16
Gory details of VW Hadoop-AllReduce
Spanning-tree server:Separate process constructs a spanning tree of the compute
nodes in the cluster and then acts as a serverWorker nodes (“fake” mappers):Input for worker is locally cachedWorkers all connect to spanning-tree server
Workers all execute the same code, which might contain AllReduce calls:Workers
synchronize whenever they reach an all-reduce
16Slide17
Hadoop
AllReduce
don’t wait for duplicate jobs
17Slide18
Second-order method - like Newton’s method
18Slide19
2
24
features~=100 non-zeros/example2.3B examplesexample is user/page/ad and conjunctions of these, positive if there was a click-thru on the ad
19Slide20
50M examples
explicitly constructed kernel
11.7M features3,300 nonzeros/example
old method: SVM, 3 days: reporting time to get to fixed test error
20Slide21
21Slide22
Matrix Factorization
22Slide23
Recovering latent factors in a matrix
m
columns
v11
…
…
…
vij
…
vnm
n
rows
23Slide24
Recovering latent factors in a matrix
K * m
n * K
x1
y1
x2
y2
..
..
…
…
xn
yn
a1
a2
..
…
am
b1
b2
…
…
bm
v11
…
…
…
vij
…
vnm
~
24Slide25
What is this for?
K * m
n * K
x1
y1
x2
y2
..
..
…
…
xn
yn
a1
a2
..
…
am
b1
b2
…
…
bm
v11
…
…
…
vij
…
vnm
~
25Slide26
MF for collaborative filtering
26Slide27
What is collaborative filtering?
27Slide28
What is collaborative filtering?
28Slide29
What is collaborative filtering?
29Slide30
What is collaborative filtering?
30Slide31
31Slide32
Recovering latent factors in a matrix
m
movies
v11
…
…
…
vij
…
vnm
V[
i,j
] = user i’s rating of movie j
n
users
32Slide33
Recovering latent factors in a matrix
m
movies
n
users
m
movies
x1
y1
x2
y2
..
..
…
…
xn
yn
a1
a2
..
…
am
b1
b2
…
…
bm
v11
…
…
…
vij
…
vnm
~
V[
i,j
] = user i’s rating of movie j
33Slide34
34Slide35
MF for image modeling
35Slide36
36
Data
: many copies of an image, rotated and shifted (matrix with one image/row)
Image “prototypes:”
a smaller number of row vectors (green=negative)
Reconstructed images :
linear combinations of prototypesSlide37
MF for images
10,000 pixels
1000
images
1000 * 10,000,00
x1
y1
x2
y2
..
..
…
…
xn
yn
a1
a2
..
…
am
b1
b2
…
…
bm
v11
…
…
…
…
…
vij
…
vnm
~
V[
i,j
] = pixel j in image
i
2 prototypes
PC1
PC2
37Slide38
MF for modeling text
38Slide39
The Neatest Little Guide to Stock Market Investing
Investing For Dummies, 4th Edition
The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of Stock Market Returns
The Little Book of Value Investing
Value Investing: From Graham to Buffett and Beyond
Rich Dad’s Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!
Investing in Real Estate, 5th Edition
Stock Investing For Dummies
Rich Dad’s Advisors: The ABC’s of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss
https://technowiki.wordpress.com/2011/08/27/latent-semantic-analysis-lsa-tutorial/
39Slide40
https://technowiki.wordpress.com/2011/08/27/latent-semantic-analysis-lsa-tutorial/
TFIDF counts would be better
40Slide41
Recovering latent factors in a matrix
m
terms
n
documents
doc term matrix
x1
y1
x2
y2
..
..
…
…
xn
yn
a1
a2
..
…
am
b1
b2
…
…
bm
v11
…
…
…
vij
…
vnm
~
V[
i,j
] = TFIDF score of term j in doc
i
41Slide42
=
42Slide43
Investing for real estate
Rich Dad’s Advisor’s: The ABCs of Real Estate Investment …
43Slide44
The little book of common sense investing: …
Neatest Little Guide to Stock Market Investing
44Slide45
MF is like clustering
45Slide46
k-means as MF
cluster means
n
examples
0
1
1
0
..
..
…
…
xn
yn
a1
a2
..
…
am
b1
b2
…
…
bm
v11
…
…
…
vij
…
vnm
~
original data set
indicators for r clusters
Z
M
X
46Slide47
How do you do it?
K * m
n * K
x1
y1
x2
y2
..
..
…
…
xn
yn
a1
a2
..
…
am
b1
b2
…
…
bm
v11
…
…
…
vij
…
vnm
~
47Slide48
talk pilfered from
…..
KDD 2011
48Slide49
49Slide50
Recovering latent factors in a matrix
m
movies
n
users
m
movies
x1
y1
x2
y2
..
..
…
…
xn
yn
a1
a2
..
…
am
b1
b2
…
…
bm
v11
…
…
…
vij
…
vnm
~
V[
i,j
] = user i’s rating of movie j
r
W
H
V
50Slide51
51Slide52
52Slide53
53Slide54
f
or image
denoising
54Slide55
Matrix factorization as SGD
step size
why does this work?
55Slide56
Matrix factorization as SGD - why does this work? Here’s the key claim:
56Slide57
Checking the claim
Think for SGD for logistic regression
LR loss = compare
y
and
ŷ
= dot(
w,x
)
similar but now update w (user weights) and x (movie weight)
57Slide58
What loss functions are possible?
N1, N2 - diagonal matrixes, sort of like IDF factors for the users/movies
“generalized” KL-divergence
58Slide59
What loss functions are possible?
59Slide60
What loss functions are possible?
60Slide61
ALS = alternating least squares
61Slide62
talk pilfered from
…..
KDD 2011
62Slide63
63Slide64
64Slide65
65Slide66
Like McDonnell
et al with perceptron learning
66Slide67
Slow convergence…..
67Slide68
68Slide69
69Slide70
70Slide71
71Slide72
72Slide73
73Slide74
More detail….
Randomly permute rows/cols of matrixChop V,W,H into blocks of size d x dm/d blocks in W,
n/d blocks in HGroup the data:Pick a set of blocks with no overlapping rows or columns (a stratum)
Repeat until all blocks in V are coveredTrain the SGDProcess strata in seriesProcess blocks within a stratum in parallel
74Slide75
More detail….
Z
was
V
75Slide76
More detail….
Initialize W,H randomlynot at zero Choose a random ordering (random sort) of the points in a stratum in each “sub-epoch”
Pick strata sequence by permuting rows and columns of M, and using M’[k,i] as column index of row i in
subepoch k Use “bold driver” to set step size:increase step size when loss decreases (in an epoch)
decrease step size when loss increasesImplemented in Hadoop and R/Snowfall
M=
76Slide77
77Slide78
Wall Clock Time8 nodes, 64 cores, R/snow
78Slide79
79Slide80
80Slide81
81Slide82
82Slide83
Number of Epochs
83Slide84
84Slide85
85Slide86
86Slide87
87Slide88
Varying rank100 epochs for all
88Slide89
Hadoop scalability
Hadoop
process setup time starts to dominate
89Slide90
Hadoop scalability
90Slide91
91