/
Schedule for near future…. Schedule for near future….

Schedule for near future…. - PowerPoint Presentation

lois-ondreau
lois-ondreau . @lois-ondreau
Follow
366 views
Uploaded On 2017-12-31

Schedule for near future…. - PPT Presentation

Previous SGD 1 Midterm Will cover all the lectures scheduled through today There are some sample questions up already from previous years syllabus is not very different for first half of course ID: 618643

vij matrix investing vnm matrix vij vnm investing amb1b2

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Schedule for near future…." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Schedule for near future….

Previous

SGD

1Slide2

Midterm

Will cover all the lectures scheduled through todayThere are some sample questions up already from previous years – syllabus is not very different for first half of course.Problems are mostly going to be harder than the quiz questions

Questions often include material from a homeworkso make sure you understand a HW if you decided to drop itClosed book and closed internetYou can bring in one sheet 8.5x11 or A4 paper front and back

2Slide3

Wrap-up on iterative parameter mixing

3Slide4

NAACL 2010

4

RecapSlide5

Parallelizing perceptrons

– take 2

Instances/labels

Instances/labels – 1

Instances/labels – 2

Instances/labels – 3

w -1

w

- 2

w-3

w

Split into example subsets

Combine by some sort of weighted averaging

Compute local

vk’s

w (previous)

5

Recap: Iterative Parameter MixingSlide6

Parallelizing perceptrons

– take 2

Instances/labels

Instances/labels – 1

Instances/labels – 2

Instances/labels – 3

w -1

w

- 2

w-3

w

Split into example subsets

Combine by some sort of weighted averaging

Compute local

vk’s

w (previous)

6

Recap: Iterative Parameter MixingSlide7

Parallel Perceptrons – take 2

Idea: do the simplest possible thing iteratively.

Split the data into shards

Let w = 0

For n=1,… Train a perceptron

on each shard with one pass

starting with

w

Average the weight vectors (somehow) and let

w

be that average

Extra communication cost:

redistributing the weight vectors

done less frequently than if fully synchronized, more frequently than if fully parallelized

All-Reduce

7

Recap: Iterative Parameter MixingSlide8

All-reduce

8Slide9

Introduction

Common pattern:do some learning in parallel aggregate local changes from each processor

to shared parametersdistribute the new shared parameters

back to each processorand repeat….

AllReduce implemented in MPI, also in VW code (John Langford) in a Hadoop/compatible scheme

MAP

ALLREDUCE

9Slide10

10Slide11

11Slide12

12Slide13

13Slide14

14Slide15

15Slide16

Gory details of VW Hadoop-AllReduce

Spanning-tree server:Separate process constructs a spanning tree of the compute

nodes in the cluster and then acts as a serverWorker nodes (“fake” mappers):Input for worker is locally cachedWorkers all connect to spanning-tree server

Workers all execute the same code, which might contain AllReduce calls:Workers

synchronize whenever they reach an all-reduce

16Slide17

Hadoop

AllReduce

don’t wait for duplicate jobs

17Slide18

Second-order method - like Newton’s method

18Slide19

2

24

features~=100 non-zeros/example2.3B examplesexample is user/page/ad and conjunctions of these, positive if there was a click-thru on the ad

19Slide20

50M examples

explicitly constructed kernel

 11.7M features3,300 nonzeros/example

old method: SVM, 3 days: reporting time to get to fixed test error

20Slide21

21Slide22

Matrix Factorization

22Slide23

Recovering latent factors in a matrix

m

columns

v11

vij

vnm

n

rows

23Slide24

Recovering latent factors in a matrix

K * m

n * K

x1

y1

x2

y2

..

..

xn

yn

a1

a2

..

am

b1

b2

bm

v11

vij

vnm

~

24Slide25

What is this for?

K * m

n * K

x1

y1

x2

y2

..

..

xn

yn

a1

a2

..

am

b1

b2

bm

v11

vij

vnm

~

25Slide26

MF for collaborative filtering

26Slide27

What is collaborative filtering?

27Slide28

What is collaborative filtering?

28Slide29

What is collaborative filtering?

29Slide30

What is collaborative filtering?

30Slide31

31Slide32

Recovering latent factors in a matrix

m

movies

v11

vij

vnm

V[

i,j

] = user i’s rating of movie j

n

users

32Slide33

Recovering latent factors in a matrix

m

movies

n

users

m

movies

x1

y1

x2

y2

..

..

xn

yn

a1

a2

..

am

b1

b2

bm

v11

vij

vnm

~

V[

i,j

] = user i’s rating of movie j

33Slide34

34Slide35

MF for image modeling

35Slide36

36

Data

: many copies of an image, rotated and shifted (matrix with one image/row)

Image “prototypes:”

a smaller number of row vectors (green=negative)

Reconstructed images :

linear combinations of prototypesSlide37

MF for images

10,000 pixels

1000

images

1000 * 10,000,00

x1

y1

x2

y2

..

..

xn

yn

a1

a2

..

am

b1

b2

bm

v11

vij

vnm

~

V[

i,j

] = pixel j in image

i

2 prototypes

PC1

PC2

37Slide38

MF for modeling text

38Slide39

The Neatest Little Guide to Stock Market Investing

Investing For Dummies, 4th Edition

The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of Stock Market Returns

The Little Book of Value Investing

Value Investing: From Graham to Buffett and Beyond

Rich Dad’s Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!

Investing in Real Estate, 5th Edition

Stock Investing For Dummies

Rich Dad’s Advisors: The ABC’s of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss

https://technowiki.wordpress.com/2011/08/27/latent-semantic-analysis-lsa-tutorial/

39Slide40

https://technowiki.wordpress.com/2011/08/27/latent-semantic-analysis-lsa-tutorial/

TFIDF counts would be better

40Slide41

Recovering latent factors in a matrix

m

terms

n

documents

doc term matrix

x1

y1

x2

y2

..

..

xn

yn

a1

a2

..

am

b1

b2

bm

v11

vij

vnm

~

V[

i,j

] = TFIDF score of term j in doc

i

41Slide42

=

42Slide43

Investing for real estate

Rich Dad’s Advisor’s: The ABCs of Real Estate Investment …

43Slide44

The little book of common sense investing: …

Neatest Little Guide to Stock Market Investing

44Slide45

MF is like clustering

45Slide46

k-means as MF

cluster means

n

examples

0

1

1

0

..

..

xn

yn

a1

a2

..

am

b1

b2

bm

v11

vij

vnm

~

original data set

indicators for r clusters

Z

M

X

46Slide47

How do you do it?

K * m

n * K

x1

y1

x2

y2

..

..

xn

yn

a1

a2

..

am

b1

b2

bm

v11

vij

vnm

~

47Slide48

talk pilfered from

…..

KDD 2011

48Slide49

49Slide50

Recovering latent factors in a matrix

m

movies

n

users

m

movies

x1

y1

x2

y2

..

..

xn

yn

a1

a2

..

am

b1

b2

bm

v11

vij

vnm

~

V[

i,j

] = user i’s rating of movie j

r

W

H

V

50Slide51

51Slide52

52Slide53

53Slide54

f

or image

denoising

54Slide55

Matrix factorization as SGD

step size

why does this work?

55Slide56

Matrix factorization as SGD - why does this work? Here’s the key claim:

56Slide57

Checking the claim

Think for SGD for logistic regression

LR loss = compare

y

and

ŷ

= dot(

w,x

)

similar but now update w (user weights) and x (movie weight)

57Slide58

What loss functions are possible?

N1, N2 - diagonal matrixes, sort of like IDF factors for the users/movies

“generalized” KL-divergence

58Slide59

What loss functions are possible?

59Slide60

What loss functions are possible?

60Slide61

ALS = alternating least squares

61Slide62

talk pilfered from

…..

KDD 2011

62Slide63

63Slide64

64Slide65

65Slide66

Like McDonnell

et al with perceptron learning

66Slide67

Slow convergence…..

67Slide68

68Slide69

69Slide70

70Slide71

71Slide72

72Slide73

73Slide74

More detail….

Randomly permute rows/cols of matrixChop V,W,H into blocks of size d x dm/d blocks in W,

n/d blocks in HGroup the data:Pick a set of blocks with no overlapping rows or columns (a stratum)

Repeat until all blocks in V are coveredTrain the SGDProcess strata in seriesProcess blocks within a stratum in parallel

74Slide75

More detail….

Z

was

V

75Slide76

More detail….

Initialize W,H randomlynot at zero Choose a random ordering (random sort) of the points in a stratum in each “sub-epoch”

Pick strata sequence by permuting rows and columns of M, and using M’[k,i] as column index of row i in

subepoch k Use “bold driver” to set step size:increase step size when loss decreases (in an epoch)

decrease step size when loss increasesImplemented in Hadoop and R/Snowfall

M=

76Slide77

77Slide78

Wall Clock Time8 nodes, 64 cores, R/snow

78Slide79

79Slide80

80Slide81

81Slide82

82Slide83

Number of Epochs

83Slide84

84Slide85

85Slide86

86Slide87

87Slide88

Varying rank100 epochs for all

88Slide89

Hadoop scalability

Hadoop

process setup time starts to dominate

89Slide90

Hadoop scalability

90Slide91

91