/
SGD on  Hadoop for Big  dATA SGD on  Hadoop for Big  dATA

SGD on Hadoop for Big dATA - PowerPoint Presentation

sherrill-nordquist
sherrill-nordquist . @sherrill-nordquist
Follow
372 views
Uploaded On 2018-03-15

SGD on Hadoop for Big dATA - PPT Presentation

amp Huge Models Alex Beutel Based on work done with Abhimanu Kumar Vagelis Papalexakis Partha Talukdar Qirong Ho Christos Faloutsos and Eric Xing Outline When to use SGD for distributed learning ID: 652614

run sgd tensor update sgd run update tensor decomposition reducers points hadoop hdfs amp block algorithm sgd

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "SGD on Hadoop for Big dATA" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

SGD on Hadoopfor Big dATA & Huge Models

Alex BeutelBased on work done with Abhimanu Kumar, Vagelis Papalexakis, Partha Talukdar, Qirong Ho, Christos Faloutsos, and Eric XingSlide2

OutlineWhen to use SGD for distributed learning

OptimizationReview of DSGDSGD for TensorsSGD for ML models – topic modeling, dictionary learning, MMSBHadoopGeneral algorithmSetting up the MapReduce bodyReducer communicationDistributed normalization

“Always-On SGD” – How to deal with the straggler problemExperimentsSlide3

When distributed SGD is useful

Collaborative FilteringPredict movie preferences

Topic Modeling

What are the topics of webpages, tweets, or status updates

Dictionary Learning

Remove noise or missing pixels from images

Tensor Decomposition

Find communities in temporal graphs

300 Million Photos uploaded

to Facebook per day!

1 Billion users on Facebook

400 million tweets per daySlide4

Gradient DescentSlide5

Stochastic Gradient Descent (SGD)Slide6

Stochastic Gradient Descent (SGD)Slide7

SGD BackgroundSlide8

DSGD for Matrices (Gemulla, 2011)

X

U

V

Users

Movies

GenresSlide9

DSGD for Matrices (Gemulla, 2011)

X

U

V

Independent!Slide10

DSGD for Matrices (Gemulla, 2011)

Independent BlocksSlide11

DSGD for Matrices

(Gemulla, 2011)

Partition your data & model into

d × d blocks

Results in d=3 strata

Process strata sequentially, process blocks in each stratum in parallelSlide12

TensorsSlide13

What is a tensor?Tensors are used for structured data > 2 dimensionsThink of as a 3D-matrix

SubjectVerb

Object

For example:Derek Jeter plays baseballSlide14

Tensor Decomposition

U

V

W

XSlide15

Tensor Decomposition

U

V

W

XSlide16

Tensor Decomposition

U

V

W

XSlide17

Tensor Decomposition

U

V

W

X

Independent

Not IndependentSlide18

Tensor DecompositionSlide19

For

d

=3 blocks per stratum, we require

d

2

=9 strataSlide20

Tensor DecompositionSlide21

Coupled Matrix + Tensor Decomposition

X

Y

Subject

Verb

ObjectDocumentSlide22

Coupled Matrix + Tensor Decomposition

U

V

W

X

Y

ASlide23

Coupled Matrix + Tensor DecompositionSlide24

Constraints & ProjectionsSlide25

Example: Topic Modeling

Documents

Words

TopicsSlide26

ConstraintsSometimes we want to restrict response:Non-negative

SparsitySimplex (so vectors become probabilities)Keep inside unit ballSlide27

How to enforce? Projections

Example: Non-negativeSlide28

More projections

Sparsity (soft thresholding):SimplexUnit ballSlide29

Dictionary LearningLearn a dictionary of concepts and a sparse reconstructionUseful for fixing noise and missing pixels of images

Sparse encoding

Within unit ballSlide30

Mixed Membership Network Decomp.Used for modeling communities in graphs (e.g. a social network)

Simplex

Non-negativeSlide31

Implementing on HadoopSlide32

High level algorithmfor Epoch e

= 1 … T do for Subepoch s = 1 … d2 do Let be the set of blocks in stratum s for block b = 1 … d

in parallel do Run SGD on all points in block

end endend

Stratum 1

Stratum 2

Stratum 3

…Slide33

Bad Hadoop Algorithm: Subepoch 1

Run SGD on

Update:

Run SGD on

Update:

Run SGD on

Update:

Reducers

Mappers

U

2

V

1

W

3

U

3

V

2

W

1

U

1

V

3

W

2Slide34

Bad Hadoop Algorithm: Subepoch 2

Run SGD on

Update:

Run SGD on

Update:

Run SGD on

Update:

Reducers

Mappers

U

2

V

1

W

2

U

3

V

2

W

3

U

1

V

3

W

1Slide35

Bad Hadoop Algorithm

U

2

V

1

W

3

Run SGD on

Update:

U

3

V

2

W

1

Run SGD on

Update:

U

1

V

3

W

2

Run SGD on

Update:

Reducers

MappersSlide36

Hadoop ChallengesMapReduce is typically very bad for iterative algorithms

T × d2 iterationsSizable overhead per Hadoop jobLittle flexibilitySlide37

High Level Algorithm

V

1

V

2

V

3

U

1

U

2

U

3

W

1

W

2

W

3

V

1

V

2

V

3

U

1

U

2

U

3

W

1

W

2

W

3

U

1

V

1

W

1

U

2

V

2

W

2

U

3

V

3

W

3Slide38

High Level Algorithm

V

1

V

2

V

3

U

1

U

2

U

3

W

1

W

2

W

3

V

1

V

2

V

3

U

1

U

2

U

3

W

1

W

2

W

3

U

1

V

1

W

1

U

2

V

2

W

2

U

3

V

3

W

3Slide39

High Level Algorithm

V

1

V

2

V

3

U

1

U

2

U

3

W

1

W

2

W

3

V

1

V

2

V

3

U

1

U

2

U

3

W

1

W

2

W

3

U

1

V

1

W

3

U

2

V

2

W

1

U

3

V

3

W

2Slide40

High Level Algorithm

V

1

V

2

V

3

U

1

U

2

U

3

W

1

W

2

W

3

V

1

V

2

V

3

U

1

U

2

U

3

W

1

W

2

W

3

U

1

V

1

W

2

U

2

V

2

W

3

U

3

V

3

W

1Slide41

Hadoop

Algorithm

Process points:

Map each point

to its block

with necessary info to order

U

1

V

1

W

1

Run SGD on

Update:

U

2

V

2

W

2

Run SGD on

Update:

U

3

V

3

W

3

Run SGD on

Update:

Reducers

Mappers

Partition

&

Sort

HDFS

HDFSSlide42

Hadoop Algorithm

Process points:Map each point

to its block

with necessary info to order

Reducers

Mappers

Partition

&

Sort

Use:

Partitioner

KeyComparator

GroupingComparatorSlide43

Hadoop

Algorithm

Process points:

Map each point

to its block

with necessary info to order

Reducers

Mappers

Partition

&

Sort

…Slide44

Hadoop

Algorithm

Process points:

Map each point

to its block

with necessary info to order

U

1

V

1

W

1

Run SGD on

Update:

U

2

V

2

W

2

Run SGD on

Update:

U

3

V

3

W

3

Run SGD on

Update:

Reducers

Mappers

Partition

&

SortSlide45

Hadoop

Algorithm

Process points:

Map each point

to its block

with necessary info to order

U

1

V

1

W

1

Run SGD on

Update:

U

2

V

2

W

2

Run SGD on

Update:

U

3

V

3

W

3

Run SGD on

Update:

Reducers

Mappers

Partition

&

Sort

…Slide46

Hadoop

Algorithm

Process points:

Map each point

to its block

with necessary info to order

U

1

V

1

Run SGD on

Update:

U

2

V

2

Run SGD on

Update:

U

3

V

3

Run SGD on

Update:

Reducers

Mappers

Partition

&

Sort

HDFS

HDFS

W

2

W

1

W

3Slide47

Hadoop

Algorithm

Process points:

Map each point

to its block

with necessary info to order

U

1

V

1

W

1

Run SGD on

Update:

U

2

V

2

W

2

Run SGD on

Update:

U

3

V

3

W

3

Run SGD on

Update:

Reducers

Mappers

Partition

&

Sort

HDFS

HDFSSlide48

Hadoop SummaryUse mappers to send data points to the correct reducers in order

Use reducers as machines in a normal clusterUse HDFS as the communication channel between reducersSlide49

Distributed Normalization

Documents

Words

Topics

π

1

β

1

π

2

β

2

π

3

β

3Slide50

Distributed Normalization

π

1

β

1

π

2

β

2

π

3

β

3

σ

(1)

σ

(2)

σ

(3)

σ

(b)

is a k-dimensional vector,

summing the terms of β

b

σ

(1)

σ

(1)

σ

(3)

σ

(3)

σ

(2)

σ

(2)

Transfer

σ

(b)

to all machines

Each machine calculates

σ

:

Normalize: Slide51

Barriers & Stragglers

Process points:

Map each point

to its block

with necessary info to order

Run SGD on

Run SGD on

Run SGD on

Reducers

Mappers

Partition

&

Sort

U

1

V

1

Update:

U

2

V

2

Update:

U

3

V

3

Update:

HDFS

HDFS

W

2

W

1

W

3

Wasting time

waiting!Slide52

Solution: “Always-On SGD”For each reducer:

Run SGD on all points in current block ZShuffle points in Z and decrease step size

Check if other reducers are ready to sync

Run SGD on points in

Z

again

If not ready to sync

Wait

If not ready to sync

Sync parameters

and get new block

Z

Slide53

“Always-On SGD”

Process points:

Map each point

to its block

with necessary info to order

Run SGD on

Run SGD on

Run SGD on

Reducers

Partition

&

Sort

U

1

V

1

Update:

U

2

V

2

Update:

U

3

V

3

Update:

HDFS

HDFS

W

2

W

1

W

3

Run SGD on old points again!

Slide54

“Always-On SGD”

First SGD pass of block Z Extra SGD Updates

Read Parameters from HDFS

Write Parameters to HDFS

Reducer 1

Reducer2Reducer 3

Reducer 4Slide55

ExperimentsSlide56

FlexiFaCT (Tensor Decomposition)ConvergenceSlide57

FlexiFaCT (Tensor Decomposition)Scalability in Data SizeSlide58

FlexiFaCT (Tensor Decomposition)Scalability in Tensor Dimension

Handles up to 2 billion parameters!Slide59

FlexiFaCT (Tensor Decomposition)Scalability in Rank of Decomposition

Handles up to 4 billion parameters!Slide60

FlexiFaCT (Tensor Decomposition)Scalability in Number of MachinesSlide61

Fugue (Using “Always-On SGD”)

Dictionary Learning: ConvergenceSlide62

Fugue (Using “Always-On SGD”)

Community Detection: ConvergenceSlide63

Fugue (Using “Always-On SGD”)

Topic Modeling: ConvergenceSlide64

Fugue (Using “Always-On SGD”)

Topic Modeling: Scalability in Data SizeSlide65

Fugue (Using “Always-On SGD”)

Topic Modeling: Scalability in Rank Slide66

Fugue (Using “Always-On SGD”)

Topic Modeling: Scalability over MachinesSlide67

Fugue (Using “Always-On SGD”)

Topic Modeling: Number of MachinesSlide68

Fugue (Using “Always-On SGD”)Slide69

Key PointsFlexible method for tensors & ML modelsCan use stock Hadoop through using HDFS for communication

When waiting for slower machines, run updates on old data againSlide70

Questions?

Alex Beutelabeutel@cs.cmu.eduhttp://alexbeutel.com