/

NIbble 2 Why Im talking about graphs Lots of large data is graphs Facebook Twitter citation data and other social networks The web the blogosphere the semantic web Freebase W ID: 632284

Download Presentation from below link

Download Presentation The PPT/PDF document "Subsampling Graphs 1 RECAP of PageRank-" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Slide1

Subsampling Graphs

1Slide2

RECAP of PageRank-NIbble

2Slide3

Why I’m talking about graphs

Lots of large data

is

graphs

Facebook, Twitter, citation data, and other

social

networks

The web, the blogosphere, the semantic web, Freebase, Wikipedia, Twitter, and other information networksText corpora (like RCV1), large datasets with discrete feature values, and other bipartite networksnodes = documents or wordslinks connect document word or word documentComputer networks, biological networks (proteins, ecosystems, brains, …), …Heterogeneous networks with multiple types of nodespeople, groups, documents

3Slide4

Our first operation on graphs

Local graph partitioning

Why?

it’s about the best we can do in terms of subsampling/exploratory analysis of a graph

4Slide5

What is Local Graph Partitioning?

Global

Local

5Slide6

What is Local Graph Partitioning?

6Slide7

Main results of the paper (Reid et al)

An

approximate

personalized

PageRank computation that only touches nodes “near” the seed

but has small error relative to the true PageRank vector

A proof that a sweep over the approximate personalized PageRank vector finds a cut with conductance sqrt(α ln m)unless no good cut existsno subset S contains significantly more mass in the approximate PageRank than in a uniform distribution7Slide8

Approximate PageRank: Key Idea

p

is current approximation (start at

0

)

r

is set of “recursive calls to make”

residual errorstart with all mass on su is the node picked for the next call

8Slide9

ANalYsis Of

PageRank-NIbble

9Slide10

Analysis

linearity

re-group & linearity

10Slide11

Approximate PageRank: Key Idea

By definition PageRank is fixed point of:

Claim:

Proof:

11Slide12

Approximate PageRank: Algorithm

12Slide13

Analysis

So, at every point in the

apr

algorithm:

Also, at each point, |

r|

1 decreases by α*

ε

*degree(

u

), so: after T push operations where degree(

i-th

u)=d

i

, we know

which bounds the size of

r

and

p

13Slide14

Analysis

With the invariant:

This bounds the error of

p

relative to the PageRank vector.

14Slide15

Comments – API

p,r

are hash tables – they are small (1/

ε

α)

push just needs

p, r,

and neighbors of

u

Could implement with API:

List<Node> neighbor(Node

u)

int

degree(Node

u

)

d(v)

=

api.degree

(

v

)

15Slide16

Comments - Ordering

might pick the

largest

r(u)/d(u) …

or…

16Slide17

Comments – Ordering for Scanning

Scan

repeatedly

through an adjacency-list encoding of

the graph

For every line you read

u, v1,…,v

d

(u)

such that

r(u)/d(u) >

ε

:

benefit: storage is O(

1

/

ε

α

) for the hash tables, avoids any

seeking

17Slide18

Possible optimizations?

Much faster than doing random access the first few scans,

but then

slower the last few

…there

will be only a few ‘pushes’ per

scan

Optimizations you might imagine:Parallelize?Hybrid seek/scan:Index the nodes in the graph on the first scanStart seeking when you expect too few pushes to justify a scanSay, less than one push/megabyte of scanningHotspots:Save adjacency-list representation for nodes with a large r(u)/d(u) in a separate file of “hot spots” as you scanThen rescan that smaller list of “hot spots” until their score drops below threshold.

18Slide19

After computing apr

…

Given a graph

that’s too big for memory, and/or

that’s only accessible via API

…we can extract a

sample

in an interesting areaRun the apr/rwr from a seed nodeSweep to find a low-conductance subsetThencompute statisticstest out some ideasvisualize it…19Slide20

Key idea: a “sweep”

Order vertices v

1

, v

2

, …. by highest

apr

(s)Pick a prefix S={ v1, v2, …. vk }:S={}; volS =0; B={

}

For k=1,…n:

S += {

v

k

}

volS

+= degree(

v

k

)

B += {u:

v

k

u

and u not in S}

Φ

[k] = |B|/

volS

Pick k:

Φ

[k]

is smallest

20Slide21

Scoring a graph prefix

the edges leaving S

vol

(S) is sum of

deg

(x) for x in S

for small S:

Prob(random

edge

leaves S)

21Slide22

Putting this together

Given a graph

that’s too big for memory, and/or

that’s only accessible via API

…we can extract a

sample

in an interesting area

Run the apr/rwr from a seed nodeSweep to find a low-conductance subsetThencompute statisticstest out some ideasvisualize it…

22Slide23

Visualizing a Graph with Gephi

23Slide24

Screen shots/demo

Gelphi

– java tool

Reads several inputs

.

csv

file

[demo]24Slide25

25Slide26

26Slide27

27Slide28

28Slide29

29Slide30

30Slide31

31Slide32

32Slide33

33Slide34

34