/
Dealing with Diversity in Mining and Query Processing Dealing with Diversity in Mining and Query Processing

Dealing with Diversity in Mining and Query Processing - PowerPoint Presentation

mojartd
mojartd . @mojartd
Follow
342 views
Uploaded On 2020-11-06

Dealing with Diversity in Mining and Query Processing - PPT Presentation

Jeffrey Xu Yu 于旭 Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong yusecuhkeduhk Books on Social Networks Social and Economic Networks ID: 816039

graph results solution top results graph top solution diversity based search cut score approach diversified node point stop algorithm

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Dealing with Diversity in Mining and Que..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Dealing with Diversity in Mining and Query Processing

Jeffrey Xu Yu (于旭)Department of Systems Engineering and Engineering ManagementThe Chinese University of Hong Kongyu@se.cuhk.edu.hk

Slide2

Books on Social Networks

Social and Economic Networks by Matthew O. JackonSocial Network Data Analysis by Charu C. Aggarwal

Exploratory Social Network Analysis with

Pajek

by

Wouter

de

Nooy

, Andrej

Mrvar

, and Vladimir

Batagelj

Networks, Crowds, and Markets: Reasoning about a Highly Connected World

by David Easley and John

Keinberg

Networks An Introduction

by M.E.J. Newman

Slide3

Some Online Courses

Mining of Massive Datasets (Anand Rajaraman and Jeff Ullman) http://infolab.stanford.edu/~ullman/mmds.htmlNetworks, Crowds, and Markets: Reasoning about a highly connected world, by David Easley and Jon Kleinberg http://www.cs.cornell.edu/home/kleinber/networks-bookTopics in Data Management & Mining – Social Networks

,

Laks

V.S.

Lakshmanan

http://www.cs.ubc.ca/~laks/534l/cpsc534l.html

Slide4

Stanford Large Network Dataset Collectionhttp://snap.stanford.edu/data

Social networksCommunication networksCitation networksCollaboration networksWeb graphsAmazon networksInternet networksRoad networksAutonomous systemsSigned networksWikipedia networks and metadataTwitter and Memetracker

Slide5

Graph Database http://en.wikipedia.org/wiki/Graph_database

Pregel: Google’s internal graph processing platformTrinity: Microsoft Research AsiaNeo4j: commercial graph database…

Slide6

Diversified Ranking

Why diversified ranking?Information requirements diversity Query incomplete

Slide7

Problem Statement

For query dependent diversity ranking, the goal is to find K nodes in a graph that are relevant to the query node, and also they are dissimilar to each other. For query independent

diversity ranking, the goal is to find K

prestige

nodes in a graph that are

dissimilar

to each other.

Main applications

Ranking nodes in social network, ranking papers, etc.

Slide8

Challenges

Diversity measuresNo wildly accepted diversity measures on graph in the literature.ScalabilityMost existing methods cannot be scalable to large graphs.Lack of intuitive interpretation.

Slide9

Existing Methods

Grasshopper [Zhu, et al., HLT-NAACL’07]ManiRank [Zhu, et al., WWW’11]DivRank [Mei, et al., KDD’10]DRAGON [Tong, et al., KDD’11]Resistive Graph Centers [Dubey, et al., KDD’11]

Slide10

Grasshopper/ManiRank

The main ideaWork in an iterative manner. Select a node at one iteration by random walk.Set the selected node to be an absorbing node, and perform random walk again to select the second node.Perform the same process K iterations to get K nodes.

No diversity measure

Achieving diversity only by intuition and experiments.

Cannot scale to large graph (time complexity O(

))

 

Slide11

Grasshopper/ManiRank

Initial random walk with no absorbing statesAbsorbing random walk after ranking the first item

Slide12

DivRank

Based on a vertex-reinforced random walk.No diversity measure.Convergence properties is not clear.Time and space complexity is

 

Slide13

DRAGON, Resistive Graph Centers

DRAGON [Tong, et al., KDD’11]Diversity measure lacks of clear topological interpretationResistive Graph Centers [Dubey, et al., KDD’11]Based on personalized

PageRank

with a learnable teleportation parameter.

Cannot be scalable to large graphs.

Slide14

A Summary

Comparison with existing methods

Slide15

Our Approach

The main ideaRelevance of the top-K nodes (denoted by a set S) is achieved by the large (Personalized) PageRank scores. Diversity of the top-K nodes is achieved by large expansion ratio.Expansion ratio of a set nodes S: σ(S)=|N(S)|/nLarger expansion ratio implies better diversity

Slide16

K-step expansion ratio of S: σ

k(S)=|Nk(S)|/nOur diversity measures

The K-step Expansion

Slide17

Diversified ranking problem on graph as a discrete optimization problem.

Submodularity F(S) is shown to be submodular and non-descreasing. The greedy algorithm

A 1-1/e approximation algorithm for solving Eq. (1).

Linear time and space complexity

w.r.t

. the size of the graph.

A Discrete Optimization Problem

Slide18

The Greedy Algorithm

Works in K roundsSelect a node with maximal marginal gain at one round

Marginal gain

Slide19

Maximize F

k(S) subject to cardinality constraint |S| <= KSubmodularity Fk(S) is shown to be submodular and non-descreasing. Randomized greedy algorithm

Near 1-1/e approximation algorithm.

Linear time and space complexity

w.r.t

. the size of the graph.

Generalized Diversified Ranking Optimization

Slide20

Randomized greedy algorithm Same idea as the greedy algorithm

Works in K roundsAt each round, select the node with maximal marginal gain. But, evaluating the maximal marginal gain is expensive.Our idea: Use a probabilistic counting data structure to sketch the k-step neighborhood for each node.Generalized Diversified Ranking Optimization

Marginal gain

Slide21

A probabilistic counting structure, devised by Flajolet

and Martin.Be used to estimate the cardinality of a multi-set using only logC+t bits, where C denotes the cardinality and t is a small constant.Each FM Sketch is a log C+t bitmap.Advantage: To estimate the cardinality of the union of two multi-sets, we only need to do a bitwise-OR between to FM Sketches.FM Sketch and Its Properties

Slide22

Randomized greedy algorithmFor each node u, use FM Sketch to sketch

Nk({u})Use the following rule to sketch Nk({u}), which can be implemented in a recursive wayUse FM sketch to sketch Nk(S)Evaluating the marginal gain can be implemented by a bitwise-OR between Nk(S) and Nk

({u})

The Randomized Greedy Algorithm

Slide23

Experimental Studies

We conduct experiments on 5 real networks (3 collaboration networks, 1 citation network, and 1 social network).We show some results with Flickr, which is a popular photo shared website (from ASU social computing data repository).Undirected social network (80,513 nodes and 5,899,882 edges, and 195 different groups)

Slide24

Some Testing Results on Flickr

Slide25

Make a Top-K Algorithm Diversified

The result of searching “apple” in G

oogle

image

Existing top-

search algorithms

Search results are ranked independently

When searching “apple” in

google

image, 9 out of top 15 results are the logo of Apple Inc.

 

Slide26

Structural Keyword Search (1)

Example: Keyword Search in GraphsInput: a graph with text information on each node, and a user given keyword queryOutput: top-k of minimal Steiner trees that contain all user given keywords

graph patterns

keyword search

DBLP

4

a

1

w

1

3

p

1

w

2

3

p

2

4

a

1

w

1

3

p

1

w

4

3

p

4

4

a

1

w

3

3

p

3

w

2

3

p

2

4

a

1

w

3

3

p

3

w

4

3

p

4

4

a

1

Author: Jiawei Han

w

1

w

2

w

3

w

4

Action: Write

3

p

1

Paper: Mining

Graph Patterns

3

p

2

3

p

3

3

p

4

Paper: Mining Significant

Graph

Patterns

by Leap Search

Paper: Optimizing Index for Taxonomy

Keyword Search

Paper:

Keyword Search

in Text Cube:

Finding Top-k Aggregated Cell Documents

v

1

v

2

v

3

v

4

Slide27

Structural Keyword Search (2)

4

a

1

w

1

3

p

1

w

2

3

p

2

v

1

score=0.8

4

a

1

w

1

3

p

1

w

4

3

p

4

v

2

score=0.5

4

a

1

w

3

3

p

3

w

2

3

p

2

v

3

score=0.5

4

a

1

w

3

3

p

3

w

4

3

p

4

v

4

score=0.4

0.6

0.6

0.6

0.6

0.2

0.2

Suppose the similarity of

and

is

, e.g.,

Let

is better than

because

and

are similar with each other

is better than

because

has a larger total score

 

Slide28

Diversified Top-K

We should consider both similarity and scoreLet

be a list of search results

Let

be the score of result

Let

be the

similarity of

and

For any

,

and

are similar

: a user given

t

hreshold

Diversified top-

results result

:

At most

results:

No two results in

are similar

Total score of results in

is maximized

 

Slide29

A Diversity Graph

3

v

3

3

v

5

3

v

4

3

v

6

6

8

7

7

1

10

3

v

1

3

v

2

3

v

6

6

8

7

7

1

10

v

1

v

2

v

3

v

5

v

4

Diversity Graph

Undirected graph

,

, there is an edge (

,

) in

is similar to

The diversified top-

result set is an independent set of

 

 

 

Slide30

Existing Top-K Search Frameworks

Most existing top-K search frameworks avoid exploring all search results by finding an early stop condition. Incremental Top-KResults are generated one by one in ranked orderStops when K results are outputBounding Top-KResults are generated not necessarily in ranked order.A non-increasing score upper bound for unseen result u is maintained.

Stop when the K-

th

largest score generated is no smaller than u.

Slide31

Our Framework

We support the existing top-K frameworksResults are generated one by oneStops if a certain stop condition is satisfiedOur framework

We extend the existing algorithms to get top-K diversified results by three new functions.

sufficient():

a new early stop condition

necessary():

the necessary stop condition

div-search():

search top-k diversified results on the current results

Step 2

Step 3

Step 1

Check the stop condition

sufficient()

Stops if

sufficient()

is satisfied

Generate the next result using the original

top-K algorithm

Check the

necessary()

condition

If

necessary()

is satisfied, search the diversified top-K results using

div-search()

Go to Step 1

Slide32

 

Sufficient Stop Condition

Sufficient stop condition

sufficient()

: the set of current generated results

: an upper bound of the optimal solution calculated from current generated results

: the current diversified top-

results with score

: the score upper bound of all unseen results

For each

, in the ideal situation, for the unseen results, all the remaining

results are set to be

We have

The sufficient stop condition is

 

Slide33

Necessary Stop Condition

 

Necessary stop condition

necessary()

: the set of current generated results

Assume the stop condition of the original algorithm is satisfied

Otherwise the algorithm cannot stop

: the set of results when the last time

necessary()

is satisfied (or

if

necessary

() is never satisfied)

If

for a certain

, we need at least

more results generated in order to get

results

The necessary stop condition is

 

Slide34

The Possible Search Algorithms

3

v

1

3

v

2

3

v

3

3

v

100

3

u

1

100

99

99

99

99

0.5

1

1

1

u

2

v

0

u

3

u

100

3

u

1

100

99

99

99

99

0.5

1

1

1

3

v

0

3

u

2

3

u

3

3

u

100

v

0

v

2

v

3

v

100

Greedy Solution: score=199

Optimal Solution: score=9900

Given the diversity graph

for the current generated result set

Greed is Not Good

 

Finding

on

is an NP-Hard problem

 

 

 

Slide35

Three New Search Algorithms

We propose three exact algorithmsdiv-astar: an A* based approachdiv-dp: decompose div-

astar

using operator

div-cut

: further decompose

div-

dp

using operators

and

 

NP

NP

NP

NP

NP

NP

NP

NP

NP

NP

NP

NP

NP

NP

NP

NP

NP

NP

NP

NP

NP

NP

NP

NP

NP

div-

astar

div-dp

div-cut

Slide36

An A* Based Approach

We use a heap to maintain partial solutionsEach partial solution is with form

the set of results selected in the partial solution

: the total score of results in

: the upper bound of score if

is expanded to a full solution

Entries in

are expanded in non-increasing order of

The algorithm stops when

of the next

soution

is no larger than the score of the current best solution

 

Slide37

An A* Based Approach

Calculation of

is the set of adjacent nodes of

in

The equation is a relaxation of the optimal solution w.r.t.

is to avoid generating redundant results

can be calculated in

time in the worst case

 

s.t.

 

Slide38

An A* Based Approach

3

 

3

 

3

 

3

 

6

8

7

7

3

10

3

 

3

 

Diversity graph

 

 

 

 

 

 

 

 

An example (

)

 

Step 1: Expand node (

), with

 

Slide39

An A* Based Approach

3

 

3

 

3

 

3

 

6

8

7

7

3

10

3

 

3

 

Diversity graph

 

 

 

 

 

 

 

 

 

 

An example (

)

 

Step 2: Expand node (

), with

 

Slide40

An A* Based Approach

3

 

3

 

3

 

3

 

6

8

7

7

3

10

3

 

3

 

Diversity graph

 

 

 

 

 

 

 

 

 

 

 

 

An example (

)

 

Step 3: Expand node (

), with

 

Slide41

An A* Based Approach

3

 

3

 

3

 

3

 

6

8

7

7

3

10

3

 

3

 

Diversity graph

 

 

 

 

 

 

 

 

 

 

 

 

 

An example (

)

 

Step 4: Expand node (

), with

 

Slide42

An A* Based Approach

3

 

3

 

3

 

3

 

6

8

7

7

3

10

3

 

3

 

Diversity graph

 

 

 

 

 

 

 

 

 

 

 

 

 

An example (

)

 

Step 5: Expand node (

), with

Current best score is

, and next best score is

: stop

Optimal solution:

 

Slide43

A DP Based Approach

The diversity graph may contain many disconnected componentsIt is costly to apply A* algorithm on the whole diversity graphCombine the results of disconnected components using operator based on Dynamic Programming (DP)Dynamic ProgrammingSuppose

contains two disconnected components

and

State

: the optimal score of the diversified top-

results on

State transition equation:

 

 

Slide44

A DP Based Approach

3

 

3

 

3

 

3

 

6

8

7

7

1

10

10

6

7

8

9

3

 

3

 

 

 

 

 

 

An Example (

)

 

 

optimal solution: {

,

,

,

}

 

i

solution

s

0

0

1

10

2

18

3

20

4

0

5

0

i

solution

s

0

0

1

10

2

18

3

20

4

0

5

0

 

i

solution

s

0

0

1

10

2

18

3

22

4

0

5

0

i

solution

s

0

0

1

10

2

18

3

22

4

0

5

0

 

i

solution

s

0

0

1

10

2

20

3

28

4

36

5

40

i

solution

s

0

0

1

10

2

20

3

28

4

36

5

40

 

 

 

 

 

 

Slide45

A Cut Point Based Approach

Cut point of graph Suppose is a connected graphA cut point is a point whose removal makes

disconnected

can be further decomposed using cut points

Suppose

is a cut point of

, there are two situations

:

is excluded in the final solution

After removing

,

becomes several disconnected components

:

is included in the final solution

After removing

and all

’s adjacent nodes,

becomes several disconnected

components

Add

to each result in

and

are combined using operator

to compute

 

Slide46

A Cut Point Based Approach

Let be a cut point of Let be the solution by excluding

Let

be the solution by including

and

are mutually exclusive with each other

: the optimal score of diversified top-

results on

Calculating

 

 

Slide47

A Cut Point Based Approach

Handling multiple cut pointsStep 1: Construct a cup-point tree (cptree)Each node: associated with a cut point (leaf node is associated with a virtual cut point)Each edge: associated with a subgraph that connects two cut points (the subgraph can be empty or disconnected)

A sample

cptree

:

Step 2: Search the

cptree

In a bottom-up fashion

 

 

 

 

 

 

 

 

 

 

 

 

 

Slide48

 

 

 

 

 

 

 

 

 

 

A Cut Point Based Approach

Suppose

,

,

,

have been computed

We now compute

and

 

An Example

Slide49

 

 

 

 

 

 

 

 

 

 

A Cut Point Based Approach

Computing

Computing

(Case 1)

is excluded:

(Case 2)

is

included:

is the result after removing adjacent nodes of

from

We have

can be computed similarly

 

An Example

Slide50

 

 

 

 

 

 

 

 

 

 

A Cut Point Based Approach

Computing

Computing

(Case 1)

is excluded:

(Case 2)

is

included:

We have

can be computed similarly

Do not forget to add {

} to all the results of

 

An Example

Slide51

A Cut Point Based Approach

isolutions0

0

1

13

2

23

3

33

4

36

5

39

i

solution

s

0

0

1

13

2

23

3

33

4

36

5

39

An Example (

)

 

3

 

3

 

6

8

7

7

1

10

 

 

10

6

7

8

9

 

 

 

3

 

 

3

 

3

 

3

 

3

 

3

 

3

 

3

 

3

 

 

 

 

 

i

solution

s

0

0

1

10

2

20

3

28

4

36

5

40

i

solution

s

0

0

1

10

2

20

3

28

4

36

5

40

 

i

solution

s

0

0

1

13

2

23

3

33

4

36

5

40

i

solution

s

0

0

1

13

2

23

3

33

4

36

5

40

 

 

13

1

1

1

1

 

Slide52

Further Improvements

Example can be removed from There exists

s.t.

After removing

and

become cut points

 

3

 

3

 

6

8

7

7

1

10

 

10

6

7

8

9

 

 

3

 

3

 

3

 

3

 

3

 

 

13

1

1

1

1

3

 

3

 

6

8

7

7

1

10

 

 

10

6

7

8

9

 

 

 

3

 

 

3

 

3

 

3

 

3

 

3

 

3

 

3

 

3

 

 

 

 

 

13

1

1

1

1

3

 

3

 

 

3

 

 

3

 

3

 

12

Slide53

Performance Studies

Experimental SetupWe use 2 real datasets: Enwiki and ReutersEnwiki

: 11,930,681 articles from English Wikipedia

Reuters: 21,578 news from Reuters

Query: a set of keywords

Answer: top-

documents

We compare

three algorithms

div-star: A* based approach

div-

dp

: Dynamic programming based approach

div-cut: Cut point based approach

We

vary 3 parameters:

: (two groups)

S

mall

40, 80, 120, 160, 200, default 120

Large

: 500, 700, 900, 1300, 2000, default 900

Similarity threshold

: 0.4, 0.5, 0.6, 0.7, 0.8 default 0.6

Keyword

frequency

:

5 levels

1,2,3,4,5,

default 3

 

Slide54

Performance Studies

Score function:Given a query and a document

is term frequency of keyword

for dataset

is the total number of words in

Similarity function:

Given

two documents

and

 

 

 

Slide55

Performance Studies

Vary

(

Enwiki

)

 

Small

 

Small

 

Large

 

Large

 

Slide56

Conclusion

We study the diversified ranking.We study the diversified top- search problem.The diversity use only the similarity of search results themselvesWe propose a framework, s.t. most top- algorithm can be easily extended to handle diversified top-

search by applying.

 

Slide57

APWeb 2013 in Sydney, Australia

The 15th International Asia-Pacific Web Conference (APWeb), 4-6 April, 2013, Sydney, AustraliaJust before ICDE 2013.Paper Submission Deadline: October 20.Three Keynote SpeakersH.V. Jagadish (University of Michigan)Dan Suciu (University of Washington)Mark Sanderson (RMIT)A Special Issue on WWW Journal

Slide58

Research Postgraduate Study at SEEM/CUHK [www.se.cuhk.edu.hk/programmes]

Research Postgraduate ProgramsM.Phil, PhD, M.phil-PhD (Articulated) Deadlines: December 1, 2012 (First Round)January 31, 2013 (Official Final Round). But, due to Chinese New Year, submit it early before January 20.Postgraduate Studentship: HK$13,600 per month (non-taxable)Current Tuition Fees: HK$42,100/yearHong Kong PhD Fellowship Scheme 2013-2014 (135 positions in HK)

Deadline: December 1, 2012

Monthly stipend of HK$20,000

10,000 travel allowance

Current Tuition Fees: HK$42,100/year

Slide59

Taught Postgraduate Study at SEEM/CUHK [www.se.cuhk.edu.hk/programmes]

Taught Postgraduate ProgrammesMSc Programme in SEEM (Systems Engineering and Engineering Management)MSc Programme in ECLT (E-Commerce and Logistics Technologies)Current Tuition Fees: (Provisional) HK$128,000Full-Time One-Year study in HKApplication deadline: 1st Round: January 15, 2013 2nd Round: March 15, 2013

Early applications are encouraged; Offers may be made to eligible applicants well before March 15.

Slide60

Thank you!

Questions?