Jeffrey Xu Yu 于旭 Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong yusecuhkeduhk Books on Social Networks Social and Economic Networks ID: 816039
Download The PPT/PDF document "Dealing with Diversity in Mining and Que..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Dealing with Diversity in Mining and Query Processing
Jeffrey Xu Yu (于旭)Department of Systems Engineering and Engineering ManagementThe Chinese University of Hong Kongyu@se.cuhk.edu.hk
Slide2Books on Social Networks
Social and Economic Networks by Matthew O. JackonSocial Network Data Analysis by Charu C. Aggarwal
Exploratory Social Network Analysis with
Pajek
by
Wouter
de
Nooy
, Andrej
Mrvar
, and Vladimir
Batagelj
Networks, Crowds, and Markets: Reasoning about a Highly Connected World
by David Easley and John
Keinberg
Networks An Introduction
by M.E.J. Newman
Slide3Some Online Courses
Mining of Massive Datasets (Anand Rajaraman and Jeff Ullman) http://infolab.stanford.edu/~ullman/mmds.htmlNetworks, Crowds, and Markets: Reasoning about a highly connected world, by David Easley and Jon Kleinberg http://www.cs.cornell.edu/home/kleinber/networks-bookTopics in Data Management & Mining – Social Networks
,
Laks
V.S.
Lakshmanan
http://www.cs.ubc.ca/~laks/534l/cpsc534l.html
Slide4Stanford Large Network Dataset Collectionhttp://snap.stanford.edu/data
Social networksCommunication networksCitation networksCollaboration networksWeb graphsAmazon networksInternet networksRoad networksAutonomous systemsSigned networksWikipedia networks and metadataTwitter and Memetracker
Slide5Graph Database http://en.wikipedia.org/wiki/Graph_database
Pregel: Google’s internal graph processing platformTrinity: Microsoft Research AsiaNeo4j: commercial graph database…
Slide6Diversified Ranking
Why diversified ranking?Information requirements diversity Query incomplete
Slide7Problem Statement
For query dependent diversity ranking, the goal is to find K nodes in a graph that are relevant to the query node, and also they are dissimilar to each other. For query independent
diversity ranking, the goal is to find K
prestige
nodes in a graph that are
dissimilar
to each other.
Main applications
Ranking nodes in social network, ranking papers, etc.
Slide8Challenges
Diversity measuresNo wildly accepted diversity measures on graph in the literature.ScalabilityMost existing methods cannot be scalable to large graphs.Lack of intuitive interpretation.
Slide9Existing Methods
Grasshopper [Zhu, et al., HLT-NAACL’07]ManiRank [Zhu, et al., WWW’11]DivRank [Mei, et al., KDD’10]DRAGON [Tong, et al., KDD’11]Resistive Graph Centers [Dubey, et al., KDD’11]
Slide10Grasshopper/ManiRank
The main ideaWork in an iterative manner. Select a node at one iteration by random walk.Set the selected node to be an absorbing node, and perform random walk again to select the second node.Perform the same process K iterations to get K nodes.
No diversity measure
Achieving diversity only by intuition and experiments.
Cannot scale to large graph (time complexity O(
))
Grasshopper/ManiRank
Initial random walk with no absorbing statesAbsorbing random walk after ranking the first item
Slide12DivRank
Based on a vertex-reinforced random walk.No diversity measure.Convergence properties is not clear.Time and space complexity is
DRAGON, Resistive Graph Centers
DRAGON [Tong, et al., KDD’11]Diversity measure lacks of clear topological interpretationResistive Graph Centers [Dubey, et al., KDD’11]Based on personalized
PageRank
with a learnable teleportation parameter.
Cannot be scalable to large graphs.
A Summary
Comparison with existing methods
Slide15Our Approach
The main ideaRelevance of the top-K nodes (denoted by a set S) is achieved by the large (Personalized) PageRank scores. Diversity of the top-K nodes is achieved by large expansion ratio.Expansion ratio of a set nodes S: σ(S)=|N(S)|/nLarger expansion ratio implies better diversity
Slide16K-step expansion ratio of S: σ
k(S)=|Nk(S)|/nOur diversity measures
The K-step Expansion
Slide17Diversified ranking problem on graph as a discrete optimization problem.
Submodularity F(S) is shown to be submodular and non-descreasing. The greedy algorithm
A 1-1/e approximation algorithm for solving Eq. (1).
Linear time and space complexity
w.r.t
. the size of the graph.
A Discrete Optimization Problem
Slide18The Greedy Algorithm
Works in K roundsSelect a node with maximal marginal gain at one round
Marginal gain
Slide19Maximize F
k(S) subject to cardinality constraint |S| <= KSubmodularity Fk(S) is shown to be submodular and non-descreasing. Randomized greedy algorithm
Near 1-1/e approximation algorithm.
Linear time and space complexity
w.r.t
. the size of the graph.
Generalized Diversified Ranking Optimization
Slide20Randomized greedy algorithm Same idea as the greedy algorithm
Works in K roundsAt each round, select the node with maximal marginal gain. But, evaluating the maximal marginal gain is expensive.Our idea: Use a probabilistic counting data structure to sketch the k-step neighborhood for each node.Generalized Diversified Ranking Optimization
Marginal gain
Slide21A probabilistic counting structure, devised by Flajolet
and Martin.Be used to estimate the cardinality of a multi-set using only logC+t bits, where C denotes the cardinality and t is a small constant.Each FM Sketch is a log C+t bitmap.Advantage: To estimate the cardinality of the union of two multi-sets, we only need to do a bitwise-OR between to FM Sketches.FM Sketch and Its Properties
Slide22Randomized greedy algorithmFor each node u, use FM Sketch to sketch
Nk({u})Use the following rule to sketch Nk({u}), which can be implemented in a recursive wayUse FM sketch to sketch Nk(S)Evaluating the marginal gain can be implemented by a bitwise-OR between Nk(S) and Nk
({u})
The Randomized Greedy Algorithm
Slide23Experimental Studies
We conduct experiments on 5 real networks (3 collaboration networks, 1 citation network, and 1 social network).We show some results with Flickr, which is a popular photo shared website (from ASU social computing data repository).Undirected social network (80,513 nodes and 5,899,882 edges, and 195 different groups)
Slide24Some Testing Results on Flickr
Slide25Make a Top-K Algorithm Diversified
The result of searching “apple” in G
oogle
image
Existing top-
search algorithms
Search results are ranked independently
When searching “apple” in
google
image, 9 out of top 15 results are the logo of Apple Inc.
Structural Keyword Search (1)
Example: Keyword Search in GraphsInput: a graph with text information on each node, and a user given keyword queryOutput: top-k of minimal Steiner trees that contain all user given keywords
“
graph patterns
”
“
keyword search
”
DBLP
4
a
1
w
1
3
p
1
w
2
3
p
2
4
a
1
w
1
3
p
1
w
4
3
p
4
4
a
1
w
3
3
p
3
w
2
3
p
2
4
a
1
w
3
3
p
3
w
4
3
p
4
4
a
1
Author: Jiawei Han
w
1
w
2
w
3
w
4
Action: Write
3
p
1
Paper: Mining
Graph Patterns
3
p
2
3
p
3
3
p
4
Paper: Mining Significant
Graph
Patterns
by Leap Search
Paper: Optimizing Index for Taxonomy
Keyword Search
Paper:
Keyword Search
in Text Cube:
Finding Top-k Aggregated Cell Documents
v
1
v
2
v
3
v
4
Slide27Structural Keyword Search (2)
4
a
1
w
1
3
p
1
w
2
3
p
2
v
1
score=0.8
4
a
1
w
1
3
p
1
w
4
3
p
4
v
2
score=0.5
4
a
1
w
3
3
p
3
w
2
3
p
2
v
3
score=0.5
4
a
1
w
3
3
p
3
w
4
3
p
4
v
4
score=0.4
0.6
0.6
0.6
0.6
0.2
0.2
Suppose the similarity of
and
is
, e.g.,
Let
is better than
because
and
are similar with each other
is better than
because
has a larger total score
Diversified Top-K
We should consider both similarity and scoreLet
be a list of search results
Let
be the score of result
Let
be the
similarity of
and
For any
,
and
are similar
: a user given
t
hreshold
Diversified top-
results result
:
At most
results:
No two results in
are similar
Total score of results in
is maximized
A Diversity Graph
3
v
3
3
v
5
3
v
4
3
v
6
6
8
7
7
1
10
3
v
1
3
v
2
3
v
6
6
8
7
7
1
10
v
1
v
2
v
3
v
5
v
4
Diversity Graph
Undirected graph
,
, there is an edge (
,
) in
is similar to
The diversified top-
result set is an independent set of
Existing Top-K Search Frameworks
Most existing top-K search frameworks avoid exploring all search results by finding an early stop condition. Incremental Top-KResults are generated one by one in ranked orderStops when K results are outputBounding Top-KResults are generated not necessarily in ranked order.A non-increasing score upper bound for unseen result u is maintained.
Stop when the K-
th
largest score generated is no smaller than u.
Slide31Our Framework
We support the existing top-K frameworksResults are generated one by oneStops if a certain stop condition is satisfiedOur framework
We extend the existing algorithms to get top-K diversified results by three new functions.
sufficient():
a new early stop condition
necessary():
the necessary stop condition
div-search():
search top-k diversified results on the current results
Step 2
Step 3
Step 1
Check the stop condition
sufficient()
Stops if
sufficient()
is satisfied
Generate the next result using the original
top-K algorithm
Check the
necessary()
condition
If
necessary()
is satisfied, search the diversified top-K results using
div-search()
Go to Step 1
Slide32Sufficient Stop Condition
Sufficient stop condition
sufficient()
: the set of current generated results
: an upper bound of the optimal solution calculated from current generated results
: the current diversified top-
results with score
: the score upper bound of all unseen results
For each
, in the ideal situation, for the unseen results, all the remaining
results are set to be
We have
The sufficient stop condition is
Necessary Stop Condition
Necessary stop condition
necessary()
: the set of current generated results
Assume the stop condition of the original algorithm is satisfied
Otherwise the algorithm cannot stop
: the set of results when the last time
necessary()
is satisfied (or
if
necessary
() is never satisfied)
If
for a certain
, we need at least
more results generated in order to get
results
The necessary stop condition is
The Possible Search Algorithms
3
v
1
3
v
2
3
v
3
3
v
100
3
u
1
…
…
100
99
99
99
99
0.5
1
1
1
u
2
v
0
u
3
u
100
3
u
1
…
…
100
99
99
99
99
0.5
1
1
1
3
v
0
3
u
2
3
u
3
3
u
100
v
0
v
2
v
3
v
100
Greedy Solution: score=199
Optimal Solution: score=9900
Given the diversity graph
for the current generated result set
Greed is Not Good
Finding
on
is an NP-Hard problem
Three New Search Algorithms
We propose three exact algorithmsdiv-astar: an A* based approachdiv-dp: decompose div-
astar
using operator
div-cut
: further decompose
div-
dp
using operators
and
NP
NP
NP
NP
NP
NP
NP
NP
NP
NP
NP
NP
NP
NP
NP
NP
NP
NP
NP
NP
NP
NP
NP
NP
NP
div-
astar
div-dp
div-cut
Slide36An A* Based Approach
We use a heap to maintain partial solutionsEach partial solution is with form
the set of results selected in the partial solution
: the total score of results in
: the upper bound of score if
is expanded to a full solution
Entries in
are expanded in non-increasing order of
The algorithm stops when
of the next
soution
is no larger than the score of the current best solution
An A* Based Approach
Calculation of
is the set of adjacent nodes of
in
The equation is a relaxation of the optimal solution w.r.t.
is to avoid generating redundant results
can be calculated in
time in the worst case
s.t.
An A* Based Approach
3
3
3
3
6
8
7
7
3
10
3
3
Diversity graph
An example (
)
Step 1: Expand node (
), with
An A* Based Approach
3
3
3
3
6
8
7
7
3
10
3
3
Diversity graph
An example (
)
Step 2: Expand node (
), with
An A* Based Approach
3
3
3
3
6
8
7
7
3
10
3
3
Diversity graph
An example (
)
Step 3: Expand node (
), with
An A* Based Approach
3
3
3
3
6
8
7
7
3
10
3
3
Diversity graph
An example (
)
Step 4: Expand node (
), with
An A* Based Approach
3
3
3
3
6
8
7
7
3
10
3
3
Diversity graph
An example (
)
Step 5: Expand node (
), with
Current best score is
, and next best score is
: stop
Optimal solution:
A DP Based Approach
The diversity graph may contain many disconnected componentsIt is costly to apply A* algorithm on the whole diversity graphCombine the results of disconnected components using operator based on Dynamic Programming (DP)Dynamic ProgrammingSuppose
contains two disconnected components
and
State
: the optimal score of the diversified top-
results on
State transition equation:
A DP Based Approach
3
3
3
3
6
8
7
7
1
10
10
6
7
8
9
3
3
An Example (
)
optimal solution: {
,
,
,
}
i
solution
s
0
0
1
10
2
18
3
20
4
0
5
0
i
solution
s
0
0
1
10
2
18
3
20
4
0
5
0
i
solution
s
0
0
1
10
2
18
3
22
4
0
5
0
i
solution
s
0
0
1
10
2
18
3
22
4
0
5
0
i
solution
s
0
0
1
10
2
20
3
28
4
36
5
40
i
solution
s
0
0
1
10
2
20
3
28
4
36
5
40
A Cut Point Based Approach
Cut point of graph Suppose is a connected graphA cut point is a point whose removal makes
disconnected
can be further decomposed using cut points
Suppose
is a cut point of
, there are two situations
:
is excluded in the final solution
After removing
,
becomes several disconnected components
:
is included in the final solution
After removing
and all
’s adjacent nodes,
becomes several disconnected
components
Add
to each result in
and
are combined using operator
to compute
A Cut Point Based Approach
Let be a cut point of Let be the solution by excluding
Let
be the solution by including
and
are mutually exclusive with each other
: the optimal score of diversified top-
results on
Calculating
A Cut Point Based Approach
Handling multiple cut pointsStep 1: Construct a cup-point tree (cptree)Each node: associated with a cut point (leaf node is associated with a virtual cut point)Each edge: associated with a subgraph that connects two cut points (the subgraph can be empty or disconnected)
A sample
cptree
:
Step 2: Search the
cptree
In a bottom-up fashion
A Cut Point Based Approach
Suppose
,
,
,
have been computed
We now compute
and
An Example
Slide49A Cut Point Based Approach
Computing
Computing
(Case 1)
is excluded:
(Case 2)
is
included:
is the result after removing adjacent nodes of
from
We have
can be computed similarly
An Example
Slide50A Cut Point Based Approach
Computing
Computing
(Case 1)
is excluded:
(Case 2)
is
included:
We have
can be computed similarly
Do not forget to add {
} to all the results of
An Example
Slide51A Cut Point Based Approach
isolutions0
0
1
13
2
23
3
33
4
36
5
39
i
solution
s
0
0
1
13
2
23
3
33
4
36
5
39
An Example (
)
3
3
6
8
7
7
1
10
10
6
7
8
9
3
3
3
3
3
3
3
3
3
i
solution
s
0
0
1
10
2
20
3
28
4
36
5
40
i
solution
s
0
0
1
10
2
20
3
28
4
36
5
40
i
solution
s
0
0
1
13
2
23
3
33
4
36
5
40
i
solution
s
0
0
1
13
2
23
3
33
4
36
5
40
13
1
1
1
1
Further Improvements
Example can be removed from There exists
s.t.
After removing
and
become cut points
3
3
6
8
7
7
1
10
10
6
7
8
9
3
3
3
3
3
13
1
1
1
1
3
3
6
8
7
7
1
10
10
6
7
8
9
3
3
3
3
3
3
3
3
3
13
1
1
1
1
3
3
3
3
3
12
Slide53Performance Studies
Experimental SetupWe use 2 real datasets: Enwiki and ReutersEnwiki
: 11,930,681 articles from English Wikipedia
Reuters: 21,578 news from Reuters
Query: a set of keywords
Answer: top-
documents
We compare
three algorithms
div-star: A* based approach
div-
dp
: Dynamic programming based approach
div-cut: Cut point based approach
We
vary 3 parameters:
: (two groups)
S
mall
40, 80, 120, 160, 200, default 120
Large
: 500, 700, 900, 1300, 2000, default 900
Similarity threshold
: 0.4, 0.5, 0.6, 0.7, 0.8 default 0.6
Keyword
frequency
:
5 levels
1,2,3,4,5,
default 3
Performance Studies
Score function:Given a query and a document
is term frequency of keyword
for dataset
is the total number of words in
Similarity function:
Given
two documents
and
Performance Studies
Vary
(
Enwiki
)
Small
Small
Large
Large
Conclusion
We study the diversified ranking.We study the diversified top- search problem.The diversity use only the similarity of search results themselvesWe propose a framework, s.t. most top- algorithm can be easily extended to handle diversified top-
search by applying.
APWeb 2013 in Sydney, Australia
The 15th International Asia-Pacific Web Conference (APWeb), 4-6 April, 2013, Sydney, AustraliaJust before ICDE 2013.Paper Submission Deadline: October 20.Three Keynote SpeakersH.V. Jagadish (University of Michigan)Dan Suciu (University of Washington)Mark Sanderson (RMIT)A Special Issue on WWW Journal
Slide58Research Postgraduate Study at SEEM/CUHK [www.se.cuhk.edu.hk/programmes]
Research Postgraduate ProgramsM.Phil, PhD, M.phil-PhD (Articulated) Deadlines: December 1, 2012 (First Round)January 31, 2013 (Official Final Round). But, due to Chinese New Year, submit it early before January 20.Postgraduate Studentship: HK$13,600 per month (non-taxable)Current Tuition Fees: HK$42,100/yearHong Kong PhD Fellowship Scheme 2013-2014 (135 positions in HK)
Deadline: December 1, 2012
Monthly stipend of HK$20,000
10,000 travel allowance
Current Tuition Fees: HK$42,100/year
Slide59Taught Postgraduate Study at SEEM/CUHK [www.se.cuhk.edu.hk/programmes]
Taught Postgraduate ProgrammesMSc Programme in SEEM (Systems Engineering and Engineering Management)MSc Programme in ECLT (E-Commerce and Logistics Technologies)Current Tuition Fees: (Provisional) HK$128,000Full-Time One-Year study in HKApplication deadline: 1st Round: January 15, 2013 2nd Round: March 15, 2013
Early applications are encouraged; Offers may be made to eligible applicants well before March 15.
Slide60Thank you!
Questions?