Christos Faloutsos CMU Thanks Alex Smola Jia Yu Tim Pan Google June 2013 C Faloutsos CMU 2 C Faloutsos CMU 3 Roadmap Graph problems G1 Fraud detection BP G2 Botnet ID: 554188
Download Presentation The PPT/PDF document "Mining Large Graphs: Spectral Methods, T..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Mining Large Graphs: Spectral Methods, Tensors and Influence propagation
Christos Faloutsos
CMUSlide2
Thanks
Alex
Smola
Jia Yu (Tim) Pan
Google, June 2013
C. Faloutsos (CMU)
2Slide3
C. Faloutsos (CMU)
3
Roadmap
Graph problems:
G1: Fraud detection – BP
G2:
Botnet
detection – spectral
G3: Beyond graphs: tensors and ``NELL’’
Influence propagation and spike modeling
C1:
spikeM
modelConclusions
Google, June 2013Slide4
Google, June 2013
C. Faloutsos (CMU)
4
E-bay Fraud detection
w/ Polo Chau &
Shashank Pandit, CMU
[www’07]Slide5
Google, June 2013
C. Faloutsos (CMU)
5
E-bay Fraud detectionSlide6
Google, June 2013
C. Faloutsos (CMU)
6
E-bay Fraud detectionSlide7
Google, June 2013
C. Faloutsos (CMU)
7
E-bay Fraud detection - NetProbeSlide8
Google, June 2013
C. Faloutsos (CMU)
8
E-bay Fraud detection - NetProbe
F
A
H
F
99%
A
99%
H
49%
49%
Compatibility
matrix
heterophily
detailsSlide9
C. Faloutsos (CMU)
9
Background 1:
Belief Propagation Equations
[Pearl ‘82][Yedidia+ ‘02]
…[Pandit+ ‘07][Gonzalez+ ‘09][Chechetka+ ‘10]
Google, June 2013
~b
i
(x
i
)Slide10
Popular press
And less desirable attention:
E-mail from ‘Belgium police’ (‘copy of your code?’)
Google, June 2013
C. Faloutsos (CMU)
10Slide11
C. Faloutsos (CMU)
11
Roadmap
Graph problems:
G1: Fraud detection – BP
Ebay
Symantec
Unification
G2:
Botnet
detection – spectral
G3: Beyond graphs: tensors and ``NELL’’
Influence propagation and spike modeling
Conclusions
Google, June 2013Slide12
Polo Chau
Machine Learning Dept
Carey Nachenberg
Vice President & Fellow
Jeffrey Wilhelm
Principal Software Engineer
Adam Wright
Software Engineer
Prof. Christos Faloutsos
Computer Science Dept
Polonium:
Tera
-Scale Graph Mining and Inference for Malware Detection
PATENT PENDING
SDM 2011, Mesa, ArizonaSlide13
Polonium: The Data
60+ terabytes
of data
anonymously
contributed
by participants of worldwide Norton Community Watch
program
50+ million
machines
900+ million
executable files
Constructed a machine-file bipartite graph (0.2 TB+)
1 billion
nodes (machines and files)
37 billion
edges
Google, June 2013
13
C. Faloutsos (CMU)Slide14
Polonium: Key Ideas
Use
“guilt-by-association”
(i.e.,
homophily
)
E.g., files that appear on machines with many bad files are more likely to be badScalability
:
handles 37 billion-edge graph
Google, June 2013
14
C. Faloutsos (CMU)Slide15
Polonium: One-Interaction Results
84.9%
True Positive Rate
1%
False Positive Rate
True Positive Rate
% of malware
correctly identified
False Positive Rate
% of non-malware wrongly labeled as malware
15
Ideal
Google, June 2013
C. Faloutsos (CMU)Slide16
C. Faloutsos (CMU)
16
Roadmap
Graph problems:
G1: Fraud detection – BP
Ebay
Symantec
Unification
G2:
Botnet
detection – spectral
G3: Beyond graphs: tensors and ``NELL’’
Influence propagation and spike modeling
Conclusions
Google, June 2013Slide17
Unifying Guilt-by-Association Approaches:
Theorems and Fast Algorithms
Danai Koutra
U Kang
Hsing-Kuo Kenneth Pao
Tai-You KeDuen Horng (Polo) ChauChristos Faloutsos
ECML PKDD, 5-9 September 2011, Athens, GreeceSlide18
Problem Definition:G
B
A techniques
C. Faloutsos (CMU)
18
Given
: Graph; & few labeled nodesFind
: labels of rest(assuming network effects)
?
?
?
?
Google, June 2013Slide19
Homophily and Heterophily
C. Faloutsos (CMU)
19
Step 1
Step 2
homophily
heterophily
All methods handle homophily
NOT
all methods handle heterophily
BUT
proposed method
does!
Google, June 2013Slide20
Are they related?
RWR (Random Walk with Restarts)
google’s pageRank (‘
if my friends are important, I’m important, too’)
SSL (Semi-supervised learning) minimize the differences among neighborsBP (Belief propagation)
send messages to neighbors, on what you believe about them
Google, June 2013C. Faloutsos (CMU)
20Slide21
Are they related?
RWR (Random Walk with Restarts)
google’s pageRank (‘
if my friends are important, I’m important, too’)
SSL (Semi-supervised learning) minimize the differences among neighborsBP (Belief propagation)
send messages to neighbors, on what you believe about them
Google, June 2013C. Faloutsos (CMU)
21
YES!Slide22
C. Faloutsos (CMU)
22
Background 1:
Belief Propagation Equations
[Pearl ‘82][Yedidia+ ‘02]
…[Pandit+ ‘07][Gonzalez+ ‘09][Chechetka+ ‘10]
Google, June 2013Slide23
Correspondence of Methods
C. Faloutsos (CMU)
23
Method
Matrix
Unknown
known
RWR
[
I
–
c
A
D
-1
]
×
x
=
(1-c)
y
SSL
[
I
+
a
(
D
- A)
] ×
x=
yF
ABP[
I + a
D
- c’A
] ×
bh
=φ
h
0 1 0
1 0 1
0 1 0
?
0
1 1
d1
d2
d3
final labels/ beliefs
prior labels/ beliefs
adjacency matrix
Google, June 2013Slide24
Correspondence of Methods
C. Faloutsos (CMU)
24
Method
Matrix
Unknown
known
RWR
[
I
–
c
A
D
-1
]
×
x
=
(1-c)
y
SSL
[
I
+
a
(
D
- A)
] ×
x=
yF
ABP[
I + a
D
- c’A
] ×
bh
=φ
h
0 1 0
1 0 1
0 1 0
?
0
1 1
d1
d2
d3
final labels/ beliefs
prior labels/ beliefs
adjacency matrix
Google, June 2013
We know when it converges!Slide25
Results: Scalability
C. Faloutsos (CMU)
25
F
A
BP is
linear
on the number of edges.
# of edges (Kronecker graphs)
runtime (min)
Google, June 2013Slide26
Results: Parallelism
C. Faloutsos (CMU)
26
F
A
BP
~2x faster
& wins/ties on
accuracy.
runtime (min)
% accuracy
Google, June 2013Slide27
C. Faloutsos (CMU)
27
Conclusions for BP
‘
NetProbe
’, ‘Polonium’, and
belief propagation
: exploit network effects.
FaBP
: fast & accurate (and -> convergence conditions)
Google, June 2013Slide28
C. Faloutsos (CMU)
28
Roadmap
Graph problems:
G1: Fraud detection – BP
Ebay
Symantec
Unification
G2:
Botnet
detection – spectral
G3: Beyond graphs: tensors and ``NELL’’
Influence propagation and spike modeling
Conclusions
Google, June 2013Slide29
EigenSpokes
B. Aditya Prakash, Mukund Seshadri, Ashwin Sridharan, Sridhar Machiraju and Christos Faloutsos:
EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs,
PAKDD 2010, Hyderabad, India, 21-24 June 2010.
C. Faloutsos (CMU)
29
Google, June 2013Slide30
EigenSpokes
Eigenvectors of adjacency matrix
equivalent to singular vectors (symmetric, undirected graph
)
30
C. Faloutsos (CMU)
Google, June 2013Slide31
EigenSpokes
Eigenvectors of adjacency matrix
equivalent to singular vectors (symmetric, undirected graph
)
31
C. Faloutsos (CMU)
Google, June 2013
N
N
detailsSlide32
EigenSpokes
Eigenvectors of adjacency matrix
equivalent to singular vectors (symmetric, undirected graph
)
32
C. Faloutsos (CMU)
Google, June 2013
N
N
detailsSlide33
EigenSpokes
Eigenvectors of adjacency matrix
equivalent to singular vectors (symmetric, undirected graph
)
33
C. Faloutsos (CMU)
Google, June 2013
N
N
detailsSlide34
EigenSpokes
Eigenvectors of adjacency matrix
equivalent to singular vectors (symmetric, undirected graph
)
34
C. Faloutsos (CMU)
Google, June 2013
N
N
detailsSlide35
EigenSpokes
EE plot:
Scatter plot of scores of u1 vs u2
One would expect
Many points @ originA few scattered ~randomly
C. Faloutsos (CMU)
35
u1
u2
Google, June 2013
1
st
Principal
component
2
nd
Principal
componentSlide36
EigenSpokes
EE plot:
Scatter plot of scores of u1 vs u2
One would expect
Many points @ originA few scattered ~randomly
C. Faloutsos (CMU)
36
u1
u2
90
o
Google, June 2013Slide37
EigenSpokes - pervasiveness
Present in mobile social graph
across time and space
Patent citation graph
37
C. Faloutsos (CMU)
Google, June 2013Slide38
EigenSpokes - explanation
Near-cliques, or near-bipartite-cores, loosely connected
38
C. Faloutsos (CMU)
Google, June 2013Slide39
EigenSpokes - explanation
Near-cliques, or near-bipartite-cores, loosely connected
39
C. Faloutsos (CMU)
Google, June 2013Slide40
EigenSpokes - explanation
Near-cliques, or near-bipartite-cores, loosely connected
40
C. Faloutsos (CMU)
Google, June 2013Slide41
EigenSpokes - explanation
Near-cliques, or near-bipartite-cores, loosely connected
So
what?
Extract nodes with high s
cores
high connectivity
Good “communities”
spy plot of top 20 nodes
41
C. Faloutsos (CMU)
Google, June 2013Slide42
Bipartite Communities!
magnified bipartite community
patents from
same inventor(s)
`cut-and-paste’
bibliography!
42
C. Faloutsos (CMU)
Google, June 2013Slide43
(maybe, botnets?)
Victim IPs?
Botnet members?
43
C. Faloutsos (CMU)
Google, June 2013
Exploring
it
with Dr.
Eric Mao
(III-Taiwan)Slide44
C. Faloutsos (CMU)
44
Roadmap
Graph problems:
G1: Fraud detection – BP
G2:
Botnet
detection – spectral
G3: Beyond graphs: tensors and ``NELL’’
Influence propagation and spike modeling
Conclusions
Google, June 2013Slide45
GigaTensor
: Scaling Tensor Analysis Up By 100 Times
–
Algorithms
and Discoveries
U
Kang
Christos
Faloutsos
KDD’12
Evangelos
Papalexakis
Abhay
Harpale
Google, June 2013
45
C. Faloutsos (CMU)Slide46
Background: Tensors
Tensors (=multi-dimensional arrays) are everywhere
Hyperlinks &anchor text [Kolda+,05]
URL 1
URL 2
Anchor Text
Java
C++
C#
1
1
1
1
1
1
1
Google, June 2013
46
C. Faloutsos (CMU)Slide47
Background: Tensors
Tensors (=multi-dimensional arrays) are everywhere
Sensor stream (time, location, type)
Predicates (subject, verb, object) in knowledge base
“
Barack Obama
is
president
of U.S
.”
“
Eric Clapton
plays
guitar
”
(26M)
(26M)
(48M)
NELL (Never Ending Language Learner) data
Nonzeros
=144M
Google, June 2013
47
C. Faloutsos (CMU)Slide48
Background: Tensors
Tensors (=multi-dimensional arrays) are everywhere
Sensor stream (time, location, type)
Predicates (subject, verb, object) in knowledge base
Google, June 2013
48
C. Faloutsos (CMU)
IP-destination
IP-source
Time-stamp
Anomaly
Detection in
Computer
networksSlide49
Problem Definition
How to decompose a billion-scale tensor?
Corresponds to SVD in 2D case
Google, June 2013
49
C. Faloutsos (CMU)Slide50
Problem Definition
How to decompose a billion-scale tensor?
Corresponds to SVD in 2D case
Google, June 2013
50
C. Faloutsos (CMU)
‘Politicians’
‘Artists’Slide51
Problem Definition
Q1: Dominant concepts/topics?
Q2: Find synonyms to a given noun phrase?
(and how to scale up: |data| > RAM)
(26M)
(26M)
(48M)
NELL (Never Ending Language Learner) data
Nonzeros
=144M
Google, June 2013
51
C. Faloutsos (CMU)Slide52
Experiments
GigaTensor
solves
100x larger problem
Number of
nonzero
= I / 50
(J)
(I)
(K)
GigaTensor
Tensor
Toolbox
Out of
Memory
100x
Google, June 2013
52
C. Faloutsos (CMU)Slide53
A1: Concept Discovery
Concept Discovery in Knowledge Base
Google, June 2013
53
C. Faloutsos (CMU)Slide54
A1: Concept Discovery
Google, June 2013
54
C. Faloutsos (CMU)Slide55
A2: Synonym Discovery
Google, June 2013
55
C. Faloutsos (CMU)Slide56
C. Faloutsos (CMU)
56
Roadmap
Graph problems:
G1: Fraud detection – BP
G2:
Botnet
detection – spectral
G3: Beyond graphs: tensors and ``NELL’’
Influence propagation and spike modeling
Conclusions
Google, June 2013Slide57
Rise and Fall Patterns of Information Diffusion:
Model and Implications
Yasuko Matsubara (Kyoto University),
Yasushi Sakurai (NTT),
B. Aditya Prakash (CMU), Lei Li
(UCB), Christos Faloutsos (CMU)KDD’12, Beijing China
KDD 2012
57
Y. Matsubara et al.Slide58
Meme (# of mentions in blogs)short phrases Sourced from U.S. politics in 2008
58
“you can put lipstick on a pig”
“yes we can”
Rise and fall patterns in social media
C. Faloutsos (CMU)
Google, June 2013Slide59
Rise and fall patterns in social media
59
four
classes on YouTube [Crane et al. ’08]
six
classes on Meme [Yang et al. ’11]
C. Faloutsos (CMU)
Google, June 2013Slide60
Rise and fall patterns in social media
60
Can we find a unifying model, which includes these patterns?
four
classes on YouTube [Crane et al. ’08]
six
classes on Meme [Yang et al. ’11]
C. Faloutsos (CMU)
Google, June 2013Slide61
Rise and fall patterns in social media
61
Answer: YES!
We
can represent
all patterns
by
single model
C. Faloutsos (CMU)
Google, June 2013Slide62
62
Main idea -
SpikeM
1.
Un
-informed bloggers
(uninformed about rumor)2. External
shock at time nb
(
e.g
, breaking news)
3. Infection (word-of-mouth)
Time n=0
Time n=
n
b
β
C. Faloutsos (CMU)
Google, June 2013
Infectiveness of a blog-post at age
n
:
Strength of infection (quality of news)
Decay function
Time n=n
b
+1Slide63
63
1.
Un
-informed bloggers
(uninformed about rumor)
2. External shock
at time nb
(e.g, breaking news)3.
Infection
(word-of-mouth)
Time n=0
Time n=
n
b
β
C. Faloutsos (CMU)
Google, June 2013
Infectiveness of a blog-post at age
n
:
Strength of infection (quality of news)
Decay function
Time n=n
b
+1
Main idea - SpikeMSlide64
Google, June 2013
C. Faloutsos (CMU)
64
-1.5 slope
J. G. Oliveira & A.-L. Barabási Human Dynamics: The Correspondence Patterns of Darwin and Einstein.
Nature
437,
1251 (2005) . [
PDF
]
Response time (log)
Prob(RT
>
x
)
(log)
-1.5Slide65
SpikeM - with periodicity
Full equation of SpikeM
65
Periodicity
noon
Peak
3am
Dip
Time n
Bloggers change their activity over time
(e.g., daily, weekly, yearly)
activity
Details
C. Faloutsos (CMU)
Google, June 2013Slide66
Details
Analysis –
exponential
rise and
power-raw fall
66
Lin-log
Log-log
Rise-part
SI
->
exponential
SpikeM
->
exponential
C. Faloutsos (CMU)
Google, June 2013Slide67
Details
Analysis –
exponential
rise and
power-raw fall
67
Lin-log
Log-log
Fall-part
SI -> exponential
SpikeM
-> power law
C. Faloutsos (CMU)
Google, June 2013Slide68
Tail-part forecasts
68
SpikeM
can capture tail part
C. Faloutsos (CMU)
Google, June 2013Slide69
“What-if” forecasting
69
e.g., given (1) first spike,
(2) release date of two sequel movies
(3) access volume before the release date
?
(1) First spike
(2) Release date
(3) Two weeks before release
C. Faloutsos (CMU)
Google, June 2013
?Slide70
“What-if” forecasting
70
SpikeM
can forecast upcoming spikes
(1) First spike
(2) Release date
(3) Two weeks before release
C. Faloutsos (CMU)
Google, June 2013Slide71
Conclusions for spikesExp rise; PL decay
‘
spikeM
’ captures all patterns, with a few parmsAnd can do extrapolationAnd forecasting
Google, June 2013
C. Faloutsos (CMU)
71Slide72
C. Faloutsos (CMU)
72
Roadmap
Graph problems:
G1: Fraud detection – BP
G2:
Botnet
detection – spectral
G3: Beyond graphs: tensors and ``NELL’’
Influence propagation and spike modeling
Future research
Conclusions
Google, June 2013Slide73
Challenge#1: Time evolving networks / tensors
Periodicities?
Burstiness
?What is ‘typical’ behavior of a node, over timeHeterogeneous graphs (= nodes w/ attributes)
Google, June 2013
C. Faloutsos (CMU)
73
…Slide74
Challenge #2: ‘Connectome’ – brain wiring
Google, June 2013
C. Faloutsos (CMU)
74
Which neurons get activated by ‘bee’
How wiring evolves
Modeling epilepsy
N.
Sidiropoulos
George
Karypis
V.
Papalexakis
Tom MitchellSlide75
C. Faloutsos (CMU)
75
Thanks
Google, June 2013
Thanks to:
NSF IIS-0705359, IIS-0534205,
CTA-INARC
;
Yahoo (M45), LLNL, IBM, SPRINT, Google, INTEL, HP,
iLabSlide76
C. Faloutsos (CMU)
76
Project info: PEGASUS
Google, June 2013
www.cs.cmu.edu/~pegasus
Results on large graphs: with Pegasus + hadoop + M45
Apache license
Code, papers, manual, video
Prof. U Kang
Prof. Polo ChauSlide77
C. Faloutsos (CMU)
77
Cast
Akoglu
,
Leman
Chau
,
Polo
Kang, U
McGlohon
,
Mary
Tong,
Hanghang
Prakash
,
Aditya
Google, June 2013
Koutra
,
Danai
Beutel
,
Alex
Papalexakis
,
VagelisSlide78
C. Faloutsos (CMU)
78
References
Deepayan
Chakrabarti
, Christos
Faloutsos
:
Graph mining: Laws, generators, and algorithms
. ACM
Comput
. Surv
. 38(1): (2006)
Google, June 2013Slide79
C. Faloutsos (CMU)
79
References
Christos Faloutsos, Tamara G. Kolda, Jimeng Sun:
Mining large graphs and streams using matrix and tensor tools
. Tutorial, SIGMOD Conference 2007: 1174
Google, June 2013Slide80
ReferencesYasuko Matsubara, Yasushi Sakurai, B. Aditya Prakash, Lei Li, Christos Faloutsos, "
Rise and Fall Patterns of Information Diffusion: Model and Implications
", KDD’12, pp. 6-14, Beijing, China, August 2012
Google, June 2013
C. Faloutsos (CMU)
80Slide81
References
Jimeng Sun, Dacheng Tao, Christos Faloutsos:
Beyond streams and graphs: dynamic tensor analysis
. KDD 2006: 374-383
Google, June 2013
C. Faloutsos (CMU)
81Slide82
Overall ConclusionsG1: fraud detection
BP: powerful method
FaBP
: faster; equally accurate; known convergenceG2: botnets -> Eigenspokes
G3: Subject-Verb-Object -> Tensors/GigaTensor
Spikes: ‘spikeM’ (exp rise; PL drop)
Google, June 2013
C. Faloutsos (CMU)
82