Sybil Detection Gang Wang 王刚 UC Santa Barbara gangwcsucsbedu Modeling User Clickstream Events Usergenerated events Eg profile load link follow photo browse friend invite ID: 918441
Download Presentation The PPT/PDF document "Clickstream Models &" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Clickstream Models & Sybil Detection
Gang Wang (
王刚
)
UC
Santa
Barbara
gangw@cs.ucsb.edu
Slide2Modeling User Clickstream EventsUser-generated eventsE.g. profile load, link follow, photo browse, friend invite
Assume we have event type,
userID
, timestampIntuition: Sybil users act differently from normal usersSybil users act differently from normal usersGoal-oriented: focus on specific actions, less “extraneous” eventsTime-limited: focused on efficient use of time, smaller gaps?Forcing Sybil users to mimic users win?
UserID
Event Generated
Timestamp
Slide3Legit
Sybils
System Overview
Clickstream Log
Sequence Clustering
Cluster Coloring
Known
Good Users
?
Incoming Clickstream
3
Slide4Clickstream ModelsClickstream loguser clicks (click type) with timestamp
Modeling Clickstream
Event-only Sequence Model
: order of eventse.g. ABCDATime-based Model: sequence of inter-arrival timee.g. {t1, t2, t3, …}Hybrid Model: sequence of click events with time e.g. A(t1)B(t2)C(t3)D(t4)A
4
Slide5Clickstream ClusteringSimilarity GraphVertices
:
users (or sessions)
Edges: weighted by the similarity score of two user’s clickstream Clustering Similar Clickstreams togetherGraph partitioning using METIS
Q: How to compare two clickstreams?
5
Slide6Distance Functions Of Each Model
Click Sequence (CS) Model
Ngram
overlap
Ngram+count
Time-based Model
Compare the distribution of inter-arrival time
K-S test
Hybrid Model
Bucketize inter-arrival time
Compute 5grams (similar with CS Model)
ngram1= {A, B, AA, AB, AAB}
ngram2= {A, C, AA, AC, AAC}
S1= AAB
S2= AAC
ngram1= {A(2), B(1), AA(1), AB(1), AAB(1)}
ngram2= {A(2), C(1), AA(1), AC(1), AAC(1)}
S1= AAB
S2= AAC
Euclidean Distance
V1=(2,1,0,1,0,1,1,0)/6
V2=(2,0,1,1,1,0,0,1)/6
6
Slide7Detection In A NutshellInputs:Trained clustersInput sequences for testing
Methodology: given a test sequence
A
K nearest neighbor: find the top-k nearest sequences in the trained clusterNearest Cluster: find the nearest cluster based on average distance to sequences in the clusterNearest Cluster (center): pre-compute the center(s) of cluster, find the nearest cluster center
?
7
Slide8Clustering Sequences
Model (Sequence
Type
)Distance Function(False positives, False
negatives) of users
20 clusters
50 clusters
100 clusters
Click Sequence Model
(Categories)
unigram
(3%
, 6%)
(1%, 7%)(2%, 4%)
unigram+count(1%
, 4%)(1%, 3%)(1%, 3
%)10gram(1%, 3%)
(1%, 3%)(2%, 2%)
10gram+count(1%, 4%)(2%, 4%)(1%, 2%)
Time-based ModelK-S Test
(9%, 8%)(2%, 10%)(5%, 10%)
Hybrid Model (Categories)
5gram(3%, 2%)(2%, 2%)
(2%, 2%)5gram+count
(3%, 4%)(4%, 5%)
(1%, 2%)
How well can each method separate Sybils from legitimate users?
8
Slide9Detection AccuracyBasicsTraining on one group of users, and test on the other group of users.
Clusters trained using Hybrid Model
Key takeaways
High accuracy with 50 clicks in the test sequenceNearest Cluster (Center) method achieves high accuracy with minor computation overhead
Number
of Clicks in the Sequence (length)
(False positives,
False
negatives) of users
K-nearest Neighbors (k=3)
Nearest Cluster(Avg. Distance)
Nearest Cluster (Center)
Length <=50(1.5% , 2.1%)(0.6%, 2.6%)
(0.4%, 2.3%)Length
<=100(0.9% , 1.8%)(0.2%, 2.5%)
(0.3%, 2.3%)All(0.6%
, 3%)(0.4%, 2.8%)(0.4%, 2.3%)
9
Slide10Can Model Be Effective Over Time?
Experiment method
Using first two-week data to train the model
Testing on the following two-week data
Model
(False positives,
False
negatives) of users
K-nearest Neighbor
s (k=3)
Nearest Cluster(Avg. Distance)
Nearest Cluster (Center)Click Sequence Model
(1.8% , 1%)(3%, 2%)(3%,
0.8%)Hybrid Model(3%
, 2%)(3%, 1%)(1.2%, 1.4
%)10
Slide11Still Ongoing WorkWith broad interest and applicationsAs Sybil detection tool
Code being tested internally at Renren
Trained with 10K users (2-week log)
Testing on 1 Million users (1-week log)5 Sybil clusters 22K suspicious profiles Further improvementTraining with longer clickstream (half users have <5 clicks in 2-week)More conservative in labeling Sybil clusters.As user modeling toolCode being tested by LinkedIn as user profiler
Slide12Some Useful ToolsGraph PartitioningMetis
http://glaros.dtc.umn.edu/gkhome/metis/metis/
overview
Community DetectionLouvain code https://sites.google.com/site/findcommunities/
Slide13Other Ongoing Works/IdeasFighting against crowdturfing
Crowdturfing
: real users are paid to spam
How to detect these malicious real usersUser behavior modelNetwork-wised temporal anomaly detection Information DisseminationContent sharing visa social edgesHow often will user click on the contentHow often will user comment on the content Sybil detection, target ad placement
Slide14Questions?http://current.cs.ucsb.edu
Thank You!