/
Clickstream Models & Clickstream Models &

Clickstream Models & - PowerPoint Presentation

taylor
taylor . @taylor
Follow
342 views
Uploaded On 2022-06-15

Clickstream Models & - PPT Presentation

Sybil Detection Gang Wang 王刚 UC Santa Barbara gangwcsucsbedu Modeling User Clickstream Events Usergenerated events Eg profile load link follow photo browse friend invite ID: 918441

users model sequence cluster model users cluster sequence nearest clickstream sybil time false detection click clusters center distance week

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Clickstream Models &" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Clickstream Models & Sybil Detection

Gang Wang (

王刚

)

UC

Santa

Barbara

gangw@cs.ucsb.edu

Slide2

Modeling User Clickstream EventsUser-generated eventsE.g. profile load, link follow, photo browse, friend invite

Assume we have event type,

userID

, timestampIntuition: Sybil users act differently from normal usersSybil users act differently from normal usersGoal-oriented: focus on specific actions, less “extraneous” eventsTime-limited: focused on efficient use of time, smaller gaps?Forcing Sybil users to mimic users  win?

UserID

Event Generated

Timestamp

Slide3

Legit

Sybils

System Overview

Clickstream Log

Sequence Clustering

Cluster Coloring

Known

Good Users

?

Incoming Clickstream

3

Slide4

Clickstream ModelsClickstream loguser clicks (click type) with timestamp

Modeling Clickstream

Event-only Sequence Model

: order of eventse.g. ABCDATime-based Model: sequence of inter-arrival timee.g. {t1, t2, t3, …}Hybrid Model: sequence of click events with time e.g. A(t1)B(t2)C(t3)D(t4)A

4

Slide5

Clickstream ClusteringSimilarity GraphVertices

:

users (or sessions)

Edges: weighted by the similarity score of two user’s clickstream Clustering Similar Clickstreams togetherGraph partitioning using METIS

Q: How to compare two clickstreams?

5

Slide6

Distance Functions Of Each Model

Click Sequence (CS) Model

Ngram

overlap

Ngram+count

Time-based Model

Compare the distribution of inter-arrival time

K-S test

Hybrid Model

Bucketize inter-arrival time

Compute 5grams (similar with CS Model)

ngram1= {A, B, AA, AB, AAB}

ngram2= {A, C, AA, AC, AAC}

S1= AAB

S2= AAC

ngram1= {A(2), B(1), AA(1), AB(1), AAB(1)}

ngram2= {A(2), C(1), AA(1), AC(1), AAC(1)}

S1= AAB

S2= AAC

Euclidean Distance

V1=(2,1,0,1,0,1,1,0)/6

V2=(2,0,1,1,1,0,0,1)/6

6

Slide7

Detection In A NutshellInputs:Trained clustersInput sequences for testing

Methodology: given a test sequence

A

K nearest neighbor: find the top-k nearest sequences in the trained clusterNearest Cluster: find the nearest cluster based on average distance to sequences in the clusterNearest Cluster (center): pre-compute the center(s) of cluster, find the nearest cluster center

?

7

Slide8

Clustering Sequences

Model (Sequence

Type

)Distance Function(False positives, False

negatives) of users

20 clusters

50 clusters

100 clusters

Click Sequence Model

(Categories)

unigram

(3%

, 6%)

(1%, 7%)(2%, 4%)

unigram+count(1%

, 4%)(1%, 3%)(1%, 3

%)10gram(1%, 3%)

(1%, 3%)(2%, 2%)

10gram+count(1%, 4%)(2%, 4%)(1%, 2%)

Time-based ModelK-S Test

(9%, 8%)(2%, 10%)(5%, 10%)

Hybrid Model (Categories)

5gram(3%, 2%)(2%, 2%)

(2%, 2%)5gram+count

(3%, 4%)(4%, 5%)

(1%, 2%)

How well can each method separate Sybils from legitimate users?

8

Slide9

Detection AccuracyBasicsTraining on one group of users, and test on the other group of users.

Clusters trained using Hybrid Model

Key takeaways

High accuracy with 50 clicks in the test sequenceNearest Cluster (Center) method achieves high accuracy with minor computation overhead

Number

of Clicks in the Sequence (length)

(False positives,

False

negatives) of users

K-nearest Neighbors (k=3)

Nearest Cluster(Avg. Distance)

Nearest Cluster (Center)

Length <=50(1.5% , 2.1%)(0.6%, 2.6%)

(0.4%, 2.3%)Length

<=100(0.9% , 1.8%)(0.2%, 2.5%)

(0.3%, 2.3%)All(0.6%

, 3%)(0.4%, 2.8%)(0.4%, 2.3%)

9

Slide10

Can Model Be Effective Over Time?

Experiment method

Using first two-week data to train the model

Testing on the following two-week data

Model

(False positives,

False

negatives) of users

K-nearest Neighbor

s (k=3)

Nearest Cluster(Avg. Distance)

Nearest Cluster (Center)Click Sequence Model

(1.8% , 1%)(3%, 2%)(3%,

0.8%)Hybrid Model(3%

, 2%)(3%, 1%)(1.2%, 1.4

%)10

Slide11

Still Ongoing WorkWith broad interest and applicationsAs Sybil detection tool

Code being tested internally at Renren

Trained with 10K users (2-week log)

Testing on 1 Million users (1-week log)5 Sybil clusters 22K suspicious profiles Further improvementTraining with longer clickstream (half users have <5 clicks in 2-week)More conservative in labeling Sybil clusters.As user modeling toolCode being tested by LinkedIn as user profiler

Slide12

Some Useful ToolsGraph PartitioningMetis

http://glaros.dtc.umn.edu/gkhome/metis/metis/

overview

Community DetectionLouvain code https://sites.google.com/site/findcommunities/

Slide13

Other Ongoing Works/IdeasFighting against crowdturfing

Crowdturfing

: real users are paid to spam

How to detect these malicious real usersUser behavior modelNetwork-wised temporal anomaly detection Information DisseminationContent sharing visa social edgesHow often will user click on the contentHow often will user comment on the content Sybil detection, target ad placement

Slide14

Questions?http://current.cs.ucsb.edu

Thank You!