/
Information Retrieval Methods for Social Media Information Retrieval Methods for Social Media

Information Retrieval Methods for Social Media - PowerPoint Presentation

tawny-fly
tawny-fly . @tawny-fly
Follow
387 views
Uploaded On 2016-07-04

Information Retrieval Methods for Social Media - PPT Presentation

600466 Social Media Properties of Social Media Scale Twitter Chirp 2010 More than 100M user accounts more than 600M search queries a day 55M tweets a day Facebook More than 400M active users ID: 389743

graph social scale media social graph media scale data properties networks large key twitter 2009 content mapreduce map processing

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Information Retrieval Methods for Social..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Information Retrieval Methods for Social Media

600.466Slide2

Social Media?Slide3
Slide4

Properties of Social Media : Scale

Twitter (Chirp 2010)More than 100M user accountsmore than 600M search queries a day

55M tweets a day

Facebook

More than 400M active users

More than 25 billion pieces of content (web links, news stories, blog posts, notes, photo albums, etc.) shared each month.Slide5

Properties of Social Media: Immediacy

Need to share breaking newsSearch : Content vs. Peer recommendationSlide6

Properties of Social Media: Duplication

Duplication of contentBlogs: Copy-PasteTwitter: “Re-tweet”

Groups: Cross-posting

Email: Signature lines, Inline RepliesSlide7

Properties of Social Media:

Semi-structurednessInformal but structured

Informal != low quality (

eg

. Wikipedia)

Structure

Metadata

ConnectivitySlide8

Suggested Reading

“Towards a PeopleWeb”, Raghu

Ramakrishnan

& Andrew Tomkins, IEEE Computer, Aug 2007

“Important properties of users and objects will move from being tied to individual Web sites to being globally available. The conjunction of a global object model with portable user context will lead to a richer content structure and introduce significant shifts in online communities and information discovery.”Slide9

Properties of Social Media

ScaleImmediacyHeterogeneityDuplicationSemi-

structurednessSlide10

Properties of Social Media Graphs

Small-world propertySix-degrees of separation

Facebook

: 5.73 (Bunyan, 2009)

MS Messenger: ~7 (

Leskovec

& Horvitz, 2007)

Mathematically

Low Average Path length

High Clustering coefficientSlide11

Network Evolution and Path SizeSlide12

Properties of Social Media Graphs

Power law degree distribution (asymptotically)Property of most real world networks

Existence of “hubs”

Scale free networksSlide13

Probabilistic Modeling of Networks

Erdos-Renyi ModelChoose a pair of nodes uniformly at random and add an edge.

G(n

,

p

)

Not Scale Free (small avg. Path but low clustering coefficient)

Scale Free networks don’t evolve by chanceSlide14

Probabilistic Modeling of Networks

Preferential Attachment (Barabasi and Albert, 99)Rich become richer

Stochastic process: Using

Polya’s

urnSlide15

Why model?

Study network evolution, degenerationDevelop algorithmsDetect communitiesWho are the movers and shakers?

Detect diffusion of ideas across networks

Detect anomaliesSlide16

Crawling Social Networks

HTML (Slashdot)RSS/Atom feeds (blogs)API driven (Twitter, Facebook

, …)

Data liberationSlide17

Twitter twitter = new

TwitterFactory().getInstance(twitterID,twitterPassword);

List<Status> statuses =

twitter.getFriendsTimeline

();

System.out.println("Showing

friends timeline.");

for (Status status : statuses) {

System.out.println(status.getUser().getName

() + ":" +

status.getText

());

}

http://twitter4j.orgSlide18

Storage and Indexing

Graph stores can be more efficiently designed traditional RDBMS or flat files (document IR)A family of “triple stores” or graph databases (#NoSQL

movement)

Neo4J

CouchDB

Hypertable

…Slide19

Data is becoming more and more connected

(

Eifrem

, OSCON 2009)Slide20

Social Media Graphs

(

Eifrem

, OSCON 2009)Slide21

Social Media Graphs : Representation

NodesRelationship between nodes Properties on Both

Storing in Flat Files vs. Graph Databases

Neo4J, disk based solution

works well for sizes up to a few billion (Single JVM)Slide22

Processing Large Scale Graph Data

Better representationParallel computationMapReduce

BSPSlide23

Parallelism via Map-Reduce

A paradigm to view input as (key, value) pairs and algorithms process these pairs in one of two stagesMap: Perform operations on individual pairs

Reduce: Combine all pairs with the same key

Functional programming origins

Abstracts away system specific issues

Manipulate large quantities of dataSlide24

Parallelism via Map-Reduce

Input is a sequence of key value pairs (records)Processing of any record is independent of the othersNeed to recast algorithms and sometimes data to fit to this model

Think of structured data (Graphs!)Slide25
Slide26

Input: Collection of documents

Output: For each word find all documents with the word

def

mapper(filename

, content):

foreach

 word  

in  

content.split

():

output

(word

, filename)

def

 

reducer(key

, values):

output

(key

,

unique(values

))

Example: Building

inverted indexesSlide27

Map-Reducing graph data

Note: By design the mappers cannot communicate with each other.The graph representation should be such that that all information (e.g. neighborhood) needed for processing a node should be locally available.

The adjacency list representation is perfectly suited.

Key: vertex in the graph

Value: neighbors of the vertex and their associated valuesSlide28

Computing PageRank

(MapReduce)

PageRank

update with dampening parameter

α

where

P

is the transition probability matrix.

One map-reduce per iterationSlide29

MapReduce:

PageRank IterationMap(Key

k

, Value

v

)

{

r_old

=

k.rank

;

r

= 0;

foreach

node

n in v.getNeighbors() { r

+= p(n, k)*r_old +

dampening_factor

}

v.rank

=

r

;

Emit(k

,

v

);

}Slide30

Processing Large Scale Graph Data

MapReduce is not the best model for large scale graph processingSimple graph concepts (

Pagerank

, BFS, …) are not easy to program

MapReduce

does not preserve data locality in consecutive operationsSlide31

A New Paradigm to Process Large Scale Graph Data

Bulk Synchronous ParallelDeveloped in the 80s by Leslie ValiantIntroduced by Google for Graph computation

Pregel

: a system for large-scale graph processing” (

Malewicz

et al, PODC 2009)Slide32

Bulk Synchronous Parallel

Sequence of steps – “SuperSteps”Each

SuperStep

S

Execute a user defined Compute() function on every vertex in parallel

Input to Compute(): All messages from

SuperStep

S – 1

Output of Compute(): Messages to other vertices

1B vertices 80B Edges

2000 Workers

Bellman-Ford: 200s

(

Malewicz

et al, PODC 2009)Slide33

Why “SuperStep

”?Internally consists of three stagesSlide34

Computing PageRank

(BSP version)Compute()

{

r_old

=

r

;

r

= 0;

for each incoming message

m

{

r

+=

m.p

*

r_old + dampening_factor; }

if(r – r_old < epsilon) done()}Slide35

Suggested Reading

“Truly, Madly, Deeply Parallel”, Robert Matthews, New Scientist, Feb 1996“Pregel: a system for large-scale graph processing” (Malewicz

et al, PODC 2009)Slide36

Social Network Analysis

Retrieving information from structureExample: Community DiscoveryMany practical applicationsOne approach: “Edge

Betweenness

betweenness(e

) = #

triangles(e)/max(e

)

iteratively prune edges with low

betweenness

Slide37

Book

Networks, Crowds, and Markets: Reasoning About a Highly Connected World

By David Easley and Jon Kleinberg

http://

www.cs.cornell.edu/home/kleinber/networks

-book/Slide38

Recap …

Properties of social mediaScaleImmediacy

Heterogeneity

Duplication

Semi-

structuredness

Properties of social media graphs

Small

-

worldness

Scale

free

property

Evolution models

Crawling

API driven

Indexing & Retrieval

Graph databases

Processing large scale social networks

MapReduce

Bulk Synchronous ParallelIR from structure

Social Network AnalysisSlide39

Social Media Related Projects

Twitter