600466 Social Media Properties of Social Media Scale Twitter Chirp 2010 More than 100M user accounts more than 600M search queries a day 55M tweets a day Facebook More than 400M active users ID: 389743
Download Presentation The PPT/PDF document "Information Retrieval Methods for Social..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Information Retrieval Methods for Social Media
600.466Slide2
Social Media?Slide3Slide4
Properties of Social Media : Scale
Twitter (Chirp 2010)More than 100M user accountsmore than 600M search queries a day
55M tweets a day
Facebook
More than 400M active users
More than 25 billion pieces of content (web links, news stories, blog posts, notes, photo albums, etc.) shared each month.Slide5
Properties of Social Media: Immediacy
Need to share breaking newsSearch : Content vs. Peer recommendationSlide6
Properties of Social Media: Duplication
Duplication of contentBlogs: Copy-PasteTwitter: “Re-tweet”
Groups: Cross-posting
Email: Signature lines, Inline RepliesSlide7
Properties of Social Media:
Semi-structurednessInformal but structured
Informal != low quality (
eg
. Wikipedia)
Structure
Metadata
ConnectivitySlide8
Suggested Reading
“Towards a PeopleWeb”, Raghu
Ramakrishnan
& Andrew Tomkins, IEEE Computer, Aug 2007
“Important properties of users and objects will move from being tied to individual Web sites to being globally available. The conjunction of a global object model with portable user context will lead to a richer content structure and introduce significant shifts in online communities and information discovery.”Slide9
Properties of Social Media
ScaleImmediacyHeterogeneityDuplicationSemi-
structurednessSlide10
Properties of Social Media Graphs
Small-world propertySix-degrees of separation
Facebook
: 5.73 (Bunyan, 2009)
MS Messenger: ~7 (
Leskovec
& Horvitz, 2007)
Mathematically
Low Average Path length
High Clustering coefficientSlide11
Network Evolution and Path SizeSlide12
Properties of Social Media Graphs
Power law degree distribution (asymptotically)Property of most real world networks
Existence of “hubs”
Scale free networksSlide13
Probabilistic Modeling of Networks
Erdos-Renyi ModelChoose a pair of nodes uniformly at random and add an edge.
G(n
,
p
)
Not Scale Free (small avg. Path but low clustering coefficient)
Scale Free networks don’t evolve by chanceSlide14
Probabilistic Modeling of Networks
Preferential Attachment (Barabasi and Albert, 99)Rich become richer
Stochastic process: Using
Polya’s
urnSlide15
Why model?
Study network evolution, degenerationDevelop algorithmsDetect communitiesWho are the movers and shakers?
Detect diffusion of ideas across networks
Detect anomaliesSlide16
Crawling Social Networks
HTML (Slashdot)RSS/Atom feeds (blogs)API driven (Twitter, Facebook
, …)
Data liberationSlide17
Twitter twitter = new
TwitterFactory().getInstance(twitterID,twitterPassword);
List<Status> statuses =
twitter.getFriendsTimeline
();
System.out.println("Showing
friends timeline.");
for (Status status : statuses) {
System.out.println(status.getUser().getName
() + ":" +
status.getText
());
}
http://twitter4j.orgSlide18
Storage and Indexing
Graph stores can be more efficiently designed traditional RDBMS or flat files (document IR)A family of “triple stores” or graph databases (#NoSQL
movement)
Neo4J
CouchDB
Hypertable
…Slide19
Data is becoming more and more connected
(
Eifrem
, OSCON 2009)Slide20
Social Media Graphs
(
Eifrem
, OSCON 2009)Slide21
Social Media Graphs : Representation
NodesRelationship between nodes Properties on Both
Storing in Flat Files vs. Graph Databases
Neo4J, disk based solution
works well for sizes up to a few billion (Single JVM)Slide22
Processing Large Scale Graph Data
Better representationParallel computationMapReduce
BSPSlide23
Parallelism via Map-Reduce
A paradigm to view input as (key, value) pairs and algorithms process these pairs in one of two stagesMap: Perform operations on individual pairs
Reduce: Combine all pairs with the same key
Functional programming origins
Abstracts away system specific issues
Manipulate large quantities of dataSlide24
Parallelism via Map-Reduce
Input is a sequence of key value pairs (records)Processing of any record is independent of the othersNeed to recast algorithms and sometimes data to fit to this model
Think of structured data (Graphs!)Slide25Slide26
Input: Collection of documents
Output: For each word find all documents with the word
def
mapper(filename
, content):
foreach
word
in
content.split
():
output
(word
, filename)
def
reducer(key
, values):
output
(key
,
unique(values
))
Example: Building
inverted indexesSlide27
Map-Reducing graph data
Note: By design the mappers cannot communicate with each other.The graph representation should be such that that all information (e.g. neighborhood) needed for processing a node should be locally available.
The adjacency list representation is perfectly suited.
Key: vertex in the graph
Value: neighbors of the vertex and their associated valuesSlide28
Computing PageRank
(MapReduce)
PageRank
update with dampening parameter
α
where
P
is the transition probability matrix.
One map-reduce per iterationSlide29
MapReduce:
PageRank IterationMap(Key
k
, Value
v
)
{
r_old
=
k.rank
;
r
= 0;
foreach
node
n in v.getNeighbors() { r
+= p(n, k)*r_old +
dampening_factor
}
v.rank
=
r
;
Emit(k
,
v
);
}Slide30
Processing Large Scale Graph Data
MapReduce is not the best model for large scale graph processingSimple graph concepts (
Pagerank
, BFS, …) are not easy to program
MapReduce
does not preserve data locality in consecutive operationsSlide31
A New Paradigm to Process Large Scale Graph Data
Bulk Synchronous ParallelDeveloped in the 80s by Leslie ValiantIntroduced by Google for Graph computation
“
Pregel
: a system for large-scale graph processing” (
Malewicz
et al, PODC 2009)Slide32
Bulk Synchronous Parallel
Sequence of steps – “SuperSteps”Each
SuperStep
S
Execute a user defined Compute() function on every vertex in parallel
Input to Compute(): All messages from
SuperStep
S – 1
Output of Compute(): Messages to other vertices
1B vertices 80B Edges
2000 Workers
Bellman-Ford: 200s
(
Malewicz
et al, PODC 2009)Slide33
Why “SuperStep
”?Internally consists of three stagesSlide34
Computing PageRank
(BSP version)Compute()
{
r_old
=
r
;
r
= 0;
for each incoming message
m
{
r
+=
m.p
*
r_old + dampening_factor; }
if(r – r_old < epsilon) done()}Slide35
Suggested Reading
“Truly, Madly, Deeply Parallel”, Robert Matthews, New Scientist, Feb 1996“Pregel: a system for large-scale graph processing” (Malewicz
et al, PODC 2009)Slide36
Social Network Analysis
Retrieving information from structureExample: Community DiscoveryMany practical applicationsOne approach: “Edge
Betweenness
”
betweenness(e
) = #
triangles(e)/max(e
)
iteratively prune edges with low
betweenness
Slide37
Book
Networks, Crowds, and Markets: Reasoning About a Highly Connected World
By David Easley and Jon Kleinberg
http://
www.cs.cornell.edu/home/kleinber/networks
-book/Slide38
Recap …
Properties of social mediaScaleImmediacy
Heterogeneity
Duplication
Semi-
structuredness
Properties of social media graphs
Small
-
worldness
Scale
free
property
Evolution models
Crawling
API driven
Indexing & Retrieval
Graph databases
Processing large scale social networks
MapReduce
Bulk Synchronous ParallelIR from structure
Social Network AnalysisSlide39
Social Media Related Projects
Twitter