Joint work with Gayatree Ganu Forum Popularity and Search Forums with most traffic httprankingsbigboardscom BMW 50K uniq visitorsday 25M Posts 06M Members Filipino Community Subaru ID: 914771
Download Presentation The PPT/PDF document "Searching Web Forums Amélie Marian, Rut..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Searching Web Forums
Amélie Marian, Rutgers University
Joint work with Gayatree Ganu
Slide2Forum Popularity and Search
Forums with most traffic
[http://rankings.big-boards.com]
BMW 50K uniq visitors/day25M Posts0.6M MembersFilipino CommunitySubaru Impreza OwnersRome Total War…Pakistan Cricket Fan SitePrison TalkOnline Money making
Despite popularity,
forums lack good
search capabilities
Slide3Outline
Multi-Granularity Search
Challenges
Unstructured textBackground information omittedDiscussion digressionContributions Return each
results at varying focus levels, allowing more or less context
. (CIKM 2013)
Egocentric Search
Challenges
Multiple interpersonal relations with varying
importance
Contributions
Proposed a multidimensional
user
similarity measure.
Use
a
uthorship
for improving personalized and keyword search
.
Slide4Hierarchical Model
Hierarchy
over objects
at three searchable levelspertinent sentences, larger posts, entire discussions or threadsHierarchy captures strength of association, containment relationshipLo
wer
levels for smaller objects
Edge represents containment
Edge weight of 2 indicates that the text in child was repeated in the text of parent
Thread 1
Thread 2
Post 1
Post 2
Post 4
Post 3
Sent 1
Sent 2
Sent 3
Sent 4
Sent 5
Sent 6
Dataset
Word 1
Word 2
Word 3
Word 4
Word 1
2
2
2
Slide5Alternate Scoring Functions
Example Textual Results.
Query :
hair lossTop-4 ResultsPost1: (A) Aromasin certainly caused my hair loss and the hair started falling 14 days after the chemo. However, I bought myself a rather fashionable scarf to hide the baldness. I wear it everyday, even at home.
(B) Onc was shocked by my hair loss so I guess it is unusual on Aromasin.
I had no other side effects from Aromasin, no hot flashes, no stomach aches or muscle pains, no headaches or nausea and none of the chemo brain.
Post2:
(C) Probably everyone is sick of the hair loss questions, but I need help with this falling hair.
I had my first cemotherapy on 16th September,
so due in one week for the 2nd treatment.
(D) Surely the hair loss can’t be starting this fast..or can it?. I was running my fingers at the nape of my neck
and about five came out in my fingers. Would love to hear from anyone else have AC done (Doxorubicin and Cyclophosphamide) only as I am not due to have the 3rd drug (whatever that is - 12 weekly sessions) after the 4 sessions of AC. Doctor said that different people have different side effects, so I wanted to know what you all went through.
(E) Have n’t noticed hair loss elsewhere, just the top hair and mainly at the back of my neck. (F) I thought the hair would start thining out between 2nd and 3rd treatment, not weeks after the 1st one.
I have very curly long ringlets past my shoulders and am wondering if it would be better to just cut it short or completely shave it off. I am willing to try anything to make this stop, does anyone have a good recommendation for a shampoo, vitamins or supplements and (sadly) a good wig shop in downtown LA.
Post3: My suggestion is, don’t focus so much on organic. Things can be organic and very unhealthy. I believe it when I read that nothing here is truly organic. They’re allowed a certain percentage. I think 5% of the food can not be organic and it still can carry the organic label. What you want is nonprocessed, traditional foods. Food that comes from a farm or a farmer’s market. Small farmers are not organic just because it is too much trouble to get the certification. Their produce is probably better than most of the industrial organic stuff.
(G) Sorry Jennifer, chemotherapy and treatment followed by hair loss is extremely depressing and you cannot prepare enough for falling hair, especially hair in clumps. (H) I am on femara and hair loss is non-stop, I had full head of thick hair.tf*idf
Sent (E) (4.742)Sent (A) (4.711)Sent (C) (4.696)Sent (G) (4.689)
BM25Sent (D) (10.570)Sent (B) (10.458)Sent (H) (10.362)Sent (E) (10.175)
HScorePost2 (0.131)
Sent (G) (0.093)Post1 (0.092)Sent (H) (0.089)
Score tf*idf (
t,d) = (1+log(tft,d)) * log(N/df
t) * 1/CharLength
Slide6Scoring Multi-
Granularity Results
Goal
: Unified scoring for objects at multiple granularity levelslargely varying sizes with inherent containment relationshipHierarchical Scoring Function (
HScore
)
Score for
node
i
with respect to
search term
t
and having
j
children
:
… if
i
is a non-leaf node
= 1 … if
i
is a leaf node containing
t
= 0 … if i
is a leaf node not containing
t
ew
ij
= edge weight between parent
i
and child
j
P(j)
= number of parents of
j
C(i)
= number of children of
i
Slide7Effect of Size Weighting Parameter
on HScore
Parameter controls the intermixing of granularitiesSize parameter Number of results
in top-20 list
HScore
Slide8Multi-
Granularity Result
Generation
Sorted Ordering: Post3(2.5), Post1(2.1), Post2(2), Sent1(1.6), Sent2(1.5), Sent3
(1.4),
Sent4
(1.3),
Sent6
(0.4),
Sent5
(0.1), Post4(0.1), Thread1(0.1), Thread2(0.1)For result size k=4, optimizing for the sum of scores:
Overlap: {Post3,
Post1, Post2, Sent1} Sum Score = 8.2 (minus 1.6?)Greedy: {Post3, Post1
, Post2, Sent6} Sum Score = 7.0Best: {Post3, Post2,
Sent1, Sent2} Sum Score = 7.6
33% sample queries had overlap amongst at least 3 of top-10 results
Thread 1
Thread 2
Post 1
Post 2
Post 4
Post 3
Sent 1
Sent 2
Sent 3
Sent 4
Sent 5
Sent 6
0.1
2.1
2
2.5
0.1
0.1
0.1
0.4
1.6
1.5
1.4
1.3
Slide9Multi-
Granularity
Result Generation
Goal: Generating a non-overlapping result set maximizing “quality”Quality = Sum of scores of all results in the setMaximal independent set problem (NP Hard)
Existing Algorithm: Lexicographic All Independent Sets (LAIS) outputs maximal independent set with polynomial delay in specific order
Slide10Optimal Algorithm for k-set (OAKS)
Fix node ordering by decreasing scores
Efficient OAKS Algorithm (
typically k<<n):Start with k-sized first independent set, i.e., greedyBranch from nodes preceding kth node of the set, check if maximal
Find new
k
-sized maximal sets, save in priority queue
Reject sets from priority queue where starting node occurs after current best set’s
k
th
node
Slide11OAKS
Sorted Ordering:
Post3
(2.5),
Post1
(2.1),
Post2
(2),
Sent1
(1.6),
Sent2
(1.5),
Sent3
(1.4), Sent4(1.3),
Sent6
(0.4), Sent5(0.1),
Post4(0.1), Thread1
(0.1), Thread2(0.1
)
For k=4, Greedy = {Post3, Post1,
Post2, Sent6} SumScore=7.0
In the 1st iteration:
{Post3, Post2
, Sent1, Sent2} SumScore
= 7.7
{Post3 ,
Post1, Sent3
, Sent4}
SumScore = 7.3
Branches from nodes before
Sent6,
i.e.
Sent1,
Sent2, Sent3
, Sent4
Branch from
Sent1, removing all adjacent to Sent1
, {Post3, Post2
, Sent1}Maximal on first 4 nodes? YES!then complete to size k and insert in queue- {Post3
, Post2, Sent1, Sent2}
Thread 1
Thread 2
Post 1
Post 2
Post 4
Post 3
Sent 1
Sent 2
Sent 3
Sent 4
Sent 5
Sent 6
0.1
2.1
2
2.5
0.1
0.1
0.1
0.4
1.6
1.5
1.4
1.3
Slide12Evaluating OAKS Algorithm
Comparing OAKS Runtime
Small overhead for practical k (=20)
Scoring time = 0.96 secOAKS Result set generation time = 0.09 secWord Frequency
Sets Evaluated
Run Time (
sec)
LAIS
OAKS
LAIS
OAKS
20-30
57.59
8.12
0.78
0.12
30-40
102.07
5.06
7.88
0.01
40-50
158.80
5.88
26.94
0.01
50-60
410.18
6.30
82.20
0.02
60-70
716.40
5.26
77.61
0.01
70-80
896.59
8.30
143.33
0.04
Comparing LAIS and OAKS
100 relatively infrequent queries with corpus frequency in range 20-30, 30-40…
OAKS is very efficient. Time required by OAKS
depends on
k
OAKS improves over
Greedy SumScore in
31% queries @top20
Slide13Dataset and Evaluation Setting
Data collected from breastcancer.org
31K threads, 301K posts, 1.8M unique sentences, 46K keywords
18 Sample Queriese.g., broccoli, herceptin side effects, emotional meltdown, scarf or wig, shampoo recommendation …Experimental Search Strategies – top20 resultsMixed-Hierarchy : Optimal mixed granularity result.
Posts-Hierarchy
: Hierarchical scoring of posts only.
Posts-tf*idf
: Existing traditional search.
Mixed-BM25
Slide14Evaluating Perceived Relevance
Graded Relevance Scale
Exactly relevant answer,
Relevant but too broad,
Relevant but too narrow,
Partially relevant answer,
Not Relevant
Crowd Sourced Relevance
using Mechanical Turk
Over 7 annotations
Quality control -Honey pot questions
EM algorithm for consensus
Query = shampoo recommendation
a
= 0.1
a
= 0.2
a
= 0.3
a
= 0.4
Rank = 1
Rel Broad
Rel Broad
Rel Broad
Partial
2
Rel Broad
Rel Broad
Rel Broad
Partial
3
Rel Broad
Rel Broad
Rel Broad
Partial
4
Rel Broad
Rel Broad
Exactly Rel
Rel Broad
5
Rel Broad
Rel Broad
Exactly Rel
Partial
6
Exactly Rel
Exactly Rel
Rel Narrow
Rel Narrow
7
Rel Broad
Exactly Rel
Rel Narrow
Not Rel
8
Rel Broad
Rel Broad
Not Rel
Partial
9
Rel Broad
Rel Narrow
Rel Broad
Partial
10
Exactly Rel
Rel Narrow
Partial
Rel Narrow
11
Rel Broad
Rel Broad
Exactly Rel
Not Rel
12
Rel Broad
Rel Broad
Exactly Rel
Not Rel
13
Rel Broad
Exactly Rel
Partial
Not Rel
14
Not Rel
Exactly Rel
Rel Narrow
Partial
15
Not Rel
Exactly Rel
Not Rel
Rel Broad
16
Not Rel
Rel Broad
Rel Narrow
Not Rel
17
Exactly Rel
Rel Broad
Exactly Rel
Not Rel
18
Exactly Rel
Exactly Rel
Partial
Partial
19
Not Rel
Rel Broad
Rel Narrow
Not Rel
20
Not Rel
Exactly Rel
Partial
Not Rel
Mixed-Hierarchy
Slide15Evaluating Perceived Relevance
Mean Average Precision
Search System
MAP
@
a
= 0.1
a
= 0.2
a
= 0.3
a
= 0.4
Mixed
Hierarchy
10
0.98
0.98
0.90
0.70
20
0.97
0.95
0.85
0.66
Posts-Hierarchy
10
0.76
0.75
0.77
0.78
20
0.72
0.71
0.73
0.75
Posts-
tf*idf
10
0.76
0.73
0.76
0.76
20
0.74
0.72
0.72
0.73
Mixed BM25
10
b=0.75
0.55
20
k=1.2
0.54
Clearly, Mixed-H
outperforms
post only methods
Users perceive higher
relevance of mixed
granularity results
Slide16EgoCentric Search
Previous technique did not take
the authorship of posts into accountSome forum participants are similar, sharing same topics of interest or having the same needs, not necessarily at the same timeRank similar author’s posts higher for personalized searchSome forum participants are experts, prolific and knowledgeableExpert opinions carry more weight in keyword searchAuthor score to enhance personalized & keyword search
Slide17Author Score
Forum participants have several reasons to be linked
Build a
multidimensional heterogeneous graph over authors incorporating many relationsBut, users assign different importance to different relations
auth
1
Topic 1
auth 2
auth
n
Topic 2
Topic t
query 1
query 2
query n
W(a,t)
W(q,t)
author 1
author 2
author n
author 3
W(a1,a2)
User Profiles:
Location
Age
Cancer stage
Treatment
…
Co-participation
Explicit References
Slide18Contributions
Critical problem for leveraging authorship for search:
Incorporating multiple user relations with varying importance learned egocentrically from user behavior
Outline:Author score computation using multidimensional graphPersonalized predictions of user interactions: authors most likely to provide answersRe-ranking results of keyword search using author expertise
Slide19Multi-Dimensional Random Walks (MRW)
Random Walks (RW) for finding most influential users
P
t+1 = M × Pt … till convergenceM = α(A + D) + (1 − α)E … relation matrix A, D for dangling nodes, uniform matrix
E
,
α
usually set to 0.85
Rooted RW for node similarity
Teleport back to root node with probability (1-α)Computes similarity of all nodes w.r.t root nodeMultidimensional RW– Heterogeneous Networks:Transition matrix computed as A =
1
* A1 + 2 * A
2 + ... + n * A
n where i
i = 1 and all i >= 0
Egocentric weights - For root node r : i (r) =
j ewAi
(r, m)/ Ak
j ewAk (r, j)
… m Ai and j Ak
a
b
c
2
3
A =
a
b
c
a
0
0
0
b
2
00c03
0D =
a
bca
000.33b
00
0.33c
00
0.33
E =
a
b
c
a
.33
.33.33b
.33.33.33
c.33.33
.33
Slide20Personalized Answer Search
Link prediction by leveraging
user similarities
:Given participant behavior, find similar users to the user asking questionPredict who will respond to this questionLearn similarities from first 90% training threadsRelations used:Topics covered in text, Co-participation in threads, Signature profiles, Proximity of posts MRW similarity compared with baselines:Single relationsPathSim: Existing approach for heterogeneous networksPredefined paths of fixed lengthNo dynamic choice of path
Link prediction enables
suggesting which threads
or which users to follow
Slide21Predicting User Interactions
MAP for link prediction
Multidimensional RW
has best predictionperformance
Slide22Predicting User Interactions
Leverage
content of the initial post
to find users who are experts on the questionTopicScore computed as cosine similarity between author’s history and initial postUserScore = β * MRWScore + (1- β) * TopicScoreNeighbors
β
= 0
β
= 0.1
β
= 0.2
β = 1
Top 5
0.520.64 (8%)0.61 (4%)
0.59Top 100.31
0.50 (8%)0.49 (5%)0.46Top 15
0.240.43 (8%)
0.42 (6%)0.40Top 200.20
0.39 (6%)0.39 (7%)0.37
Purely MRWPurely topical expertise
% Improvement over purely MRWMAP
Slide23Enhanced Keyword Search
Non-rooted RW to find most influential
expert users
Re-rank top-k results of IR scoring using author scoresFinal score of post = ω*IR_score λ + (1- ω)*Authority_score
Posts only,
tf
*
idf
scoring with size parameter
Re-ranking searchresults with authorscore yields higherMAP relevance4% improvement
5%
Slide24Patient Emotion and
stRucture
Search
USer tool(PERSEUS) - ConclusionsDesigned hierarchical model and score that allows generating search results at several granularities of web forum objects. Proposed OAKS algorithm for best non-overlapping result. Conducted extensive user studies, show that mixed collection of granularities yields better relevance than post-only results.
Combined multiple relations linking users for computing similarities
Enhanced search results using multidimensional author
similarity
Future Directions:
Multi-granular search on web pages, blogs, emails. Dynamic focus level selection.
Search in and out of context over dialogue, interviews, Q&A.
Optimal result set selection for targeted advertising, result diversificationTime sensitive recommendations – Changing friendships, progressive search needs.
Slide25Thank you!
Slide26Why Random Walks?
Rooted RW Examples
r
t
(a)
r
a
(b)
t
r
a
(c)
t
b
c
r
a
(d)
t
b
0.4
0.16
0.26
c
u
0.16
Multi-dimensional
Rooted RW Example
0.6
r
1
score (b
w.r.t
r
1
) = 0.072
score (c
w.r.t
r
1
) = 0.096
score (b
w.r.t
r
2
) = 0.097
score (c
w.r.t
r
2
) = 0.066
r
2
2
b
1
4
A
1
r
1
r
2
4
c
1
2
A
2
Slide27LAIS
Sorted Ordering:
Post3
(2.5),
Post1
(2.1),
Post2
(2),
Sent1
(1.6),
Sent2(1.5), Sent3(1.4), Sent4(1.3), Sent6(0.4), Sent5(0.1), Post4(0.1),
Thread1(0.1),
Thread2(0.1)
Greedy = {
Post3, Post1, Post2,
Sent6}. In the 1st iteration:
{Post3, Post2,
Sent1, Sent2, Sent6}
{Post3
, Post1, Sent3, Sent4
, Sent6}
{Post1,
Post2, Sent6, Sent5}
{
Post3, Post1, Post2,
Post4)
{
Post3
,
Sent6, Thread1
}
{
Post1
, Post2
, Thread2
}
Thread 1
Thread 2
Post 1
Post 2
Post 4
Post 3
Sent 1
Sent 2
Sent 3
Sent 4
Sent 5
Sent 6
0.1
2.1
2
2.5
0.1
0.1
0.1
0.4
1.6
1.5
1.4
1.3
2
2
Slide28Current Search Functionality breastcancer.org
Filtering criteria
keyword search,
member searchRanking basedon date
Posts only
results