/
Searching Web Forums Amélie Marian, Rutgers University Searching Web Forums Amélie Marian, Rutgers University

Searching Web Forums Amélie Marian, Rutgers University - PowerPoint Presentation

margaret
margaret . @margaret
Follow
342 views
Uploaded On 2022-06-07

Searching Web Forums Amélie Marian, Rutgers University - PPT Presentation

Joint work with Gayatree Ganu Forum Popularity and Search Forums with most traffic httprankingsbigboardscom BMW 50K uniq visitorsday 25M Posts 06M Members Filipino Community Subaru ID: 914771

broad rel post search rel broad search post post3 score post2 hair results post1 posts oaks partial sent1 sent6

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Searching Web Forums Amélie Marian, Rut..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Searching Web Forums

Amélie Marian, Rutgers University

Joint work with Gayatree Ganu

Slide2

Forum Popularity and Search

Forums with most traffic

[http://rankings.big-boards.com]

BMW 50K uniq visitors/day25M Posts0.6M MembersFilipino CommunitySubaru Impreza OwnersRome Total War…Pakistan Cricket Fan SitePrison TalkOnline Money making

Despite popularity,

forums lack good

search capabilities

Slide3

Outline

Multi-Granularity Search

Challenges

Unstructured textBackground information omittedDiscussion digressionContributions Return each

results at varying focus levels, allowing more or less context

. (CIKM 2013)

Egocentric Search

Challenges

Multiple interpersonal relations with varying

importance

Contributions

Proposed a multidimensional

user

similarity measure.

Use

a

uthorship

for improving personalized and keyword search

.

Slide4

Hierarchical Model

Hierarchy

over objects

at three searchable levelspertinent sentences, larger posts, entire discussions or threadsHierarchy captures strength of association, containment relationshipLo

wer

levels for smaller objects

Edge represents containment

Edge weight of 2 indicates that the text in child was repeated in the text of parent

Thread 1

Thread 2

Post 1

Post 2

Post 4

Post 3

Sent 1

Sent 2

Sent 3

Sent 4

Sent 5

Sent 6

Dataset

Word 1

Word 2

Word 3

Word 4

Word 1

2

2

2

Slide5

Alternate Scoring Functions

Example Textual Results.

Query :

hair lossTop-4 ResultsPost1: (A) Aromasin certainly caused my hair loss and the hair started falling 14 days after the chemo. However, I bought myself a rather fashionable scarf to hide the baldness. I wear it everyday, even at home.

(B) Onc was shocked by my hair loss so I guess it is unusual on Aromasin.

I had no other side effects from Aromasin, no hot flashes, no stomach aches or muscle pains, no headaches or nausea and none of the chemo brain.

Post2:

(C) Probably everyone is sick of the hair loss questions, but I need help with this falling hair.

I had my first cemotherapy on 16th September,

so due in one week for the 2nd treatment.

(D) Surely the hair loss can’t be starting this fast..or can it?. I was running my fingers at the nape of my neck

and about five came out in my fingers. Would love to hear from anyone else have AC done (Doxorubicin and Cyclophosphamide) only as I am not due to have the 3rd drug (whatever that is - 12 weekly sessions) after the 4 sessions of AC. Doctor said that different people have different side effects, so I wanted to know what you all went through.

(E) Have n’t noticed hair loss elsewhere, just the top hair and mainly at the back of my neck. (F) I thought the hair would start thining out between 2nd and 3rd treatment, not weeks after the 1st one.

I have very curly long ringlets past my shoulders and am wondering if it would be better to just cut it short or completely shave it off. I am willing to try anything to make this stop, does anyone have a good recommendation for a shampoo, vitamins or supplements and (sadly) a good wig shop in downtown LA.

Post3: My suggestion is, don’t focus so much on organic. Things can be organic and very unhealthy. I believe it when I read that nothing here is truly organic. They’re allowed a certain percentage. I think 5% of the food can not be organic and it still can carry the organic label. What you want is nonprocessed, traditional foods. Food that comes from a farm or a farmer’s market. Small farmers are not organic just because it is too much trouble to get the certification. Their produce is probably better than most of the industrial organic stuff.

(G) Sorry Jennifer, chemotherapy and treatment followed by hair loss is extremely depressing and you cannot prepare enough for falling hair, especially hair in clumps. (H) I am on femara and hair loss is non-stop, I had full head of thick hair.tf*idf

Sent (E) (4.742)Sent (A) (4.711)Sent (C) (4.696)Sent (G) (4.689)

BM25Sent (D) (10.570)Sent (B) (10.458)Sent (H) (10.362)Sent (E) (10.175)

HScorePost2 (0.131)

Sent (G) (0.093)Post1 (0.092)Sent (H) (0.089)

Score tf*idf (

t,d) = (1+log(tft,d)) * log(N/df

t) * 1/CharLength

Slide6

Scoring Multi-

Granularity Results

Goal

: Unified scoring for objects at multiple granularity levelslargely varying sizes with inherent containment relationshipHierarchical Scoring Function (

HScore

)

Score for

node

i

with respect to

search term

t

and having

j

children

:

… if

i

is a non-leaf node

= 1 … if

i

is a leaf node containing

t

= 0 … if i

is a leaf node not containing

t

ew

ij

= edge weight between parent

i

and child

j

P(j)

= number of parents of

j

C(i)

= number of children of

i

Slide7

Effect of Size Weighting Parameter

on HScore

Parameter  controls the intermixing of granularitiesSize parameter Number of results

in top-20 list

HScore

Slide8

Multi-

Granularity Result

Generation

Sorted Ordering: Post3(2.5), Post1(2.1), Post2(2), Sent1(1.6), Sent2(1.5), Sent3

(1.4),

Sent4

(1.3),

Sent6

(0.4),

Sent5

(0.1), Post4(0.1), Thread1(0.1), Thread2(0.1)For result size k=4, optimizing for the sum of scores:

Overlap: {Post3,

Post1, Post2, Sent1} Sum Score = 8.2 (minus 1.6?)Greedy: {Post3, Post1

, Post2, Sent6} Sum Score = 7.0Best: {Post3, Post2,

Sent1, Sent2} Sum Score = 7.6

33% sample queries had overlap amongst at least 3 of top-10 results

Thread 1

Thread 2

Post 1

Post 2

Post 4

Post 3

Sent 1

Sent 2

Sent 3

Sent 4

Sent 5

Sent 6

0.1

2.1

2

2.5

0.1

0.1

0.1

0.4

1.6

1.5

1.4

1.3

Slide9

Multi-

Granularity

Result Generation

Goal: Generating a non-overlapping result set maximizing “quality”Quality = Sum of scores of all results in the setMaximal independent set problem (NP Hard)

Existing Algorithm: Lexicographic All Independent Sets (LAIS) outputs maximal independent set with polynomial delay in specific order

Slide10

Optimal Algorithm for k-set (OAKS)

Fix node ordering by decreasing scores

Efficient OAKS Algorithm (

typically k<<n):Start with k-sized first independent set, i.e., greedyBranch from nodes preceding kth node of the set, check if maximal

Find new

k

-sized maximal sets, save in priority queue

Reject sets from priority queue where starting node occurs after current best set’s

k

th

node

Slide11

OAKS

Sorted Ordering:

Post3

(2.5),

Post1

(2.1),

Post2

(2),

Sent1

(1.6),

Sent2

(1.5),

Sent3

(1.4), Sent4(1.3),

Sent6

(0.4), Sent5(0.1),

Post4(0.1), Thread1

(0.1), Thread2(0.1

)

For k=4, Greedy = {Post3, Post1,

Post2, Sent6} SumScore=7.0

In the 1st iteration:

{Post3, Post2

, Sent1, Sent2} SumScore

= 7.7

{Post3 ,

Post1, Sent3

, Sent4}

SumScore = 7.3

Branches from nodes before

Sent6,

i.e.

Sent1,

Sent2, Sent3

, Sent4

Branch from

Sent1, removing all adjacent to Sent1

,  {Post3, Post2

, Sent1}Maximal on first 4 nodes? YES!then complete to size k and insert in queue- {Post3

, Post2, Sent1, Sent2}

Thread 1

Thread 2

Post 1

Post 2

Post 4

Post 3

Sent 1

Sent 2

Sent 3

Sent 4

Sent 5

Sent 6

0.1

2.1

2

2.5

0.1

0.1

0.1

0.4

1.6

1.5

1.4

1.3

Slide12

Evaluating OAKS Algorithm

Comparing OAKS Runtime

Small overhead for practical k (=20)

Scoring time = 0.96 secOAKS Result set generation time = 0.09 secWord Frequency

Sets Evaluated

Run Time (

sec)

LAIS

OAKS

LAIS

OAKS

20-30

57.59

8.12

0.78

0.12

30-40

102.07

5.06

7.88

0.01

40-50

158.80

5.88

26.94

0.01

50-60

410.18

6.30

82.20

0.02

60-70

716.40

5.26

77.61

0.01

70-80

896.59

8.30

143.33

0.04

Comparing LAIS and OAKS

100 relatively infrequent queries with corpus frequency in range 20-30, 30-40…

OAKS is very efficient. Time required by OAKS

depends on

k

OAKS improves over

Greedy SumScore in

31% queries @top20

Slide13

Dataset and Evaluation Setting

Data collected from breastcancer.org

31K threads, 301K posts, 1.8M unique sentences, 46K keywords

18 Sample Queriese.g., broccoli, herceptin side effects, emotional meltdown, scarf or wig, shampoo recommendation …Experimental Search Strategies – top20 resultsMixed-Hierarchy : Optimal mixed granularity result.

Posts-Hierarchy

: Hierarchical scoring of posts only.

Posts-tf*idf

: Existing traditional search.

Mixed-BM25

Slide14

Evaluating Perceived Relevance

Graded Relevance Scale

Exactly relevant answer,

Relevant but too broad,

Relevant but too narrow,

Partially relevant answer,

Not Relevant

Crowd Sourced Relevance

using Mechanical Turk

Over 7 annotations

Quality control -Honey pot questions

EM algorithm for consensus

Query = shampoo recommendation

a

= 0.1

a

= 0.2

a

= 0.3

a

= 0.4

Rank = 1

Rel Broad

Rel Broad

Rel Broad

Partial

2

Rel Broad

Rel Broad

Rel Broad

Partial

3

Rel Broad

Rel Broad

Rel Broad

Partial

4

Rel Broad

Rel Broad

Exactly Rel

Rel Broad

5

Rel Broad

Rel Broad

Exactly Rel

Partial

6

Exactly Rel

Exactly Rel

Rel Narrow

Rel Narrow

7

Rel Broad

Exactly Rel

Rel Narrow

Not Rel

8

Rel Broad

Rel Broad

Not Rel

Partial

9

Rel Broad

Rel Narrow

Rel Broad

Partial

10

Exactly Rel

Rel Narrow

Partial

Rel Narrow

11

Rel Broad

Rel Broad

Exactly Rel

Not Rel

12

Rel Broad

Rel Broad

Exactly Rel

Not Rel

13

Rel Broad

Exactly Rel

Partial

Not Rel

14

Not Rel

Exactly Rel

Rel Narrow

Partial

15

Not Rel

Exactly Rel

Not Rel

Rel Broad

16

Not Rel

Rel Broad

Rel Narrow

Not Rel

17

Exactly Rel

Rel Broad

Exactly Rel

Not Rel

18

Exactly Rel

Exactly Rel

Partial

Partial

19

Not Rel

Rel Broad

Rel Narrow

Not Rel

20

Not Rel

Exactly Rel

Partial

Not Rel

Mixed-Hierarchy

Slide15

Evaluating Perceived Relevance

Mean Average Precision

Search System

MAP

@

a

= 0.1

a

= 0.2

a

= 0.3

a

= 0.4

Mixed

Hierarchy

10

0.98

0.98

0.90

0.70

20

0.97

0.95

0.85

0.66

Posts-Hierarchy

10

0.76

0.75

0.77

0.78

20

0.72

0.71

0.73

0.75

Posts-

tf*idf

10

0.76

0.73

0.76

0.76

20

0.74

0.72

0.72

0.73

Mixed BM25

10

b=0.75

0.55

20

k=1.2

0.54

Clearly, Mixed-H

outperforms

post only methods

Users perceive higher

relevance of mixed

granularity results

Slide16

EgoCentric Search

Previous technique did not take

the authorship of posts into accountSome forum participants are similar, sharing same topics of interest or having the same needs, not necessarily at the same timeRank similar author’s posts higher for personalized searchSome forum participants are experts, prolific and knowledgeableExpert opinions carry more weight in keyword searchAuthor score to enhance personalized & keyword search

Slide17

Author Score

Forum participants have several reasons to be linked

Build a

multidimensional heterogeneous graph over authors incorporating many relationsBut, users assign different importance to different relations

auth

1

Topic 1

auth 2

auth

n

Topic 2

Topic t

query 1

query 2

query n

W(a,t)

W(q,t)

author 1

author 2

author n

author 3

W(a1,a2)

User Profiles:

Location

Age

Cancer stage

Treatment

Co-participation

Explicit References

Slide18

Contributions

Critical problem for leveraging authorship for search:

Incorporating multiple user relations with varying importance learned egocentrically from user behavior

Outline:Author score computation using multidimensional graphPersonalized predictions of user interactions: authors most likely to provide answersRe-ranking results of keyword search using author expertise

Slide19

Multi-Dimensional Random Walks (MRW)

Random Walks (RW) for finding most influential users

P

t+1 = M × Pt … till convergenceM = α(A + D) + (1 − α)E … relation matrix A, D for dangling nodes, uniform matrix

E

,

α

usually set to 0.85

Rooted RW for node similarity

Teleport back to root node with probability (1-α)Computes similarity of all nodes w.r.t root nodeMultidimensional RW– Heterogeneous Networks:Transition matrix computed as A =

1

* A1 +  2 * A

2 + ... +  n * A

n where  i

 i = 1 and all  i >= 0

Egocentric weights - For root node r :  i (r) =

 j ewAi

(r, m)/  Ak 

j ewAk (r, j)

…  m  Ai and  j  Ak

a

b

c

2

3

A =

a

b

c

a

0

0

0

b

2

00c03

0D =

a

bca

000.33b

00

0.33c

00

0.33

E =

a

b

c

a

.33

.33.33b

.33.33.33

c.33.33

.33

Slide20

Personalized Answer Search

Link prediction by leveraging

user similarities

:Given participant behavior, find similar users to the user asking questionPredict who will respond to this questionLearn similarities from first 90% training threadsRelations used:Topics covered in text, Co-participation in threads, Signature profiles, Proximity of posts MRW similarity compared with baselines:Single relationsPathSim: Existing approach for heterogeneous networksPredefined paths of fixed lengthNo dynamic choice of path

Link prediction enables

suggesting which threads

or which users to follow

Slide21

Predicting User Interactions

MAP for link prediction

Multidimensional RW

has best predictionperformance

Slide22

Predicting User Interactions

Leverage

content of the initial post

to find users who are experts on the questionTopicScore computed as cosine similarity between author’s history and initial postUserScore = β * MRWScore + (1- β) * TopicScoreNeighbors

β

= 0

β

= 0.1

β

= 0.2

β = 1

Top 5

0.520.64 (8%)0.61 (4%)

0.59Top 100.31

0.50 (8%)0.49 (5%)0.46Top 15

0.240.43 (8%)

0.42 (6%)0.40Top 200.20

0.39 (6%)0.39 (7%)0.37

Purely MRWPurely topical expertise

% Improvement over purely MRWMAP

Slide23

Enhanced Keyword Search

Non-rooted RW to find most influential

expert users

Re-rank top-k results of IR scoring using author scoresFinal score of post = ω*IR_score λ + (1- ω)*Authority_score

Posts only,

tf

*

idf

scoring with size parameter 

Re-ranking searchresults with authorscore yields higherMAP relevance4% improvement

5%

Slide24

Patient Emotion and

stRucture

Search

USer tool(PERSEUS) - ConclusionsDesigned hierarchical model and score that allows generating search results at several granularities of web forum objects. Proposed OAKS algorithm for best non-overlapping result. Conducted extensive user studies, show that mixed collection of granularities yields better relevance than post-only results.

Combined multiple relations linking users for computing similarities

Enhanced search results using multidimensional author

similarity

Future Directions:

Multi-granular search on web pages, blogs, emails. Dynamic focus level selection.

Search in and out of context over dialogue, interviews, Q&A.

Optimal result set selection for targeted advertising, result diversificationTime sensitive recommendations – Changing friendships, progressive search needs.

Slide25

Thank you!

Slide26

Why Random Walks?

Rooted RW Examples

r

t

(a)

r

a

(b)

t

r

a

(c)

t

b

c

r

a

(d)

t

b

0.4

0.16

0.26

c

u

0.16

Multi-dimensional

Rooted RW Example

0.6

r

1

score (b

w.r.t

r

1

) = 0.072

score (c

w.r.t

r

1

) = 0.096

score (b

w.r.t

r

2

) = 0.097

score (c

w.r.t

r

2

) = 0.066

r

2

2

b

1

4

A

1

r

1

r

2

4

c

1

2

A

2

Slide27

LAIS

Sorted Ordering:

Post3

(2.5),

Post1

(2.1),

Post2

(2),

Sent1

(1.6),

Sent2(1.5), Sent3(1.4), Sent4(1.3), Sent6(0.4), Sent5(0.1), Post4(0.1),

Thread1(0.1),

Thread2(0.1)

Greedy = {

Post3, Post1, Post2,

Sent6}. In the 1st iteration:

{Post3, Post2,

Sent1, Sent2, Sent6}

{Post3

, Post1, Sent3, Sent4

, Sent6}

{Post1,

Post2, Sent6, Sent5}

{

Post3, Post1, Post2,

Post4)

{

Post3

,

Sent6, Thread1

}

{

Post1

, Post2

, Thread2

}

Thread 1

Thread 2

Post 1

Post 2

Post 4

Post 3

Sent 1

Sent 2

Sent 3

Sent 4

Sent 5

Sent 6

0.1

2.1

2

2.5

0.1

0.1

0.1

0.4

1.6

1.5

1.4

1.3

2

2

Slide28

Current Search Functionality breastcancer.org

Filtering criteria

keyword search,

member searchRanking basedon date

Posts only

results