/
SourceRank : Relevance and Trust Assessment for Deep Web Sources Based on Inter-Source SourceRank : Relevance and Trust Assessment for Deep Web Sources Based on Inter-Source

SourceRank : Relevance and Trust Assessment for Deep Web Sources Based on Inter-Source - PowerPoint Presentation

sherrill-nordquist
sherrill-nordquist . @sherrill-nordquist
Follow
385 views
Uploaded On 2018-03-11

SourceRank : Relevance and Trust Assessment for Deep Web Sources Based on Inter-Source - PPT Presentation

Raju Balakrishnan Subbarao Kambhampati Arizona State University Funding from Deep Web Integration Scenario Web DB Mediator query Web DB Web DB Web DB Web DB Millions of sources containing structured tuples ID: 646494

agreement sources web query sources agreement query web google queries tuples source answers sourcerank base based relevance results search

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "SourceRank : Relevance and Trust Assessm..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

SourceRank: Relevance and Trust Assessment for Deep Web Sources Based on Inter-Source Agreement

Raju Balakrishnan, Subbarao KambhampatiArizona State University

Funding fromSlide2

Deep Web Integration Scenario

Web DB

Mediator

query

Web DB

Web DB

Web DB

Web DB

Millions of sources containing structured tuples

Uncontrolled collection of redundant information

answer

tuples

answer tuples

answer tuples

answer tuples

answer tuples

query

query

query

query

Deep Web

Search engines have nominal access. We don’t Google for a “Honda Civic 2008 Tampa”

2Slide3

Why Another Ranking?

Example Query: “Godfather Trilogy” on Google Base

Importance

: Searching for titles matching with the query. None of the results are the classic Godfather

Rankings are oblivious to result Importance & Trustworthiness

Trustworthiness (bait and switch)

The titles and cover image match exactly.

Prices are low. Amazing deal!

But when you proceed towards check out you realize that the product is a different one! (or when you open the mail package, if you are really unlucky)

3Slide4

Agenda

Problem DefinitionSourceRank: Ranking based on AgreementComputing AgreementComputing Source CollusionSystem implementation and Results

4Slide5

Problem:

Given a user query, select a subset of sources to provide

important and trustworthy answers.

Surface

web

search combines link analysis with Query-Relevance

to

consider trustworthiness and relevance of the results.

Unfortunately, deep web records do not have hyper-links.

Source Selection in the Deep Web

5Slide6

Observations

Many sources return answers to the same query.

Comparison of semantics of the answers is facilitated by structure of the

tuples

.

Idea:

Compute importance and trustworthiness of sources based on the agreement of answers returned by different sources

.

Source Agreement

6Slide7

Agreement Implies Trust & Importance.

Important results are likely to be returned by a large number of sources. e.g. For the query “Godfather” hundreds of sources return the classic “The Godfather” while a few sources return the little known movie “Little Godfather”.

Two independent sources are not likely to agree upon corrupt/untrustworthy answers.e.g. The wrong author of the book (e.g.

Godfather author as “Nino Rota”) would not be agreed by other sources. As we know, truth is one (or a few), but lies are many.

7Slide8

Which tire?

Agreement is not just for the search

8Slide9

Agreement Implies Trust & Relevance

Probability of agreement of two independently selected irrelevant/false tuples is

Probability of agreement or two independently picked relevant and true tuples is

9Slide10

Method: Sampling based Agreement

Link semantics from

Si

to Sj with weight

w

:

S

i

acknowledges w fraction of tuples in Sj. Since weight is the fraction, links are unsymmetrical.

where induces the smoothing links to account for the unseen samples. R

1

, R2 are the result sets of S1

, S2.

Agreement is computed using key word queries.

Partial titles of movies/books are used as queries.

Mean agreement over all the queries are used as the final agreement.

10Slide11

Method: Calculating SourceRank

How can I use the agreement graph

for improved search?

Source graph is viewed as a markov chain, with edges as the transition probabilities between the sources.

The prestige of sources considering transitive nature of the agreement may be computed based on a markov random walk.

SourceRank

is equal to this stationary visit probability of the random walk on the database vertex.

This static SourceRank may be combined with a query-specific source-relevance measure for the final ranking.

11Slide12

Computing Agreement is Hard

Computing semantic agreement between two records is the

record linkage

problem, and is known to be hard.

Semantically same entities may be represented syntactically differently by two databases (non-common domains).

Godfather, The: The Coppola Restoration

James

Caan

/

Marlon Brando more

$9.99

Marlon Brando, Al

Pacino

13.99 USD

The Godfather - The Coppola Restoration

Giftset

[

Blu

-ray]

Example “Godfather” tuples from two web sources. Note that titles and castings are denoted differently.

12Slide13

Method: Computing Agreement

Agreement Computation has Three levels.Comparing Attribute-Value Soft-TFIDF with Jaro-Winkler as the similarity measure is used. Comparing Records.

We do not assume predefined schema matching. Instance of a bipartite matching problem.

Optimal matching is . Greedy matching is used. Values are greedily matched

against most similar value in the other record.

The attribute importance are weighted by IDF. (e.g. same titles (Godfather) is more important than same format (paperback))

Comparing result sets.

Using the record similarity computed above, result set similarities are computed using the same greedy approach.

13Slide14

Detecting Source Collusion

Observation 1: Even non-colluding sources in the same domain may contain same data. e.g. Movie databases may contain all Hollywood movies. Observation 2: Top-k answers of even non-colluding sources may be similar.

e.g. Answers to query “Godfather” may contain all the three movies in the Godfather trilogy.

The sources may copy data from each other, or make mirrors, boosting SourceRank of the group.

14Slide15

Source Collusion--Continued

Basic Method: If two sources return same top-k answers to the queries with large number of answers (e.g. queries like “the” or “DVD”) they are likely to be colluding. We compute the degree of collusio

n of sources as the agreement on large answer queries. Words with highest DF

in the crawl is used as the queries.The agreement between two databases are adjusted for collusion by multiplying by (1-collusion).

15Slide16

Factal: Search based on SourceRank

http://factal.eas.asu.edu

 ”I personally ran a handful of test queries this way and got

much better results [than Google Products] results using Factal” --- Anonymous WWW’11 Reviewer.

16Slide17

Evaluation

Precision and DCG are compared with the following baseline methodsCORI: Adapted from text database selection. Union of sample documents from sources are indexed and sources with highest number term hits are selected [Callan

et al. 1995].

Coverage: Adapted from relational databases. Mean relevance of the top-5 results to the sampling queries [

Nie

et al.

2004].

Google Products:

Products Search that is used over Google Base

All experiments distinguish the SourceRank from baseline methods with 0.95 confidence levels.

17Slide18

Online Top-4 Sources-Movies

29%

Though

c

ombinations are

not our competitors, note that they are not better:

1.SourceRank implicitly considers query relevance,

as selected sources fetch answers by query similarity. Combining again with query similarity may be an “overweighting”.

2.

Search is Vertical

18Slide19

Online Top-4 Sources-Books

48%

19Slide20

Google Base Top-5 Precision-Books

24%

675

Google Base sources responding to a set of book queries are used as the book domain sources.

GBase

-Domain is the Google Base searching only on these 675 domain sources.

Source Selection by SourceRank (coverage) followed by ranking by Google Base.

675 Sources

20Slide21

Google Base Top-5 Precision-Movies

25%

21Slide22

Trustworthiness of Source Selection

Google Base Movies

Corrupted the results in sample crawl by replacing attribute vales not specified in the queries with random strings (since partial titles are the queries, we corrupted attributes except titles).

If the source selection is sensitive to corruption, the ranks should decrease with the corruption levels.

Every relevance measure based on query-similarity are oblivious to the corruption of attributes unspecified in queries

.

22Slide23

Trustworthiness- Google Base Books

23Slide24

Collusion—Ablation Study

Two database with the same one million tuples from IMDB are created.

Correlation between the ranking functions reduced increasingly.

Natural agreement will be preserved while catching near-mirrors.

Observations:

At high correlation the adjusted agreement is very low.

Adjusted agreement is almost the same as the pure agreement at low correlations.

24Slide25

Computation Time

Random walk is known to be feasible in large scale.

Time to compute the agreements is evaluated against number of sources.

Note that the computation is offline.

Easy to parallelize.

25Slide26

Contributions

Agreement based trust assessment for the deep web

Agreement based relevance assessment for the deep webCollusion detection between the web sourcesEvaluations in Google Base sources and online web databases

26

The search using

SourceRank

is demonstrated on Friday: 10-15:30