/
The Anatomy of a Large-Scale The Anatomy of a Large-Scale

The Anatomy of a Large-Scale - PowerPoint Presentation

trish-goza
trish-goza . @trish-goza
Follow
342 views
Uploaded On 2018-12-10

The Anatomy of a Large-Scale - PPT Presentation

Hypertextual Web Search Engine By Sergey B rin and Lawrence Page Presented by Joshua Haley Zeyad Zainal Michael Lopez Michael Galletti Britt Phillips Jeff Masson Searching in the 90s ID: 739444

results search pages web search results web pages page data quality information index repository list document google engine high

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "The Anatomy of a Large-Scale" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

The Anatomy of a Large-Scale Hypertextual Web Search Engine

By Sergey

B

rin

and Lawrence Page

Presented by

Joshua Haley

Zeyad

Zainal

Michael Lopez

Michael Galletti

Britt Phillips

Jeff MassonSlide2
Slide3

Searching in the 90’sSearch Engine Technology had to deal with huge growths.Slide4

Google will ScaleThey wanted a search engine that:

Has fast crawling capabilities

Use Storage Space Efficiently

Process Indexes fast

Handles Queries fast

They Had to Deal with Scaling DifficultiesDisk Speeds and OS robustness not scaling as well as hardware performance and costSlide5

The Google GoalsImprove Search Quality

Remove Junk Results (Prioritizing of Results)

Academic Search Engine Research

Create Literature on the subject of Databases

Gather Usage Data

Data bases can support researchSupport Novel Research Activities on Web DataSlide6

System FeaturesTwo important features that help it produce high precision

results:

PageRank

Anchor TextSlide7

PageRankGraph structure of hyperlinks hadn’t been used by other search engines

Graph of 518 million hyperlinks

Text matching using page titles performs well after pages are prioritized

Similar results when looking at entire pagesSlide8

PageRank FormulaNot all pages linking to others are counted equally

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(

Tn

)/C(

Tn

))A: pageT1…Tn

: pages linking to it

C(A): pages linking out of it

d: “damping factor”

PageRank for 26m pages can be calculated in a few hoursSlide9

Intuitive JustificationA page can have a high PageRank if many pages link to it

Or if a high

PageRank’d

page links to it (

eg

: Yahoo News)The page wouldn’t be linked to if it wasn’t high quality, or it had a broken linkPageRank handles these cases by propagating the weights of different pagesSlide10

Anchor TextAnchors provide more accurate descriptions than the page itself.

Anchors exist for documents that aren’t text-based (

eg

. Images, videos,

etc

)Google indexed more than 259m anchors from just 24m pages.Slide11

Other FeaturesLarger font sizes or bold fonts carry more weight than other wordsSlide12

Related WorkSlide13

Early Search EnginesThe World Wide Web Worm (WWWW)

One of the first web search engines (Developed 1994)

Had a database of 300,000 multimedia objects

Some early search engines retrieved results by post-processing the results of other search engines.Slide14

Information RetrievalThe science of searching for documents or information within documents and for metadata about documents.

Most research is on small collections of scientific papers or news stories on a related topic.

Text Retrieval Conference is the primary benchmark for information retrieval

Uses the “Very Large Corpus”, a small and well controlled collection for their benchmarks

Very Large Corpus benchmark is only 20GBSlide15

Information RetrievalThe Text Retrieval Conference doesn’t produce good results on the web

EX: A search of “Bill Clinton” would return a page that only says “Bill Clinton Sucks” and have a picture of him.

Brin

and Page believe that for a search of “Bill Clinton” you should receive reasonable results because there is so much information on the topic.

The standard information

retreival

work needs to be extended to deal effectively with the webSlide16

Differences Between the Web and Well Controlled Collections

Documents differ internally in their language, vocabulary, type or format, and may even be machine generated.

External meta information is information that can be inferred about a document but is not contained within it.

Ex: reputation of the source, update frequency, quality, popularity, etc.

A page like Yahoo needs to be treated differently than an article or web page that receives one view every ten years.Slide17

Differences Between the Web and Well Controlled Collections

There is no control over what people can put on the web

Some companies manipulate search engines to route traffic for profit

Metadata efforts have largely failed with web search engines because a user can be returned a web page that has nothing to do with the query due to the search engine being manipulated.Slide18

The Anatomy of a Large-Scale Hypertextual Web Search Engine

By Sergey

B

rin

and Lawrence Page

Presented by

Joshua Haley

Zeyad

Zainal

Michael Lopez

Michael Galletti

Britt Phillips

Jeff MassonSlide19

System AnatomyHigh-level discussion of architecture

Descriptions of data structures

Repository

Lexicon

HitLists

Forward and Inverted IndicesMajor applications

Crawling

Indexing

SearchingSlide20

Google Architecture Overview

Implemented in C, C++

Runs efficiently on Linux, Solaris

Many distributed

webcrawlers

Receive list of URLs to crawl from URL ServerCrawlers send pages to Store Server

Compressed pages sent to Repository

Repository assigns page a

docID

Indexer

Documents from Repository converted into

HitLists

Sends

HitLists

to Barrels

Sends links to anchor fileSlide21

Google Architecture Overview

URL Resolver

Reads from anchor file

Converts URLS to

docIDs

and sends them to BarrelsPairs of docIDs

stored in Links database

Sorter

Barrels presorted by

docID

, Forward Index

Re-sorts by

wordID

to create Inverted Index

Dumps a list of associated

wordIDs

to Lexicon

Lexicon

Keeps a list of words

Searcher

Uses Lexicon, Inverted Index, and

Pagerank

to answer queriesSlide22

Repository

BigFiles

Virtual files spanning multiple file systems

Operating systems did not provide enough for system needs

Repository access

No additional data structures necessary

Reduces complexity

Can rebuild all data structures from

Respository

Repository

53.5 GB =

147.8 GB Uncompressed

Sync

Length

Compressed packet

Sync

Length

Compressed packet

Uncompressed Packet

docId

ecode

urlLen

pageLen

url

page

Repository

Contains full HTML of every web page

Compression decision

Bzip

offers 4 : 1 compression

Zlib

offers 3 : 1, is faster

Opted for speed over ratioSlide23

Document Index and Lexicon

Document Index

Stores information about each document

Fixed-width ISAM (Index-Sequential Access Mode) ordered by

docID

Information includes:StatusPointer into Repository

Checksum

Various Statistics

Record fetching

Document points to

docinfo

file with URL and title if previously crawled

Otherwise points to URL in

URLlist

docID

Allocation

File of all document checksums paired with

docIDs

Sorted by checksum

Find

docID

1. Checksum of URL is computed

2. Binary search over file

May be done in batches

Lexicon

Capable of existing in main memory of a machine

Holds 14 million words

Linked-List of words

Hash table of pointersSlide24

HitLists and Encoding

Hit

Occurrence of a word in a document, given 2 bytes

Fancy and plain hits

Records capitalization, size relative to document, and position

HitList

List of Hits for some word in some document

Requires the most space

Many possible encoding schemes

Simple

Hand-optimized

Huffman

Time

vs

space compromise

Bit

Allocation for Different Hits [2 Bytes]

Plain

Cap: 1

Size:

3

Position:

12

Fancy

Cap: 1

Size = 7

Type: 4

Position:

4

Anchor

Cap: 1

Size

= 7

Type:

4

Hash: 4

Pos: 4

Anchor Hits

Hash to

docID

anchor occurs in

Storing

Lists stored in barrels

Space-saving

Combine length with different ID depending on Forward or Inverted index

If list length will not fit in remaining bits, place escape character there and use next two bytes to store list lengthSlide25

Forward and Inverted Indices

Forward Index

64 barrels

Each one corresponds to a range of

wordIDs

Words in documents broken up into rangesdocID

is recorded into appropriate barrel

List of

wordIDs

with

HitLists

follow

wordIDs

stored relative to Barrel starting index

Fit in 24 bits, leaving 8 for list length

System requires more storage for duplicate IDs

However, coding complexity greatly reduced

Inverted Index

Created after Barrels go through Sorter

For each valid

wordID

there is a pointer from Lexicon into corresponding Barrel

Points to

docList

of

docIDs

and matching

HitLists

Represents every document in which a particular word appears

docList

Ordering

Sort by

docID

Quick for multi-word queries

Sort by ranking of occurrence

One word queries trivial

Multi-word queries likely near start of list

Merging is difficult

Development is difficult

Compromise!

Keep two sets of BarrelsSlide26

Crawling The WebSlide27

Crawling

Accessing millions of

webpages

and logging data

DNS caching for increased performance

Email from web

admins

Unpredictable bugs

Copyright problems

Robots.txt Slide28

Indexing The Web

Parsing HTML data

Handle wide variety of errors

Encoding to Barrels

Turning words into

WordIds

Hashing all the data

Sorting data recursively – Bucket Sort Slide29

Searching

Quality first

Limited depth(40k hits)

No one factor will have too much impact

Titles,Font

Size,

Distance,Count

Creates Relevance score

Combines

PageRank

and IR score Slide30

User Feedback

User input vital to improved search results

Verified users can evaluate results and send their ratings back

Adjust ranking system

Verify that old results are still validSlide31

Results and PerformanceSlide32

The most important measure of a search

engine is the quality of

its search results

“Our own experience with Google has shown

it to produce better results than the major

commercial search engines for most

searches.”

Results are generally high quality pages

with minimal broken linksSlide33

Storage RequirementsTotal size of repository is about 53 GB

(relatively cheap source of data)

Total of all the data use by engine requires

about 55 GB

With better compression, only 7 GB

of drive neededSlide34

System and Search PerformanceGoogle’s major operations:

Crawling, Indexing, Sorting

Indexer > Crawler in terms of speed

Indexer runs at 54 pages/second

Using four machines, sorting takes

24 hours

Most queries answered within 10 s

No query caching or subindices on

common terms Slide35

ConclusionsGoogle is designed to a be a scalable search engine, providing high quality search results.

Future Work:

Query caching, smart disk allocation, subindices

Smart algorithms to decide what old web pages should be recrawled and which new ones should be crawled

Using proxy caches to build search databases and adding boolean operators, negation, and stemming

Support user context and result summarizationSlide36

High Quality Search: Users want high quality results without being frustrates and wasting time.

Google returns higher quality search results than current commercial search engines; Link structure analysis determines quality of pages, link description determines relevance.

Scalable Architecture:

Google is efficient in both space and time

Google has overcome bottleneck in CPU, memory access and capacity, and disk I/O during various operations to prove excellence

Crawling, Indexing, Sorting are efficient enough to build 24 million pages in less than a week