Hypertextual Web Search Engine By Sergey B rin and Lawrence Page Presented by Joshua Haley Zeyad Zainal Michael Lopez Michael Galletti Britt Phillips Jeff Masson Searching in the 90s ID: 739444
Download Presentation The PPT/PDF document "The Anatomy of a Large-Scale" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
The Anatomy of a Large-Scale Hypertextual Web Search Engine
By Sergey
B
rin
and Lawrence Page
Presented by
Joshua Haley
Zeyad
Zainal
Michael Lopez
Michael Galletti
Britt Phillips
Jeff MassonSlide2Slide3
Searching in the 90’sSearch Engine Technology had to deal with huge growths.Slide4
Google will ScaleThey wanted a search engine that:
Has fast crawling capabilities
Use Storage Space Efficiently
Process Indexes fast
Handles Queries fast
They Had to Deal with Scaling DifficultiesDisk Speeds and OS robustness not scaling as well as hardware performance and costSlide5
The Google GoalsImprove Search Quality
Remove Junk Results (Prioritizing of Results)
Academic Search Engine Research
Create Literature on the subject of Databases
Gather Usage Data
Data bases can support researchSupport Novel Research Activities on Web DataSlide6
System FeaturesTwo important features that help it produce high precision
results:
PageRank
Anchor TextSlide7
PageRankGraph structure of hyperlinks hadn’t been used by other search engines
Graph of 518 million hyperlinks
Text matching using page titles performs well after pages are prioritized
Similar results when looking at entire pagesSlide8
PageRank FormulaNot all pages linking to others are counted equally
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(
Tn
)/C(
Tn
))A: pageT1…Tn
: pages linking to it
C(A): pages linking out of it
d: “damping factor”
PageRank for 26m pages can be calculated in a few hoursSlide9
Intuitive JustificationA page can have a high PageRank if many pages link to it
Or if a high
PageRank’d
page links to it (
eg
: Yahoo News)The page wouldn’t be linked to if it wasn’t high quality, or it had a broken linkPageRank handles these cases by propagating the weights of different pagesSlide10
Anchor TextAnchors provide more accurate descriptions than the page itself.
Anchors exist for documents that aren’t text-based (
eg
. Images, videos,
etc
)Google indexed more than 259m anchors from just 24m pages.Slide11
Other FeaturesLarger font sizes or bold fonts carry more weight than other wordsSlide12
Related WorkSlide13
Early Search EnginesThe World Wide Web Worm (WWWW)
One of the first web search engines (Developed 1994)
Had a database of 300,000 multimedia objects
Some early search engines retrieved results by post-processing the results of other search engines.Slide14
Information RetrievalThe science of searching for documents or information within documents and for metadata about documents.
Most research is on small collections of scientific papers or news stories on a related topic.
Text Retrieval Conference is the primary benchmark for information retrieval
Uses the “Very Large Corpus”, a small and well controlled collection for their benchmarks
Very Large Corpus benchmark is only 20GBSlide15
Information RetrievalThe Text Retrieval Conference doesn’t produce good results on the web
EX: A search of “Bill Clinton” would return a page that only says “Bill Clinton Sucks” and have a picture of him.
Brin
and Page believe that for a search of “Bill Clinton” you should receive reasonable results because there is so much information on the topic.
The standard information
retreival
work needs to be extended to deal effectively with the webSlide16
Differences Between the Web and Well Controlled Collections
Documents differ internally in their language, vocabulary, type or format, and may even be machine generated.
External meta information is information that can be inferred about a document but is not contained within it.
Ex: reputation of the source, update frequency, quality, popularity, etc.
A page like Yahoo needs to be treated differently than an article or web page that receives one view every ten years.Slide17
Differences Between the Web and Well Controlled Collections
There is no control over what people can put on the web
Some companies manipulate search engines to route traffic for profit
Metadata efforts have largely failed with web search engines because a user can be returned a web page that has nothing to do with the query due to the search engine being manipulated.Slide18
The Anatomy of a Large-Scale Hypertextual Web Search Engine
By Sergey
B
rin
and Lawrence Page
Presented by
Joshua Haley
Zeyad
Zainal
Michael Lopez
Michael Galletti
Britt Phillips
Jeff MassonSlide19
System AnatomyHigh-level discussion of architecture
Descriptions of data structures
Repository
Lexicon
HitLists
Forward and Inverted IndicesMajor applications
Crawling
Indexing
SearchingSlide20
Google Architecture Overview
Implemented in C, C++
Runs efficiently on Linux, Solaris
Many distributed
webcrawlers
Receive list of URLs to crawl from URL ServerCrawlers send pages to Store Server
Compressed pages sent to Repository
Repository assigns page a
docID
Indexer
Documents from Repository converted into
HitLists
Sends
HitLists
to Barrels
Sends links to anchor fileSlide21
Google Architecture Overview
URL Resolver
Reads from anchor file
Converts URLS to
docIDs
and sends them to BarrelsPairs of docIDs
stored in Links database
Sorter
Barrels presorted by
docID
, Forward Index
Re-sorts by
wordID
to create Inverted Index
Dumps a list of associated
wordIDs
to Lexicon
Lexicon
Keeps a list of words
Searcher
Uses Lexicon, Inverted Index, and
Pagerank
to answer queriesSlide22
Repository
BigFiles
Virtual files spanning multiple file systems
Operating systems did not provide enough for system needs
Repository access
No additional data structures necessary
Reduces complexity
Can rebuild all data structures from
Respository
Repository
53.5 GB =
147.8 GB Uncompressed
Sync
Length
Compressed packet
Sync
Length
Compressed packet
Uncompressed Packet
docId
ecode
urlLen
pageLen
url
page
Repository
Contains full HTML of every web page
Compression decision
Bzip
offers 4 : 1 compression
Zlib
offers 3 : 1, is faster
Opted for speed over ratioSlide23
Document Index and Lexicon
Document Index
Stores information about each document
Fixed-width ISAM (Index-Sequential Access Mode) ordered by
docID
Information includes:StatusPointer into Repository
Checksum
Various Statistics
Record fetching
Document points to
docinfo
file with URL and title if previously crawled
Otherwise points to URL in
URLlist
docID
Allocation
File of all document checksums paired with
docIDs
Sorted by checksum
Find
docID
1. Checksum of URL is computed
2. Binary search over file
May be done in batches
Lexicon
Capable of existing in main memory of a machine
Holds 14 million words
Linked-List of words
Hash table of pointersSlide24
HitLists and Encoding
Hit
Occurrence of a word in a document, given 2 bytes
Fancy and plain hits
Records capitalization, size relative to document, and position
HitList
List of Hits for some word in some document
Requires the most space
Many possible encoding schemes
Simple
Hand-optimized
Huffman
Time
vs
space compromise
Bit
Allocation for Different Hits [2 Bytes]
Plain
Cap: 1
Size:
3
Position:
12
Fancy
Cap: 1
Size = 7
Type: 4
Position:
4
Anchor
Cap: 1
Size
= 7
Type:
4
Hash: 4
Pos: 4
Anchor Hits
Hash to
docID
anchor occurs in
Storing
Lists stored in barrels
Space-saving
Combine length with different ID depending on Forward or Inverted index
If list length will not fit in remaining bits, place escape character there and use next two bytes to store list lengthSlide25
Forward and Inverted Indices
Forward Index
64 barrels
Each one corresponds to a range of
wordIDs
Words in documents broken up into rangesdocID
is recorded into appropriate barrel
List of
wordIDs
with
HitLists
follow
wordIDs
stored relative to Barrel starting index
Fit in 24 bits, leaving 8 for list length
System requires more storage for duplicate IDs
However, coding complexity greatly reduced
Inverted Index
Created after Barrels go through Sorter
For each valid
wordID
there is a pointer from Lexicon into corresponding Barrel
Points to
docList
of
docIDs
and matching
HitLists
Represents every document in which a particular word appears
docList
Ordering
Sort by
docID
Quick for multi-word queries
Sort by ranking of occurrence
One word queries trivial
Multi-word queries likely near start of list
Merging is difficult
Development is difficult
Compromise!
Keep two sets of BarrelsSlide26
Crawling The WebSlide27
Crawling
Accessing millions of
webpages
and logging data
DNS caching for increased performance
Email from web
admins
Unpredictable bugs
Copyright problems
Robots.txt Slide28
Indexing The Web
Parsing HTML data
Handle wide variety of errors
Encoding to Barrels
Turning words into
WordIds
Hashing all the data
Sorting data recursively – Bucket Sort Slide29
Searching
Quality first
Limited depth(40k hits)
No one factor will have too much impact
Titles,Font
Size,
Distance,Count
Creates Relevance score
Combines
PageRank
and IR score Slide30
User Feedback
User input vital to improved search results
Verified users can evaluate results and send their ratings back
Adjust ranking system
Verify that old results are still validSlide31
Results and PerformanceSlide32
The most important measure of a search
engine is the quality of
its search results
“Our own experience with Google has shown
it to produce better results than the major
commercial search engines for most
searches.”
Results are generally high quality pages
with minimal broken linksSlide33
Storage RequirementsTotal size of repository is about 53 GB
(relatively cheap source of data)
Total of all the data use by engine requires
about 55 GB
With better compression, only 7 GB
of drive neededSlide34
System and Search PerformanceGoogle’s major operations:
Crawling, Indexing, Sorting
Indexer > Crawler in terms of speed
Indexer runs at 54 pages/second
Using four machines, sorting takes
24 hours
Most queries answered within 10 s
No query caching or subindices on
common terms Slide35
ConclusionsGoogle is designed to a be a scalable search engine, providing high quality search results.
Future Work:
Query caching, smart disk allocation, subindices
Smart algorithms to decide what old web pages should be recrawled and which new ones should be crawled
Using proxy caches to build search databases and adding boolean operators, negation, and stemming
Support user context and result summarizationSlide36
High Quality Search: Users want high quality results without being frustrates and wasting time.
Google returns higher quality search results than current commercial search engines; Link structure analysis determines quality of pages, link description determines relevance.
Scalable Architecture:
Google is efficient in both space and time
Google has overcome bottleneck in CPU, memory access and capacity, and disk I/O during various operations to prove excellence
Crawling, Indexing, Sorting are efficient enough to build 24 million pages in less than a week