utseduau Kevin McCurley IBM Research Division Almaden Research Center K53802 650 Harry Road San Jose CA 951206099 USA mccurleyalmadenibmcom John Tomlin IBM Research Division Almaden Research Center K53802 650 Harry Road San Jose CA 951206099 USA toml ID: 72096
Download Pdf The PPT/PDF document "An Adaptive Model for Optimizing Perform..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
AnAdaptiveModelforOptimizingPerformanceofanIncrementalWebCrawlerJennyEdwardsFacultyofInformationTechnologyUniversityofTechnology,SydneyPOBox123BroadwayNSW2007AustraliaKevinMcCurleyIBMResearchDivisionAlmadenResearchCenter,650HarryRoadSanJose,CA95120-6099USAJohnTomlinIBMResearchDivisionAlmadenResearchCenter,650HarryRoadSanJose,CA95120-6099 Thisworkwascompletedwhiletheauthorw ofwhicebecomeaare,eitherthroughnewlinks,orex-ogenousinformation,anddenethemasalsobeingobsolete.Aparticularthemodelwdescribeaboutpageschange,simplythatwecanmeasurewhenchangesoc-cur,andrecordthefrequencywiththepages'metadata.thissensethemodelisinthatthemorecyclesthewlerisinoperation,themorereliableandrenedisthedatawhicehaailabletodriveit.orourpurposesapageisrequiredtobenon-trivial,asdeterminedyashingle(see[4]).urthermore,whilegrowthinthewhangesthedatainourmodel,ithasnoeectonthesizeofthemodelnorthesolutiontime.Nopreviousworkthatweknowofmakesuseofpreciselythissetofassumptions,butmhofithassomebearingonourmodel.Afterareviewofthiswork,wepresentamorematicaldescriptionoftheoptimizationmodelforthemodelethendescribeaseriesofcomputationalex-perimentsdesignedtotestuseofthemodelandsomeofitstsonsimulatedbutrealisticdata.2.PREVIOUSWORKTherehaebeenseveralstudiesofwebcrawlinginitsrel-focusratherdierentfromours.Somehaeconcentratedonas-pectsrelatingtocaching,e.g.,[13]and[9].Othershaebeenprincipallyinterestedinthemostecientandeectivtoupdateadsizedatabaseextractedfromtheweb,oftenforsomespecicfunction,suchasdatamining,seeegtheorkofChoetal.[5,6,7].Thesestudieswereperformedertimeperiodsrangingfromafewdaystosevenmoner,fordieringpracticalreasons,thesecrawlerswrestrictedtosubsetsofwebpages.eralauthors,e.g.,Comanetal.[8],approachcrapoinpollingsystemsofqueueingtheory,i.e.,multiplequeue-singleersystems.er,theproblemequivttotheob-solescencetimeofapageisunexploredinthequeueinglit-Acommonassumptionhasbeenthatpagechangesareaoissonormemorylessprocess,withparameterastherateofchangeforthepages.BrewingtonandCybenko[1]andandconrmthiswithinthelimitsoftheirdatagathering.Thisissomewhatunderminedbanotherstudy,basedonanextensivesubsetofthewebbBrewingtonandCybenko[2]showingthatmostwebpagesaremodiedduringUSworkinghours,ie5amto5pm(Sil-iconValleyStandardTime),MondaytoFless,thewidelyacceptedPoissonmodelformsthebasisforaseriesofstudiesoncrawlerstrategies.TheseleadtoavyofanalyticalmodelsdesignedtominimizetheageormaximizethefreshnessofacollectionbyinwoftenapageshouldbecrainwhatorderpagesshouldbecrashouldacrawlingstrategybebasedontheofpagesortheirratesofcThedenitionsmetricsforandimpor-tancearenotcompletelyconsistent,butingeneraltheofapagereferstothedierencebeteenthetimeofpageisthedierencebeteenthetimewhenthepagelasthangedandthecurrenttime.Widelydieringmetricshabeenoeredfortheofapage,butasthetotalberofpossiblemetricsislargeandthecrawlerinthisstudydoesnotcurrentlyuseanyofthem,noformaldeni-tionswillbegivenhere.Whiledieringindetail,theexperimentalresultsinthereferencedpapersagreeongeneraltrends,andinparticulartheaeragesizeofindividualpagesisgrotheproportionofvisualandothernontextualmaterialisgrowingincomparisontotextthenberofpageshasbeengrowingexponen([1]givestodierentestimatesof318and390dawthrateissloer,seeourcommentsatthebeginningofSection1.)tdomainshaerydierentpagechangeratestheaw(cfTables1and2).importanthreearemorerelevttothestudyofwholewebcratobediscussedhere.2.1CrawlingModelsInaseriesofpapers,Choetal.[5,6,7]addressanberofissuesrelatingtothedesignofeectivecraIn[7]theyexaminedierentcrawlingstrategiesusingtheStanfordebpagesasaparticularsubsetofthewebandexamineseveralscenarioswithdierentphysicallimitationsonthecraTheirapproachistovisitmoreimportanpagesrstandtheydescribeanberofpossiblemetricsfordeterminingthisaswellastheorderinwhichthecpageswillbevisited.Theyshowthatitispossibletobuildwlerswhichcanobtainasignicantportionofimportanpagesquicklyusingarangeofmetrics.TheirmodelappearstobemostusefulwhentryingtocrawllargeportionsoftheebwithlimitedresourcesorifpagesneedtoberevisitedoftentodetectcChoandGarcia-Molina[6]deriveaseriesofmathemati-calmodelstodeterminetheoptimumstrategyinanberofcrawlingscenarios,wheretherepository'sextractedcopmodelsallpagesaredeemedtochangeatthesameaerageorrateandwheretheychangeatdierentororareallifecrawler,thelatterismorelikelytobeoreachofthesetocases,theyexamineavyofsynchronizationorcrawlingpolicies.woftentherepositorywebcopyisupdateddependsonthecrawlingca-berWithinthatlimitationisthequestionofhowoftenhindividualpageshouldbecrawledtomeetaparticularobjectivMolinaexamineauniformallopolicy,inwhicheac policywhereeachpageiscrawledwithafre-quencythatisproportionaltothefrequencywithwhichitishangedontheweb,iepageswhichareupdatedfrequenarecrawledmoreoftenthanthosewhichangeonlyocca-,theyexaminetheorderinwhichthepagesshouldbecraTheydevelopmodelsof:whereallpagesarecrawledrepeatedlyinthesameordereachcycleandomorwhereallpagesarecrawledineachcyclebutinarandomorder,egbyalwysstartingwiththerootURLforasiteandcrawlingallpageslinkedtoitandothersnevmodelswlingpagesatauniformrateandinxedorder.Aswenoted,moststudiesmaketheassumptionthatpageshangeatavariableratewhichmaybeapproximatedboissondistribution,butinthesecondstageoftheirstudyChoandGarcia-Molinaassumethebroadergammadistri-butionforthischangerate.er,theyproethattheirparticularmodelisvalidfordistribution,andconcludethatwhenpageshangeatvaryingrates,itisysbet-tertocrawlthesepagesatauniformrate,ieignoringtherateofchange,thanataratewhichisproportionaltotherateofcer,tomaximizefreshnesstheyndaformsolutiontotheirmodelwhichprovidesanop-timalcraratewhichisbetterthantheuniformrate.Theseresultswereallderivedforaiewhereaxednberofpagesiscrawledinagiventimeperiod.Thesepagesareusedtoupdateaxedsizereposi-toryeitherbyreplacingexistingrepositorypageswithnewimportandeemedtobeofgreaterimportance.Comanetal.[8]builtatheoreticalmodeltominimizethefractionoftimepagesspendoutofdate.Alsoassumingoissonpagechangeprocessesandageneraldistributionforpageaccesstime,theysimilarlyshowthatoptimalresultscanbeobtainedbycrawlingpagesasuniformlyaspossible.In[5],ChoandGarcia-Molinadeviseanarchitectureforanincrementalcrawler,andexaminetheuseofanincremen-talversusabatchcrawlerundervariousconditions,particu-larlythosewheretheentirewebisnotcrawlingasubset(720,000pagesfrom270sites)ofthewebdaily,theydeterminedstatisticsontherateofchangeofpagesfromdif-tdomains.Theyfound,forexample,thatforthesitesdailyincontrastto.eduand.govdomainswheremorethan50%ofpagesdidnotchangeinthefourmonthsofthestudyTheyshowthattheratesofchangeofpagestheycravisothattheguresforpageswhichangemoreoftenthandailyorlessoftenthanfourmonthlyaretcollectionprocedures,andMikhailo[13]derivesimilarconclusions.AdisadvtageofallthesemodelsisthattheydealonlywithaxedsizerepositoryofalimitedsubsetofthewIncontrast,ourmodelis exible,adaptive,baseduponthewholewebandcatersgracefullyforitsgroageAgeinDa eprobabilit pageage(da 0.03 100 0.14 101 0.48 102 0.98 103 MeanLifetimeinDa eprobabilit meanlifetime(da 0 1 2 6103 2.2PageStatisticsDerivedfromCrawlingStatisticsonpageages,lifetimes,ratesofchange,etcareimportantforourmodel.Subjecttotheassumptionsofatsizerepositorycopyoftheweb,whichisupdatedwithperiodicuniformreindexing,BrewingtonandCybenk[1]shoedthatinordertobesurethatarandomlycpageisatleast95%freshorcurrentuptoadayago,eb(of800M)pagesneedsareindexingperiodof8.5daandareindexingperiodof18daysisneededtobe95%surethattherepositorycopyofarandompagewascurrentuptoeekago.Inanotherstudy[2],thesameauthorsestimatethecumeprobabilityfunctionforthepageageindaonalogscaleasshowninTable1.Aswithsimilarstudies,thisdoesnotaccuratelyaccounforpageswhichangeveryfrequentlyorthosewhiceryslowingforthesebiases,theauthorsalsoes-timatethecumeprobabilityofmeanlifetimeindawninTable2.BrewingtonandCybenkothenusetheirdatatoexamineariousreindexingstrategiesbasedonasinglerevisitperiodandrefertotheneedfortodeterminetheoptimalreindexingstrategywhenthereindexingperiodvariesperpage.hamodelisthemainfocusofthispaper.3.THEWEBFOUNTAINCRAWLERmodelpapertaindataminingarcThefeaturesofthiswlerthatdistinguishitfrommostpreviouscrawlersarethatitisfullydistributedandincremenBydistributed,responsibilitclusterofmacURLsaregroupedbysite,andasitesitessuchasgeocitiesmayactuallybesplitamongsevThereisnoglobalscheduler,noraretherean globalqueuestobemainer,thereisnoma-hinewithaccesstoagloballistofURLs.Anincrementalcra(asopposedtoabatchcraprocess,erregardedascomplete.Theunderlyingphilosophyisthatthelocalcollectionofdocumentswillalwysgrow,al-ysbedynamic,andshouldbeconstructedwiththegoalofkeepingtherepositoryasfreshandcompleteaspossible.Insteadofdevotingallofitseorttocrawlingnewlydiscoeredpages,apercentageofitstimeisdevotedtorecrathatwwledinthepast,inordertominimizethenberofobsoletepages.Notethatouruseofthetermtal'diersfromthatofChoetal.[5,6,7].denitionassumesthatthedocumentcollectionisofstaticsize,andarankingfunctionisusedtoreplacedocumentsinthecollectionwithmoreimportantdocumeneregardtheissueofincremenytobeindependentofthesizeofordertomeetthedemandsofanever-expandingwTheWtaincrawleriswritteninC++,isfullydis-tributed,andusesMPI(MessagePassingInterface)forcom-betcomponenmajorcomponentsarethe,whicharethemachinesas-signedtocrawlsites,atedete,whichareresponsi-blefordetectingduplicatesornear-duplicates,andasinglehinecalledaollerTheControlleristheconpointforthemachinecluster,andkeepsadynamiclistofsiteassignmentsontheAnItisalsoresponsibleforrout-ingmessagesfordiscoeredURLs,andmanagestheowlrate,monitoringofdiskspace,loadbalancing,etc.OthercrawlershaebeenwrittenthatdistributetheloadtheworkindierenDuetothecompetitivenatureoftheInternetindexingandsearchingbusiness,fewdetailsaboutthelatestGoogleogleisapparentlydesignedasabatchcrawler,andisonlypartiallydistributed.Itusesasinglepoinappearvidesabottleneckforintschedulingalgorithms,sincetheschedulingofURLstobecrawledmaypotenneedtotouchalargeamountofdata(eg,robots.txt,polite-nessvalues,changeratedata,DNSrecords,etc).[10]supportsincrementalcrawlingusingprioritaluesonURLsandinvingcrawlingnewandoldURLs.TheschedulingmechanismoftheWtaincrawlerre-blesMercatorinthatitisfullydistributed,very exible,andcanevenbechangedonthe y.ThisenablesecientuseofallcrawlingprocessorsandtheirunderlyingnetcomponenURLstobecrawledconsistsofacompositionofsequencers.Sequencersaresoftareobjectsthatimplementafewsim-plemethodstodeterminethecurrentbacklog,whetherthereareanyURLsaailabletobecrawled,andcontrolofload-ingand ushingdatastructurestodisk.Sequencersarethentedaccordingtodierentpolicies,includingasim-pleFIFOqueueorapriorityqueue.OtherSequencersareandimplementapolicyaggregatorthatprobabilisticallyselectsfromamongsevsequencersaccordingtosomewInaddition,weusetheSequencermechanismtoimplementthecrawlingpolite-nesspolicyforasite.Theabilitytocombinesequencersand URLStreamintheWtainCracascadethemprovidesaverycontmeanstobuilda exiblerecrawlstrategystrategyisillustratedinFiguretthetoplevtherearetoqueues,oneforimmediatecrawlingthatisin-tendedtobeusedfromaGUI,andonethataggregatesallotherURLs.Underthat,eachAntisassignedalistofsitestobecrawled,andmaintainsanactivelistofappro1000sitesthatarecurrentlybeingcraTheselectionofURLstobecrawledistakenfromthisactivelistinaroundrobinfashion.Thisaoidscrawlinganyparticularsitetootly-theso-calledInaddition,hAntismultithreadedtominimizelatencyeects.asiteisaddedtotheactivelist,asequencerisconstructedthatloadsalldatastructuresfromdisk,mergesthelistofnewlydiscoeredURLswiththelistofpreviouslycraURLsforthatsite,andpreparestoqueues.OneofthesebeenURLsthatareheduledforarecraItisthewemanageournevwledandrecrawlliststhatitistobedeterminedbyouroptimisationmodel.4.MODELDESCRIPTIONeconstructedthismodelforrobustnessundergrowthofthewebandchangestoitsunderlyingnature.Hencewedonotmakeanaprioriassumptionsaboutthedistributionofpagechangerates.er,wedomakeanassumptionthatparticularhistoricalinformationoneachpageismain-tainedinthemetadataforthatpage.Specically,eachtimeapageiscrawled,werecordwhetherthatpagehascthepageintooneof(atmost256)change-frequency`bucets',recordedasonebyteofthepage'smetadata.hdatabecomesmorereliableasthepageages.Inprac-utestotlyasearorhangingpages(severaltimesaday)arealmostallspecializedTheremainingpagesintherepositoryaregroupedhconpageswhichhaesimilarratesofc etperTimePeriodA(theoretically)completecrawlofthewebismodeledasinaberofperiodsinsucandthe(equal)lengthofperiodmaybevariedasneeded.Ourfundamentalrequire-tisberfrequencybucetattheendofeachtimeperiod.ThewinwhichthisiscalculatedisbestillustratedbyfollowingthetreeofalternativesinFigure2.econsideragenerictimeperiodandagenericbuc(droppingthesubscript)ofpagesintherepository,conbeginningperiod.berofhpagesourrepositorycopyisreadyobsoleteatthebeginningofthetimeperiodbe,leaing()up-to-date.wletbethenberofpagesinthisbucetcrawledduringthetimeperiod,andassumethatobsoleteandup-to-datepagesarecrawledwithequalprob-Thisgivesthepartitionofpagesinthebucetseenatthesecondlevelofbranchesinthetree.,letthe(givofpageswhichangeinthebucduringthetimeperiod,andassumethatsuchachangeisperiod.inthebucetarewpartitionedamongtheesofthetree,andweeasilyseethattheleaesmarkcorrespondtoobsoleteNotethataddingthehedtopressionforthenberofobsoletepagesattheofthistimeperiod,asafunctionofthedataatthebeginningperiodspecieswmanypagesinthisbucettocraThisrelationshipisfundamentaltoourmodel,andtothewyinwhichthequeueof`old'URLsismanaged.Inadditiontothe`old'pageswustdealwiththe`new'pageseitherdiscoeredorexogenouslysuppliedduringtheLetthenberofthesenewpagescrawledduringatimeperiodbe(thesearethenaddedtotheappropriateetsintherepository).TheremainingnewuncrapagesarerepresentedbThesepagesarealsoregardedasobsoleteandwillremaininthatstateuntilcraearenowreadytogivethedenitionsofthevandtheformalmodel.beroftimeperiodsinmodelberofbucwherebucetreferstopageswhicatapproximatelythesamerateeragetimeinsecondstocrawlanoldpageinbucintimeperioderagetimeinsecondstocrawlanewpageintimeperiodCconsttotalnberofsecondsaailableforcraintimeperiodoldwtexperimentalproportionalwtoncraobsoletepagesinbucintimeperiodnewwtexperimentalproportionalwtoncranewpagesintimeperiodoldnwtythatwhenanoldpageinbuciscrawledintimeperiod,itndsanewpagenewnwtythatwhenanewuncrawledpageiscrawledintimeperiod,itndsanewpageumnberofnewpagesbroughttothetionofthecrawlerpertimeperiodberofpagesinbucattheendoftimeperioddistributionofnewpagestobucisproportionaltofractionofpagesinbucintimeperiodberofobsoletepagesinbucatendoftimeperiodberofcrawledpagesinbucintimeperiodberofnewuncrawledpagesatendoftimeperiodberofnewpagescrawledintimeperiodTheobjectiveofthetimeperiodmodelistominimizethetedsumofobsoletepages,ie:oldwtnewwtsubjecttothefollowingconstrainThebandwidthaailableforcrawlingduringeachtimepe-riodmaynotbeexceeded.CconstThenberofobsoleteexistingorisupdated asdiscussedaboeateverytimeperiod.=(1Thenberofisupdatedeverytimeperiod.oldnwtnewnwtberupdatederytimeperiod.Thenberofexistingpages,wledinanybucetinperiodberpagestobecrawledinthebucperiod,berustbelessthanthenberofnewuncra;:::;T;:::;Balues,whichtellushowmanyURLsinthe`old'and`new'queuesweshouldcrawlineachtimeperiod,andhencetheprobabilitiesinFigure1.5.SOLUTIONupdatingnonlinear,themodelmustbesolvedbyanonlinearprogram-ming(NLP)methodcapableofsolvinglargescaleproblems.modeluniquepagesontheweb,withagrowthratewhichallothewebtodoubleinsizeinapproximately400daWithaalueof14forthenberoftimeperiods,and255forthenberofbucets,thebasicmodelhasappro11200variablesofwhich10000arenonlinearand11200con-tsofwhichabout3300arenonlinear.enaftertheuseofvariouspresolvetechniquestoreducethesizeoftheproblem,thesolutionproedtobenon-trivial.eusedtheNEOS[12]publicserversystemtorunexper-tswithseveraldierentNLPsolversonthemodelsandfoundthatthestandardNLPpacage,MINOS[11]gaethebestandmostreliableresults.SolutiontimesforallvtionsofthemodelwerearoundtenminutesonMINOSonatimesharedmacSincetheWtaincrawlerforwhichthismodelwdesignedisinitsearlystagesofdevt,wehalittleactualhistoricaldataforsuchparametersastherateofhangeofvariouspages.ehae,therefore,usedsimcitedfromtheliterature.5.1Resultsmodelyexperi-tswererunforarangeofvaluesofthecriticalparame-,themodelisquiterobustundermostofable3:ModelStatistics objectiv otalP otalObsolete atend atend Strat1 32:592107 5: 3:107 Strat2 1:722107 5: 1:107 Strat3 2:138107 5: 1:108 thesecturnsouttobequitesensitivtochangesinthewtsintheobjectivefunction,oldwtnewwtentheoerallaimofminimizingthetotalberofobsoletepages,wedescribetheimplementationofthreeofthemanypossiblevariationsoftheobjectivefunc-tionwStrategy1givesequalwts(summingto1)toeactimeperiodinacycleofthecraStrategy2givesthelasttimeperiodthetotalwperiods,modelisminimizingthenberofobsoletepagesjustonthelasttimeperiodofacrawlercycleStrategy3givesthelasttimeperiodahighwandtheremainingtimeperiodseryloieitisstilltryingtominimizethenberofobsoletepagesinthelasttimeperiodbutitisalsotakinginttheobsoletepagesinalltimeperiods.5.2Experiment1AsshowninTable3,Strategy2givestheminimumvoftheobjectivefunction,iethewtedtotalofobsoletepages,folloedbyStrategy3.TheobjectivalueforStrat-egy1isconsiderablyhigher.Figure3illustratestheeectsofimplementingthesestrategies.Strategy2recommendscrawlingeachpagejustoncedur-ingacycleofthecraThisuniformcrawlingisinlinelineandComanetal.al.Strategy1recommendscrawlingfasthangingpagesinmanytimeperiodsofacycle.orthesethecrarateisusuallyhighereventhanthepagehangerate.er,itdoesnotcrawlatall,thosepageshfallintotheloest40%ofpagechangerates.egy3isacompromise.Itrecommendscrawlingallpagesatleastonceinacrawlercyclewiththefastestchanging18%beingcrawledmorethanoncepercycle.berof,thetotalnberofobsoletepagesisnot.Figure4examinestheberofhtimeperiodofacycleundereachstrategyIfwedisregardtheactualobjectivefunctionandlookatthenberofobsoletepagesweseethatinanygiventimeperiod(exceptthelast),Strategy1alwyshasfewerobsoletepagesthanStrategy3andconsiderablyfewerthanStrategyBecauseofthewtsintheobjectivefunctionsforthetstrategies,theloernberofobsoletepagesforStrategies2and3inthelasttimeperiodisexpected.,Strategy2isoptimal.er,depend-purpose(s)ynotbeeantalcranorethepagechangerateandcrawlallpagesatauniform berperRatesunderDierentCrawlStrategies berTimePeriodunderDierentStrategiesModelStrategies1and3 objectiv otalP otalObsolete atend atend S1 :107 5:224108 3:086107 S1 :107 5:214108 3:534107 S3 :138107 5:224109 1:702108 S3 :500107 5:193109 1:972108 Figure5:EectsontheNumberofObsoletePperTimePeriodofVaryingthePageChangeRatesrate,iejustonceeachcrawlercycle.Intermsofminimizingthenberofobsoletepageseachtimeperiod,Strategy1isclearlysuperior.er,itdoesnotcrawl40%oftheyofthepossibleusesofthecrawlsallpageswithinacycleandstillhasasomewhatlomathematicalminimumobjectivaluethanStrategy1.5.3Experiment2modelruncuesofdierentparameters.Inallcasestheresultswereasexpected.ExperimenorVersion1model,the500Mrepositoryereassumedtobedistributedequallytoeachbucet,ieitasassumedthattherearethesamenberofpagescorre-spondingtoeachofthedierentpagechangerates.hofthe255bucetsreceived2Mpages.InVersion2,thebucetsrepresentingthe25%ofthefastestchangingpageratesandthe25%ofthesloestchangingpagesallreceived3MpagesinitiallyThebucetsrepresentingthemiddle50%ofpagechangerates,eachreceived1Mpagesinitiallyable4berofthiscFigure5showsthatthischangemadelittledierencetothedistributionofobsoletepagesforeachtimeperiod.bothStrategy1andStrategy3,therearealargernberofobsoletepagesinVersion2.ThisisexpectedasthemodelsinVersion2startedwithahighernberofobsoletepagesberwillgrowduringa14timeperiodcycleofthecrawlerunlessforcedtoreduceasinthelastfewtimeperiodsofStrategy3. rendtoberofObsoletePagesoerTimeExperiment2doesshowtherobustnessofthemodeltohangesintheinitialparameters.5.4Experiment3TheobjectofExperiment3wastodetermineifthenberofobsoletepagesconuedtogrowwithtimeorifthisberreachedstabilisation.Therewerefourruns,allmodelobjectivcorrespondingFigure6illustratestheresults.Strategy1(Version1)wrunfor14timeperiodsandfor28timeperiodsaswasVsion3.InVersion3,itwasassumedthatthepagectastionofobsoletepagesoertimeperiodsbeteenandwithinersionshowstheexpectedsimilarities.Ascanbeseeninanygiventimeperiod,thenberofobsoletepagesforersion3isapproximatelyhalfthatofVersion1.Moreim-portan,itcanbeseenthatforbothversions,thenberofobsoletepagesistendingtostabilise.Itwasnotpossibletorunthecrawlermodelforalongerperiodandobtainausefulmathematicalsolution,norwouldthecrawlerberunforthislonginpracticewithoutanupdateoftheparametersandreoptimization.Theobjectivesweused,basedonthedierensumsofobsoletepages,correspondtomaximisingthefresh-nessofthecollectionunderdierentcrawleraims,egalltheebmustbecrawledeachcycleoracertainpercentageofpagesintherepositoryshouldbeguaranteedtobenomorethansa,aweekold.akingalltheexperimentsintoconsideration,theresultsareconsistentwithanimplementationphilosophyofusingStrategy2inearlycyclesofthecrawler,todrivedowntheberofobsoletepagesintherepositoryquic.ItwthenbebenecialtoswitchtoStrategy1or3tomainastablenber.6.CONCLUSIONSThecomputationalresultswehaeobtained(albeitwithulateddata)suggestthatanecientcrawlingstrategytalcrawlerwithoutmakinganytheoreticalassumptionsabouttherateofchangeofpages,butbyusinginformationgleanedfromactualcyclesofthecrawlerwhichadaptivbuildupmoreextensiveandreliabledata.usthemodelehaedescribedisadaptiveattolevwithinacracycleitcoordinatesthemanagementoftheURLqueuesothecycle'scomponenttimeperiods,andbeteencyclesthedatanecessaryfortheoptimizationisupdatedforthenextcycle-particularlythechangerates,andnewpagecreationelookforwardtofullscaleimplementationofthemodelwhentheWtaincrawlerbeginsregularoper-7.ACKNOWLEDGEMENTSTheauthorswouldliketothankmembersofIBM'sWtainteam,inparticular,SridharRajagopalanandAn-drewTomkins.TheywouldalsoliketothankMichaelSaun-andPaulaWirthfortechnicalassistance.8.REFERENCESREFERENCESB.BrewingtonandG.Cybenko.Howdynamicisthedingsofthe9thWorldWideWebe(WWW9),2000.2000.B.BrewingtonandG.Cybenko.Keepingupwiththehangingw,pages52{58,May2000.2000.S.BrinandL.Page.Theanatomyofalarge-scaleypertextualwebsearchengine.Indingsofthe7thWorldWideWebConfere(WWW7),1998.998.A.Broder,S.Glassman,M.Manasse,andG.Zwtacticclusteringoftheweb.Indingsof6thInternationalWorldWideWebConfere(WWW6)WWW6)J.ChoandH.Garcia-Molina.Theevolutionoftheebandimplicationsforanincrementalcrawler.Indingsof26thInternationalConfereonVgeDatabases(VLDB),2000.000.J.ChoandH.Garcia-Molina.Synchronizingadatabasetoimproefreshness.Indingsof2000CMInternationalConfereonManagementofData(SIGMOD),2000.2000.J.Cho,H.Garcia-Molina,andL.Page.EcienwlingthroughURLordering.Indingsofthe7thWorldWideWebConfere(WWW7),1998.998.E.Coman,Z.Liu,andR.Weber.Optimalrobothedulingforwebsearchengines,Rapportdeheno3317.Thnicalreport,INRIASophiatipolis,1997.1997.F.Douglis,A.Feldmann,andB.Krishnam.Rateofchangeandothermetrics:alivestudyofthewwideweb.IndingsofUSENIXSymposiumonInternetworkingTgiesandSystems,1997.1997.A.HeydonandM.Najork.Mercator:Ascalable,extensiblewebcraWorldWideWeb2(4):219{229,1999.1999.MINOS.(http://www.sbsi-sol-optimize.com/minos.htm).[12]NEOSServerforOptimization.ptimization.C.WillsandM.Mikhailov.Tardsabetterunderstandingofwebresourcesandserverresponsesforimproedcaching.Indingsofthe8thWorldWideWebConfere(WWW8),1999.