/
An Adaptive Model for Optimizing Performance of an Inc An Adaptive Model for Optimizing Performance of an Inc

An Adaptive Model for Optimizing Performance of an Inc - PDF document

pasty-toler
pasty-toler . @pasty-toler
Follow
405 views
Uploaded On 2015-05-22

An Adaptive Model for Optimizing Performance of an Inc - PPT Presentation

utseduau Kevin McCurley IBM Research Division Almaden Research Center K53802 650 Harry Road San Jose CA 951206099 USA mccurleyalmadenibmcom John Tomlin IBM Research Division Almaden Research Center K53802 650 Harry Road San Jose CA 951206099 USA toml ID: 72096

utseduau Kevin McCurley IBM Research

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "An Adaptive Model for Optimizing Perform..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

AnAdaptiveModelforOptimizingPerformanceofanIncrementalWebCrawlerJennyEdwardsFacultyofInformationTechnologyUniversityofTechnology,SydneyPOBox123BroadwayNSW2007AustraliaKevinMcCurleyIBMResearchDivisionAlmadenResearchCenter,650HarryRoadSanJose,CA95120-6099USAJohnTomlinIBMResearchDivisionAlmadenResearchCenter,650HarryRoadSanJose,CA95120-6099 Thisworkwascompletedwhiletheauthorw ofwhicebecomeaare,eitherthroughnewlinks,orex-ogenousinformation,andde nethemasalsobeingobsolete.Aparticularthemodelwdescribeaboutpageschange,simplythatwecanmeasurewhenchangesoc-cur,andrecordthefrequencywiththepages'metadata.thissensethemodelisinthatthemorecyclesthewlerisinoperation,themorereliableandre nedisthedatawhicehaailabletodriveit.orourpurposesapageisrequiredtobenon-trivial,asdeterminedyashingle(see[4]).urthermore,whilegrowthinthewhangesthedatainourmodel,ithasnoe ectonthesizeofthemodelnorthesolutiontime.Nopreviousworkthatweknowofmakesuseofpreciselythissetofassumptions,butmhofithassomebearingonourmodel.Afterareviewofthiswork,wepresentamorematicaldescriptionoftheoptimizationmodelforthemodelethendescribeaseriesofcomputationalex-perimentsdesignedtotestuseofthemodelandsomeofitstsonsimulatedbutrealisticdata.2.PREVIOUSWORKTherehaebeenseveralstudiesofwebcrawlinginitsrel-focusratherdi erentfromours.Somehaeconcentratedonas-pectsrelatingtocaching,e.g.,[13]and[9].Othershaebeenprincipallyinterestedinthemostecientande ectivtoupdateadsizedatabaseextractedfromtheweb,oftenforsomespeci cfunction,suchasdatamining,seeegtheorkofChoetal.[5,6,7].Thesestudieswereperformedertimeperiodsrangingfromafewdaystosevenmoner,fordi eringpracticalreasons,thesecrawlerswrestrictedtosubsetsofwebpages.eralauthors,e.g.,Co manetal.[8],approachcrapoinpollingsystemsofqueueingtheory,i.e.,multiplequeue-singleersystems.er,theproblemequivttotheob-solescencetimeofapageisunexploredinthequeueinglit-Acommonassumptionhasbeenthatpagechangesareaoissonormemorylessprocess,withparameterastherateofchangeforthepages.BrewingtonandCybenko[1]andandcon rmthiswithinthelimitsoftheirdatagathering.Thisissomewhatunderminedbanotherstudy,basedonanextensivesubsetofthewebbBrewingtonandCybenko[2]showingthatmostwebpagesaremodi edduringUSworkinghours,ie5amto5pm(Sil-iconValleyStandardTime),MondaytoFless,thewidelyacceptedPoissonmodelformsthebasisforaseriesofstudiesoncrawlerstrategies.TheseleadtoavyofanalyticalmodelsdesignedtominimizetheageormaximizethefreshnessofacollectionbyinwoftenapageshouldbecrainwhatorderpagesshouldbecrashouldacrawlingstrategybebasedontheofpagesortheirratesofcThede nitionsmetricsforandimpor-tancearenotcompletelyconsistent,butingeneraltheofapagereferstothedi erencebeteenthetimeofpageisthedi erencebeteenthetimewhenthepagelasthangedandthecurrenttime.Widelydi eringmetricshabeeno eredfortheofapage,butasthetotalberofpossiblemetricsislargeandthecrawlerinthisstudydoesnotcurrentlyuseanyofthem,noformalde ni-tionswillbegivenhere.Whiledi eringindetail,theexperimentalresultsinthereferencedpapersagreeongeneraltrends,andinparticulartheaeragesizeofindividualpagesisgrotheproportionofvisualandothernontextualmaterialisgrowingincomparisontotextthenberofpageshasbeengrowingexponen([1]givestodi erentestimatesof318and390dawthrateissloer,seeourcommentsatthebeginningofSection1.)tdomainshaerydi erentpagechangeratestheaw(cfTables1and2).importanthreearemorerelevttothestudyofwholewebcratobediscussedhere.2.1CrawlingModelsInaseriesofpapers,Choetal.[5,6,7]addressanberofissuesrelatingtothedesignofe ectivecraIn[7]theyexaminedi erentcrawlingstrategiesusingtheStanfordebpagesasaparticularsubsetofthewebandexamineseveralscenarioswithdi erentphysicallimitationsonthecraTheirapproachistovisitmoreimportanpages rstandtheydescribeanberofpossiblemetricsfordeterminingthisaswellastheorderinwhichthecpageswillbevisited.Theyshowthatitispossibletobuildwlerswhichcanobtainasigni cantportionofimportanpagesquicklyusingarangeofmetrics.TheirmodelappearstobemostusefulwhentryingtocrawllargeportionsoftheebwithlimitedresourcesorifpagesneedtoberevisitedoftentodetectcChoandGarcia-Molina[6]deriveaseriesofmathemati-calmodelstodeterminetheoptimumstrategyinanberofcrawlingscenarios,wheretherepository'sextractedcopmodelsallpagesaredeemedtochangeatthesameaerageorrateandwheretheychangeatdi erentororareallifecrawler,thelatterismorelikelytobeoreachofthesetocases,theyexamineavyofsynchronizationorcrawlingpolicies.woftentherepositorywebcopyisupdateddependsonthecrawlingca-berWithinthatlimitationisthequestionofhowoftenhindividualpageshouldbecrawledtomeetaparticularobjectivMolinaexamineauniformallopolicy,inwhicheac policywhereeachpageiscrawledwithafre-quencythatisproportionaltothefrequencywithwhichitishangedontheweb,iepageswhichareupdatedfrequenarecrawledmoreoftenthanthosewhichangeonlyocca-,theyexaminetheorderinwhichthepagesshouldbecraTheydevelopmodelsof:whereallpagesarecrawledrepeatedlyinthesameordereachcycleandomorwhereallpagesarecrawledineachcyclebutinarandomorder,egbyalwysstartingwiththerootURLforasiteandcrawlingallpageslinkedtoitandothersnevmodelswlingpagesatauniformrateandin xedorder.Aswenoted,moststudiesmaketheassumptionthatpageshangeatavariableratewhichmaybeapproximatedboissondistribution,butinthesecondstageoftheirstudyChoandGarcia-Molinaassumethebroadergammadistri-butionforthischangerate.er,theyproethattheirparticularmodelisvalidfordistribution,andconcludethatwhenpageshangeatvaryingrates,itisysbet-tertocrawlthesepagesatauniformrate,ieignoringtherateofchange,thanataratewhichisproportionaltotherateofcer,tomaximizefreshnessthey ndaformsolutiontotheirmodelwhichprovidesanop-timalcraratewhichisbetterthantheuniformrate.Theseresultswereallderivedforaiewherea xednberofpagesiscrawledinagiventimeperiod.Thesepagesareusedtoupdatea xedsizereposi-toryeitherbyreplacingexistingrepositorypageswithnewimportandeemedtobeofgreaterimportance.Co manetal.[8]builtatheoreticalmodeltominimizethefractionoftimepagesspendoutofdate.Alsoassumingoissonpagechangeprocessesandageneraldistributionforpageaccesstime,theysimilarlyshowthatoptimalresultscanbeobtainedbycrawlingpagesasuniformlyaspossible.In[5],ChoandGarcia-Molinadeviseanarchitectureforanincrementalcrawler,andexaminetheuseofanincremen-talversusabatchcrawlerundervariousconditions,particu-larlythosewheretheentirewebisnotcrawlingasubset(720,000pagesfrom270sites)ofthewebdaily,theydeterminedstatisticsontherateofchangeofpagesfromdif-tdomains.Theyfound,forexample,thatforthesitesdailyincontrastto.eduand.govdomainswheremorethan50%ofpagesdidnotchangeinthefourmonthsofthestudyTheyshowthattheratesofchangeofpagestheycravisothatthe guresforpageswhichangemoreoftenthandailyorlessoftenthanfourmonthlyaretcollectionprocedures,andMikhailo[13]derivesimilarconclusions.Adisadvtageofallthesemodelsisthattheydealonlywitha xedsizerepositoryofalimitedsubsetofthewIncontrast,ourmodelis exible,adaptive,baseduponthewholewebandcatersgracefullyforitsgroageAgeinDa eprobabilit pageage(da 0.03 100 0.14 101 0.48 102 0.98 103 MeanLifetimeinDa eprobabilit meanlifetime(da 0 1 2 6103 2.2PageStatisticsDerivedfromCrawlingStatisticsonpageages,lifetimes,ratesofchange,etcareimportantforourmodel.Subjecttotheassumptionsofatsizerepositorycopyoftheweb,whichisupdatedwithperiodicuniformreindexing,BrewingtonandCybenk[1]shoedthatinordertobesurethatarandomlycpageisatleast95%freshorcurrentuptoadayago,eb(of800M)pagesneedsareindexingperiodof8.5daandareindexingperiodof18daysisneededtobe95%surethattherepositorycopyofarandompagewascurrentuptoeekago.Inanotherstudy[2],thesameauthorsestimatethecumeprobabilityfunctionforthepageageindaonalogscaleasshowninTable1.Aswithsimilarstudies,thisdoesnotaccuratelyaccounforpageswhichangeveryfrequentlyorthosewhiceryslowingforthesebiases,theauthorsalsoes-timatethecumeprobabilityofmeanlifetimeindawninTable2.BrewingtonandCybenkothenusetheirdatatoexamineariousreindexingstrategiesbasedonasinglerevisitperiodandrefertotheneedfortodeterminetheoptimalreindexingstrategywhenthereindexingperiodvariesperpage.hamodelisthemainfocusofthispaper.3.THEWEBFOUNTAINCRAWLERmodelpapertaindataminingarcThefeaturesofthiswlerthatdistinguishitfrommostpreviouscrawlersarethatitisfullydistributedandincremenBydistributed,responsibilitclusterofmacURLsaregroupedbysite,andasitesitessuchasgeocitiesmayactuallybesplitamongsevThereisnoglobalscheduler,noraretherean globalqueuestobemainer,thereisnoma-hinewithaccesstoagloballistofURLs.Anincrementalcra(asopposedtoabatchcraprocess,erregardedascomplete.Theunderlyingphilosophyisthatthelocalcollectionofdocumentswillalwysgrow,al-ysbedynamic,andshouldbeconstructedwiththegoalofkeepingtherepositoryasfreshandcompleteaspossible.Insteadofdevotingallofitse orttocrawlingnewlydiscoeredpages,apercentageofitstimeisdevotedtorecrathatwwledinthepast,inordertominimizethenberofobsoletepages.Notethatouruseofthetermtal'di ersfromthatofChoetal.[5,6,7].de nitionassumesthatthedocumentcollectionisofstaticsize,andarankingfunctionisusedtoreplacedocumentsinthecollectionwithmoreimportantdocumeneregardtheissueofincremenytobeindependentofthesizeofordertomeetthedemandsofanever-expandingwTheWtaincrawleriswritteninC++,isfullydis-tributed,andusesMPI(MessagePassingInterface)forcom-betcomponenmajorcomponentsarethe,whicharethemachinesas-signedtocrawlsites,atedete,whichareresponsi-blefordetectingduplicatesornear-duplicates,andasinglehinecalledaollerTheControlleristheconpointforthemachinecluster,andkeepsadynamiclistofsiteassignmentsontheAnItisalsoresponsibleforrout-ingmessagesfordiscoeredURLs,andmanagestheowlrate,monitoringofdiskspace,loadbalancing,etc.Othercrawlershaebeenwrittenthatdistributetheloadtheworkindi erenDuetothecompetitivenatureoftheInternetindexingandsearchingbusiness,fewdetailsaboutthelatestGoogleogleisapparentlydesignedasabatchcrawler,andisonlypartiallydistributed.Itusesasinglepoinappearvidesabottleneckforintschedulingalgorithms,sincetheschedulingofURLstobecrawledmaypotenneedtotouchalargeamountofdata(eg,robots.txt,polite-nessvalues,changeratedata,DNSrecords,etc).[10]supportsincrementalcrawlingusingprioritaluesonURLsandinvingcrawlingnewandoldURLs.TheschedulingmechanismoftheWtaincrawlerre-blesMercatorinthatitisfullydistributed,very exible,andcanevenbechangedonthe y.ThisenablesecientuseofallcrawlingprocessorsandtheirunderlyingnetcomponenURLstobecrawledconsistsofacompositionofsequencers.Sequencersaresoftareobjectsthatimplementafewsim-plemethodstodeterminethecurrentbacklog,whetherthereareanyURLsaailabletobecrawled,andcontrolofload-ingand ushingdatastructurestodisk.Sequencersarethentedaccordingtodi erentpolicies,includingasim-pleFIFOqueueorapriorityqueue.OtherSequencersareandimplementapolicyaggregatorthatprobabilisticallyselectsfromamongsevsequencersaccordingtosomewInaddition,weusetheSequencermechanismtoimplementthecrawlingpolite-nesspolicyforasite.Theabilitytocombinesequencersand URLStreamintheWtainCracascadethemprovidesaverycontmeanstobuilda exiblerecrawlstrategystrategyisillustratedinFiguretthetoplevtherearetoqueues,oneforimmediatecrawlingthatisin-tendedtobeusedfromaGUI,andonethataggregatesallotherURLs.Underthat,eachAntisassignedalistofsitestobecrawled,andmaintainsanactivelistofappro1000sitesthatarecurrentlybeingcraTheselectionofURLstobecrawledistakenfromthisactivelistinaroundrobinfashion.Thisaoidscrawlinganyparticularsitetootly-theso-calledInaddition,hAntismultithreadedtominimizelatencye ects.asiteisaddedtotheactivelist,asequencerisconstructedthatloadsalldatastructuresfromdisk,mergesthelistofnewlydiscoeredURLswiththelistofpreviouslycraURLsforthatsite,andpreparestoqueues.OneofthesebeenURLsthatareheduledforarecraItisthewemanageournevwledandrecrawlliststhatitistobedeterminedbyouroptimisationmodel.4.MODELDESCRIPTIONeconstructedthismodelforrobustnessundergrowthofthewebandchangestoitsunderlyingnature.Hencewedonotmakeanaprioriassumptionsaboutthedistributionofpagechangerates.er,wedomakeanassumptionthatparticularhistoricalinformationoneachpageismain-tainedinthemetadataforthatpage.Speci cally,eachtimeapageiscrawled,werecordwhetherthatpagehascthepageintooneof(atmost256)change-frequency`bucets',recordedasonebyteofthepage'smetadata.hdatabecomesmorereliableasthepageages.Inprac-utestotlyasearorhangingpages(severaltimesaday)arealmostallspecializedTheremainingpagesintherepositoryaregroupedhconpageswhichhaesimilarratesofc etperTimePeriodA(theoretically)completecrawlofthewebismodeledasinaberofperiodsinsucandthe(equal)lengthofperiodmaybevariedasneeded.Ourfundamentalrequire-tisberfrequencybucetattheendofeachtimeperiod.ThewinwhichthisiscalculatedisbestillustratedbyfollowingthetreeofalternativesinFigure2.econsideragenerictimeperiodandagenericbuc(droppingthesubscript)ofpagesintherepository,conbeginningperiod.berofhpagesourrepositorycopyisreadyobsoleteatthebeginningofthetimeperiodbe,leaing()up-to-date.wletbethenberofpagesinthisbucetcrawledduringthetimeperiod,andassumethatobsoleteandup-to-datepagesarecrawledwithequalprob-Thisgivesthepartitionofpagesinthebucetseenatthesecondlevelofbranchesinthetree.,letthe(givofpageswhichangeinthebucduringthetimeperiod,andassumethatsuchachangeisperiod.inthebucetarewpartitionedamongtheesofthetree,andweeasilyseethattheleaesmarkcorrespondtoobsoleteNotethataddingthehedtopressionforthenberofobsoletepagesattheofthistimeperiod,asafunctionofthedataatthebeginningperiodspeci eswmanypagesinthisbucettocraThisrelationshipisfundamentaltoourmodel,andtothewyinwhichthequeueof`old'URLsismanaged.Inadditiontothe`old'pageswustdealwiththe`new'pageseitherdiscoeredorexogenouslysuppliedduringtheLetthenberofthesenewpagescrawledduringatimeperiodbe(thesearethenaddedtotheappropriateetsintherepository).TheremainingnewuncrapagesarerepresentedbThesepagesarealsoregardedasobsoleteandwillremaininthatstateuntilcraearenowreadytogivethede nitionsofthevandtheformalmodel.beroftimeperiodsinmodelberofbucwherebucetreferstopageswhicatapproximatelythesamerateeragetimeinsecondstocrawlanoldpageinbucintimeperioderagetimeinsecondstocrawlanewpageintimeperiodCconsttotalnberofsecondsaailableforcraintimeperiodoldwtexperimentalproportionalwtoncraobsoletepagesinbucintimeperiodnewwtexperimentalproportionalwtoncranewpagesintimeperiodoldnwtythatwhenanoldpageinbuciscrawledintimeperiod,it ndsanewpagenewnwtythatwhenanewuncrawledpageiscrawledintimeperiod,it ndsanewpageumnberofnewpagesbroughttothetionofthecrawlerpertimeperiodberofpagesinbucattheendoftimeperioddistributionofnewpagestobucisproportionaltofractionofpagesinbucintimeperiodberofobsoletepagesinbucatendoftimeperiodberofcrawledpagesinbucintimeperiodberofnewuncrawledpagesatendoftimeperiodberofnewpagescrawledintimeperiodTheobjectiveofthetimeperiodmodelistominimizethetedsumofobsoletepages,ie:oldwtnewwtsubjecttothefollowingconstrainThebandwidthaailableforcrawlingduringeachtimepe-riodmaynotbeexceeded.CconstThenberofobsoleteexistingorisupdated asdiscussedaboeateverytimeperiod.=(1Thenberofisupdatedeverytimeperiod.oldnwtnewnwtberupdatederytimeperiod.Thenberofexistingpages,wledinanybucetinperiodberpagestobecrawledinthebucperiod,berustbelessthanthenberofnewuncra;:::;T;:::;Balues,whichtellushowmanyURLsinthe`old'and`new'queuesweshouldcrawlineachtimeperiod,andhencetheprobabilitiesinFigure1.5.SOLUTIONupdatingnonlinear,themodelmustbesolvedbyanonlinearprogram-ming(NLP)methodcapableofsolvinglargescaleproblems.modeluniquepagesontheweb,withagrowthratewhichallothewebtodoubleinsizeinapproximately400daWithaalueof14forthenberoftimeperiods,and255forthenberofbucets,thebasicmodelhasappro11200variablesofwhich10000arenonlinearand11200con-tsofwhichabout3300arenonlinear.enaftertheuseofvariouspresolvetechniquestoreducethesizeoftheproblem,thesolutionproedtobenon-trivial.eusedtheNEOS[12]publicserversystemtorunexper-tswithseveraldi erentNLPsolversonthemodelsandfoundthatthestandardNLPpacage,MINOS[11]gaethebestandmostreliableresults.SolutiontimesforallvtionsofthemodelwerearoundtenminutesonMINOSonatimesharedmacSincetheWtaincrawlerforwhichthismodelwdesignedisinitsearlystagesofdevt,wehalittleactualhistoricaldataforsuchparametersastherateofhangeofvariouspages.ehae,therefore,usedsimcitedfromtheliterature.5.1Resultsmodelyexperi-tswererunforarangeofvaluesofthecriticalparame-,themodelisquiterobustundermostofable3:ModelStatistics objectiv otalP otalObsolete atend atend Strat1 32:592107 5: 3:107 Strat2 1:722107 5: 1:107 Strat3 2:138107 5: 1:108 thesecturnsouttobequitesensitivtochangesinthewtsintheobjectivefunction,oldwtnewwtentheoerallaimofminimizingthetotalberofobsoletepages,wedescribetheimplementationofthreeofthemanypossiblevariationsoftheobjectivefunc-tionwStrategy1givesequalwts(summingto1)toeactimeperiodinacycleofthecraStrategy2givesthelasttimeperiodthetotalwperiods,modelisminimizingthenberofobsoletepagesjustonthelasttimeperiodofacrawlercycleStrategy3givesthelasttimeperiodahighwandtheremainingtimeperiodseryloieitisstilltryingtominimizethenberofobsoletepagesinthelasttimeperiodbutitisalsotakinginttheobsoletepagesinalltimeperiods.5.2Experiment1AsshowninTable3,Strategy2givestheminimumvoftheobjectivefunction,iethewtedtotalofobsoletepages,folloedbyStrategy3.TheobjectivalueforStrat-egy1isconsiderablyhigher.Figure3illustratesthee ectsofimplementingthesestrategies.Strategy2recommendscrawlingeachpagejustoncedur-ingacycleofthecraThisuniformcrawlingisinlinelineandCo manetal.al.Strategy1recommendscrawlingfasthangingpagesinmanytimeperiodsofacycle.orthesethecrarateisusuallyhighereventhanthepagehangerate.er,itdoesnotcrawlatall,thosepageshfallintotheloest40%ofpagechangerates.egy3isacompromise.Itrecommendscrawlingallpagesatleastonceinacrawlercyclewiththefastestchanging18%beingcrawledmorethanoncepercycle.berof,thetotalnberofobsoletepagesisnot.Figure4examinestheberofhtimeperiodofacycleundereachstrategyIfwedisregardtheactualobjectivefunctionandlookatthenberofobsoletepagesweseethatinanygiventimeperiod(exceptthelast),Strategy1alwyshasfewerobsoletepagesthanStrategy3andconsiderablyfewerthanStrategyBecauseofthewtsintheobjectivefunctionsforthetstrategies,theloernberofobsoletepagesforStrategies2and3inthelasttimeperiodisexpected.,Strategy2isoptimal.er,depend-purpose(s)ynotbeeantalcranorethepagechangerateandcrawlallpagesatauniform berperRatesunderDi erentCrawlStrategies berTimePeriodunderDi erentStrategiesModelStrategies1and3 objectiv otalP otalObsolete atend atend S1 :107 5:224108 3:086107 S1 :107 5:214108 3:534107 S3 :138107 5:224109 1:702108 S3 :500107 5:193109 1:972108 Figure5:E ectsontheNumberofObsoletePperTimePeriodofVaryingthePageChangeRatesrate,iejustonceeachcrawlercycle.Intermsofminimizingthenberofobsoletepageseachtimeperiod,Strategy1isclearlysuperior.er,itdoesnotcrawl40%oftheyofthepossibleusesofthecrawlsallpageswithinacycleandstillhasasomewhatlomathematicalminimumobjectivaluethanStrategy1.5.3Experiment2modelruncuesofdi erentparameters.Inallcasestheresultswereasexpected.ExperimenorVersion1model,the500Mrepositoryereassumedtobedistributedequallytoeachbucet,ieitasassumedthattherearethesamenberofpagescorre-spondingtoeachofthedi erentpagechangerates.hofthe255bucetsreceived2Mpages.InVersion2,thebucetsrepresentingthe25%ofthefastestchangingpageratesandthe25%ofthesloestchangingpagesallreceived3MpagesinitiallyThebucetsrepresentingthemiddle50%ofpagechangerates,eachreceived1Mpagesinitiallyable4berofthiscFigure5showsthatthischangemadelittledi erencetothedistributionofobsoletepagesforeachtimeperiod.bothStrategy1andStrategy3,therearealargernberofobsoletepagesinVersion2.ThisisexpectedasthemodelsinVersion2startedwithahighernberofobsoletepagesberwillgrowduringa14timeperiodcycleofthecrawlerunlessforcedtoreduceasinthelastfewtimeperiodsofStrategy3. rendtoberofObsoletePagesoerTimeExperiment2doesshowtherobustnessofthemodeltohangesintheinitialparameters.5.4Experiment3TheobjectofExperiment3wastodetermineifthenberofobsoletepagesconuedtogrowwithtimeorifthisberreachedstabilisation.Therewerefourruns,allmodelobjectivcorrespondingFigure6illustratestheresults.Strategy1(Version1)wrunfor14timeperiodsandfor28timeperiodsaswasVsion3.InVersion3,itwasassumedthatthepagectastionofobsoletepagesoertimeperiodsbeteenandwithinersionshowstheexpectedsimilarities.Ascanbeseeninanygiventimeperiod,thenberofobsoletepagesforersion3isapproximatelyhalfthatofVersion1.Moreim-portan,itcanbeseenthatforbothversions,thenberofobsoletepagesistendingtostabilise.Itwasnotpossibletorunthecrawlermodelforalongerperiodandobtainausefulmathematicalsolution,norwouldthecrawlerberunforthislonginpracticewithoutanupdateoftheparametersandreoptimization.Theobjectivesweused,basedonthedi erensumsofobsoletepages,correspondtomaximisingthefresh-nessofthecollectionunderdi erentcrawleraims,egalltheebmustbecrawledeachcycleoracertainpercentageofpagesintherepositoryshouldbeguaranteedtobenomorethansa,aweekold.akingalltheexperimentsintoconsideration,theresultsareconsistentwithanimplementationphilosophyofusingStrategy2inearlycyclesofthecrawler,todrivedowntheberofobsoletepagesintherepositoryquic.Itwthenbebene cialtoswitchtoStrategy1or3tomainastablenber.6.CONCLUSIONSThecomputationalresultswehaeobtained(albeitwithulateddata)suggestthatanecientcrawlingstrategytalcrawlerwithoutmakinganytheoreticalassumptionsabouttherateofchangeofpages,butbyusinginformationgleanedfromactualcyclesofthecrawlerwhichadaptivbuildupmoreextensiveandreliabledata.usthemodelehaedescribedisadaptiveattolevwithinacracycleitcoordinatesthemanagementoftheURLqueuesothecycle'scomponenttimeperiods,andbeteencyclesthedatanecessaryfortheoptimizationisupdatedforthenextcycle-particularlythechangerates,andnewpagecreationelookforwardtofullscaleimplementationofthemodelwhentheWtaincrawlerbeginsregularoper-7.ACKNOWLEDGEMENTSTheauthorswouldliketothankmembersofIBM'sWtainteam,inparticular,SridharRajagopalanandAn-drewTomkins.TheywouldalsoliketothankMichaelSaun-andPaulaWirthfortechnicalassistance.8.REFERENCESREFERENCESB.BrewingtonandG.Cybenko.Howdynamicisthedingsofthe9thWorldWideWebe(WWW9),2000.2000.B.BrewingtonandG.Cybenko.Keepingupwiththehangingw,pages52{58,May2000.2000.S.BrinandL.Page.Theanatomyofalarge-scaleypertextualwebsearchengine.Indingsofthe7thWorldWideWebConfere(WWW7),1998.998.A.Broder,S.Glassman,M.Manasse,andG.Zwtacticclusteringoftheweb.Indingsof6thInternationalWorldWideWebConfere(WWW6)WWW6)J.ChoandH.Garcia-Molina.Theevolutionoftheebandimplicationsforanincrementalcrawler.Indingsof26thInternationalConfereonVgeDatabases(VLDB),2000.000.J.ChoandH.Garcia-Molina.Synchronizingadatabasetoimproefreshness.Indingsof2000CMInternationalConfereonManagementofData(SIGMOD),2000.2000.J.Cho,H.Garcia-Molina,andL.Page.EcienwlingthroughURLordering.Indingsofthe7thWorldWideWebConfere(WWW7),1998.998.E.Co man,Z.Liu,andR.Weber.Optimalrobothedulingforwebsearchengines,Rapportdeheno3317.Thnicalreport,INRIASophiatipolis,1997.1997.F.Douglis,A.Feldmann,andB.Krishnam.Rateofchangeandothermetrics:alivestudyofthewwideweb.IndingsofUSENIXSymposiumonInternetworkingTgiesandSystems,1997.1997.A.HeydonandM.Najork.Mercator:Ascalable,extensiblewebcraWorldWideWeb2(4):219{229,1999.1999.MINOS.(http://www.sbsi-sol-optimize.com/minos.htm).[12]NEOSServerforOptimization.ptimization.C.WillsandM.Mikhailov.Tardsabetterunderstandingofwebresourcesandserverresponsesforimproedcaching.Indingsofthe8thWorldWideWebConfere(WWW8),1999.