bennouascriteocom Fabien de Montgol64257er LIAFA Universite Paris 7 2 place Jussieu Case 7014 75251 Paris France fmliafajussieufr ABSTRACT This paper proposes a random Web crawl model A Web crawl is a biased and partial image of the Web This pa per ID: 72099
Download Pdf The PPT/PDF document "Random Web Crawls Touk Bennouas Criteo R..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
beremovedfromtheWebandthendeletedfromthecrawl.Theyhaveveryfewsources(pageswithnoincominglinks,eithersubmittedbypeoplesorunlinkedafterawhile)andalotofsinks(pagesnotcrawledyetorwithnoexternalhyperlink).2.1ConnectivityandClusteringSmall-Worldgraphs,denedbyWattsandStrogatz[24]andstudiedbymanyauthorssince,aregraphsthatfulllthefollowingtwoproperties:1.thecharacteristicpathlength(averagedistancebe-tweentwovertices)issmall:O(logn)orO(loglogn)2.theclusteringcoecient(probabilitythattwoverticessharingacommonneighborarelinked)ishigh:O(1).Thisdenitioncanbeappliedtodirectedgraphswhenomittingarcdirection.TheWeb(infactthecrawls)char-acteristicpathlengthseemssmall(about16clicks[8]or19[3]),consistantwiththeO(logn)axiom.Itsclusteringco-ecientishigh.Existingcomputationsofitsexactvaluedier,butitisadmittedthatitisgreaterthan0:1,whilerandomgraphs(Erdos-Renyi,seeSection3.1)withthesameaveragedegreehaveclusteringcoecientp'0.Crawlsdiameter(maximumdistancebetweentwopages)ispotentiallyinnitebecauseadynamicpagelabeledbyninURLmayrefertoadynamicpagelabeledbyn+1,butsinceWebcrawlersusuallyperformBFS(seeSection4.1)thediameterofcrawlsmaybeactuallysmall.2.2DegreedistributionZipflaws(a.k.apowerlaws)areprobabilitylawssuchthatlog(Prob(X=d))= log(d)Ifthein-degree(respectivelyout-degree)distributionofagraphfollowsaZipflaw,Prob(X=d)istheprobabilityforavertextohavein-(resp.out-)degreed.Inotherwords,thenumberofverticeswithdegreedisk:d (kde-pendsonthenumberofverticesn).AgraphclasssuchthatthedegreeofalmostallgraphsfollowaZipflawiscalledscale-freebecausesomeparameterslikearescaleinvari-ant.Scale-freegraphshavebeenextensivelystudied[4,6,17,18].Manygraphsmodelingsocialnetworks,interactionbetweenobjects(proteins,peoples,neurons...)orothernet-workpropertiesseemtohavethescale-freeproperty.ForWebcrawls,ameasurefromBroder&al.[8]ona200000000pagescrawlshowthattheinandout-degreesfollowZipflaw.Theexponentsarein=2:1forin-degreeandout=2:72forout-degree.2.3StronglyconnectedcomponentsandtheBowTiestructureAccordingtoBroder,Kumaretal[8]theWebhasaBowTiestructure:aquarterofthepageareinaGiantStronglyConnectedComponent(GSCC),aquarterarethe\in"page,leadingtotheGSCCbutnotlinkedfromthere,anotherquarterarethe\out"pagesreachablefromtheGSCCbutnotlinkingtoit,andthelastquarterisnotrelatedtotheGSCC.ThisfamousassertionwasreportedevenbyNature[23]but,sincefouryears,anincreasingnumberofpeoplesuspectsitisacrawlingartifact.Accordingtothesamesurvey,thedistributionofthesizeofstronglyconnectedcomponentsfollowsaZipflawwithexponentroughly2.5.2.4CoresAnotherwell-knownpropertyofcrawlsistheexistenceofcores.Acoreisadensedirectedbipartitesubgraph,consist-inginmanyhubspages(orfans)pointingmanyauthorities.Itissupposed[19,16]thatsuchcoresarethecentralstruc-tureofcybercommunities,setofpagesaboutthesametopics.Theauthoritiesarethemostrelevantpages,buttheydonotnecessarilypointoneeachother(becausecompetition,forinstance)butthehubslistmostofthem.Startingfromthisassumption,HITSalgorithm[15]ranksthepagescontainingagivenkeywordaccordingtoahubfactorandanauthorityfactor.Kumaretal.[19,18]enumerateover200,000bipar-titecoresfroma200,000,000pagescrawloftheWeb.Coressizes(countinghubs,authorities,orboth)followZipflawsofexponentbetween1:09and1:4.2.5SpectralpropertiesandPageRankfactorAnotherrankingmethod,themostpopularsinceitdoesnotdependsongivenkeywords,isGoogle'sPageRankfactor[21].Itisanaccessibilitymeasureofthepage.Brie\ry,thePageRankofapageistheprobabilityforarandomsurfertobepresentonthispageafteraverylongsurf.Itcanbecomputedbybasiclinearalgebraalgorithms.PageRankdistributionalsofollowsaZipflawwiththesameexponentasthein-degreedistribution[22].PageswithhighPageRankareveryvisible,sincetheyareeectivelypopularontheWebandarelinkedfromotherpageswithhighPageRank.Acrawlerthereforeeasilyndsthem[5,20]whileitmaymisslow-rankedpages.Thisisindeedausefulbiasforsearchenginecrawlers!3.RANDOMGRAPHSMODELS3.1Basicmodels:Erd¨osR´enyiForalongtimethemostusedrandomgraphmodelwasErdos-Renyimodel[13].Therandomgraphdependsontwoparameters,thenumberofverticesnandtheprobabilitypfortwoverticestobelinked.Theexistenceofeachedgeisarandomvariableindependentfromothers.Forsuitableval-ues(p=d=n),E.-R.graphsindeedhavecharacteristicpathlengthofO(logn)butverysmallclustering(p=o(1))anddegreedistributionfollowingaPoissonlawandnotaZipflaw.Thereforetheydonotaccuratelydescribethecrawls.Othermodelshavethenbeproposedwhereattachmentisnotindependent.3.2IncrementalgenerationmodelsMostrandomWebgraphmodels[4,6,17,18,7]proposeanincrementalconstructionofthegraph.Whentheexis-tenceofalinkisprobed,itdependsontheexistinglinks.ThatprocessmodelsthecreationoftheWebacrosstime.Insomemodelsallthelinkgoingfromapageareinsertedatonce,andinotheronesitisincremental.3.3PreferentialAttachmentmodelsTherstevolvinggraphmodel(BA)wasgivenbyBarabasiandAlbert[4].Themainideaisthatnewnodesaremorelikelytojointoexistingnodeswithhighdegrees.Thismodelisnowreferredtoasanexampleofapreferentialattachmentmodel.Theyconcludedthatthemodelgeneratesgraphswhosein-degreedistributionfollowsaZipflawwithexpo-nent=3. Anotherpreferentialattachmentmodel,calledtheLin-earizedChordDiagram(LCD),wasgivenin[6].Inthismodelanewvertexiscreatedateachstep,andconnectstoexistingverticeswithaconstantnumberofedges.Avertexisselectedastheend-pointoftheanedgewithprobabilityproportionaltoitsin-degree,withanappropriatenormal-izationfactor.In-degreesfollowaZipflawwithexponentroughly2whenout-degreesare7(constant).IntheACL[2]model,eachvertexisassociatedain-weight(respectivelyout-weight)dependentofin-degree(re-spectivelyout-degree).Avertexisselectedastheend-pointoftheanedgewithprobabilityproportionaltoitsweight.Inthesemodelsedgesareaddedbutneverdeleted.TheCL-delmodel[9]andCFVmodel[10]incorporateintheirdesignboththeadditionanddeletionofnodesandedges.3.4CopymodelsAmodelwasproposedby[17]toexplainotherrelevantpropertiesoftheWeb,especiallythegreatnumberofcores,sincetheACLmodelgeneratesgraphswhichonaveragecontainfewcores.ThelineargrowthcopingmodelfromKumar&al.[18]postulatesthataWebpageauthorshallcopyanexistingpagewhenwritingitsown,includingthehyperlinks.Inthismodel,eachnewpagehasamasterpagefromwhichitcopiesagivenamountoflinks.Themasterpageischosenproportionallytoin-degree.Otherlinksfromthenewpagearethenaddedfollowinguniformorpreferentialattachment.Theresultisagraphwithallpropertiesofpreviousmodels,plustheexistenceofmanycores.Thesemodelsoftenusemanyparametersneedingnetune,andsociologicalassumptionsonhowtheWebpagesarewritten.Weproposeamodelbasedonacomputersci-enceassumption:theWebgraphsweknowareproducedbycrawlers.Thisallowustodesignasimpler(itdependsonlyontwoparametersgetfromexperiments)andveryaccuratemodelofWebcrawl.4.AWEBCRAWLMODELInthissection,wepresentthecrawlingstrategiesandde-riveourWebcrawlmodelfromthem.Itaimstomimicthecrawlingprocessitself,ratherthanthepagewritingprocessaswebgraphmodelsdo.4.1WebcrawlsstrategiesLetusconsideratheoreticalcrawler.Wesupposethecrawlervisitseachpageonlyonce.Thebenetistoavoidmodelingthedisappearanceofpagesorlinksacrosstime,becausethelawitfollowsisstilldebatable(isthepageslifetimerelatedtotheirpopularity,ortotheirdegreeprop-erties?)Whenscanningapage,thecrawlergetsatoncethesetofitsoutgoinglinks.Atanytimethe(potentiallyinnite)setofvalidURLisdividedinto1.Crawled:thecorrespondingpageswerevisitedandtheiroutgoinglinksareknown2.Unvisited:alinktothisURLhasbeenfoundbutnotprobedyet3.Erroneous:theURLwasprobedbutpointsanon-existingornon-HTMLle(somesearchenginesindexthem,buttheydonotcontainURLandarenotinter-estingforourpurposes)4.Unknown:theURLwasneverencounteredThecrawlingalgorithmbasicallychooseandremovefromitsUnvisitedsetanURLtocrawl,andthenaddstheout-goingunprobedlinksofthepage,ifany,totheUnvisitedset.ThecrawlingstrategyisthewaytheUnvisitedsetismanaged.Itmaybe:DFS(depth-rstsearch)ThestrategyisFIFOandthedatastructureisastackBFS(breadth-rstsearch)ThestrategyisLIFOandthedatastructureisaqueueDEG(higherdegree)ThemostpointedURLischosen.Thedatastructureisapriorityqueue(anheap)RND(random)AnuniformrandomURLischosenWesupposethecrawledpagesareorderedbytheirdiscov-erydate.Fordiscussingstructuralproperties,thecrawledpagesonlyaretobeconsidered.Noticethattherstthreestrategiescanonlybecorrectlyimplementedwithasinglecomputed.Themostpowerfulcrawlersaredistributedonmanycomputersandtheirstrategyishardtodene.ItisusuallysomethingbetweenBFSandRandom.4.2ModeldescriptionOurmodelshallmimicacrawlerstrategy.Itworksintwosteps:rstconstructingthesetofpages,thenaddingthehyperlinks.Constructingthesetofpages.Eachpagephastwoelds:itsin-degreedin(p)anditsout-degreedout(p).Intherststepofthecrawlconstructingprocess,wesetavaluetoeachofthem.Thein-degreeandout-degreearesetaccordingtotwoindependentZipflaws.Theexponentofeachlawisaparameterofthemodel,thereforeourmodeldependsontwoparametersin(forin-degree)andout(forout-degree).Thesevaluesarewellknownforrealcrawls:following[8],wehavein=2:1andout=2:72.Weshallhavetochosethepagesatrandomaccordingtotheirin-degree.Forsolvingtheproblem,npages(themaximalsizeofthecrawl)aregeneratedandtheirin-andout-degreesareset.Then,asetLiscreated,whereeachpagepisduplicateddin(p)times.Thesizeofthissetisthemaximalnumberofhyperlinks.Eachtimeweneedtochooseapageatrandomaccordingtothein-degreelaw,wejusthavetoremoveoneelementfromL.Constructingthehyperlinks.Nowthepagesdegreesarepre-set,butthegraphtopologyisnotyetdened.Analgo-rithm,simulatingacrawling,shalladdthelinks.Thereareindeedfouralgorithms,dependingonwhichcrawlingstrat-egyshallbesimulated.Thegenericalgorithmissimply:1.RemoveapagepfromtheUnvisitedSetandmarkpascrawled2.Removetherstdout(p)pagesfromL3.Setthesepagesasthepagespointedbyp4.AddtheunvisitedonestotheUnvisitedSet5.Goto1 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 Number of vertices crawled (log)Out-degree (log)16/x**2.6 BFS, 5% 18.05/x**2.62 BFS, 25% 19.26/x**2.615 BFS, 75% Figure1:Out-degreedistributionatthreestepsofaBFSTheUnvisitedSetisseededwithoneormorepages.Thewayitismanageddependsonwhichcrawlingstrategyissimulated,i.e.whichalgorithmischosen:ForDFSalgorithm,theUnvisitedSetisastack(FIFO)ForBFSalgorithm,itisaqueue(LIFO)ForDEGalgorithm,itisapriorityqueue(heap)ForRNDalgorithm,arandompageisextractedfromtheUnvisitedSetBecausetheaverageout-degreeofapageislargeenough,thecrawlingprocesswillnotstopunlessalmostallpageshavebeencrawled.Theprogressofthecrawl(expressedinpercent)isthefractionofcrawledpagesovern.Asitapproachesn,someweirdthingswilloccurasnomoreun-knownpagesareallowed.Inourexperiments(seethenextsection)wesometimesgoupto100%progressbutresultsaremorerealisticbefore30%;whenthecrawlcanexpandtowardunkownpages.Ourmodeldiersradicallyfrompreferentialattachmentorcopymodelsbecausetheneighborhoodofapageisnotsetatwritingtimebutatcrawlingtime.Soapageisallowedtopointknownorunknownpagesaswell.5.RESULTSWepresentheresimulationresultsusingthedierentstrate-giesandshowinghowthemeasurementsevolveacrosstime.Thankstothescale-freeeect,theactualnumberofpagesdoesnotmatter,sinceitisbigenough.Wehaveusedseveralgraphsofdierentsizesbutwiththesameexpo-nentsin=2:1andout=2:72(experimentalvaluesfrom[8]).Andunlessotherwisespecied,wepresentresultsfromBFS,themostusedcrawlingstrategy,andsimulationsupto20,000,000crawledpages.5.1DegreedistributionAtanystepofthecrawl,theactualdegreedistributionfollowsaZipflawofthegivenparameters(2.1and2.72)withverysmalldeviation(seeFigures1and2).Thisre-sultisindependentfromthecrawlstrategy(BFS,etc.)Itdemonstratesthatourgeneratedcrawlsreallyarescale-freegraphs. 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 Number of vertices crawled (log)In-degree (log)13.77/x**2.6 BFS, 1% 15.4/x**2.13 BFS, 25% 16.48/x**2.1 BFS, 75% Figure2:In-degreedistributionatthreestepsofaBFS5.2SmallWorldpropertiesThedistributionofpathlength(Figure3)clearlyfollowsaGaussianlawforBFS,DEGandRANDstrategies.Thisdis-tributionisplottedatprogress30%butitdoesnotchangealotaccrostime,asshowninFigure5.DFSproducesfargreaterdistancesbetweenvertices,andthedistributionfol-lowsanunknownlaw(Figure4).DFScrawlsdiameterisabout10%ofthenumberofvertices!ThisisbecauseDFScrawlsarelikelongtighttrees.ItiswhyDFSisnotusedbyrealcrawlers,andthispaperfocusesonthethreeothercrawlsstrategies.Theclustering(Figure6),computedon500,000pagessimulation)ishighanddonotdecreasetoomuchasthecrawlgoesbigger.Ourcrawlsdenitelyaresmall-worldgraphs.5.3Bowtiestructure?Therelativesizeofthefourbow-tiecomponents(SCC,IN,OUTandOTHER)areroughlythesameforBFS,DEGandevenRAND(butnotDFS)strategies(Figure7).Whenusingonlyoneseed,thesizeofthelargestSCCconvergestowardtwothirdsofthesizeofthegraph.Theseproportionsthusdierfrom[8]crawlobservationssincethe\in"and\others"partsaresmaller.Butwithmanyseeds(itmaybeseenasmanypagessubmittedtothecrawlerportal)thesizeofthe\in"componentislargerandcanbeuptoonequarterofthepages.Ourmodelreplicatesindeedverywellgenuinecrawlsbow-tietopology.5.4CoresWeusedAgrawalpracticalalgorithm[1]forcoresenu-meration(noticethatthemaximalcoreproblemisNP-complete).Figure10givesthenumberofcoreofagivenminimalsizeforacrawlupto25000vertices.Asshown,thenumberofcoresisverydependentfromexponentsofZipflaws,sincehighexponentsmeansparsergraphs.Itmeansthatoursimulatedcrawlscontainmanycore,asrealcrawlsdo.Figure9showsthatthenumberof(4;4)-cores(atleastfourhubsandfourauthorities)isproportionaltonandafterawhilestaysbetweenn=100andn=50. 0 500000 1e+06 1.5e+06 2e+06 2.5e+06 0 5 10 15 20 25 PopulationPath lengthBFS DEG RAND Figure3:DistributionofpathlengthforBFSandDEGandRAND 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 Population (log)Path length (log) Figure4:DistributionofpathlengthforDFS(log/logscale) 5 10 15 20 25 30 35 40 0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000 Average path length/DiameterNumber of vertices crawledAverage path lengthDiameterBFS, apl DEG, apl RAND, apl BFS, d DEG, d RAND, d log(N)/log(average degree) Figure5:EvolutionofdiameterandaveragepathlengthacrosstimeforBFS,DEGandRAND 0.15 0.2 0.25 0.3 0.35 0.4 0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000 Clustering coefficientNumber of vertices crawled0.750029-0.0454094*log(x) 0.546027-0.0293349*log(x) BFS DEG RAND Figure6:Evolutionofclusteringcoecientacrosstime 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0 100000 200000 300000 400000 500000 SCC size (proportion)Number of vertices crawledBFS RAND DEG DFS 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0 100000 200000 300000 400000 500000 OUT size (proportion)Number of vertices crawledBFS RAND DEG DFS Figure7:EvolutionofthesizeofofthelargestSCC(left)andoftheOUTcomponent(right)acrosstime.Oneseed,upto500,000pagesinthecrawl5.5PagerankdistributionandqualitypagescrawlingspeedFigure10showsthePageRankdistribution(PageRankisnormalizedto1andlogarithmsarethereforenegative).WehavefoundresultsimilartoPanduranganetal.observations[22]:thedistributionisaZipflawwithexponent2.1.Thecrawlquicklyconvergestothisvalue.Figure11showsthesumofthePageRankofthecrawledpagesacrosstime(thePageRankcomputedattheendofthecrawl,sothatitmustvaryfrom0atbeginningto1whencrawlstops).Inaveryfewsteps,BFSandDEGstrategiesndtheverysmallamountofpagesthatcontainsmostofthetotalPageRank.ThispropertyofrealBFScrawlersisknownsinceNajorkandWiener[20].OurresultscanbecomparedtoBoldietalcrawlingstrategiesexperiments[5].5.6DiscoveryspeedandFrontierFigure12showsanotherdynamicalproperty:thediscov-eryrate.Itistheprobabilityfortheextremityofalinkofbeingalreadycrawled.Itconvergestoward40%forallstrategies.Thisisaninterestingscale-freeproperty:afterawhile,theprobabilityforaURLtopointanewpageisveryhigh,about60%.This\expander"propertyisveryusefullfortruecrawlers.Thissimulationshowsitdoesnotdependsonlyonthedynamicalnatureoftheweb,butalsofromthecrawlingprocessitself. hubs auth cores hubs auth cores 2 2 220 3 6 5 2 3 83 3 7 2 2 4 37 4 4 40 2 5 14 4 5 14 2 6 14 4 6 5 2 7 14 4 7 2 3 3 84 5 5 12 3 4 37 5 6 3 3 5 14 6 6 3 Figure8:Numberofsmallcores(over1000pages) 0.01 0.015 0.02 0.025 0.03 0.035 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Fraction of cores (4,4)Number of vertices crawledBFS Figure9:(4,4)CoresdistributionFigure13focusesonawellknowntopologicalpropertyofcrawls,thatoursimulationsalsoproduces,theveryhighnumberofsinksregardlessofcrawlsize.NoticethattheirexistenceisaproblemforpracticalPageRankcomputation[21].Inotherwords,thelarge\out"componentofthebow-tieisverybroadandshort...EironetalsurveytheWebfrontierranking[12].6.CONCLUSIONAssaidinSection2,agoodcrawlmodelshouldoutputgraphswiththefollowingproperties:1.highlyclustered2.withashortcharacteristicpathlength3.in-andout-degreedistributionsfollowZipflaws4.withmanysinks5.suchthathighPageRankvertices(ascomputedinthenalgraph)arecrawledearly6.withabowtiestructure 0 2 4 6 8 10 12 14 16 -20 -18 -16 -14 -12 -10 -8 -6 -4 Number of vertices crawled (log)PageRank values(log)-22.75*x**(-2.1) BFS Figure10:PageRankdistribution 0 0.2 0.4 0.6 0.8 1 1.2 0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 1.4e+07 1.6e+07 1.8e+07 2e+07 Total PageRankNumber of vertices crawled1.68*10**(-8)*x+0.062*log(x)-0.39 BFS DFS RAND DEG Figure11:PageRankcapture 10 15 20 25 30 35 40 45 50 55 0 5e+06 1e+07 1.5e+07 2e+07 probability of seeing an old vertex (%)Number of vertices crawled4.4*log(x)-31.46 BFS DFS 5.59*log(x)-51.81 RAND 4.4*log(x)-31.46 DEG Figure12:Evolutionofthediscoveryrate 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 1.4e+07 1.6e+07 1.8e+07 2e+07 Fraction of sinks verticesNumber of vertices crawled0.644592+-0.0235404*log(x) RAND DFS 0.791703-0.0323677*log(x) BFS DEG Figure13:Evolutionoftheproportionofsinks(pageswithnocrawledsuccessors)amongcrawledpages AsshowninSection5,ourmodelmeetsalltheseobjectives.Property3ofcourseisensuredbythemodel,buttheotheronesareresultsofthegeneratingprocess.Thebasicas-sumptionofdegreedistribution,togetherwiththecrawlingstrategy,isenoughtomimicthepropertiesobservedinlargerealcrawls.ThisisconceptuallysimplerthanothermodelthatalsohavethesamepropertiesliketheCopymodel[17].TheBowTiestructureweobservediersfrom[8]sincethelargeststronglyconnectedcomponentislarger.Butto-getherwiththeothertopologicalpropertiesmeasured,itprovesthatwereproducequitewellthetopologyofrealcrawlswithourverysimplemodel.Itisnice,becausewehavefewerassumptionthan[6]or[18].OurapproachisdierentfromtheWebgraphmodels,thatmimicthepagewritingstrategyinsteadofthepagecrawling,butgivesim-ilarresult.Itpointsoutthatweneedmorenumericalorothermeasuresongraphinordertoanalyzetheirstructure.BFS,RANDandDEGstrategiesarethemostusedinsim-plecrawlers.Weshowthattheyproduceverysimilarresultsfortopologicalaspects.Fordynamicalaspects(PageRankcaptureforinstance)BFSandDEGseemsbetter,butarehardertoimplementinarealcrawler.DFSisdenitelybad,andforthisreasonisnotusedbycrawlers.Parallelcrawleruse,however,moresophisticatedstrategiesthatwerenotmodeledhere.SoourrandomWebcrawlsmodelcanbecomparedwiththeexistingrandomWebgraphmodels[4,6,17,18,7,2,9,10,17,18].Butunlikethem,itisnotbasedonsociologicalassumptionsabouthowthepagesarewritten,butonanassumptiononthelawfollowedbythepagesdegreesand,forthestructuralproperties,ononlyoneassumptionthatthegraphisoutputbyacrawler.ThedesignisthenquitedierentfromthedesignoftherandomWebgraphmodels,buttheresultsarethesame.Wecaninterpretthisconclusioninapessimisticway:itishardtotellwhatarethebiasesofthecrawling.IndeedwehavenotsupposedthattheWebgraphhasanyotherspecicpropertythandegreesfollowingaZipflaw,andyetourrandomcrawlshaveallpropertiesofrealcrawls.ThismeansthatonecancrawlanythingfollowingaZipflaw,notonlytheWeb,andoutputcrawlswiththespecicpropertiesoftheWebcrawls.SothecomparisonoftheresultofaWebgraphmodelwithrealcrawlscouldbenotenoughtoassertthatthemodelcapturespropertiesoftheWeb.7.REFERENCES[1]R.AgrawalandR.Srikant.Fastalgorithmsforminingassociationrulesinlargedatabases.InVLDB'94,pages487{499.MorganKaufmannPublishersInc.,1994.[2]W.Aiello,F.Chung,andL.Lu.Arandomgraphmodelformassivegraphs.InProceedingsofthethirty-secondannualACMsymposiumonTheoryofcomputing,pages171{180,2000.[3]R.Albert,H.Jeong,andA.-L.Barabasi.Diameteroftheworld-wideweb.Nature,401:130{131,September1999.[4]A.-L.BarabasiandR.Albert.Emergenceofscalinginrandomnetworks.Science,286:509{512,1999.[5]P.Boldi,M.Santini,andS.Vigna.Doyourworsttomakethebest:Paradoxicaleectsinpagerankincrementalcomputations.InWorkshoponAlgorithmsandModelsfortheWeb-Graph,2004.[6]B.Bollobas,O.Riordan,J.Spencer,andG.Tusnady.Thedegreesequenceofascale-freerandomgraphprocess.RandomStructuresandAlgorithms,18(3):279{290,May2001.[7]A.Bonato.Asurveyofmodelsofthewebgraph.InTheProceedingsofCombinatorialandAlgorithmicAspectsofNetworking,August2004.[8]A.Broder,R.Kumar,F.Maghoul,P.Raghavan,S.Rajagopalan,R.Stata,A.Tomkins,J.Wiener,A.Borodin,G.O.Roberts,J.S.Rosenthal,andP.Tsaparas.Graphstructureoftheweb.InProceedingsoftheninthinternationalconferenceonWorldWideWeb,pages309{320.ForetecSeminars,Inc.,2000.[9]F.ChungandL.Lu.Couplingon-lineando-lineanalysesforrandompowergraphs.InternetMathematics,1(4):409{461,2003.[10]A.F.ColinCooperandJ.Vera.Randomdeletioninascalefreerandomgraphprocess.InternetMathematics,1(4):463{483,2003.[11]K.Efe,V.Raghavan,C.H.Chu,A.L.Broadwater,L.Bolelli,andS.Ertekin.TheshapeoftheWebanditsimplicationsforsearchingtheWeb,31{62000.[12]N.Eiron,K.S.McCurley,andJ.A.Tomlin.Rankingthewebfrontier.InWWWconference,2004.[13]P.ErdosandA.Renyi.Onrandomgraphs.InPublicationesoftheMathematicae,volume6,1959.[14]H.Ino,M.Kudo,andA.Nakamura.Partitioningofwebgraphsbycommunitytopology.InWWW'05:Proceedingsofthe14thinternationalconferenceonWorldWideWeb,pages661{669,2005.[15]J.M.Kleinberg.Authoritativesourcesinahyperlinkedenvironment.J.ACM,46(5):604{632,1999.[16]J.M.Kleinberg.Hubs,authorities,andcommunities.ACMComputingSurveys(CSUR),31(4es):5,1999.[17]R.Kumar,P.Raghavan,S.Rajagopalan,D.Sivakumar,A.Tomkins,andE.Upfal.Stochasticmodelsforthewebgraph.InProceedingsofthe41stAnnualSymposiumonFoundationsofComputerScience,page57.IEEEComputerSociety,2000.[18]R.Kumar,P.Raghavan,S.Rajagopalan,andA.Tomkins.Extractinglarge-scaleknowledgebasesfromtheweb.InVLDB,pages639{650,1999.[19]R.Kumar,P.Raghavan,S.Rajagopalan,andA.Tomkins.Trawlingthewebforemergingcyber-communities.InWWWconference,1999.[20]M.NajorkandJ.L.Wiener.Breadth-rstcrawlingyieldshigh-qualitypages.InProceedingsInternationalWWWConference(10),Hong-Kong,2001.[21]L.Page,S.Brin,R.Motwani,andT.Winograd.ThePageRankCitationRanking:BringingOrdertotheWeb.Technicalreport,ComputerScienceDepartment,StanfordUniversity,1998.[22]G.Pandurangan,P.Raghavan,andE.Upfal.Usingpageranktocharacterizewebstructure.In8thAnnualInternationalConferenceonComputingandCombinatorics,pages330{339,2002.[23]Nature.405:113,11May2000.[24]D.J.WattsandS.H.Strogatz.Collectivedynamicsofsmall-worldnetworks.Nature,393(1{7):440{442,1998. RandomWebCrawlsToukBennouasCriteoR&D6,boulevardSaintDenis75010Paris,Francet.bennouas@criteo.comFabiendeMontgolerLIAFAUniversit´eParis72,placeJussieu,Case7014,75251Paris,Francefm@liafa.jussieu.frABSTRACTThispaperproposesarandomWebcrawlmodel.AWebcrawlisa(biasedandpartial)imageoftheWeb.Thispa-perdealswiththehyperlinkstructure,i.e.aWebcrawlisagraph,whoseverticesarethepagesandwhoseedgesarethehypertextuallinks.OfcourseaWebcrawlhasaveryspecial