/
Random Web Crawls Touk Bennouas Criteo RD  boulevard S Random Web Crawls Touk Bennouas Criteo RD  boulevard S

Random Web Crawls Touk Bennouas Criteo RD boulevard S - PDF document

briana-ranney
briana-ranney . @briana-ranney
Follow
381 views
Uploaded On 2015-05-22

Random Web Crawls Touk Bennouas Criteo RD boulevard S - PPT Presentation

bennouascriteocom Fabien de Montgol64257er LIAFA Universite Paris 7 2 place Jussieu Case 7014 75251 Paris France fmliafajussieufr ABSTRACT This paper proposes a random Web crawl model A Web crawl is a biased and partial image of the Web This pa per ID: 72099

bennouascriteocom Fabien Montgol64257er LIAFA

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Random Web Crawls Touk Bennouas Criteo R..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

beremovedfromtheWebandthendeletedfromthecrawl.Theyhaveveryfewsources(pageswithnoincominglinks,eithersubmittedbypeoplesorunlinkedafterawhile)andalotofsinks(pagesnotcrawledyetorwithnoexternalhyperlink).2.1ConnectivityandClusteringSmall-Worldgraphs,de nedbyWattsandStrogatz[24]andstudiedbymanyauthorssince,aregraphsthatful llthefollowingtwoproperties:1.thecharacteristicpathlength(averagedistancebe-tweentwovertices)issmall:O(logn)orO(loglogn)2.theclusteringcoecient(probabilitythattwoverticessharingacommonneighborarelinked)ishigh:O(1).Thisde nitioncanbeappliedtodirectedgraphswhenomittingarcdirection.TheWeb(infactthecrawls)char-acteristicpathlengthseemssmall(about16clicks[8]or19[3]),consistantwiththeO(logn)axiom.Itsclusteringco-ecientishigh.Existingcomputationsofitsexactvaluedi er,butitisadmittedthatitisgreaterthan0:1,whilerandomgraphs(Erdos-Renyi,seeSection3.1)withthesameaveragedegreehaveclusteringcoecientp'0.Crawlsdiameter(maximumdistancebetweentwopages)ispotentiallyin nitebecauseadynamicpagelabeledbyninURLmayrefertoadynamicpagelabeledbyn+1,butsinceWebcrawlersusuallyperformBFS(seeSection4.1)thediameterofcrawlsmaybeactuallysmall.2.2DegreedistributionZipflaws(a.k.apowerlaws)areprobabilitylawssuchthatlog(Prob(X=d))= log(d)Ifthein-degree(respectivelyout-degree)distributionofagraphfollowsaZipflaw,Prob(X=d)istheprobabilityforavertextohavein-(resp.out-)degreed.Inotherwords,thenumberofverticeswithdegreedisk:d(kde-pendsonthenumberofverticesn).AgraphclasssuchthatthedegreeofalmostallgraphsfollowaZipflawiscalledscale-freebecausesomeparameterslikearescaleinvari-ant.Scale-freegraphshavebeenextensivelystudied[4,6,17,18].Manygraphsmodelingsocialnetworks,interactionbetweenobjects(proteins,peoples,neurons...)orothernet-workpropertiesseemtohavethescale-freeproperty.ForWebcrawls,ameasurefromBroder&al.[8]ona200000000pagescrawlshowthattheinandout-degreesfollowZipflaw.Theexponentsarein=2:1forin-degreeandout=2:72forout-degree.2.3StronglyconnectedcomponentsandtheBowTiestructureAccordingtoBroder,Kumaretal[8]theWebhasaBowTiestructure:aquarterofthepageareinaGiantStronglyConnectedComponent(GSCC),aquarterarethe\in"page,leadingtotheGSCCbutnotlinkedfromthere,anotherquarterarethe\out"pagesreachablefromtheGSCCbutnotlinkingtoit,andthelastquarterisnotrelatedtotheGSCC.ThisfamousassertionwasreportedevenbyNature[23]but,sincefouryears,anincreasingnumberofpeoplesuspectsitisacrawlingartifact.Accordingtothesamesurvey,thedistributionofthesizeofstronglyconnectedcomponentsfollowsaZipflawwithexponentroughly2.5.2.4CoresAnotherwell-knownpropertyofcrawlsistheexistenceofcores.Acoreisadensedirectedbipartitesubgraph,consist-inginmanyhubspages(orfans)pointingmanyauthorities.Itissupposed[19,16]thatsuchcoresarethecentralstruc-tureofcybercommunities,setofpagesaboutthesametopics.Theauthoritiesarethemostrelevantpages,buttheydonotnecessarilypointoneeachother(becausecompetition,forinstance)butthehubslistmostofthem.Startingfromthisassumption,HITSalgorithm[15]ranksthepagescontainingagivenkeywordaccordingtoahubfactorandanauthorityfactor.Kumaretal.[19,18]enumerateover200,000bipar-titecoresfroma200,000,000pagescrawloftheWeb.Coressizes(countinghubs,authorities,orboth)followZipflawsofexponentbetween1:09and1:4.2.5SpectralpropertiesandPageRankfactorAnotherrankingmethod,themostpopularsinceitdoesnotdependsongivenkeywords,isGoogle'sPageRankfactor[21].Itisanaccessibilitymeasureofthepage.Brie\ry,thePageRankofapageistheprobabilityforarandomsurfertobepresentonthispageafteraverylongsurf.Itcanbecomputedbybasiclinearalgebraalgorithms.PageRankdistributionalsofollowsaZipflawwiththesameexponentasthein-degreedistribution[22].PageswithhighPageRankareveryvisible,sincetheyaree ectivelypopularontheWebandarelinkedfromotherpageswithhighPageRank.Acrawlerthereforeeasily ndsthem[5,20]whileitmaymisslow-rankedpages.Thisisindeedausefulbiasforsearchenginecrawlers!3.RANDOMGRAPHSMODELS3.1Basicmodels:Erd¨os­R´enyiForalongtimethemostusedrandomgraphmodelwasErdos-Renyimodel[13].Therandomgraphdependsontwoparameters,thenumberofverticesnandtheprobabilitypfortwoverticestobelinked.Theexistenceofeachedgeisarandomvariableindependentfromothers.Forsuitableval-ues(p=d=n),E.-R.graphsindeedhavecharacteristicpathlengthofO(logn)butverysmallclustering(p=o(1))anddegreedistributionfollowingaPoissonlawandnotaZipflaw.Thereforetheydonotaccuratelydescribethecrawls.Othermodelshavethenbeproposedwhereattachmentisnotindependent.3.2IncrementalgenerationmodelsMostrandomWebgraphmodels[4,6,17,18,7]proposeanincrementalconstructionofthegraph.Whentheexis-tenceofalinkisprobed,itdependsontheexistinglinks.ThatprocessmodelsthecreationoftheWebacrosstime.Insomemodelsallthelinkgoingfromapageareinsertedatonce,andinotheronesitisincremental.3.3PreferentialAttachmentmodelsThe rstevolvinggraphmodel(BA)wasgivenbyBarabasiandAlbert[4].Themainideaisthatnewnodesaremorelikelytojointoexistingnodeswithhighdegrees.Thismodelisnowreferredtoasanexampleofapreferentialattachmentmodel.Theyconcludedthatthemodelgeneratesgraphswhosein-degreedistributionfollowsaZipflawwithexpo-nent=3. Anotherpreferentialattachmentmodel,calledtheLin-earizedChordDiagram(LCD),wasgivenin[6].Inthismodelanewvertexiscreatedateachstep,andconnectstoexistingverticeswithaconstantnumberofedges.Avertexisselectedastheend-pointoftheanedgewithprobabilityproportionaltoitsin-degree,withanappropriatenormal-izationfactor.In-degreesfollowaZipflawwithexponentroughly2whenout-degreesare7(constant).IntheACL[2]model,eachvertexisassociatedain-weight(respectivelyout-weight)dependentofin-degree(re-spectivelyout-degree).Avertexisselectedastheend-pointoftheanedgewithprobabilityproportionaltoitsweight.Inthesemodelsedgesareaddedbutneverdeleted.TheCL-delmodel[9]andCFVmodel[10]incorporateintheirdesignboththeadditionanddeletionofnodesandedges.3.4CopymodelsAmodelwasproposedby[17]toexplainotherrelevantpropertiesoftheWeb,especiallythegreatnumberofcores,sincetheACLmodelgeneratesgraphswhichonaveragecontainfewcores.ThelineargrowthcopingmodelfromKumar&al.[18]postulatesthataWebpageauthorshallcopyanexistingpagewhenwritingitsown,includingthehyperlinks.Inthismodel,eachnewpagehasamasterpagefromwhichitcopiesagivenamountoflinks.Themasterpageischosenproportionallytoin-degree.Otherlinksfromthenewpagearethenaddedfollowinguniformorpreferentialattachment.Theresultisagraphwithallpropertiesofpreviousmodels,plustheexistenceofmanycores.Thesemodelsoftenusemanyparametersneeding netune,andsociologicalassumptionsonhowtheWebpagesarewritten.Weproposeamodelbasedonacomputersci-enceassumption:theWebgraphsweknowareproducedbycrawlers.Thisallowustodesignasimpler(itdependsonlyontwoparametersgetfromexperiments)andveryaccuratemodelofWebcrawl.4.AWEBCRAWLMODELInthissection,wepresentthecrawlingstrategiesandde-riveourWebcrawlmodelfromthem.Itaimstomimicthecrawlingprocessitself,ratherthanthepagewritingprocessaswebgraphmodelsdo.4.1WebcrawlsstrategiesLetusconsideratheoreticalcrawler.Wesupposethecrawlervisitseachpageonlyonce.Thebene tistoavoidmodelingthedisappearanceofpagesorlinksacrosstime,becausethelawitfollowsisstilldebatable(isthepageslifetimerelatedtotheirpopularity,ortotheirdegreeprop-erties?)Whenscanningapage,thecrawlergetsatoncethesetofitsoutgoinglinks.Atanytimethe(potentiallyin nite)setofvalidURLisdividedinto1.Crawled:thecorrespondingpageswerevisitedandtheiroutgoinglinksareknown2.Unvisited:alinktothisURLhasbeenfoundbutnotprobedyet3.Erroneous:theURLwasprobedbutpointsanon-existingornon-HTML le(somesearchenginesindexthem,buttheydonotcontainURLandarenotinter-estingforourpurposes)4.Unknown:theURLwasneverencounteredThecrawlingalgorithmbasicallychooseandremovefromitsUnvisitedsetanURLtocrawl,andthenaddstheout-goingunprobedlinksofthepage,ifany,totheUnvisitedset.ThecrawlingstrategyisthewaytheUnvisitedsetismanaged.Itmaybe:DFS(depth- rstsearch)ThestrategyisFIFOandthedatastructureisastackBFS(breadth- rstsearch)ThestrategyisLIFOandthedatastructureisaqueueDEG(higherdegree)ThemostpointedURLischosen.Thedatastructureisapriorityqueue(anheap)RND(random)AnuniformrandomURLischosenWesupposethecrawledpagesareorderedbytheirdiscov-erydate.Fordiscussingstructuralproperties,thecrawledpagesonlyaretobeconsidered.Noticethatthe rstthreestrategiescanonlybecorrectlyimplementedwithasinglecomputed.Themostpowerfulcrawlersaredistributedonmanycomputersandtheirstrategyishardtode ne.ItisusuallysomethingbetweenBFSandRandom.4.2ModeldescriptionOurmodelshallmimicacrawlerstrategy.Itworksintwosteps: rstconstructingthesetofpages,thenaddingthehyperlinks.Constructingthesetofpages.Eachpagephastwo elds:itsin-degreedin(p)anditsout-degreedout(p).Inthe rststepofthecrawlconstructingprocess,wesetavaluetoeachofthem.Thein-degreeandout-degreearesetaccordingtotwoindependentZipflaws.Theexponentofeachlawisaparameterofthemodel,thereforeourmodeldependsontwoparametersin(forin-degree)andout(forout-degree).Thesevaluesarewellknownforrealcrawls:following[8],wehavein=2:1andout=2:72.Weshallhavetochosethepagesatrandomaccordingtotheirin-degree.Forsolvingtheproblem,npages(themaximalsizeofthecrawl)aregeneratedandtheirin-andout-degreesareset.Then,asetLiscreated,whereeachpagepisduplicateddin(p)times.Thesizeofthissetisthemaximalnumberofhyperlinks.Eachtimeweneedtochooseapageatrandomaccordingtothein-degreelaw,wejusthavetoremoveoneelementfromL.Constructingthehyperlinks.Nowthepagesdegreesarepre-set,butthegraphtopologyisnotyetde ned.Analgo-rithm,simulatingacrawling,shalladdthelinks.Thereareindeedfouralgorithms,dependingonwhichcrawlingstrat-egyshallbesimulated.Thegenericalgorithmissimply:1.RemoveapagepfromtheUnvisitedSetandmarkpascrawled2.Removethe rstdout(p)pagesfromL3.Setthesepagesasthepagespointedbyp4.AddtheunvisitedonestotheUnvisitedSet5.Goto1 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 Number of vertices crawled (log)Out-degree (log)16/x**2.6 BFS, 5% 18.05/x**2.62 BFS, 25% 19.26/x**2.615 BFS, 75% Figure1:Out-degreedistributionatthreestepsofaBFSTheUnvisitedSetisseededwithoneormorepages.Thewayitismanageddependsonwhichcrawlingstrategyissimulated,i.e.whichalgorithmischosen:ForDFSalgorithm,theUnvisitedSetisastack(FIFO)ForBFSalgorithm,itisaqueue(LIFO)ForDEGalgorithm,itisapriorityqueue(heap)ForRNDalgorithm,arandompageisextractedfromtheUnvisitedSetBecausetheaverageout-degreeofapageislargeenough,thecrawlingprocesswillnotstopunlessalmostallpageshavebeencrawled.Theprogressofthecrawl(expressedinpercent)isthefractionofcrawledpagesovern.Asitapproachesn,someweirdthingswilloccurasnomoreun-knownpagesareallowed.Inourexperiments(seethenextsection)wesometimesgoupto100%progressbutresultsaremorerealisticbefore30%;whenthecrawlcanexpandtowardunkownpages.Ourmodeldi ersradicallyfrompreferentialattachmentorcopymodelsbecausetheneighborhoodofapageisnotsetatwritingtimebutatcrawlingtime.Soapageisallowedtopointknownorunknownpagesaswell.5.RESULTSWepresentheresimulationresultsusingthedi erentstrate-giesandshowinghowthemeasurementsevolveacrosstime.Thankstothescale-freee ect,theactualnumberofpagesdoesnotmatter,sinceitisbigenough.Wehaveusedseveralgraphsofdi erentsizesbutwiththesameexpo-nentsin=2:1andout=2:72(experimentalvaluesfrom[8]).Andunlessotherwisespeci ed,wepresentresultsfromBFS,themostusedcrawlingstrategy,andsimulationsupto20,000,000crawledpages.5.1DegreedistributionAtanystepofthecrawl,theactualdegreedistributionfollowsaZipflawofthegivenparameters(2.1and2.72)withverysmalldeviation(seeFigures1and2).Thisre-sultisindependentfromthecrawlstrategy(BFS,etc.)Itdemonstratesthatourgeneratedcrawlsreallyarescale-freegraphs. 0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 Number of vertices crawled (log)In-degree (log)13.77/x**2.6 BFS, 1% 15.4/x**2.13 BFS, 25% 16.48/x**2.1 BFS, 75% Figure2:In-degreedistributionatthreestepsofaBFS5.2Small­WorldpropertiesThedistributionofpathlength(Figure3)clearlyfollowsaGaussianlawforBFS,DEGandRANDstrategies.Thisdis-tributionisplottedatprogress30%butitdoesnotchangealotaccrostime,asshowninFigure5.DFSproducesfargreaterdistancesbetweenvertices,andthedistributionfol-lowsanunknownlaw(Figure4).DFScrawlsdiameterisabout10%ofthenumberofvertices!ThisisbecauseDFScrawlsarelikelongtighttrees.ItiswhyDFSisnotusedbyrealcrawlers,andthispaperfocusesonthethreeothercrawlsstrategies.Theclustering(Figure6),computedon500,000pagessimulation)ishighanddonotdecreasetoomuchasthecrawlgoesbigger.Ourcrawlsde nitelyaresmall-worldgraphs.5.3Bow­tiestructure?Therelativesizeofthefourbow-tiecomponents(SCC,IN,OUTandOTHER)areroughlythesameforBFS,DEGandevenRAND(butnotDFS)strategies(Figure7).Whenusingonlyoneseed,thesizeofthelargestSCCconvergestowardtwothirdsofthesizeofthegraph.Theseproportionsthusdi erfrom[8]crawlobservationssincethe\in"and\others"partsaresmaller.Butwithmanyseeds(itmaybeseenasmanypagessubmittedtothecrawlerportal)thesizeofthe\in"componentislargerandcanbeuptoonequarterofthepages.Ourmodelreplicatesindeedverywellgenuinecrawlsbow-tietopology.5.4CoresWeusedAgrawalpracticalalgorithm[1]forcoresenu-meration(noticethatthemaximalcoreproblemisNP-complete).Figure10givesthenumberofcoreofagivenminimalsizeforacrawlupto25000vertices.Asshown,thenumberofcoresisverydependentfromexponentsofZipflaws,sincehighexponentsmeansparsergraphs.Itmeansthatoursimulatedcrawlscontainmanycore,asrealcrawlsdo.Figure9showsthatthenumberof(4;4)-cores(atleastfourhubsandfourauthorities)isproportionaltonandafterawhilestaysbetweenn=100andn=50. 0 500000 1e+06 1.5e+06 2e+06 2.5e+06 0 5 10 15 20 25 PopulationPath lengthBFS DEG RAND Figure3:DistributionofpathlengthforBFSandDEGandRAND 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 Population (log)Path length (log) Figure4:DistributionofpathlengthforDFS(log/logscale) 5 10 15 20 25 30 35 40 0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000 Average path length/DiameterNumber of vertices crawledAverage path lengthDiameterBFS, apl DEG, apl RAND, apl BFS, d DEG, d RAND, d log(N)/log(average degree) Figure5:EvolutionofdiameterandaveragepathlengthacrosstimeforBFS,DEGandRAND 0.15 0.2 0.25 0.3 0.35 0.4 0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000 Clustering coefficientNumber of vertices crawled0.750029-0.0454094*log(x) 0.546027-0.0293349*log(x) BFS DEG RAND Figure6:Evolutionofclusteringcoecientacrosstime 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0 100000 200000 300000 400000 500000 SCC size (proportion)Number of vertices crawledBFS RAND DEG DFS 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0 100000 200000 300000 400000 500000 OUT size (proportion)Number of vertices crawledBFS RAND DEG DFS Figure7:EvolutionofthesizeofofthelargestSCC(left)andoftheOUTcomponent(right)acrosstime.Oneseed,upto500,000pagesinthecrawl5.5Pagerankdistributionand“qualitypages”crawlingspeedFigure10showsthePageRankdistribution(PageRankisnormalizedto1andlogarithmsarethereforenegative).WehavefoundresultsimilartoPanduranganetal.observations[22]:thedistributionisaZipflawwithexponent2.1.Thecrawlquicklyconvergestothisvalue.Figure11showsthesumofthePageRankofthecrawledpagesacrosstime(thePageRankcomputedattheendofthecrawl,sothatitmustvaryfrom0atbeginningto1whencrawlstops).Inaveryfewsteps,BFSandDEGstrategies ndtheverysmallamountofpagesthatcontainsmostofthetotalPageRank.ThispropertyofrealBFScrawlersisknownsinceNajorkandWiener[20].OurresultscanbecomparedtoBoldietalcrawlingstrategiesexperiments[5].5.6DiscoveryspeedandFrontierFigure12showsanotherdynamicalproperty:thediscov-eryrate.Itistheprobabilityfortheextremityofalinkofbeingalreadycrawled.Itconvergestoward40%forallstrategies.Thisisaninterestingscale-freeproperty:afterawhile,theprobabilityforaURLtopointanewpageisveryhigh,about60%.This\expander"propertyisveryusefullfortruecrawlers.Thissimulationshowsitdoesnotdependsonlyonthedynamicalnatureoftheweb,butalsofromthecrawlingprocessitself. hubs auth cores hubs auth cores 2 2 220 3 6 5 2 3 83 3 7 2 2 4 37 4 4 40 2 5 14 4 5 14 2 6 14 4 6 5 2 7 14 4 7 2 3 3 84 5 5 12 3 4 37 5 6 3 3 5 14 6 6 3 Figure8:Numberofsmallcores(over1000pages) 0.01 0.015 0.02 0.025 0.03 0.035 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Fraction of cores (4,4)Number of vertices crawledBFS Figure9:(4,4)CoresdistributionFigure13focusesonawellknowntopologicalpropertyofcrawls,thatoursimulationsalsoproduces,theveryhighnumberofsinksregardlessofcrawlsize.NoticethattheirexistenceisaproblemforpracticalPageRankcomputation[21].Inotherwords,thelarge\out"componentofthebow-tieisverybroadandshort...EironetalsurveytheWebfrontierranking[12].6.CONCLUSIONAssaidinSection2,agoodcrawlmodelshouldoutputgraphswiththefollowingproperties:1.highlyclustered2.withashortcharacteristicpathlength3.in-andout-degreedistributionsfollowZipflaws4.withmanysinks5.suchthathighPageRankvertices(ascomputedinthe nalgraph)arecrawledearly6.withabowtiestructure 0 2 4 6 8 10 12 14 16 -20 -18 -16 -14 -12 -10 -8 -6 -4 Number of vertices crawled (log)PageRank values(log)-22.75*x**(-2.1) BFS Figure10:PageRankdistribution 0 0.2 0.4 0.6 0.8 1 1.2 0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 1.4e+07 1.6e+07 1.8e+07 2e+07 Total PageRankNumber of vertices crawled1.68*10**(-8)*x+0.062*log(x)-0.39 BFS DFS RAND DEG Figure11:PageRankcapture 10 15 20 25 30 35 40 45 50 55 0 5e+06 1e+07 1.5e+07 2e+07 probability of seeing an old vertex (%)Number of vertices crawled4.4*log(x)-31.46 BFS DFS 5.59*log(x)-51.81 RAND 4.4*log(x)-31.46 DEG Figure12:Evolutionofthediscoveryrate 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 1.4e+07 1.6e+07 1.8e+07 2e+07 Fraction of sinks verticesNumber of vertices crawled0.644592+-0.0235404*log(x) RAND DFS 0.791703-0.0323677*log(x) BFS DEG Figure13:Evolutionoftheproportionofsinks(pageswithnocrawledsuccessors)amongcrawledpages AsshowninSection5,ourmodelmeetsalltheseobjectives.Property3ofcourseisensuredbythemodel,buttheotheronesareresultsofthegeneratingprocess.Thebasicas-sumptionofdegreedistribution,togetherwiththecrawlingstrategy,isenoughtomimicthepropertiesobservedinlargerealcrawls.ThisisconceptuallysimplerthanothermodelthatalsohavethesamepropertiesliketheCopymodel[17].TheBowTiestructureweobservedi ersfrom[8]sincethelargeststronglyconnectedcomponentislarger.Butto-getherwiththeothertopologicalpropertiesmeasured,itprovesthatwereproducequitewellthetopologyofrealcrawlswithourverysimplemodel.Itisnice,becausewehavefewerassumptionthan[6]or[18].Ourapproachisdi erentfromtheWebgraphmodels,thatmimicthepagewritingstrategyinsteadofthepagecrawling,butgivesim-ilarresult.Itpointsoutthatweneedmorenumericalorothermeasuresongraphinordertoanalyzetheirstructure.BFS,RANDandDEGstrategiesarethemostusedinsim-plecrawlers.Weshowthattheyproduceverysimilarresultsfortopologicalaspects.Fordynamicalaspects(PageRankcaptureforinstance)BFSandDEGseemsbetter,butarehardertoimplementinarealcrawler.DFSisde nitelybad,andforthisreasonisnotusedbycrawlers.Parallelcrawleruse,however,moresophisticatedstrategiesthatwerenotmodeledhere.SoourrandomWebcrawlsmodelcanbecomparedwiththeexistingrandomWebgraphmodels[4,6,17,18,7,2,9,10,17,18].Butunlikethem,itisnotbasedonsociologicalassumptionsabouthowthepagesarewritten,butonanassumptiononthelawfollowedbythepagesdegreesand,forthestructuralproperties,ononlyoneassumptionthatthegraphisoutputbyacrawler.Thedesignisthenquitedi erentfromthedesignoftherandomWebgraphmodels,buttheresultsarethesame.Wecaninterpretthisconclusioninapessimisticway:itishardtotellwhatarethebiasesofthecrawling.IndeedwehavenotsupposedthattheWebgraphhasanyotherspeci cpropertythandegreesfollowingaZipflaw,andyetourrandomcrawlshaveallpropertiesofrealcrawls.ThismeansthatonecancrawlanythingfollowingaZipflaw,notonlytheWeb,andoutputcrawlswiththespeci cpropertiesoftheWebcrawls.SothecomparisonoftheresultofaWebgraphmodelwithrealcrawlscouldbenotenoughtoassertthatthemodelcapturespropertiesoftheWeb.7.REFERENCES[1]R.AgrawalandR.Srikant.Fastalgorithmsforminingassociationrulesinlargedatabases.InVLDB'94,pages487{499.MorganKaufmannPublishersInc.,1994.[2]W.Aiello,F.Chung,andL.Lu.Arandomgraphmodelformassivegraphs.InProceedingsofthethirty-secondannualACMsymposiumonTheoryofcomputing,pages171{180,2000.[3]R.Albert,H.Jeong,andA.-L.Barabasi.Diameteroftheworld-wideweb.Nature,401:130{131,September1999.[4]A.-L.BarabasiandR.Albert.Emergenceofscalinginrandomnetworks.Science,286:509{512,1999.[5]P.Boldi,M.Santini,andS.Vigna.Doyourworsttomakethebest:Paradoxicale ectsinpagerankincrementalcomputations.InWorkshoponAlgorithmsandModelsfortheWeb-Graph,2004.[6]B.Bollobas,O.Riordan,J.Spencer,andG.Tusnady.Thedegreesequenceofascale-freerandomgraphprocess.RandomStructuresandAlgorithms,18(3):279{290,May2001.[7]A.Bonato.Asurveyofmodelsofthewebgraph.InTheProceedingsofCombinatorialandAlgorithmicAspectsofNetworking,August2004.[8]A.Broder,R.Kumar,F.Maghoul,P.Raghavan,S.Rajagopalan,R.Stata,A.Tomkins,J.Wiener,A.Borodin,G.O.Roberts,J.S.Rosenthal,andP.Tsaparas.Graphstructureoftheweb.InProceedingsoftheninthinternationalconferenceonWorldWideWeb,pages309{320.ForetecSeminars,Inc.,2000.[9]F.ChungandL.Lu.Couplingon-lineando -lineanalysesforrandompowergraphs.InternetMathematics,1(4):409{461,2003.[10]A.F.ColinCooperandJ.Vera.Randomdeletioninascalefreerandomgraphprocess.InternetMathematics,1(4):463{483,2003.[11]K.Efe,V.Raghavan,C.H.Chu,A.L.Broadwater,L.Bolelli,andS.Ertekin.TheshapeoftheWebanditsimplicationsforsearchingtheWeb,31{62000.[12]N.Eiron,K.S.McCurley,andJ.A.Tomlin.Rankingthewebfrontier.InWWWconference,2004.[13]P.ErdosandA.Renyi.Onrandomgraphs.InPublicationesoftheMathematicae,volume6,1959.[14]H.Ino,M.Kudo,andA.Nakamura.Partitioningofwebgraphsbycommunitytopology.InWWW'05:Proceedingsofthe14thinternationalconferenceonWorldWideWeb,pages661{669,2005.[15]J.M.Kleinberg.Authoritativesourcesinahyperlinkedenvironment.J.ACM,46(5):604{632,1999.[16]J.M.Kleinberg.Hubs,authorities,andcommunities.ACMComputingSurveys(CSUR),31(4es):5,1999.[17]R.Kumar,P.Raghavan,S.Rajagopalan,D.Sivakumar,A.Tomkins,andE.Upfal.Stochasticmodelsforthewebgraph.InProceedingsofthe41stAnnualSymposiumonFoundationsofComputerScience,page57.IEEEComputerSociety,2000.[18]R.Kumar,P.Raghavan,S.Rajagopalan,andA.Tomkins.Extractinglarge-scaleknowledgebasesfromtheweb.InVLDB,pages639{650,1999.[19]R.Kumar,P.Raghavan,S.Rajagopalan,andA.Tomkins.Trawlingthewebforemergingcyber-communities.InWWWconference,1999.[20]M.NajorkandJ.L.Wiener.Breadth- rstcrawlingyieldshigh-qualitypages.InProceedingsInternationalWWWConference(10),Hong-Kong,2001.[21]L.Page,S.Brin,R.Motwani,andT.Winograd.ThePageRankCitationRanking:BringingOrdertotheWeb.Technicalreport,ComputerScienceDepartment,StanfordUniversity,1998.[22]G.Pandurangan,P.Raghavan,andE.Upfal.Usingpageranktocharacterizewebstructure.In8thAnnualInternationalConferenceonComputingandCombinatorics,pages330{339,2002.[23]Nature.405:113,11May2000.[24]D.J.WattsandS.H.Strogatz.Collectivedynamicsofsmall-worldnetworks.Nature,393(1{7):440{442,1998. RandomWebCrawlsToukBennouasCriteoR&D6,boulevardSaintDenis75010Paris,Francet.bennouas@criteo.comFabiendeMontgolerLIAFA­Universit´eParis72,placeJussieu,Case7014,75251Paris,Francefm@liafa.jussieu.frABSTRACTThispaperproposesarandomWebcrawlmodel.AWebcrawlisa(biasedandpartial)imageoftheWeb.Thispa-perdealswiththehyperlinkstructure,i.e.aWebcrawlisagraph,whoseverticesarethepagesandwhoseedgesarethehypertextuallinks.OfcourseaWebcrawlhasaveryspecial