Da vison Computer Science Engineering Lehigh Univ ersit baw4davison cselehighed Abstract Cloaking and redirection are ossible searc en gine spamming tec hniques In order to understand cloaking and redirection on the eb do wnloaded sets of eb pages w ID: 67813
Download Pdf The PPT/PDF document "Cloaking and Redirection Preliminary Stu..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
CloakingandRedirection:APreliminaryStudyBaoningWuandBrianD.DavisonComputerScience&EngineeringLehighUniversityfbaw4,davisong@cse.lehigh.eduAbstractandredirectionaretwopossiblesearchen-ginespammingtechniques.InordertounderstandcloakingandredirectionontheWeb,wedownloadedtwosetsofWebpageswhilemimickingapopularWebcrawlerandasacommonWebbrowser.Weestimatethat3%oftherstdatasetand9%oftheseconddatasetutilizecloakingofsomekind.Bycheckingmanuallyasampleofthecloakingpagesfromthesec-onddataset,nearlyonethirdofthemappeartoaimtomanipulatesearchengineranking.Wealsoexaminedredirectionmethodspresentintherstdataset.Weproposeamethodofdetectingcloakingpagesbycalculatingthedierenceofthreecopiesofthesamepage.Weexaminethedierenttypesofcloakingthatarefoundandthedistributionofdierenttypesofredirection.1IntroductionCloakingisthepracticeofsendingdierentcontenttoasearchenginethantoregularvisitorsofawebsite.RedirectionisusedtosendusersautomaticallytoanotherURLafterloadingthecurrentURL.Bothofthesetechniquescanbeusedinsearchenginespam-ming[13,7].Henzingeretal.[8]haspointedoutthatsearchenginespamisoneofthemajorchallengesofwebsearchenginesandcloakingisamongthespammingtechniquesusedtoday.Sincesearchen-gineresultscanbeseverelyaectedbyspam,searchenginestypicallyhavepoliciesagainstcloakingandsomekindsofdedicatedredirection[5,16,1].Google[5]describescloakingasthesituationinwhich\thewebserverisprogrammedtoreturndif-ferentcontenttoGooglethanitreturnstoregularCopyrightisheldbytheauthor/owner(s).AIRWeb'05,May10,2005,Chiba,Japan.users,usuallyinanattempttodistortsearchenginerankings."Anobvioussolutiontodetectcloakingisthatforeachpage,calculatewhetherthereisadier-encebetweenacopyfromasearchengine'sperspec-tiveandacopyfromawebbrowser'sperspective.Butinreality,thisisnon-trivial.Unfortunately,itisnotenoughtoknowthatcorrespondingcopiesofapagedier;westillcannottellwhetherthepageisacloakingpage.Thereasonisthatwebpagesmaybeupdatedfrequently,suchasinanewsweb-siteorablogwebsite,orsimplythatthewebsiteputsatimestamponeverypageitserves.Eveniftwocrawlersweresynchronizedtovisitthesamewebpageatnearlythesamemoment,somedynamicallygeneratedpagesmaystillhavedierentcontent,suchasabanneradvertisementthatisrotatedoneachac-cess.Besidesthedicultyofidentifyingcloaking,itisalsohardtotellwhetheraparticularinstanceofcloakingisconsideredacceptableornot.Wede-nethecloakingbehaviorthathastheeectofma-nipulatingsearchenginerankingresultsassemanticcloaking.Unfortunately,thevarioussearchenginesmayhavedierentcriteriafordeningunacceptablecloaking.Asaresult,wehavefocusedonthesimpler,morebasictask|whenwementioncloakinginthispaper,weusuallyrefertothesimplercaseofwhetherdierentcontentisservedtoautomatedcrawlersver-suswebbrowsers,butnotdierentcontenttoeveryvisitor.Wenamethiscloakingassyntacticcloak-ing.So,forexample,wewillnotconsiderdynamicadvertisementstobecloaking.Inordertoinvestigatethisissue,wecollectedtwodatasets:oneisalargedatasetcontaining250,000pagesandtheotherisasmallerdatasetcontaining47,170pages.ThedetailofthesetwodatasetwillbegiveninSection3.Wemanuallyexaminedanumberofsamplesofthosepagesandfoundseveraldier-entkindsofcloakingtechniques.Fromthisstudywemakeaninitialpropositiontowardbuildinganauto-1 matedcloakingdetectionsystem.Ourhopeisthattheseresultsmaybeofusetoresearcherstodesignbetterandmorethoroughsolutionstothecloakingproblem.Sinceredirectioncanalsobeusedasaspammingtechnique,wealsocalculatedsomestatisticsbasedonourcrawleddataforcloaking.Fourtypesofredirec-tionarestudied.FewpublicationsaddresstheissueofcloakingontheWeb.Asaresult,themaincontributionofthispaperistobeginadiscussionoftheproblemofcloak-inganditsprevalenceinthewebtoday.Wepro-videaviewofactualcloakingandredirectiontech-niques.Weadditionallyproposeamethodfordetect-ingcloakingbyusingthreecopiesofthesamepage.Wenextreviewthosefewpapersthatmentioncloaking.ThedatasetsweuseforthisstudyareintroducedinSection3.TheresultsofcloakingandredirectionareshowninSection4and5respectively.Weconcludethispaperwithasummaryanddiscus-sioninSection6.2RelatedWorkHenzingeretal.[8]mentionedthatsearchenginespamisquiteprevalentandsearchengineresultswouldsuergreatlywithouttakingmeasures.Theyalsomentionedthatcloakingisoneofthemajorsearchenginespamtechniques.GyongyiandGarcia-Molina[7]describecloakingandredirectionasspamhidingtechniques.TheyshowedthatwebsitescanidentifysearchenginecrawlersbytheirnetworkIPaddressoruser-agentnames.TheyalsodescribedtheuseofrefreshmetatagsandJavaScripttoperformredirection.Theyad-ditionallymentionthatsomecloaking(suchassend-ingsearchengineaversionfreeofnavigationallinks,advertisementsbutnochangetothecontent)areac-ceptedbysearchengines.Perkins[13]arguesthatagent-basedcloakingisspam.Nomatterwhatkindofcontentissenttosearchengine,thegoalistomanipulatesearchen-ginesrankings,whichisanobviouscharacteristicofsearchenginespam.CafarellaandCutting[4]mentioncloakingasoneofthespammingtechniques.Theysaidthatsearchengineswillghtcloakingbypenalizingsitesthatgivesubstantiallydierentcontenttodierentbrowsers.Noneoftheabovepapersdiscusshowtodetectcloaking,whichisoneaspectofthepresentwork.Inonecloakingforum[14],manyexamplesofcloakingandmethodsofdetectingcloakingareproposedanddiscussed.Unfortunately,generallythesediscussionscanbetakenasspeculationonly,astheylackstrongevidenceorconclusiveexperiments.Najorkledforpatent[12]onamethodfordetect-ingcloakedpages.Heproposedanideaofdetect-ingcloakedpagesfromusers'browsersbyinstallingatoolbarandlettingthetoolbarsendthesignatureofuserperceivedpagestosearchengines.HismethoddoesnotdistinguishrapidlychangingordynamicallygeneratedWebpagesfromrealcloakingpages,whichisamajorconcernforouralgorithms.3DatasetTwodatasetswereexaminedforourcloakingandredirectiontesting.Forconvenience,wenametherstdataasHITSdataandthesecondasHOTdata.3.1Firstdataset:HITSdataInrelatedworktorecognizespamintheformoflinkfarms[15],wecollectedWebpagesintheneighborhoodofthetop100resultsfor412queriesbyfollowingtheHITSdatacollectionprocess[9].Thatis,foreachquerypresentedtoapopularsearchengine,wecollectedthetop200resultreferences,andforeachURLwealsoretrievedtheoutgoinglinkset,andupto100incominglinkpages.Theresultingdatasetcontains2.1MuniqueWebpages.Fromthese2.1MURLs,werandomlyselected250,000URLs.Inordertotestforcloaking,wecrawledthesepagessimultaneouslyfromauniver-sityIPaddress(Lehigh)andfromacommercialIPaddress(VerizonDSL).Wesettheuser-agentfromtheuniversityaddresstobeMozilla/4.0(compatible;MSIE5.5;Windows98)andtheonefromthecommercialIPtobeGooglebot/2.1(+http://www.googlebot.com/bot.html).Fromeachlocationwecrawledourdatasettwicewithatimeintervalofoneday.So,foreachpage,wenallyhavefourcopies,twoofwhicharefromawebbrowser'sperspectiveandtwofromacrawler'sperspective.Forconvenience,wenamethesefourcopiesasB1,B2,C1andC2respectively.Foreachpage,thetimeorderofretrievalofthesefourcopiesisalwaysC1,B1,C2andB2.2 3.2Seconddataset:HOTdataWealsowanttoknowthecloakingratiowithinthetopresponselistsforhotqueries.Therststepistocollecthotqueriesfrompopularsearchengines.Todothis,wecollected10popularqueriesofJan2005fromGoogleZeigeist[6],top100searchtermsof2004fromLycos[10],top10searchesfortheweekendingMar11,2005fromAskJeeves[3],and10hotsearchesineachof16categoriesendingMar11,2005fromAOL[2].Thisresultedin257uniquequeriesfromthesewebsites.Thesecondstepistocollecttopresponselistforthesehotqueries.Foreachofthese257queries,were-trievedthetop200responsesfromtheGooglesearchengine.ThenumberofuniqueURLsis47,170.Liketherstdataset,wedownloadedfourcopiesforeachofthese47,170URLs,twofromabrowser'sperspec-tiveandtwofromacrawler'sperspective.ButallthesecopiesaredownloadedfrommachineswithauniversityIPaddress.Forconvenience,wenamethesefourcopiesHC1,HB1,HC2andHB2re-spectively.Thisorderalsomatchesthetimeorderofdownloadingthem.4ResultsofCloakingInthissection,wewillshowtheresultsforthecloak-ingtest.4.1DetectingCloakinginHITSdataIntuitively,thegoalofcloakingistogivedier-entcontenttoasearchenginethantonormalwebbrowsers.Thiscanbedierenttextorlinks.Weusetwotechniquestocompareversionsretrievedbyacrawlerandabrowser|weconsiderthenumberofdierencesinthetermsandlinksusedovertimetodetectcloaking.AswementionedearlierinSection1,calculat-ingthedierencebetweenpagesfromthebrowser'sandcrawler'sviewpointsisnotstrongenoughtotellwhetherthepagedoescloaking.OurproposedmethodisthatwecanusethreecopiesofapageC1,C2andB1todecideifitisacloakingpage.ThedetailisthatforeachURL,werstcalculatethedif-ferencebetweenC1andC2(forconvenience,weuseNCCtorepresentthisnumber).ThenwecalculatedthedierencebetweenB1andC1(forconvenience,weuseNBCtorepresentthisnumber).FinallyifNBCisgreaterthanNCC,thenwemarkitasa 1 10 100 1000 10000 1 10 100 1000 10000 100000Number of URLsDifference between NCC and NBCFigure1:DistributionofthedierenceofNCCandNBC.cloakingcandidate.Theintuitionisthatthepagemaychangefrequently,butifthedierencebetweenthebrowser'scopyandthecrawler'scopyisbiggerthanthedierencebetweentwocrawlercopies,theevidencemaybeenoughthatthepageiscloaking.Weusedtwomethodstocalculatethedierencebetweenpages|thedierenceintermsused,andthedierenceinlinksprovided.Wedescribeeachbelow,alongwiththeresultsobtained.4.1.1TermDierenceTherstmethodfordetectingcloakingistousetermdierenceamongdierentcopies.InsteadofusingallthetermsintheHTMLles,weusedthe\bagofwords"methodforanalyzingthewebpages,i.e.,weparsetheHTMLleintotermsandonlycounteachuniquetermoncenomatterhowmanytimesthistermappears.Thus,eachpageismarkedbyasetofwordsafterparsing.Foreachpage,werstcalculatedthenumberofdierenttermsbetweenthecopiesC1andC2(desig-natedNCC,asdescribedabove).WethencalculatedthenumberofdierenttermsbetweenthecopiesC1andB1,(designatedNBC).WethenselectpagesthathaveabiggerNBCthanNCCascandidatesofcloaking.Forthisdataset,wemarked23,475candi-datesoftheoriginal250Kdataset.Thedistributionofthedierenceofthese23,475pagesformsapower-law-likedistribution,showninFigure1.TocheckwhatthresholdforthisdierencebetweenNCCandNBCisagoodindicationforrealcloaking,rst,weputthe23,475URLsintotendierentbuck-etsbasedonthedierencevalue.TherangeforeachbucketandthenumberofpageswithineachbucketareshowninTable1.Then,fromeachbucketwerandomlyselected3 BucketIDRANGENo.ofPages1x=5808425x=102287310x=201938420x=402065540x=802908680x=16017317160x=32014968320x=6409129640x=12801297101280x757Table1:BucketsoftermdierenceFigure2:Theratioofsyntacticcloakingineachbucketbasedontermdierence.thirtypagesandcheckedthemmanuallytoseehowmanyfromthesethirtypagesarerealsyntacticcloak-ingpageswithineachbucket.TheresultisshowninFigure2.ThetrendisobviousinFigure2.Thegreaterthedierence,thehigherproportionofcloakingthatiscontainedinthebucket.Inordertoknowwhichistheoptimalthresholdtochoose,wecalculatedtheprecision,recallandF-measurebasedontherangeofthesebuckets.Forthesethreemeasures,wefol-lowthedenitionsin[11]andselecttobe0.5intheF-measureformulatogiveequalweighttorecallandprecision.Precisionistheproportionofselecteditemsthatthesystemgotright;Recallisthepropor-tionofthetargetitemsthatthesystemselected;F-measureisthemeasurethatcombinesprecisionandrecall.TheresultsofthesethreemeasuresareshowninTable2.IfwechooseF-measureasthecriteria,buckets4and5havethehighestvalue.Sincetherangeofbucket4and5isaround40inTable1,wecansetthethresholdtobe40anddeclarethatallpageswiththedierenceabove40tobecategorizedascloakingpages.Inthatcase,theprecisionandThresholdPRECISIONRECALLFvalue10.3551.0000.50250.4230.8280.560100.4800.7990.560200.5340.7580.627400.5800.6710.622800.6330.4980.5881600.6850.3880.4963200.6950.2620.3806400.7520.1960.31112800.8990.0860.157Table2:F-measurefordierentthresholdsbasedontermdierence.recallare0:580and0:671respectively.FromFigure2,wecanmakeanestimationofwhatpercentageofour250,000pagesetarecloakingpages.Sinceweknowthetotalnumberofpageswithineachbucketandthenumberofcloakingpageswithinthe30manuallycheckedpagesfromeachbucket,theesti-mationoftotalnumberofcloakingpagesistheprod-uctofthenumberofpageswithineachbucketandtheratioofcloakingpageswithinthe30pages.There-sultis7,780,soweexpectthatwecanidentifynearly8,000cloakingpages(about3%)withinthe250,000pages.LinkDierenceSimilartotermdierence,wealsoanalyzedthisdatasetsonthebasisoflinkdierences.Herelinkdier-encemeansthenumberofdierentlinksbetweentwocorrespondingpages.FirstwecalculatedthelinkdierencebetweenthecopyofC1andC2(termedLCC).WethencalculatedthelinkdierencebetweenthecopyofC1andB1(termedLBC).FinallywemarkedthepagethathaveahigherLBCthanLCCascloakingcandidates.Inthisway,wemarked8,205candidates.Thefrequencyofthesecandidatesalsoapproximatesapower-lawdistributionliketermcloaking.ItisshowninFigure3.Aswithtermdierence,wealsoputthese8,205candidatesinto10buckets.TherangeandnumberofpageswithineachbucketisshowninTable3.Fromeachbucket,werandomlyselected30pagesandcheckedmanuallytoseehowmanyofthemarerealcloakingpages.TheresultisshowninFigure4.Itisobviousthatthemostofthepagesfrombucket4orabovearecloakingpages.WealsocalculatedtheFvaluesforthesethresholdscorrespondingtothe4 1 10 100 1000 10000 1 10 100 1000 10000 100000Number of URLsDifference between LCC and LBCFigure3:DistributionofthedierenceofLCCandLBC.BucketIDRANGENo.ofPages1x=5441525x=10787310x=20746420x=35783535x=55441655x=80299780x=1102798110x=1451829145x=18510010185x173Table3:Bucketsoflinkdierencerangeofeachbucket.TheresultisshowninTable4.Wecantellthat5isanoptimalthresholdwiththebestFvalue.Sincethenumberofpageshavinglinkdierenceissmallerthantheoneshavingtermdierenceinreality,fewercloakingpagescanbefoundbyusinglinkdierencealone,butaremoreaccurate.ThresholdPRECISIONRECALLFvalue10.4791.0000.64850.7270.7000.713100.8220.6270.711200.9060.5200.660350.9100.3400.496550.9000.2360.374800.9000.1670.2831100.9000.1040.1861450.8780.0600.1141850.8660.0380.072Table4:F-measurefordierentthresholdsbasedonlinkdierence.Figure4:Theratioofsyntacticcloakingineachbucketbasedonlinkdierence.Figure5:IntersectionofthefourcopiesforaWebpage.DetectingCloakinginHOTdataBasedontheexperienceofmanuallycheckingforcloakingpagesfortherstdataset,weattemptedtodetectsyntacticcloakingautomaticallybyusingallfourcopiesofeachpage.4.2.1Algorithmofdetectingcloakingauto-maticallyOurassumptionaboutsyntacticcloakingisthatthewebsitewillsendsomethingconsistenttothecrawlerbutsendsomethingdierentyetstillconsistenttothebrowser.So,ifthereexistssuchtermsthatonlyappearinbothofthecopiessenttothecrawlerbutneverappearinanyofthecopiessendtothebrowserorviceversa,itisquitepossiblethatthepageisdoingsyntacticcloaking.Herewhengettingthetermsoutofeachcopy,westillusethe\bagofwords"approach,i.e.,wereplaceallthenon-wordcharacterswithinanHTMLlewithblankandthengetallthewordsoutofitfortheintersectionoperation.Toeasilydescribeouralgorithm,theintersectionoffourcopiesareshownasaVenndiagraminFig-5 BucketRANGENo.Accuracy1x=172540%21x=254030%32x=449530%44x=862340%58x=1665090%616x=32822100%732x=64600100%864x=128741100%9128x=256420100%10256x1120100%Table5:BucketsofuniquetermsinareaAandGure5.WeusecapitallettersfromAtoMtorepre-senteachintersectioncomponentoffourcopies.Forexample,theareaLcontainscontentthatonlyap-pearsinHC1,butneverappearinHC2,HB1andHB2;areaFistheintersectionoffourcopies,i.e.,thecontentthatappearsonallofthefourcopies.ThemostinterestingcomponentstousareareasAandG.AreaArepresentstermsthatappearonbothbrowsers'copiesbutneverappearonanyofthecrawlers'copies,whileareaGrepresentstermsthatappearonbothcrawlers'copiesbutneverappearonanyofthebrowsers'copies.Soouralgorithmofdetectingsyntacticcloakingautomaticallyisthatforeachwebpage,wecalculatethenumberoftermsinareaAandthenumberoftermsinareaG.Ifthesumofthesetwonumbersisnonzero,wemaymarkthispageasacloakingpage.Therearefalsenegativeexamplesforthisalgo-rithm.Asimpleexampleisthatsupposethereisadynamicpictureonthepage,everytimethewebserverwillrandomlyselectonefrom4JPEGles(a1.jpgtoa4.jpg)toservetherequest.Ithappensthata1.jpgissenteverytimewhenourcrawlervisitsthispage,buta2.jpganda3.jpgaresentwhenourbrowservisitthispage.Byouralgorithm,thepagewillbemarkedascloaking,butitcanbeeasilyver-iedthatthisisnotthecase.So,againweneedathresholdforthealgorithmtoworkmoreaccurately.Forthe47,170URLs,wefound6466pagesthathavethesumofnumberoftermsinareaAandGgreaterthan0.Again,weputtheminto10buckets,asshowninTable5.Thethirdcolumnisthenumberofpageswithinthisbucket.Fromeachbucket,werandomlyselected10pagesandmanuallycheckedtoseewhetherthispageisrealsyntacticcloaking.TheaccuracyisshowninthefourthcolumninTable5.WealsocalculatedtheF-measure,theresultsareshowninTable6.ThresholdsPRECISIONRECALLFvalue00.6471.0000.78510.7030.9650.81320.7660.9520.84940.8360.9400.88580.9020.8810.891160.9220.7560.831320.9600.5990.738640.9790.4700.6351280.9720.3580.5232561.0000.2670.422Table6:F-measureofdierentthreshold 0 1 2 3 4 5 6 7 8 9 0 50 100 150 200The Cumulative Percentage of Cloaking pagesTop ResultsFigure6:Percentageofsyntacticcloakingpageswithingoogle'stopresponses.Sincethe4thand5thbuckethavehighestFvalueinTable6,wechoosethethresholdtobetherangebetweenbucket4andbucket5,i.e.,8.So,ourau-tomatedcloakingalgorithmisrevisedtoonlymarkpageswiththesumofareaAandGgreaterthan8ascloakingpages.So,forourseconddataset,allpagesinbucket5tobucket10aremarkedcloakingpages.Finally,wemarked4,083pagesoutofthe47,170pages,i.e.,about9%ofpagesfromthehotquerydatasetaresyntacticcloakingpages.4.2.2DistributionofsyntacticcloakingwithintoprankingsSincewehaveidentied4,083pagesthatutilizecloak-ing,wecannowdrawthedistributionofthesecloak-ingpageswithindierenttoprankings.Figure6showsthecumulativepercentageofcloakingpageswithintheTop200responselistsreturnedbygoogle.Aswecansee,about2%oftop50,about4%oftop100URLsandmorethan8%oftop200URLsdouti-lizecloaking.Theratioisquitehighandthecloaking6 A.AutosB.CompaniesC.ComputingD.EntertainmentE.GamesF.HealthG.HouseH.HolidaysI.LocalJ.MoviesK.MusicL.ResearchM.ShoppingN.SportsO.TVP.TravelFigure7:Category-specicCloaking.maybehelpfulforthesepagestoberankedhigh.Sinceweretrievedtop10hotqueriesfromeachof16categoriesfromAOL,wecanconsiderthetopicofthecloakingpages.Intuitivelysomepopularcate-gories,suchassportsorcomputers,maycontainmorecloakingpagesinthetoprankinglist.Sowealsocalculatedthefractionofcloakingpageswithineachcategory.TheresultsareshowninFigure7.Somecategories,suchasShoppingandSports,aremorelikelytohavecloakedresultsthanothercategories.4.2.3SyntacticvsSemanticcloakingNotallsyntacticcloakingisconsideredunacceptabletosearchengines.Forexample,apagesenttothecrawlerthatdoesn'tcontainadvertisingcontentoraPHPsessionidentierwhichisusedtodistinguishdierentrealusersisnotaproblemtosearchengines.Incontrasttoacceptablecloaking,wedeneseman-ticcloakingascloakingbehaviorwiththeeectofmanipulatingsearchengineresults.Tomakeonemorestepaboutourcloakingstudy,werandomlyselected100pagesfromthe4,083pageswehavedetectedassyntacticcloakingpagesandmanuallycheckedthepercentageofsemanticcloak-ingamongthem.Inpractice,itisdiculttojudgewhethersomebehaviorisharmfultosearchenginerankings.Forexample,somewebsiteswillsendloginpagetobrowser,whilesendfullpagetocrawler.So,weendupwiththreecategories:acceptablecloaking,unknownandsemanticcloaking.Fromthese100pages,weclassied33pagesasse-manticcloaking,32asunknownand35asacceptablecloaking.4.3DierenttypesofcloakingIntheprocessofmanuallychecking600pagesfortheabovesections,wefoundseveraldierenttypesofcloaking.4.3.1TypesoftermcloakingWeidentiedmanydierentmethodsofsendingdif-ferenttermcontenttocrawlersandwebbrowsers.Theycanbecategorizedbythemagnitudeofthedif-ference.Werstconsiderthecaseinwhichthecontentofthepagessenttothecrawlerandwebbrowserarequitedierent.Thepageprovidedtothecrawlerisfullofdetail,buttheonetothewebbrowserisempty,oronlycontainsframesorJavaScript.Thewebsitesendstextpagetothecrawler,butsendsnon-textcontent(suchasmacrome-diaFlashcontent)towebbrowser.Thepagesenttothecrawlerincorporatescon-tent,buttheonesenttothewebbrowsercon-tainsonlyaredirector404errorresponse.Thesecondcaseiswhencontentdiersonlypar-tiallybetweenthepagessenttothecrawlerandthebrowserandtheremainingcontentisidentical,oronecopyhasslightlymorecontentthantheother.Thepagessenttothecrawlercontainmoretextcontentthantheonestowebbrowser.Forex-ample,onlythepagesenttothecrawlercontainskeywordsshowninFigure8.DierentredirectiontargetURLsarecontainedinthepagessenttothecrawlerandtothewebbrowser.Thewebsitesendsdierenttitles,meta-descriptionorkeywordstothecrawlerthantowebbrowser.Forexample,theheadertobrowseruses\ShapeofThingsmovieinfoatVideoUni-verse"asthemeta-description,whiletheonetothecrawleruses\GreatpricesonShapeofThingsVHSmoviesatVideoUniverse.Greatservice,secureorderingandfastshippingatev-erydaydiscountprices."ThepagesenttothecrawlercontainsJavaScript,butnosuchJavaScriptissenttothebrowser,orthepageshavedierentJavaScriptssenttothecrawlerthantowebbrowser.7 gamecomputergamesPCgamesconsolegamesvideogamescomputeractiongamesadventuregamesroleplayinggamessimulationgamessportsgamesstrategygamescontestcontestsprizeprizesgamecheatshintsstrategycomputergamesPCgamescomputeractiongamesadventuregamesroleplayinggamesNintendoPlaystationsimula-tiongamessportsgamesstrategygamescontestcontestsprizeprizesgamecomputergamesPCgamescomputeractiongamesadventuregamesroleplayinggamessimulationgamessportsgamesstrategygamescontestcontestsprizeprizes.Figure8:Sampleofkeywordscontentonlysenttothecrawler.Pagestothecrawlerdonotcontainsomebanneradvertisements,whilethepagestowebbrowserdo.TheNOSCRIPTelementisusedtodeneanalternatecontentifascriptisnotexecuted.ThepagesenttowebbrowserhastheNOSCRIPTtag,whilethepagesenttothecrawlerdoesnot.4.3.2TypesoflinkcloakingForlinkcloaking,weagaingroupthesituationsbythemagnitudeofthedierencesbetweendierentversionsofthesamepage.Inonecase,bothpagescontainsimilarnumberoflinksandtheotheristhatbothpageshavequitedierentnumberoflinks.Fortherstsituation,examplesfoundinclude:Therearethesamenumberoflinkswithinthepagesenttothecrawlerandwebbrowser,butthecorrespondinglinkpairshaveadierentfor-mat.Forexample,thelinktowebbrowsermaycontainaPHPsessionidwhilethelinktothecrawlerdoesnot.AnotherexampleisthatthepagetothecrawleronlycontainsabsoluteURLs,whilethepagetothebrowsercontainsrelativeURLsthatareinfactpointingtothesametar-getsastheabsoluteones.Thelinksinthepagetothecrawleraredirectlinks,whilethecorrespondinglinkswithinthepagetowebbrowserareencodedredirections.Thelinkstowebbrowserarenormallinks,butthelinkstothecrawlerarearoundsmallimagesinsteadoftexts.Thewebsiteshowslinkstodierentstylesheetstowebbrowserthantothecrawler.Forexample,thepagetothecrawlercontains\href=/styles/styleswinie.css",whilethepagetothebrowsercontains\href=/styles/styleswinns.css".Insomecases,thenumberoflinkswithinthepagetothecrawlerandthepagetothewebbrowsercanbequitedierent.Morelinksexistinthepagesenttothecrawlerthanthepagesenttowebbrowser.Forexample,theselinksmaypointtoalinkfarm.Thepagesenttowebbrowserhasmorelinksthanthepagesenttothecrawler.Forexample,theselinksmaybenavigationallinks.Thepagesenttothebrowsercontainssomenor-mallinks,butinthesamepositionofthepagesenttothecrawler,onlyerrormessagessaying\nopermissiontoincludelinks"exist.Fromtheresultsshownwithinthissection,itisobviousthatcloakingisnotrareintherealWeb.Ithappensmoreoftenforhotqueriesorpopulartopics.5ResultsofRedirectAswehavediscussedinSection1,redirectioncanalsobeusedasaspammingtechnique.Togetaninsightintohowoftentheredirectionappearanddistributionofdierentredirectmethods,weusetheHITSdatasetmentionedinSection3.Wedon'tuseallfourcopiesbutonlycomparetwocopiesforeachpage:onefromthesimulatedbrowser'sset(BROWSER)andtheotherfromthecrawler'sset(CRAWLER).5.1DistributionWecheckthedistributionoffourdierenttypesofredirection:HTTP301MovedPermanentlyand302MovedTemporarilyresponses,theHTMLmetare-freshtag,andtheuseofJavaScripttoloadanewpage.Inordertoknowthedistributionofabovefourdierentredirects,wetabulatedthenumberofap-pearancesofeachtype.Forthersttwotypes,thesituationissimple:wejustcountthepageswithresponsestatusof\301"and\302".Thelasttwoaremorecomplicated;theHTTPrefreshtagdoesnotnecessarilymeanaredirectionandJavaScript8 TYPECRAWLERBROWSER30120223025660Refreshtag42304356JavaScript23992469Table7:Numberofpagesusingdierenttypesofredirection.evenmorecomplicatedforredirectionpurpose.Fortherststep,wejustcounttheappearanceof\metahttp-equiv=refresh"tagforthethirdtypeandcounttheappearanceof\location.replace"and\window.location"forthefourthtype.TheresultsforthisrststepareshowninTable7.TogetamoreaccuratenumberofappearancesoftheHTTPrefreshtag,weexaminedthisfurther.Inreality,theRefreshtagmayjustmeanrefreshing,notnecessarilytoredirectiontoanotherpage.Forexam-ple,theRefreshtagmaybeputinsideaNOSCRIPTtagforbrowsersthatdonotsupportJavaScript.Toestimatethenumberofrealredirectionbyusingthisrefreshtag,werandomlyselected20pagesfromthe4;230pagesthatusetherefreshtagandcheckedthemmanually.Wefoundthat95%ofthemarerealredi-rectionandonly5%areinsideaNOSCRIPTtag.Be-sides,somepagesmayhaveidenticaltargetURLasthemselvesintheRefreshtagtokeeprefreshingthem-selves.Wealsocountedthesenumbers.Thereare47pagesoutof4;230pageswithintheCRAWLERdatasetand142pagesoutof4;356pageswithintheBROWSERdatasetthatrefreshtothemselves.Wedidonemorestepforthe4;214(4;356-142)pagesthatarepagesusingRefreshtagandrefreshtoadierentpage.Usuallythereisatimevalueassignedwithintherefreshtagtoshowhowlongtowaitbeforerefreshing.Ifthistimeissmallenough,i.e.,0or1seconds,userscannotseetheoriginpagebutareredirectedtoanewpageimmediately.Wefetchedthistimevalueforthese4;214pagesanddrawthedistributionofdierenttimevaluesfromtherangeof0secondsto30secondsinFigure9.Morethan50%ofthesepagesrefreshtoadierentURLafter0secondsandabout10%refreshafter1second.ToestimatetherealdistributionoftheJavaScriptrefreshmethod,werandomlyselected40pagesfromthe2;399pagesthathavebeenidentiedascandi-datesforusingJavaScriptforredirectionintherststep.Aftermanuallycheckingthese40les,wefoundthe20%ofthemarerealredirections,32.5%ofthemareconditionalredirections,andtherestarenotforredirectionpurpose,suchastoavoidshowingthe 0 10 20 30 40 50 60 0 5 10 15 20 25 30Percentage of PagesNumber of seconds before refreshFigure9:Distributionofdelaysbeforerefresh.pagewithinaframe.SometimesthetargetURLandoriginURLarewithinthesamesite,whileothertimestheyareondierentsites.Inordertoknowthepercentageofredirectionsthatredirectiontothesamesites,wealsoanalyzedourcollecteddatasetforthisinformation.SincetheJavaScriptredirectioniscomplicated,weonlycounttherstthreetypesofredirectionhere.Thesumoftherstthreetypesofredirectionis4;306.WithintheCRAWLERdataset,thereare2;328pageswithinthese4;306pagesredirectingtothesamesite,whilefortheBROWSERdataset,thenumberis2;453.5.2RedirectionCloakingAswehavementionedinSection4.3,thesitemayreturnpagesredirectingtodierentlocationsincaseofdierentuseragents.Weconsiderthisredirectioncloaking.Wefoundthatthereare153pairsofpagesoutof250;000pairsthathavedierentresponsecodeforacrawlerandanormalbrowserwhendoingredirecting.Usuallythesewebsiteswillsend404or503responsecodetooneandsend200responsecodetotheother.Weevenfoundthatthereare10pagesthatusedif-ferentredirectionmethodforacrawlerandnormalwebbrowser.Forexample,theymayuse302or301forthecrawler,butuserefreshtagwiththeresponsecode200foranormalwebbrowser.6SummaryandDiscussionDetectionofsearchenginespamisachallengingre-searcharea.Cloakingandredirectionaretwoimpor-9 tantspammingtechniques.Thisstudyisbasedonasampleofaquarterofmil-lionpagesandtopresponsesfromapopularsearchenginetohotqueriesontheWeb.Weidentieddif-ferentkindsofcloakingandgaveanestimateofthepercentageofpagesthatarecloakedinthesampleandalsoshowanestimationofdistributionofdier-entredirect.Therearefourissuesthatwewouldliketoseead-dressedinfuturework.Therstisthatofabiasinthedatasetused.Ourdatasets(pagesinornearthetopresultsforqueries)donotnearlyre\recttheWebasawhole.However,itmightbearguedthatitre-\rectstheWebthatisimportant(atleastforthepur-posesofndingpagesthatmightaectsearchenginerankingsthroughcloaking).ThesecondisthatthispaperdoesnotaddressIP-basedcloaking,sotherearelikelypagesthatdoindeedprovidecloakedcon-tenttothemajorengineswhentheyrecognizethecrawlingIP.Wewouldwelcomethepartnershipofasearchenginetocollaborateonfuturecrawls.Thenalissueisthebottomline.Whilesearchenginesmaybeinterestedinndingandeliminat-inginstancesofcloaking,ourproposedtechniquere-quiresthreeorfourcrawls.Ideally,afuturetechniquewouldincorporateatwo-stageapproachthatidenti-esasubsetofthefullwebthatismorelikelytocontaincloakedpages,sothatafullcrawlusingabrowseridentitywouldnotbenecessary.Ourhopeisthatthisstudycanprovidearealisticviewoftheuseofthesetwotechniquesandwillcon-tributetorobustandeectivesolutionstotheiden-ticationofsearchenginespam.References[1]AskJeeves/TeomaSiteSubmitmanagedbyin-eedhits.com:ProgramTerms,2005.Onlineathttp://ask.ineedhits.com/programterms.asp.[2]AmericaOnline,Inc.AOLSearch:Hotsearches,Mar.2005.http://hot.aol.com/hot/hot.[3]AskJeeves,Inc.AskJeevesAbout,Mar.2005.http://sp.ask.com/docs/about/jeevesiq.html.[4]M.CafarellaandD.Cutting.BuildingNutch:Opensource.ACMQUEUE,2,Apr.2004.[5]Google,Inc.Googleinformationforwebmasters,2005.Onlineathttp://www.google.com/webmasters/faq.html.[6]Google,Inc.GoogleZeitgeist,Jan.2005.http://www.google.com/press/zeitgeist/zeitgeist-jan05.html.[7]Z.GyongyiandH.Garcia-Molina.Webspamtaxon-omy.InFirstInternationalWorkshoponAdversarialInformationRetrievalontheWeb(AIRWeb),2005.[8]M.R.Henzinger,R.Motwani,andC.Silverstein.Challengesinwebsearchengines.SIGIRForum,36(2),Fall2002.[9]J.M.Kleinberg.Authoritativesourcesinahyper-linkedenvironment.JournaloftheACM,46(5):604{632,1999.[10]Lycos.Lycos50withDean:2004web'smostwanted,Dec.2004.http://50.lycos.com/121504.asp.[11]C.D.ManningandH.Schutze.Foundationsofsta-tisticalnaturallanguageprocessing,chapter8,pages268{269.MITpress,2001.[12]M.Najork.Systemandmethodforidentifyingcloakedwebservers,Jul102003.PatentApplica-tionnumber20030131048.[13]A.Perkins.Whitepaper:Theclassica-tionofsearchenginespam,Sept.2001.On-lineathttp://www.silverdisc.co.uk/articles/spam-classication/.[14]WebmasterWorld.com.Cloaking,2005.Onlineathttp://www.webmasterworld.com/forum24/.[15]B.WuandB.D.Davison.Identifyinglinkfarmspampages.InProceedingsofthe14thInternationalWorldWideWebConference,IndustrialTrack,May2005.[16]Yahoo!Inc.Yahoo!Help-Ya-hoo!Search,2005.Onlineathttp://help.yahoo.com/help/us/ysearch/deletions/.10