/
Cloaking and Redirection Preliminary Study Baoning and Cloaking and Redirection Preliminary Study Baoning and

Cloaking and Redirection Preliminary Study Baoning and - PDF document

sherrill-nordquist
sherrill-nordquist . @sherrill-nordquist
Follow
401 views
Uploaded On 2015-05-16

Cloaking and Redirection Preliminary Study Baoning and - PPT Presentation

Da vison Computer Science Engineering Lehigh Univ ersit baw4davison cselehighed Abstract Cloaking and redirection are ossible searc en gine spamming tec hniques In order to understand cloaking and redirection on the eb do wnloaded sets of eb pages w ID: 67813

vison Computer Science

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Cloaking and Redirection Preliminary Stu..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

CloakingandRedirection:APreliminaryStudyBaoningWuandBrianD.DavisonComputerScience&EngineeringLehighUniversityfbaw4,davisong@cse.lehigh.eduAbstractandredirectionaretwopossiblesearchen-ginespammingtechniques.InordertounderstandcloakingandredirectionontheWeb,wedownloadedtwosetsofWebpageswhilemimickingapopularWebcrawlerandasacommonWebbrowser.Weestimatethat3%ofthe rstdatasetand9%oftheseconddatasetutilizecloakingofsomekind.Bycheckingmanuallyasampleofthecloakingpagesfromthesec-onddataset,nearlyonethirdofthemappeartoaimtomanipulatesearchengineranking.Wealsoexaminedredirectionmethodspresentinthe rstdataset.Weproposeamethodofdetectingcloakingpagesbycalculatingthedi erenceofthreecopiesofthesamepage.Weexaminethedi erenttypesofcloakingthatarefoundandthedistributionofdi erenttypesofredirection.1IntroductionCloakingisthepracticeofsendingdi erentcontenttoasearchenginethantoregularvisitorsofawebsite.RedirectionisusedtosendusersautomaticallytoanotherURLafterloadingthecurrentURL.Bothofthesetechniquescanbeusedinsearchenginespam-ming[13,7].Henzingeretal.[8]haspointedoutthatsearchenginespamisoneofthemajorchallengesofwebsearchenginesandcloakingisamongthespammingtechniquesusedtoday.Sincesearchen-gineresultscanbeseverelya ectedbyspam,searchenginestypicallyhavepoliciesagainstcloakingandsomekindsofdedicatedredirection[5,16,1].Google[5]describescloakingasthesituationinwhich\thewebserverisprogrammedtoreturndif-ferentcontenttoGooglethanitreturnstoregularCopyrightisheldbytheauthor/owner(s).AIRWeb'05,May10,2005,Chiba,Japan.users,usuallyinanattempttodistortsearchenginerankings."Anobvioussolutiontodetectcloakingisthatforeachpage,calculatewhetherthereisadi er-encebetweenacopyfromasearchengine'sperspec-tiveandacopyfromawebbrowser'sperspective.Butinreality,thisisnon-trivial.Unfortunately,itisnotenoughtoknowthatcorrespondingcopiesofapagedi er;westillcannottellwhetherthepageisacloakingpage.Thereasonisthatwebpagesmaybeupdatedfrequently,suchasinanewsweb-siteorablogwebsite,orsimplythatthewebsiteputsatimestamponeverypageitserves.Eveniftwocrawlersweresynchronizedtovisitthesamewebpageatnearlythesamemoment,somedynamicallygeneratedpagesmaystillhavedi erentcontent,suchasabanneradvertisementthatisrotatedoneachac-cess.Besidesthedicultyofidentifyingcloaking,itisalsohardtotellwhetheraparticularinstanceofcloakingisconsideredacceptableornot.Wede- nethecloakingbehaviorthathasthee ectofma-nipulatingsearchenginerankingresultsassemanticcloaking.Unfortunately,thevarioussearchenginesmayhavedi erentcriteriaforde ningunacceptablecloaking.Asaresult,wehavefocusedonthesimpler,morebasictask|whenwementioncloakinginthispaper,weusuallyrefertothesimplercaseofwhetherdi erentcontentisservedtoautomatedcrawlersver-suswebbrowsers,butnotdi erentcontenttoeveryvisitor.Wenamethiscloakingassyntacticcloak-ing.So,forexample,wewillnotconsiderdynamicadvertisementstobecloaking.Inordertoinvestigatethisissue,wecollectedtwodatasets:oneisalargedatasetcontaining250,000pagesandtheotherisasmallerdatasetcontaining47,170pages.ThedetailofthesetwodatasetwillbegiveninSection3.Wemanuallyexaminedanumberofsamplesofthosepagesandfoundseveraldi er-entkindsofcloakingtechniques.Fromthisstudywemakeaninitialpropositiontowardbuildinganauto-1 matedcloakingdetectionsystem.Ourhopeisthattheseresultsmaybeofusetoresearcherstodesignbetterandmorethoroughsolutionstothecloakingproblem.Sinceredirectioncanalsobeusedasaspammingtechnique,wealsocalculatedsomestatisticsbasedonourcrawleddataforcloaking.Fourtypesofredirec-tionarestudied.FewpublicationsaddresstheissueofcloakingontheWeb.Asaresult,themaincontributionofthispaperistobeginadiscussionoftheproblemofcloak-inganditsprevalenceinthewebtoday.Wepro-videaviewofactualcloakingandredirectiontech-niques.Weadditionallyproposeamethodfordetect-ingcloakingbyusingthreecopiesofthesamepage.Wenextreviewthosefewpapersthatmentioncloaking.ThedatasetsweuseforthisstudyareintroducedinSection3.TheresultsofcloakingandredirectionareshowninSection4and5respectively.Weconcludethispaperwithasummaryanddiscus-sioninSection6.2RelatedWorkHenzingeretal.[8]mentionedthatsearchenginespamisquiteprevalentandsearchengineresultswouldsu ergreatlywithouttakingmeasures.Theyalsomentionedthatcloakingisoneofthemajorsearchenginespamtechniques.GyongyiandGarcia-Molina[7]describecloakingandredirectionasspamhidingtechniques.TheyshowedthatwebsitescanidentifysearchenginecrawlersbytheirnetworkIPaddressoruser-agentnames.TheyalsodescribedtheuseofrefreshmetatagsandJavaScripttoperformredirection.Theyad-ditionallymentionthatsomecloaking(suchassend-ingsearchengineaversionfreeofnavigationallinks,advertisementsbutnochangetothecontent)areac-ceptedbysearchengines.Perkins[13]arguesthatagent-basedcloakingisspam.Nomatterwhatkindofcontentissenttosearchengine,thegoalistomanipulatesearchen-ginesrankings,whichisanobviouscharacteristicofsearchenginespam.CafarellaandCutting[4]mentioncloakingasoneofthespammingtechniques.Theysaidthatsearchengineswill ghtcloakingbypenalizingsitesthatgivesubstantiallydi erentcontenttodi erentbrowsers.Noneoftheabovepapersdiscusshowtodetectcloaking,whichisoneaspectofthepresentwork.Inonecloakingforum[14],manyexamplesofcloakingandmethodsofdetectingcloakingareproposedanddiscussed.Unfortunately,generallythesediscussionscanbetakenasspeculationonly,astheylackstrongevidenceorconclusiveexperiments.Najork ledforpatent[12]onamethodfordetect-ingcloakedpages.Heproposedanideaofdetect-ingcloakedpagesfromusers'browsersbyinstallingatoolbarandlettingthetoolbarsendthesignatureofuserperceivedpagestosearchengines.HismethoddoesnotdistinguishrapidlychangingordynamicallygeneratedWebpagesfromrealcloakingpages,whichisamajorconcernforouralgorithms.3DatasetTwodatasetswereexaminedforourcloakingandredirectiontesting.Forconvenience,wenamethe rstdataasHITSdataandthesecondasHOTdata.3.1Firstdataset:HITSdataInrelatedworktorecognizespamintheformoflinkfarms[15],wecollectedWebpagesintheneighborhoodofthetop100resultsfor412queriesbyfollowingtheHITSdatacollectionprocess[9].Thatis,foreachquerypresentedtoapopularsearchengine,wecollectedthetop200resultreferences,andforeachURLwealsoretrievedtheoutgoinglinkset,andupto100incominglinkpages.Theresultingdatasetcontains2.1MuniqueWebpages.Fromthese2.1MURLs,werandomlyselected250,000URLs.Inordertotestforcloaking,wecrawledthesepagessimultaneouslyfromauniver-sityIPaddress(Lehigh)andfromacommercialIPaddress(VerizonDSL).Wesettheuser-agentfromtheuniversityaddresstobeMozilla/4.0(compatible;MSIE5.5;Windows98)andtheonefromthecommercialIPtobeGooglebot/2.1(+http://www.googlebot.com/bot.html).Fromeachlocationwecrawledourdatasettwicewithatimeintervalofoneday.So,foreachpage,we nallyhavefourcopies,twoofwhicharefromawebbrowser'sperspectiveandtwofromacrawler'sperspective.Forconvenience,wenamethesefourcopiesasB1,B2,C1andC2respectively.Foreachpage,thetimeorderofretrievalofthesefourcopiesisalwaysC1,B1,C2andB2.2 3.2Seconddataset:HOTdataWealsowanttoknowthecloakingratiowithinthetopresponselistsforhotqueries.The rststepistocollecthotqueriesfrompopularsearchengines.Todothis,wecollected10popularqueriesofJan2005fromGoogleZeigeist[6],top100searchtermsof2004fromLycos[10],top10searchesfortheweekendingMar11,2005fromAskJeeves[3],and10hotsearchesineachof16categoriesendingMar11,2005fromAOL[2].Thisresultedin257uniquequeriesfromthesewebsites.Thesecondstepistocollecttopresponselistforthesehotqueries.Foreachofthese257queries,were-trievedthetop200responsesfromtheGooglesearchengine.ThenumberofuniqueURLsis47,170.Likethe rstdataset,wedownloadedfourcopiesforeachofthese47,170URLs,twofromabrowser'sperspec-tiveandtwofromacrawler'sperspective.ButallthesecopiesaredownloadedfrommachineswithauniversityIPaddress.Forconvenience,wenamethesefourcopiesHC1,HB1,HC2andHB2re-spectively.Thisorderalsomatchesthetimeorderofdownloadingthem.4ResultsofCloakingInthissection,wewillshowtheresultsforthecloak-ingtest.4.1DetectingCloakinginHITSdataIntuitively,thegoalofcloakingistogivedi er-entcontenttoasearchenginethantonormalwebbrowsers.Thiscanbedi erenttextorlinks.Weusetwotechniquestocompareversionsretrievedbyacrawlerandabrowser|weconsiderthenumberofdi erencesinthetermsandlinksusedovertimetodetectcloaking.AswementionedearlierinSection1,calculat-ingthedi erencebetweenpagesfromthebrowser'sandcrawler'sviewpointsisnotstrongenoughtotellwhetherthepagedoescloaking.OurproposedmethodisthatwecanusethreecopiesofapageC1,C2andB1todecideifitisacloakingpage.ThedetailisthatforeachURL,we rstcalculatethedif-ferencebetweenC1andC2(forconvenience,weuseNCCtorepresentthisnumber).Thenwecalculatedthedi erencebetweenB1andC1(forconvenience,weuseNBCtorepresentthisnumber).FinallyifNBCisgreaterthanNCC,thenwemarkitasa 1 10 100 1000 10000 1 10 100 1000 10000 100000Number of URLsDifference between NCC and NBCFigure1:Distributionofthedi erenceofNCCandNBC.cloakingcandidate.Theintuitionisthatthepagemaychangefrequently,butifthedi erencebetweenthebrowser'scopyandthecrawler'scopyisbiggerthanthedi erencebetweentwocrawlercopies,theevidencemaybeenoughthatthepageiscloaking.Weusedtwomethodstocalculatethedi erencebetweenpages|thedi erenceintermsused,andthedi erenceinlinksprovided.Wedescribeeachbelow,alongwiththeresultsobtained.4.1.1TermDi erenceThe rstmethodfordetectingcloakingistousetermdi erenceamongdi erentcopies.InsteadofusingallthetermsintheHTML les,weusedthe\bagofwords"methodforanalyzingthewebpages,i.e.,weparsetheHTML leintotermsandonlycounteachuniquetermoncenomatterhowmanytimesthistermappears.Thus,eachpageismarkedbyasetofwordsafterparsing.Foreachpage,we rstcalculatedthenumberofdi erenttermsbetweenthecopiesC1andC2(desig-natedNCC,asdescribedabove).Wethencalculatedthenumberofdi erenttermsbetweenthecopiesC1andB1,(designatedNBC).WethenselectpagesthathaveabiggerNBCthanNCCascandidatesofcloaking.Forthisdataset,wemarked23,475candi-datesoftheoriginal250Kdataset.Thedistributionofthedi erenceofthese23,475pagesformsapower-law-likedistribution,showninFigure1.Tocheckwhatthresholdforthisdi erencebetweenNCCandNBCisagoodindicationforrealcloaking, rst,weputthe23,475URLsintotendi erentbuck-etsbasedonthedi erencevalue.TherangeforeachbucketandthenumberofpageswithineachbucketareshowninTable1.Then,fromeachbucketwerandomlyselected3 BucketIDRANGENo.ofPages1x=5808425x=102287310x=201938420x=402065540x=802908680x=16017317160x=32014968320x=6409129640x=12801297101280x757Table1:Bucketsoftermdi erenceFigure2:Theratioofsyntacticcloakingineachbucketbasedontermdi erence.thirtypagesandcheckedthemmanuallytoseehowmanyfromthesethirtypagesarerealsyntacticcloak-ingpageswithineachbucket.TheresultisshowninFigure2.ThetrendisobviousinFigure2.Thegreaterthedi erence,thehigherproportionofcloakingthatiscontainedinthebucket.Inordertoknowwhichistheoptimalthresholdtochoose,wecalculatedtheprecision,recallandF-measurebasedontherangeofthesebuckets.Forthesethreemeasures,wefol-lowthede nitionsin[11]andselect tobe0.5intheF-measureformulatogiveequalweighttorecallandprecision.Precisionistheproportionofselecteditemsthatthesystemgotright;Recallisthepropor-tionofthetargetitemsthatthesystemselected;F-measureisthemeasurethatcombinesprecisionandrecall.TheresultsofthesethreemeasuresareshowninTable2.IfwechooseF-measureasthecriteria,buckets4and5havethehighestvalue.Sincetherangeofbucket4and5isaround40inTable1,wecansetthethresholdtobe40anddeclarethatallpageswiththedi erenceabove40tobecategorizedascloakingpages.Inthatcase,theprecisionandThresholdPRECISIONRECALLFvalue10.3551.0000.50250.4230.8280.560100.4800.7990.560200.5340.7580.627400.5800.6710.622800.6330.4980.5881600.6850.3880.4963200.6950.2620.3806400.7520.1960.31112800.8990.0860.157Table2:F-measurefordi erentthresholdsbasedontermdi erence.recallare0:580and0:671respectively.FromFigure2,wecanmakeanestimationofwhatpercentageofour250,000pagesetarecloakingpages.Sinceweknowthetotalnumberofpageswithineachbucketandthenumberofcloakingpageswithinthe30manuallycheckedpagesfromeachbucket,theesti-mationoftotalnumberofcloakingpagesistheprod-uctofthenumberofpageswithineachbucketandtheratioofcloakingpageswithinthe30pages.There-sultis7,780,soweexpectthatwecanidentifynearly8,000cloakingpages(about3%)withinthe250,000pages.LinkDi erenceSimilartotermdi erence,wealsoanalyzedthisdatasetsonthebasisoflinkdi erences.Herelinkdi er-encemeansthenumberofdi erentlinksbetweentwocorrespondingpages.Firstwecalculatedthelinkdi erencebetweenthecopyofC1andC2(termedLCC).Wethencalculatedthelinkdi erencebetweenthecopyofC1andB1(termedLBC).FinallywemarkedthepagethathaveahigherLBCthanLCCascloakingcandidates.Inthisway,wemarked8,205candidates.Thefrequencyofthesecandidatesalsoapproximatesapower-lawdistributionliketermcloaking.ItisshowninFigure3.Aswithtermdi erence,wealsoputthese8,205candidatesinto10buckets.TherangeandnumberofpageswithineachbucketisshowninTable3.Fromeachbucket,werandomlyselected30pagesandcheckedmanuallytoseehowmanyofthemarerealcloakingpages.TheresultisshowninFigure4.Itisobviousthatthemostofthepagesfrombucket4orabovearecloakingpages.WealsocalculatedtheFvaluesforthesethresholdscorrespondingtothe4 1 10 100 1000 10000 1 10 100 1000 10000 100000Number of URLsDifference between LCC and LBCFigure3:Distributionofthedi erenceofLCCandLBC.BucketIDRANGENo.ofPages1x=5441525x=10787310x=20746420x=35783535x=55441655x=80299780x=1102798110x=1451829145x=18510010185x173Table3:Bucketsoflinkdi erencerangeofeachbucket.TheresultisshowninTable4.Wecantellthat5isanoptimalthresholdwiththebestFvalue.Sincethenumberofpageshavinglinkdi erenceissmallerthantheoneshavingtermdi erenceinreality,fewercloakingpagescanbefoundbyusinglinkdi erencealone,butaremoreaccurate.ThresholdPRECISIONRECALLFvalue10.4791.0000.64850.7270.7000.713100.8220.6270.711200.9060.5200.660350.9100.3400.496550.9000.2360.374800.9000.1670.2831100.9000.1040.1861450.8780.0600.1141850.8660.0380.072Table4:F-measurefordi erentthresholdsbasedonlinkdi erence.Figure4:Theratioofsyntacticcloakingineachbucketbasedonlinkdi erence.Figure5:IntersectionofthefourcopiesforaWebpage.DetectingCloakinginHOTdataBasedontheexperienceofmanuallycheckingforcloakingpagesforthe rstdataset,weattemptedtodetectsyntacticcloakingautomaticallybyusingallfourcopiesofeachpage.4.2.1Algorithmofdetectingcloakingauto-maticallyOurassumptionaboutsyntacticcloakingisthatthewebsitewillsendsomethingconsistenttothecrawlerbutsendsomethingdi erentyetstillconsistenttothebrowser.So,ifthereexistssuchtermsthatonlyappearinbothofthecopiessenttothecrawlerbutneverappearinanyofthecopiessendtothebrowserorviceversa,itisquitepossiblethatthepageisdoingsyntacticcloaking.Herewhengettingthetermsoutofeachcopy,westillusethe\bagofwords"approach,i.e.,wereplaceallthenon-wordcharacterswithinanHTML lewithblankandthengetallthewordsoutofitfortheintersectionoperation.Toeasilydescribeouralgorithm,theintersectionoffourcopiesareshownasaVenndiagraminFig-5 BucketRANGENo.Accuracy1x=172540%21x=254030%32x=449530%44x=862340%58x=1665090%616x=32822100%732x=64600100%864x=128741100%9128x=256420100%10256x1120100%Table5:BucketsofuniquetermsinareaAandGure5.WeusecapitallettersfromAtoMtorepre-senteachintersectioncomponentoffourcopies.Forexample,theareaLcontainscontentthatonlyap-pearsinHC1,butneverappearinHC2,HB1andHB2;areaFistheintersectionoffourcopies,i.e.,thecontentthatappearsonallofthefourcopies.ThemostinterestingcomponentstousareareasAandG.AreaArepresentstermsthatappearonbothbrowsers'copiesbutneverappearonanyofthecrawlers'copies,whileareaGrepresentstermsthatappearonbothcrawlers'copiesbutneverappearonanyofthebrowsers'copies.Soouralgorithmofdetectingsyntacticcloakingautomaticallyisthatforeachwebpage,wecalculatethenumberoftermsinareaAandthenumberoftermsinareaG.Ifthesumofthesetwonumbersisnonzero,wemaymarkthispageasacloakingpage.Therearefalsenegativeexamplesforthisalgo-rithm.Asimpleexampleisthatsupposethereisadynamicpictureonthepage,everytimethewebserverwillrandomlyselectonefrom4JPEG les(a1.jpgtoa4.jpg)toservetherequest.Ithappensthata1.jpgissenteverytimewhenourcrawlervisitsthispage,buta2.jpganda3.jpgaresentwhenourbrowservisitthispage.Byouralgorithm,thepagewillbemarkedascloaking,butitcanbeeasilyver-i edthatthisisnotthecase.So,againweneedathresholdforthealgorithmtoworkmoreaccurately.Forthe47,170URLs,wefound6466pagesthathavethesumofnumberoftermsinareaAandGgreaterthan0.Again,weputtheminto10buckets,asshowninTable5.Thethirdcolumnisthenumberofpageswithinthisbucket.Fromeachbucket,werandomlyselected10pagesandmanuallycheckedtoseewhetherthispageisrealsyntacticcloaking.TheaccuracyisshowninthefourthcolumninTable5.WealsocalculatedtheF-measure,theresultsareshowninTable6.ThresholdsPRECISIONRECALLFvalue00.6471.0000.78510.7030.9650.81320.7660.9520.84940.8360.9400.88580.9020.8810.891160.9220.7560.831320.9600.5990.738640.9790.4700.6351280.9720.3580.5232561.0000.2670.422Table6:F-measureofdi erentthreshold 0 1 2 3 4 5 6 7 8 9 0 50 100 150 200The Cumulative Percentage of Cloaking pagesTop ResultsFigure6:Percentageofsyntacticcloakingpageswithingoogle'stopresponses.Sincethe4thand5thbuckethavehighestFvalueinTable6,wechoosethethresholdtobetherangebetweenbucket4andbucket5,i.e.,8.So,ourau-tomatedcloakingalgorithmisrevisedtoonlymarkpageswiththesumofareaAandGgreaterthan8ascloakingpages.So,forourseconddataset,allpagesinbucket5tobucket10aremarkedcloakingpages.Finally,wemarked4,083pagesoutofthe47,170pages,i.e.,about9%ofpagesfromthehotquerydatasetaresyntacticcloakingpages.4.2.2DistributionofsyntacticcloakingwithintoprankingsSincewehaveidenti ed4,083pagesthatutilizecloak-ing,wecannowdrawthedistributionofthesecloak-ingpageswithindi erenttoprankings.Figure6showsthecumulativepercentageofcloakingpageswithintheTop200responselistsreturnedbygoogle.Aswecansee,about2%oftop50,about4%oftop100URLsandmorethan8%oftop200URLsdouti-lizecloaking.Theratioisquitehighandthecloaking6 A.AutosB.CompaniesC.ComputingD.EntertainmentE.GamesF.HealthG.HouseH.HolidaysI.LocalJ.MoviesK.MusicL.ResearchM.ShoppingN.SportsO.TVP.TravelFigure7:Category-speci cCloaking.maybehelpfulforthesepagestoberankedhigh.Sinceweretrievedtop10hotqueriesfromeachof16categoriesfromAOL,wecanconsiderthetopicofthecloakingpages.Intuitivelysomepopularcate-gories,suchassportsorcomputers,maycontainmorecloakingpagesinthetoprankinglist.Sowealsocalculatedthefractionofcloakingpageswithineachcategory.TheresultsareshowninFigure7.Somecategories,suchasShoppingandSports,aremorelikelytohavecloakedresultsthanothercategories.4.2.3SyntacticvsSemanticcloakingNotallsyntacticcloakingisconsideredunacceptabletosearchengines.Forexample,apagesenttothecrawlerthatdoesn'tcontainadvertisingcontentoraPHPsessionidenti erwhichisusedtodistinguishdi erentrealusersisnotaproblemtosearchengines.Incontrasttoacceptablecloaking,wede neseman-ticcloakingascloakingbehaviorwiththee ectofmanipulatingsearchengineresults.Tomakeonemorestepaboutourcloakingstudy,werandomlyselected100pagesfromthe4,083pageswehavedetectedassyntacticcloakingpagesandmanuallycheckedthepercentageofsemanticcloak-ingamongthem.Inpractice,itisdiculttojudgewhethersomebehaviorisharmfultosearchenginerankings.Forexample,somewebsiteswillsendloginpagetobrowser,whilesendfullpagetocrawler.So,weendupwiththreecategories:acceptablecloaking,unknownandsemanticcloaking.Fromthese100pages,weclassi ed33pagesasse-manticcloaking,32asunknownand35asacceptablecloaking.4.3Di erenttypesofcloakingIntheprocessofmanuallychecking600pagesfortheabovesections,wefoundseveraldi erenttypesofcloaking.4.3.1TypesoftermcloakingWeidenti edmanydi erentmethodsofsendingdif-ferenttermcontenttocrawlersandwebbrowsers.Theycanbecategorizedbythemagnitudeofthedif-ference.We rstconsiderthecaseinwhichthecontentofthepagessenttothecrawlerandwebbrowserarequitedi erent.Thepageprovidedtothecrawlerisfullofdetail,buttheonetothewebbrowserisempty,oronlycontainsframesorJavaScript.Thewebsitesendstextpagetothecrawler,butsendsnon-textcontent(suchasmacrome-diaFlashcontent)towebbrowser.Thepagesenttothecrawlerincorporatescon-tent,buttheonesenttothewebbrowsercon-tainsonlyaredirector404errorresponse.Thesecondcaseiswhencontentdi ersonlypar-tiallybetweenthepagessenttothecrawlerandthebrowserandtheremainingcontentisidentical,oronecopyhasslightlymorecontentthantheother.Thepagessenttothecrawlercontainmoretextcontentthantheonestowebbrowser.Forex-ample,onlythepagesenttothecrawlercontainskeywordsshowninFigure8.Di erentredirectiontargetURLsarecontainedinthepagessenttothecrawlerandtothewebbrowser.Thewebsitesendsdi erenttitles,meta-descriptionorkeywordstothecrawlerthantowebbrowser.Forexample,theheadertobrowseruses\ShapeofThingsmovieinfoatVideoUni-verse"asthemeta-description,whiletheonetothecrawleruses\GreatpricesonShapeofThingsVHSmoviesatVideoUniverse.Greatservice,secureorderingandfastshippingatev-erydaydiscountprices."ThepagesenttothecrawlercontainsJavaScript,butnosuchJavaScriptissenttothebrowser,orthepageshavedi erentJavaScriptssenttothecrawlerthantowebbrowser.7 gamecomputergamesPCgamesconsolegamesvideogamescomputeractiongamesadventuregamesroleplayinggamessimulationgamessportsgamesstrategygamescontestcontestsprizeprizesgamecheatshintsstrategycomputergamesPCgamescomputeractiongamesadventuregamesroleplayinggamesNintendoPlaystationsimula-tiongamessportsgamesstrategygamescontestcontestsprizeprizesgamecomputergamesPCgamescomputeractiongamesadventuregamesroleplayinggamessimulationgamessportsgamesstrategygamescontestcontestsprizeprizes.Figure8:Sampleofkeywordscontentonlysenttothecrawler.Pagestothecrawlerdonotcontainsomebanneradvertisements,whilethepagestowebbrowserdo.TheNOSCRIPTelementisusedtode neanalternatecontentifascriptisnotexecuted.ThepagesenttowebbrowserhastheNOSCRIPTtag,whilethepagesenttothecrawlerdoesnot.4.3.2TypesoflinkcloakingForlinkcloaking,weagaingroupthesituationsbythemagnitudeofthedi erencesbetweendi erentversionsofthesamepage.Inonecase,bothpagescontainsimilarnumberoflinksandtheotheristhatbothpageshavequitedi erentnumberoflinks.Forthe rstsituation,examplesfoundinclude:Therearethesamenumberoflinkswithinthepagesenttothecrawlerandwebbrowser,butthecorrespondinglinkpairshaveadi erentfor-mat.Forexample,thelinktowebbrowsermaycontainaPHPsessionidwhilethelinktothecrawlerdoesnot.AnotherexampleisthatthepagetothecrawleronlycontainsabsoluteURLs,whilethepagetothebrowsercontainsrelativeURLsthatareinfactpointingtothesametar-getsastheabsoluteones.Thelinksinthepagetothecrawleraredirectlinks,whilethecorrespondinglinkswithinthepagetowebbrowserareencodedredirections.Thelinkstowebbrowserarenormallinks,butthelinkstothecrawlerarearoundsmallimagesinsteadoftexts.Thewebsiteshowslinkstodi erentstylesheetstowebbrowserthantothecrawler.Forexample,thepagetothecrawlercontains\href=/styles/styleswinie.css",whilethepagetothebrowsercontains\href=/styles/styleswinns.css".Insomecases,thenumberoflinkswithinthepagetothecrawlerandthepagetothewebbrowsercanbequitedi erent.Morelinksexistinthepagesenttothecrawlerthanthepagesenttowebbrowser.Forexample,theselinksmaypointtoalinkfarm.Thepagesenttowebbrowserhasmorelinksthanthepagesenttothecrawler.Forexample,theselinksmaybenavigationallinks.Thepagesenttothebrowsercontainssomenor-mallinks,butinthesamepositionofthepagesenttothecrawler,onlyerrormessagessaying\nopermissiontoincludelinks"exist.Fromtheresultsshownwithinthissection,itisobviousthatcloakingisnotrareintherealWeb.Ithappensmoreoftenforhotqueriesorpopulartopics.5ResultsofRedirectAswehavediscussedinSection1,redirectioncanalsobeusedasaspammingtechnique.Togetaninsightintohowoftentheredirectionappearanddistributionofdi erentredirectmethods,weusetheHITSdatasetmentionedinSection3.Wedon'tuseallfourcopiesbutonlycomparetwocopiesforeachpage:onefromthesimulatedbrowser'sset(BROWSER)andtheotherfromthecrawler'sset(CRAWLER).5.1DistributionWecheckthedistributionoffourdi erenttypesofredirection:HTTP301MovedPermanentlyand302MovedTemporarilyresponses,theHTMLmetare-freshtag,andtheuseofJavaScripttoloadanewpage.Inordertoknowthedistributionofabovefourdi erentredirects,wetabulatedthenumberofap-pearancesofeachtype.Forthe rsttwotypes,thesituationissimple:wejustcountthepageswithresponsestatusof\301"and\302".Thelasttwoaremorecomplicated;theHTTPrefreshtagdoesnotnecessarilymeanaredirectionandJavaScript8 TYPECRAWLERBROWSER30120223025660Refreshtag42304356JavaScript23992469Table7:Numberofpagesusingdi erenttypesofredirection.evenmorecomplicatedforredirectionpurpose.Forthe rststep,wejustcounttheappearanceof\metahttp-equiv=refresh�"tagforthethirdtypeandcounttheappearanceof\location.replace"and\window.location"forthefourthtype.Theresultsforthis rststepareshowninTable7.TogetamoreaccuratenumberofappearancesoftheHTTPrefreshtag,weexaminedthisfurther.Inreality,theRefreshtagmayjustmeanrefreshing,notnecessarilytoredirectiontoanotherpage.Forexam-ple,theRefreshtagmaybeputinsideaNOSCRIPTtagforbrowsersthatdonotsupportJavaScript.Toestimatethenumberofrealredirectionbyusingthisrefreshtag,werandomlyselected20pagesfromthe4;230pagesthatusetherefreshtagandcheckedthemmanually.Wefoundthat95%ofthemarerealredi-rectionandonly5%areinsideaNOSCRIPTtag.Be-sides,somepagesmayhaveidenticaltargetURLasthemselvesintheRefreshtagtokeeprefreshingthem-selves.Wealsocountedthesenumbers.Thereare47pagesoutof4;230pageswithintheCRAWLERdatasetand142pagesoutof4;356pageswithintheBROWSERdatasetthatrefreshtothemselves.Wedidonemorestepforthe4;214(4;356-142)pagesthatarepagesusingRefreshtagandrefreshtoadi erentpage.Usuallythereisatimevalueassignedwithintherefreshtagtoshowhowlongtowaitbeforerefreshing.Ifthistimeissmallenough,i.e.,0or1seconds,userscannotseetheoriginpagebutareredirectedtoanewpageimmediately.Wefetchedthistimevalueforthese4;214pagesanddrawthedistributionofdi erenttimevaluesfromtherangeof0secondsto30secondsinFigure9.Morethan50%ofthesepagesrefreshtoadi erentURLafter0secondsandabout10%refreshafter1second.ToestimatetherealdistributionoftheJavaScriptrefreshmethod,werandomlyselected40pagesfromthe2;399pagesthathavebeenidenti edascandi-datesforusingJavaScriptforredirectioninthe rststep.Aftermanuallycheckingthese40 les,wefoundthe20%ofthemarerealredirections,32.5%ofthemareconditionalredirections,andtherestarenotforredirectionpurpose,suchastoavoidshowingthe 0 10 20 30 40 50 60 0 5 10 15 20 25 30Percentage of PagesNumber of seconds before refreshFigure9:Distributionofdelaysbeforerefresh.pagewithinaframe.SometimesthetargetURLandoriginURLarewithinthesamesite,whileothertimestheyareondi erentsites.Inordertoknowthepercentageofredirectionsthatredirectiontothesamesites,wealsoanalyzedourcollecteddatasetforthisinformation.SincetheJavaScriptredirectioniscomplicated,weonlycountthe rstthreetypesofredirectionhere.Thesumofthe rstthreetypesofredirectionis4;306.WithintheCRAWLERdataset,thereare2;328pageswithinthese4;306pagesredirectingtothesamesite,whilefortheBROWSERdataset,thenumberis2;453.5.2RedirectionCloakingAswehavementionedinSection4.3,thesitemayreturnpagesredirectingtodi erentlocationsincaseofdi erentuseragents.Weconsiderthisredirectioncloaking.Wefoundthatthereare153pairsofpagesoutof250;000pairsthathavedi erentresponsecodeforacrawlerandanormalbrowserwhendoingredirecting.Usuallythesewebsiteswillsend404or503responsecodetooneandsend200responsecodetotheother.Weevenfoundthatthereare10pagesthatusedif-ferentredirectionmethodforacrawlerandnormalwebbrowser.Forexample,theymayuse302or301forthecrawler,butuserefreshtagwiththeresponsecode200foranormalwebbrowser.6SummaryandDiscussionDetectionofsearchenginespamisachallengingre-searcharea.Cloakingandredirectionaretwoimpor-9 tantspammingtechniques.Thisstudyisbasedonasampleofaquarterofmil-lionpagesandtopresponsesfromapopularsearchenginetohotqueriesontheWeb.Weidenti eddif-ferentkindsofcloakingandgaveanestimateofthepercentageofpagesthatarecloakedinthesampleandalsoshowanestimationofdistributionofdi er-entredirect.Therearefourissuesthatwewouldliketoseead-dressedinfuturework.The rstisthatofabiasinthedatasetused.Ourdatasets(pagesinornearthetopresultsforqueries)donotnearlyre\recttheWebasawhole.However,itmightbearguedthatitre-\rectstheWebthatisimportant(atleastforthepur-posesof ndingpagesthatmighta ectsearchenginerankingsthroughcloaking).ThesecondisthatthispaperdoesnotaddressIP-basedcloaking,sotherearelikelypagesthatdoindeedprovidecloakedcon-tenttothemajorengineswhentheyrecognizethecrawlingIP.Wewouldwelcomethepartnershipofasearchenginetocollaborateonfuturecrawls.The nalissueisthebottomline.Whilesearchenginesmaybeinterestedin ndingandeliminat-inginstancesofcloaking,ourproposedtechniquere-quiresthreeorfourcrawls.Ideally,afuturetechniquewouldincorporateatwo-stageapproachthatidenti- esasubsetofthefullwebthatismorelikelytocontaincloakedpages,sothatafullcrawlusingabrowseridentitywouldnotbenecessary.Ourhopeisthatthisstudycanprovidearealisticviewoftheuseofthesetwotechniquesandwillcon-tributetorobustande ectivesolutionstotheiden-ti cationofsearchenginespam.References[1]AskJeeves/TeomaSiteSubmitmanagedbyin-eedhits.com:ProgramTerms,2005.Onlineathttp://ask.ineedhits.com/programterms.asp.[2]AmericaOnline,Inc.AOLSearch:Hotsearches,Mar.2005.http://hot.aol.com/hot/hot.[3]AskJeeves,Inc.AskJeevesAbout,Mar.2005.http://sp.ask.com/docs/about/jeevesiq.html.[4]M.CafarellaandD.Cutting.BuildingNutch:Opensource.ACMQUEUE,2,Apr.2004.[5]Google,Inc.Googleinformationforwebmasters,2005.Onlineathttp://www.google.com/webmasters/faq.html.[6]Google,Inc.GoogleZeitgeist,Jan.2005.http://www.google.com/press/zeitgeist/zeitgeist-jan05.html.[7]Z.GyongyiandH.Garcia-Molina.Webspamtaxon-omy.InFirstInternationalWorkshoponAdversarialInformationRetrievalontheWeb(AIRWeb),2005.[8]M.R.Henzinger,R.Motwani,andC.Silverstein.Challengesinwebsearchengines.SIGIRForum,36(2),Fall2002.[9]J.M.Kleinberg.Authoritativesourcesinahyper-linkedenvironment.JournaloftheACM,46(5):604{632,1999.[10]Lycos.Lycos50withDean:2004web'smostwanted,Dec.2004.http://50.lycos.com/121504.asp.[11]C.D.ManningandH.Schutze.Foundationsofsta-tisticalnaturallanguageprocessing,chapter8,pages268{269.MITpress,2001.[12]M.Najork.Systemandmethodforidentifyingcloakedwebservers,Jul102003.PatentApplica-tionnumber20030131048.[13]A.Perkins.Whitepaper:Theclassi ca-tionofsearchenginespam,Sept.2001.On-lineathttp://www.silverdisc.co.uk/articles/spam-classi cation/.[14]WebmasterWorld.com.Cloaking,2005.Onlineathttp://www.webmasterworld.com/forum24/.[15]B.WuandB.D.Davison.Identifyinglinkfarmspampages.InProceedingsofthe14thInternationalWorldWideWebConference,IndustrialTrack,May2005.[16]Yahoo!Inc.Yahoo!Help-Ya-hoo!Search,2005.Onlineathttp://help.yahoo.com/help/us/ysearch/deletions/.10