/
Cloak and Da er Dynamics of eb Sear Cloaking Da vid ang Stef an Sa age and Geoffre M Cloak and Da er Dynamics of eb Sear Cloaking Da vid ang Stef an Sa age and Geoffre M

Cloak and Da er Dynamics of eb Sear Cloaking Da vid ang Stef an Sa age and Geoffre M - PDF document

briana-ranney
briana-ranney . @briana-ranney
Follow
487 views
Uploaded On 2015-03-06

Cloak and Da er Dynamics of eb Sear Cloaking Da vid ang Stef an Sa age and Geoffre M - PPT Presentation

oelk er Deptar tment of Computer Science and Engineer ing Univ ersity of Calif or nia San Diego ABSTRA CT Cloaking is common baitandswitch technique used to hide the true nature of eb site by deli ering blatantly dif ferent semantic content to dif f ID: 42060

oelk Deptar tment

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Cloak and Da er Dynamics of eb Sear Cloa..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

CloakandDagger:DynamicsofWebSearchCloakingDavidY.Wang,StefanSavage,andGeoffreyM.VoelkerDeptartmentofComputerScienceandEngineeringUniversityofCalifornia,SanDiegoABSTRACTCloakingisacommon“bait-and-switch”techniqueusedtohidethetruenatureofaWebsitebydeliveringblatantlydifferentsemanticcontenttodifferentusersegments.Itisoftenusedinsearchengineoptimization(SEO)toobtainusertrafcillegitimatelyforscams.Inthispaper,wemeasureandcharacterizetheprevalenceofcloak-ingondifferentsearchengines,howthisbehaviorchangesfortar-getedversusuntargetedadvertisingandultimatelytheresponsetositecloakingbysearchengineproviders.Usingacustomcrawler,calledDagger,wetrackbothpopularsearchterms(e.g.,asidenti-edbyGoogle,AlexaandTwitter)andtargetedkeywords(focusedonpharmaceuticalproducts)forovervemonths,identifyingwhendistinctresultswereprovidedtocrawlersandbrowsers.Wefurthertrackthelifetimeofcloakedsearchresultsaswellasthesitestheypointto,demonstratingthatcloakerscanexpecttomaintaintheirpagesinsearchresultsforseveraldaysonpopularsearchenginesandmaintainthepagesthemselvesforlongerstill.CategoriesandSubjectDescriptorsH.3.5[InformationStorageandRetrieval]:OnlineInformationServices—Web-basedservicesGeneralTermsMeasurement,SecurityKeywordsCloaking,SearchEngineOptimization,Webspam1.INTRODUCTIONThegrowthofe-commerceinthelate20thcenturyinturncre-atedvaluearoundtheattentionofindividualInternetusers—de-scribedcrasslybyCaldwellas“TheGreatEyeballRace”[3].Sincethen,virtuallyeverymediumofInternetinteractionhasbeenmon-etizedviasomeformofadvertising,includinge-mail,Websites,socialnetworksandon-linegames,butperhapsnoneassuccess-fullyassearch.Today,thetopInternetsearchenginesareapri-marymeansforconnectingcustomersandsellersinabroadrangePermissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprotorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontherstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecicpermissionand/orafee.CCS'11,October17–21,2011,Chicago,Illinois,USA.Copyright2011ACM978-1-4503-0948-6/11/10...$10.00.ofmarkets,eithervia“organic”searchresultsorsponsoredsearchplacements—togethercomprisinga$14Bmarketingsector[16].Notsurprisingly,theunderlyingvalueopportunitieshavecreatedstrongincentivestoinuencesearchresults—aeldcalled“searchengineoptimization”orSEO.Someofthesetechniquesarebe-nignandevenencouragedbysearchengineoperators(e.g.,simpli-fyingpagecontent,optimizingloadtimes,etc.)whileothersaredesignedspecicallytomanipulatepagerankingalgorithmswith-outregardtocustomerinterests(e.g.,linkfarms,keywordstufng,blogspamming,etc.)Thus,acatandmousegamehasemergedbetweensearchengineoperatorsandscammerswheresearchoper-atorstrytoidentifyandrootoutpagesdeemedtouse“blackhat”optimizationtechniquesorthathostharmfulcontent(e.g.,phish-ingpages,malwaresites,etc.)whilescammersseektoeludesuchdetectionandcreatenewpagesfasterthantheycanberemoved.Inthisconict,oneofthemostpotenttoolsiscloaking,a“bait-and-switch”techniqueusedtohidethetruenatureofaWebsitebydeliveringblatantlydifferentcontenttodifferentusersegments.Typicallyacloakerwillserve“benign”contenttosearchenginecrawlersandscamcontenttonormalvisitorswhoarereferredviaaparticularsearchrequest.Bystructuringthebenignversionofthiscontenttocorrespondwithpopularsearchterms—apracticeknownaskeywordstufng—Webspammersaggressivelyacquireunwittingusertrafctotheirscampages.Similarly,cloakingmaybeusedtopreventcompromisedWebservershostingsuchscampagesfrombeingidentied(i.e.,byprovidingnormalcontenttovisitorswhoarenotreferredviathetargetedsearchterms).Inre-sponsetosuchactivity,searchengineprovidersattempttodetectcloakingactivityanddelistsearchresultsthatpointtosuchpages.Inthispaper,westudythedynamicsofthiscloakingphenomenonandtheresponseitengenders.WedescribeasystemcalledDagger,designedtoharvestsearchresultdataandidentifycloakinginnearreal-time.Usingthisinfrastructureforovervemonthswemakethreeprimarycontributions.First,weprovideacontemporarypic-tureofcloakingactivityasseenthroughthreepopularsearchen-gines(Google,BingandYahoo)anddocumentdifferencesinhoweachistargeted.Second,wecharacterizethedifferencesincloak-ingbehaviorbetweensitesfoundusingundifferentiated“trending”keywordsandthosethatappearinresponsetoqueriesfortargetedkeywords(inparticularforpharmaceuticalproducts).Finally,wecharacterizethedynamicbehaviorofcloakingactivityincludingthelifetimeofcloakedpagesandtheresponsivenessofsearchen-ginesinremovingresultsthatpointtosuchsites.Theremainderofthispaperisstructuredasfollows.Section2providesatechnicalbackgroundoncloakingandtherelatedworkwebuildupon,followedbyadescriptionofDagger'sdesignandimplementationinSection3.Section4describesourresults,fol-lowedinSection5byadiscussionofouroverallndings. 2.BACKGROUNDTheterm“cloaking”,asappliedtosearchengines,hasanun-certainhistory,butdatestoatleast1999whenitenteredthever-nacularoftheemergingsearchengineoptimization(SEO)mar-ket.1ThegrowingroleofsearchenginesindirectingWebtrafccreatedstrongincentivestoreverseengineersearchrankingalgo-rithmsandusethisknowledgeto“optimize”thecontentofpagesbeingpromotedandthusincreasetheirrankinsearchresultlist-ings.However,sincethemosteffectivewaytoinuencesearchrankingsfrequentlyrequiredcontentvastlydifferentfromthepagebeingpromoted,thisencouragedSEOrmstoservedifferentsetsofpagecontenttosearchenginecrawlersthantonormalusers;hence,cloaking.Intheremainderofthissectionweprovideaconcreteexampleofhowcloakingisusedinpractice,weexplainthedifferenttech-niquesusedforvisitordifferentiation,andsummarizethepreviousworkthathasinformedourstudy.2.1AnExampleAsanexampleofcloaking,enteringthequery“bethenneyfrankeltwitter”onGoogleonMay3,2011returnedasetofsearchresultswithlinksrelatedtoBethenneyFrankel.(Theappendixincludesscreenshotsofthisexample,withFigure11ashowingascreenshotofthesearchresults.)Fortheeighthresult,theboldfacecontentsnippetreturnedbyGoogleindeedindicatesthatthelinkedpageincludestheterms“bethenneyfrankeltwitter”(inadditiontootherseeminglyunrelatedterms,acommonindicatorofkeywordstuff-ing).Further,GooglepreviewshowsasnapshotofthepagethattheGooglecrawlersees.Inthiscaseitappearsthelinkindeedleadstocontentwithtextandimagesrelatedto“bethenneyfrankel”.UponclickingonthislinkonaWindowsmachine,however,wearesentthrougharedirectchainandeventuallyarriveatalandingpage(Figure11b).Thispageisanexampleofthefraudulentanti-virusscamsthathavebecomeincreasinglypopular[5,15]andshareslit-tlewith“bethenneyfrankeltwitter”(andindeed,boththeprevioussnippetandanyhintsofthepreviewedsnapshotdonotappear).Finally,ifwerevisitthesearchresultlinkbutdisguiseouriden-titytoappearasasearchcrawler(inthiscasebymodifyingtheUser-AgentstringinourHTTPrequestheader)wegetredirectedusinga302HTTPcodetothebenignrootpageofthesite(Fig-ure11c),presumablytoavoidpossiblesuspicion.Thiscaserepre-sentsaclassicexampleofIPcloaking.ThesiteisdetectingGoogleIPaddresses,asevidencedbythesnippetandpreview,andpro-vidinganSEO-edpagewithtextandimagesrelatedto“bethenneyfrankeltwitter”thatitdeceivesGoogleintoindexing.Thecoreideahereisclear.Thescammeriscraftingcontenttoreectthetran-sientpopularityofparticularterms(inthiscaseduetoBethenneyFrankel'ssaleofhercocktailline)andservingthiscontenttosearchenginecrawlerstocapitalizeonsubsequentuserqueries,butthendirectsanyusertrafctocompletelyunrelatedcontentintendedtodefraudthem.2.2TypesofCloakingForcloakingtowork,thescammermustbeabletodistinguishbetweenusersegmentsbasedonsomeidentiervisibletoaWebserver.Thechoiceofidentierusediswhatdistinguishesbetweencloakingtechniques,whichincludeRepeatCloaking,UserAgentCloaking,ReferrerCloaking(sometimesalsocalled“Click-throughCloaking”),andIPCloaking.1Forexample,inaSeptember2000articleinSearchEngineWatch,DannySullivancommentsthat“everymajorperformance-basedSEOrmIknowofdoes[usecloaking]”[19].InthecaseofRepeatCloaking,theWebsitestoresstateoneithertheclientside(usingacookie)ortheserverside(e.g.,trackingclientIPs).Thismechanismallowsthesitetodeterminewhetherthevisitorhaspreviouslyvisitedthesite,andtousethisknowledgeinselectingwhichversionofthepagetoreturn.Thusrst-timevisitorsaregivenaglimpseofascam,inthehopesofmakingasale,butsubsequentvisitsarepresentedwithabenignpagestymieingreportingandcrawlers(whoroutinelyrevisitpages).Incontrast,UserAgentCloakingusestheUser-AgenteldfromtheHTTPrequestheadertoclassifyHTTPclientsasuserbrowsersorsearchenginecrawlers.Useragentcloakingcanbeusedforbenigncontentpresentationpurposes(e.g.,toprovideuniquecontenttoSafarionaniPadvs.FirefoxonWindows),butisrou-tinelyexploitedbyscammerstoidentifycrawlersviathewell-knownUser-Agentstringstheyadvertise(e.g.,Googlebot).ReferrerCloakingtakestheideaofexaminingHTTPheadersevenfurtherbyusingtheReferereldtodeterminewhichURLvisitorsclickedthroughtoreachtheirsite.Thus,scammerscom-monlyonlydeliverascampagetousersthatvisittheirsitebyrstclickingthroughthesearchenginethathasbeentargeted(e.g.,byverifyingthattheReferereldishttp://www.google.com).Thistechniquehasalsobeenused,incombinationwithrepeatcloak-ingandchainsofWebsiteredirections,tocreateone-time-useURLsadvertisedine-mailspam(tostymiesecurityresearchers).How-ever,werestrictourfocustosearchenginecloakinginthispaper.Finally,oneofthesimplestmechanismsinusetodayisIPCloak-ing,inwhichascammerusestheIPaddressoftherequesterindeterminingtheidentityofthevisitor.WithanaccuratemappingbetweenIPaddressesandorganizations,ascammercantheneasilydistinguishallsearchenginerequestsandservethembenigncon-tentinamannerthatisdifculttoside-step.Indeed,theonlyclearwayforsearchengineoperatorstomechanisticallydetectsuchclo-akingisthroughacquiringfreshIPresources—butthesignalof“delisting”seemstoprovideaclearpathforefcientlymappingsuchaddressspaceevenifitisinitiallyunknown[1].AlthoughchallengingtodetectinprinciplesinceitwouldnominallyrequirecrawlingfromaGoogleIPaddress,inpracticeacrawlerlikeDag-gercanstilldetecttheuseofIPcloakingbecausecloakersstillneedtoexposedifferentversionsofapagetodifferentvisitors(Sec-tion3.3).Asaresult,webelieveDaggercandetectallformsofcloakingcommonlyusedintheWebtoday.2.3PreviousworkTheearlieststudyofcloakingweareawareofisduetoWuandDavidson[24].Theyrstdevelopedthenowstandardtechniqueofcrawlingpagesmultipletimes(usingbothuserandcrawleriden-tiers)andcomparingthereturnedcontent.Usingthisapproach,theyrefertosituationsinwhichthecontentdiffersas“syntacticcloaking”,whereasthesubsetofthesedifferencesthataredeemedtobedrivenbyfraudulentintent(usingsomeotherclassier[25])aretermed“semanticcloaking”.ChellapillaandChickeringusedasimilardetectionframeworktocomparesyntacticcloakingonthemostpopularandmonetiz-ablesearchterms[4].Inthisstudy,monetizabilitycorrespondstotheamountofrevenuegeneratedfromusersclickingonsponsoredadsreturnedwithsearchresultsforasearchterm.Usinglogsfromasearchprovider,theyfoundthatmonetizedsearchtermshadahigherprevalenceofcloaking(10%)thanjustpopularterms(6%)acrossthetop5000searchterms.Whilewedonothaveaccesstologsofsearchproviderrevenue,wealsodemonstratesigni-cantdifferencesincloakingprevalenceasafunctionofhowsearchtermsareselected.Uptothispoint,allstudieshadfocusedexclusivelyonuseragent cloaking.In2006,Wangetal.extendedthisanalysistoincludereferrercloaking(calledclick-throughcloakingbythem),wherepagesonlyreturncloakedcontentifaccessedviatheURLreturnedinsearchresults[21].TargetingahandfulofsuspiciousIPaddressranges,theyfoundwidespreadreferrercloakingamongthedo-mainshostedthere.Inarelated,muchmoreextensivestudy,Wangetal.usedthisanalysisinafocusedstudyofredirectionspam,atechniquebywhich“doorway”Webpagesredirecttrafctopagescontrolledbyspammers[22].Finally,Niuetal.performedasimi-laranalysisincorporatingreferrercloakingbutfocusedexclusivelyonforumspamming[14].Thesepriorstudieshaveusedarangeofdifferentinputsindecid-ingwhethermultiplecrawlsofthesamepagearesufcientlydif-ferentthatcloakinghasoccurred.Theseincludedifferencesintheworddistributioninthecontentofthetwopages[4,24],differencesinthelinksinthepage[24],differencesinHTMLtags[12]ordif-ferencesinthechainofdomainnamesintheredirectionchain[21,22].Anicesummaryofthesetechniquesaswellasthedifferentalgorithmsusedtocomparethemisfoundin[12].Ourapproachrepresentsacombinationofthesetechniquesandsomenewones,drivenbyadesiretosupportcrawlinganddetectioninnear-realtime(Section3).Finally,theuseofcloakingispredominatelydrivenbyso-called“blackhat”searchengineoptimization(SEO)inwhichtheperpe-tratorsseektoattracttrafcbyusingvariousmethodstoacquirehigherrankinginsearchengineresults.Twocontemporaneousstudiestouchondifferentaspectsofthisbehaviorthatareechoedinourownwork.Johnetal.examineoneverylargescale“linkfarm”attackusedtocreatehighrankingsfor“trending”searchtermsandthusattractlargeamountsofundifferentiatedtrafc[8].Whilethisparticularattackdidnotusecloakingperse(exceptimplicitlyintheuseofJavaScriptredirection)weencountersimilarstructuresdrivingthepopularityofcloakedsitesinourmeasurements.Leon-tiadisetal.performabroadstudyofSEO-advertisedpharmaceu-ticalsites(drivenbyfocusedsearchesondrug-relatedterms)andnotethatcloakingiscommonlyusedbycompromisedsitesand,inpart,servesthepurposeof“hiding”theexistenceofsuchsitesfromnormalvisitors[10].Weexplorebothkindsofsearchtraf-cstreamsinouranalysis(drivenbytrendingandfocusedsearchterms)andndsignicantdifferencesbetweenthem.Ingeneral,moststudiesofcloakinghavefocusedonasinglesnapshotintime.Moreover,theprevalenceofcloakingreportedinallsuchstudiesisdifculttocompareduetovariationsindetectionapproach,differencesinhowsearchtermsareselectedandchangesinscammerbehaviorduringdifferenttimeperiods.Thus,oneofourprimarycontributionsisinextendingthesepreviouseffortstoexaminethedynamicsofcloakingovertime,uncoveringthene-grainedvariabilityincloaking,thelifetimeofcloakedsearchre-sults(boundingtheresponsetimeofsearchenginestocloakedre-sults)andthedurationofthepagestheypointto.Ultimately,thesedynamicsintherankingofcloakedsitesdrivetheunderlyingeco-nomicsoftheirattack.3.METHODOLOGYDaggerconsistsofvefunctionalcomponents:collectingsearchterms,fetchingsearchresultsfromsearchengines,crawlingthepageslinkedfromthesearchresults,analyzingthepagescrawled,andrepeatingmeasurementsovertime.Inthissection,wedescribethedesignandimplementationofeachfunctionalcomponent,fo-cusingonthegoalsandpotentiallimitations.3.1CollectingSearchTermsTherstchallengeindatacollectionisbuildingameaningfultestsetformeasuringcloaking.Sinceourgoalistounderstandthedynamicsofscammersutilizingcloakinginsearchresults,wewanttotargetourdatacollectiontothesearchtermsthatscammersalsotargetratherthanarandomsubsetoftheoverallsearchspace.Inparticular,wetargettwodifferentkindsofcloakedsearchterms:thosereectingpopulartermsintendedtogatherhighvolumesofundifferentiatedtrafc,andtermsreectinghighlytargetedtrafcwherethecloakedcontentmatchesthecloakedsearchterms.Forourrstsetofsearchterms,aswithpreviousworkweseedourdatacollectionwithpopulartrendingsearchterms.WealsoenhancethissetbyaddingadditionalsourcesfromsocialnetworksandtheSEOcommunity.Specically,wecollectpopularsearchtermsfromGoogleHotSearches,Alexa,andTwitter,whicharepubliclyavailableandprovidereal-timeupdatestosearchtrendsatthegranularityofanhour.2Weextractthetop20popularsearchtrendsviaGoogleHotSearchesandAlexa,whichreectsearchen-gineandclient-baseddatacollectionmethods,respectively,whilethe10mostpopularsearchtermsfromTwitteraddsinsightfromso-cialnetworkingtrends.Thesesourcesgenerallycomplimenteachotherandextendourcoverageofterms.WefoundthattermsfromGoogleHotSearchesonlyoverlapped3–8%withtrendingtermsfrombothTwitterandAlexa.Notethat,fortrendingterms,thepagebeingcloakedisentirelyunrelatedtothesearchtermsusedtoSEOthepage.Ausermaysearchforacelebritynewsitemanden-counterasearchresultthatisacloakedpagesellingfakeanti-virus.Foroursecondsetofsearchterms,weuseasetoftermscateringtoaspecicdomain:pharmaceuticals.Wegatheredagenericsetofpharmaceuticaltermscommontomanyspam-advertisedonlinepharmacysites,togetherwithbest-sellingproducttermsfromthemostpopularsite[11].Unliketrendingsearchterms,thecontentofthecloakedpagesactuallymatchesthesearchterms.Ausermaysearchfor“viagra”andencounteracloakedpagethatleadstoanonlinepharmacysitesellingViagra.Weconstructanothersourceofsearchtermsusingkeywordsug-gestionsfromGoogleSuggest.GoogleSuggestisasearchtermautocompleteservicethatnotonlyspeedsupuserinput,butalsoal-lowsuserstoexploresimilarlong-tailqueries.Forexample,whenusersenter“viagra50mg”,theyarepromptedwithsuggestionssuchas“viagra50mgcost”and“viagra50mgcanada”.Speci-cally,wesubmitsearchtermsfromGoogleHotSearchesandtheonlinepharmacysitetoGoogleSuggestandusetheresulttocre-atedynamicfeedsofsearchtermsfortrendingandpharmaceuti-calsearches,respectively.WhileinvestigatingSEOcommunityforums,wefoundvarioussoftwarepackagesandservicesusingpopularsearchtermservicesasseedsforextendingthetermstheytargetusingsuggestionservices.Combinedwithasuggestionser-vice,eachsearchtermformsthebasisofaclusterofrelatedsearchtermsfromlistsofsuggestions[23].Themainattractionofasug-gestionserviceisthatittargetsfurtherdownthetailofthesearchtermdistribution,resultinginlesscompetitionforthesuggestionandapotentiallymoreadvantageoussearchresultposition.More-over,long-tailqueriestypicallyretainthesamesemanticmeaningastheoriginalsearchtermseed.Furthermore,recentstudieshaveshownsuperiorconversionratesoflong-tailqueries[20].3.2QueryingSearchResultsDaggerthensubmitstheterms,everyfourhoursfortrending2Weoriginallyconsideredusinganevenbroaderrangeofsearchtermsources,inparticularYahooBuzz,AskIQ,AOLHotSearches,eBayPulse,AmazonMostPopularTags,FlickrPopularTags,andWordPressHotTop-ics.Sincewedetectednomorethanafewcloakedresultsinmultipleat-temptsovertime,weconcludedthatscammersarenottargetingthesesearchtermsandnolongerconsideredthosesources. queriesandeverydayforpharmaceuticalqueries,tothethreemostactivelyusedsearchenginesintheUS:Google,Yahoo,andBing.Withresultsfrommultiplesearchengines,wecancomparetheprevalenceofcloakingacrossenginesandexaminetheirresponsetocloaking.FromthesearchresultsweextracttheURLsforcrawl-ingaswellasadditionalmetadatasuchasthesearchresultpositionandsearchresultsnippet.Ateachmeasurementpoint,westartwiththebasesetofsearchtermsandusethemtocollectthetopthreesearchtermsugges-tionsfromGoogleSuggest.3Fortrendingsearches,GoogleHotSearchesandAlexaeachsupply80searchtermseveryfour-hourperiod,whileTwittersupplies40.Togetherwiththe240additionalsuggestionsbasedonGoogleHotSearches,oursearchtermwork-loadis440terms.Notethatwhileoverlapdoesoccurwithineachfour-hourperiodandevenbetweenperiods,thisoverlapissimplyanartifactofthesearchtermsourcesandrepresentspopularityasintended.Forexample,atermthatspansmultiplesourcesormulti-pleperiodsreectsthepopularityoftheterm.Forpharmaceuticalqueries,wealwaysuseasetof230termscomposedoftheoriginal74manually-collectedtermsand156fromGoogleSuggest.Next,wesubmitthesearchtermsandsuggestionstothesearchenginesandextractthetop100searchresultsandaccompanyingmetadata.Weassumethat100searchresults,representing10searchresultpages,coversthemaximumnumberofresultsthevastma-jorityofuserswillencounterforagivenquery[17].UsingtheseresultsDaggerconstructsanindexofURLsandmetadata,whichservesasthefoundationforthesearchspacethatitwillcrawlandanalyze.Atthispoint,weuseawhitelisttoremoveentriesintheindexbasedonregularexpressionsthatmatchURLsfrom“knowngood”domains,suchashttp://www.google.com.Thisstepre-ducesthenumberoffalsepositivesduringdataprocessing.Inaddi-tion,wheneverweupdatethislist,were-processpreviousresultstofutherreducethenumberoffalsepositives.Afterwards,wegroupsimilarentriestogether.Forexample,twoentriesthatsharethesameURL,source,andsearchtermarecombinedintoasingleentrywiththesameinformation,exceptwithacountoftwoin-steadofonetosignifythequantity.Asaresult,foreachsearchengine,insteadofcrawling44,000URLsfortrendingsearchre-sults(440searchterms100searchresults),onaverageDaggercrawlsroughly15,000uniqueURLsineachmeasurementperiod.3.3CrawlingSearchResultsForeachsearchengine,wecrawltheURLsfromthesearchre-sultsandprocessthefetchedpagestodetectcloakinginparalleltominimizeanypossibletimeofdayeffects.Webcrawler.ThecrawlerincorporatesamultithreadedJavaWebcrawlerusingtheHttpClient3.xpackagefromApache.WhilethispackagedoesnothandleJavaScriptredirectsorotherformsofclient-sidescripting,itdoesprovidemanyusefulfeatures,suchashandlingHTTP3xxredirects,enablingHTTPheadermod-ication,timeouts,anderrorhandlingwithretries.Asaresult,thecrawlercanrobustlyfetchpagesusingvariousidentities.Multiplecrawls.ForeachURLwecrawlthesitethreetimes,althoughonlythersttwoarerequiredforanalysis.Webegindis-guisedasanormalsearchuservisitingthesite,clickingthroughthesearchresultusingInternetExploreronWindows.ThenwevisitthesitedisguisedastheGooglebotWebcrawler.Thesecrawlsdownloadtheviewsofthepagecontenttypicallyreturnedtousersandcrawlers,respectively.Finally,wevisitthesiteforathirdtimeasauserwhodoesnotclickthroughthesearchresulttodown-3Weonlyusethreesuggestionstoreducetheoverallloadonthesystemwhilestillmaintainingaccuracy,aswefoundnosignicantdifferenceinourresultswhenusingveoreventensuggestions.loadtheviewofthepagetothesiteowner.Aswithpreviousap-proaches,wedisguiseourselvesbysettingtheUser-AgentandtheReferereldsintheHTTPrequestheader.Thisapproachensuresthatwecandetectanycombinationofuser-agentcloakingandre-ferrercloaking.Moreover,ourthirdcrawlallowsustodetectpureuser-agentcloakingwithoutanychecksonthereferrer.Wefoundthatroughly35%ofcloakedsearchresultsforasinglemeasure-mentperformpureuser-agentcloaking.Forthemostpart,thesesitesarenotmaliciousbutmanyareinvolvedinblack-hatSEOop-erations.Incontrast,pagesthatemploybothuser-agentandreferrercloakingarenearlyalwaysmalicious(Section4.5).IPcloaking.PaststudiesoncloakinghavenotdealtwithIPad-dresscloaking,andthemethodologyweuseisnodifferent.How-ever,becausetheemphasisofourstudyisindetectingthesituationwherecloakingisusedasanSEOtechniqueinscams,wedonotexpecttoencounterproblemscausedbyIPcloaking.Inoursce-nario,thecloakermustreturnthescampagetotheusertopoten-tiallymonetizethevisit.AndthecloakermustreturntheSEO-edpagetothesearchcrawlertobothindexandrankwell.Evenifthecloakercoulddetectthatwearenotarealcrawler,theyhavefewchoicesforthepagetoreturntoourimitationcrawler.Iftheyre-turnthescampage,theyarepotentiallyleavingthemselvesopentosecuritycrawlersorthesiteowner.IftheyreturntheSEO-edpage,thenthereisnopointinidentifyingtherealcrawler.Andiftheyreturnabenignpage,suchastherootofthesite,thenDaggerwillstilldetectthecloakingbecausetheuservisitreceivedthescampage,whichisnoticeablydifferentfromthecrawlervisit.Inotherwords,althoughDaggermaynotobtaintheversionofthepagethattheGooglecrawlersees,Daggerisstillabletodetectthatthepageisbeingcloaked(seeAppendixAforanillustratedexample).Toconrmthishypothesis,wetookasampleof20KcloakedURLsreturnedfromqueryingtrendingsearchterms.Wethencraw-ledthoseURLsusingtheabovemethodology(threecrawls,eachwithdifferentUser-AgentandRefererelds).Inaddition,weperformedafourthcrawlusingGoogleTranslate,whichvisitsaURLusingaGoogleIPaddressandwillfoolreverseDNSlookupsintobelievingthevisitisoriginatingfromGoogle'sIPaddressspace.Fromthisoneexperiment,wefoundmorethanhalfofcurrentcloakedsearchresultsdoinfactemployIPcloakingviareverseDNSlookups,yetineverycasetheyweredetectedbyDaggerbe-causeofthescenariodescribedabove.3.4DetectingCloakingWeprocessthecrawleddatausingmultipleiterativepasseswhereweapplyvarioustransformationsandanalysestocompilethein-formationneededtodetectcloaking.Eachpassusesacomparison-basedapproach:weapplythesametransformationsontotheviewsofthesameURL,asseenfromtheuserandthecrawler,anddirectlycomparetheresultofthetransformationusingascoringfunctiontoquantifythedeltabetweenthetwoviews.Intheend,weperformthresholdingontheresulttodetectpagesthatareactivelycloakingandannotatethemforlateranalysis.Whilesomeofthecomparison-basedmechanismsweusetode-tectcloakingareinspiredfrompreviouswork,akeyconstraintisourreal-timerequirementforrepeatedlysearchingandcrawlingtouncoverthetimedynamicsofcloaking.Asaresult,wecannotuseasinglesnapshotofdata,andweavoidedintensiveofinetrainingformachinelearningclassiers[4,12,24,25].Wealsoavoidedrunningclient-sidescripts,whichwouldaddpotentiallyunboundedlatencytoourdatacollectionprocess.Consequently,wedonotdi-rectlymeasureallformsofredirection,althoughwedocapturethesameendresult:adifferenceinthesemanticcontentofthesameURL[21,22].Sincewecontinuouslyremeasureovertime,man- ualinspectionisnotscalableoutsideofacoupleofsnapshots[14].Moreover,evenaninsightfulmechanismthatcomparesthestruc-tureoftwoviewsusingHTMLtags,tolimittheeffectsofdynamiccontent[12],mustbeappliedcautiouslyasdoingsorequiressig-nicantprocessingoverhead.Thealgorithmbeginsbyremovinganyentrieswhereeithertheuserorcrawlerpageencounteredanerrorduringcrawling(anon-200HTTPstatuscode,connectionerror,TCPerror,etc.);onaver-age,4%ofcrawledURLsfallintothiscategory.Atthispoint,theremainingentriesrepresentthecandidatesetofpagesthatthealgorithmwillanalyzefordetectingcloaking.Tostart,thedetectionalgorithmltersoutnearlyidenticalpagesusingtextshingling[2],whichhashessubstringsineachpagetoconstructsignaturesofthecontent.Thefractionofsignaturesinthetwoviewsisanexcellentmeasureofsimilarityaswendnearly90%ofcrawledURLsarenearduplicatesbetweenthemultiplecrawlsasauserandasacrawler.Fromexperimentation,wefoundthatadifferenceof10%orlessinsetsofsignaturessigniesnearlyidenticalcontent.Weremovesuchpagesfromthecandidateset.Fromthisreducedset,wemakeanotherpassthatmeasuresthesimilaritybetweenthesnippetofthesearchresultandtheuserviewofthepage.Thesnippetisanexcerptfromthepagecontentob-tainedbysearchengines,composedfromsectionsofthepagerel-evanttotheoriginalquery,thatsearchenginesdisplaytotheuseraspartofthesearchresult.Ineffect,thesnippetrepresentsgroundtruthaboutthecontentseenbythecrawler.Oftenusersexaminethesnippettohelpdeterminewhethertofollowthelinkinthesearchresult.Therefore,wearguethattheuserhasanimplicitexpectationthatthepagecontentshouldresemblethesnippetincontent.Weevaluatesnippetinclusionbyrstremovingnoise(charactercase,punctuation,HTMLtags,gratuitouswhitespace,etc.)fromboththesnippetandthebodyoftheuserview.Then,wesearchforeachsubstringfromthesnippetinthecontentoftheuserviewpage,whichcanbeidentiedbythecharactersequence`...'(pro-videdinthesnippettoidentifynon-contiguoustextinthecrawledpage).Wethencomputeascoreoftheratioofthenumberofwordsfromunmatchedsubstringsdividedbythetotalnumberofwordsfromallsubstrings.Thesubstringmatchidentiessimilarcontent,whiletheuseofthenumberofwordsinthesubstringquantiesthisresult.Anupperboundscoreof1:0meansthatnopartofthesnip-petmatched,andhencetheuserviewdiffersfromthepagecontentoriginallyseenbythesearchengine;alowerboundscoreof0:0meansthattheentiresnippetmatched,andhencetheuserviewful-llstheexpectationoftheuser.Weuseathresholdof0:33,mean-ingthatwelteroutentriesfromthecandidatesetwhoseuserviewdoesnotdifferbymorethantwo-thirdsfromthesnippet.Wechosethisthresholdduetotheabundanceofsnippetsseenwiththreedis-tinctsubstrings,and0:33signiesthatthemajorityofthecontentdiffersbetweenthesnippetanduserview.Inpractice,thissteplters56%oftheremainingcandidateURLs.Atthispoint,weknow(1)thattheviewsaredifferentintermsofunstructuredtext,and(2)thattheuserviewdoesnotresemblethesnippetcontent.Thepossibilitystillexists,however,thatthepageisnotcloaked.Thepagecouldhavesufcientlyfrequentupdates(ormayrotatebetweencontentchoices)thatthesnippetmismatchismisleading.Therefore,asanaltestweexaminethepagestruc-turesoftheviewsviatheirDOMsasin[12].Thekeyinsightfortheeffectivenessofthisapproachcomesfromthefactthat,whilepagecontentmaychangefrequently,asinblogs,itisfarlesslikelyforthepagestructuretochangedramatically.WecomparethepagestructurebyrstremovinganycontentthatisnotpartofawhitelistofHTMLstructuraltags,whilealsoattemptingtoxanyerrors,suchasmissingclosingtags,alongtheway.Wecomputeanotherscoreasthesumofanoverallcom-parisonandahierarchicalcomparison.Intheoverallcomparison,wecalculatetheratioofunmatchedtagsfromtheentirepagedi-videdbythetotalnumberoftags.Inthehierarchicalcomparison,wecalculatetheratioofthesumoftheunmatchedtagsfromeachleveloftheDOMhierarchydividedbythetotalnumberoftags.Weusethesetwometricstoallowthepossibilityofaslighthier-archychange,whileleavingthecontentfairlysimilar.Anupperboundscoreof2:0meansthattheDOMsfailedtomatchanytags,whereasalowerboundscoreof0:0meansthatboththetagsandhierarchymatched.Weuseathresholdof0:66inthisstep,whichmeansthatcloakingonlyoccurswhenthecombinationoftagsandhierarchydifferbyathirdbetweenthestructureoftheuserandcrawlerviews.Wechosethisthresholdfromexperimentationthatshowedthelargemajorityofcloakedpagesscoredover1:0.Oncewedetectanentryasactivelycloaking,weannotatetheentryintheindexforlaterprocessing.Whenusinganydetectionalgorithm,wemustconsidertherateoffalsepositivesandfalsenegativesasasignofaccuracyandsuc-cess.Becauseitisinfeasibletomanuallyinspectallresults,weprovideestimatesbasedonsamplingandmanualinspection.ForGooglesearchresults,wefound9.1%(29of317)ofcloakedpageswerefalsepositives,meaningthatwelabeledthesearchresultascloaking,butitisbenign;forYahoo,wefound12%(9of75)ofcloakedpageswerefalsepositives.Ititworthnotingthatalthoughwelabeledbenigncontentascloaking,theyaretechnicallydeliv-eringdifferentdatatousersandcrawlers.Ifweconsiderfalsepos-itivestomeanthatwelabeledthesearchresultascloakingwhenitisnot,thentherearenofalsepositivesineithercase.Intermsoffalsenegatives,whenmanuallybrowsingcollectionsofsearchre-sultsDaggerdetectedcloakedredirectionpagesforthemajorityofthetime.Theonecasewherewefailiswhenthesiteemploysad-vancedbrowserdetectiontopreventusfromfetchingthebrowserview,butwehaveonlyencounteredthiscaseahandfuloftimes.3.5TemporalRemeasurementThebasicmeasurementcomponentcapturesdatarelatedtocloak-ingatonepointintime.However,becausewewanttostudytempo-ralquestionssuchasthelifetimeofcloakedpagesinsearchresults,DaggerimplementsatemporalcomponenttofetchsearchresultsfromsearchenginesandcrawlandprocessURLsatlaterpointsintime.Intheexperimentsinthispaper,Daggerremeasureseveryfourhoursupto24hours,andtheneverydayforuptosevendaysaftertheoriginalmeasurement.Thetemporalcomponentperformsthesamebasicdatacollectionandprocessingstepsasdiscussedinthepreviouscomponents.Tomeasuretherateatwhichsearchenginesrespondtocloaking,wefetchresultsusingtheoriginalsearchtermsetandconstructanewindexfromtheresultsthatwillcaptureanychurninsearchresultssincetheoriginalmeasurement.ThenweanalyzethechurnbysearchingforanyentrywithaURLthatmatchesthesetofcloakedpagesoriginallyidentiedandlabeled.Notethattherestillexiststhepossibilitythatforeverycloakedpageremovedfromthenewindex,anothercloakedpage,whichoriginallywasnotapartoftheindex,couldhavetakenitsplace.Therefore,thisprocessdoesnotmeasurehowcleanthesearchresultsareatagiventime,justwhethertheoriginalcloakedpagesstillappearinthesearchresults.Tomeasurethedurationpagesarecloaked,thetemporalcom-ponentselectsthecloakedURLsfromtheoriginalindex.Itthenperformsthemeasurementprocessagain,visitingthepagesasbothauserandacrawler,andapplyingthedetectionalgorithmtothere-sults.Therestillexiststhepossibilitythatpagesperformcloakingatrandomtimesratherthancontinuously,whichwemightnotde- 0 0.5 1 1.5 2 2.503/0104/0105/0106/0107/0108/01Percentage of Cloaked Search ResultsDateGoogleYahooBing(a)Trends 2 4 6 8 10 12 14 16 1803/0104/0105/0106/0107/0108/01Percentage of Cloaked Search ResultsDateGoogleYahooBing(b)PharmacyFigure1:PrevalenceofcloakedsearchresultsinGoogle,Yahoo,andBingovertimefortrendingandpharmaceuticalsearches.tect.However,weconsiderthesesituationsunlikelyasspammersneedsufcientvolumetoturnaprotandhencecloakingcontinu-ouslywillresultinfargreatertrafcthancloakingatrandom.4.RESULTSThissectionpresentsourndingsfromusingDaggertostudycloakingintrendingandpharmaceuticalsearchresultsinGoogle,Yahoo,andBing.WeuseDaggertocollectsearchresultseveryfourhoursfortwomonths,fromMarch1,2011throughAugust1,2011,crawlingover47millionsearchresults.Weexaminetheprevalenceofcloakinginsearchresultsovertime,howcloak-ingcorrelatestothevarioussearchtermsweuse,howsearchen-ginesrespondtocloaking,andhowquicklycloakedpagesaretakendown.Wealsobroadlycharacterizethecontentofcloakedpages,theDNSdomainswithcloakedpages,andhowwellcloakedpagesperformfromanSEOperspective.Whereinformative,wenotehowtrendingandpharmaceuticalcloakingcharacteristicscontrast,andalsocommentontheresultsofcloakingthatweobservecom-paredwithresultsfrompreviousstudies.4.1CloakingOverTimeFigure1ashowstheprevalenceofcloakingovertimefortrend-ingsearchresultsreturnedfromeachsearchengine.Weshowthepercentageofthecloakedsearchresultsaveragedacrossallsearchesmadeeachday.RecallfromSection3.3thatwecrawlthetop100searchresultseveryfourhoursfor183uniquetrend-ingsearchterms(onaverage)collectedfromGoogleHotSearches,GoogleSuggest,Twitter,andAlexa,resultingonaveragein13,947uniqueURLstocrawlafterde-dupingandwhitelisting.Althoughwecrawleveryfourhours,wereporttheprevalenceofcloakingatthegranularityofadayforclarity(wedidnotseeanyinterestingtime-of-dayeffectsintheresults).Forexample,whencloakinginGooglesearchresultspeakedat2.2%onApril12,2011,wefound2,094outof95,974cloakedsearchresultsthatday.Initially,throughMay4th,weseethesametrendforthepreva-lenceofcloakedsearchresultsamongsearchengines:GoogleandYahoohavenearlythesameamountofcloakedresults(onaverage1.11%onGoogleand0.84%onYahoo),whereasBinghas3–4fewer(just0.25%).OneexplanationisthatBingismuchbetteratdetectingandthwartingcloaking,butwendevidencethatcloak-erssimplydonotappeartotargetBingnearlyasheavilyasGoogleandYahoo.Forinstance,cloakedresultsoftenpointtolinkfarms,aformofSEOinvolvingcollectionsofinterlinkedpagesusedtoboostpagerankforsetsofsearchterms.Forlarge-scalelinkfarmsthatwehavetrackedovertime,weconsistentlyndtheminGoogleandYahooresults,butnotinBingresults.Similarly,Figure1bshowstheprevalenceofcloakingovertimewhensearchingforpharmaceuticalterms.Wecrawlthetop100searchresultsdailyfor230uniquepharmaceuticalsearchtermscollectedfromapopularspam-advertisedafliateprogram,furtherextendedwithGoogleSuggest,resulingin13,646uniqueURLstocrawlafterde-dupingandwhitelisting.(NotethatthegapinresultsintherstweekofJunecorrespondstoaperiodwhenourcrawlerwasofine.)Acrossalldays,weseethesamerelativerankingofsearchenginesintermsofcloakingprevalence,butwithover-alllargerquantitiesofcloakedresultsforthesamerespectivetimeranges:onaverage9.4%ofresultswerecloakedonGoogle,7.7%resultsonYahoo,and4.0%onBing.Thedifferenceinquantitiesofcloakedresultsfortrendingandpharmaceuticaltermsreectsthedifferencesbetweenthesetwotypesofsearches.Intrendingsearchesthetermsconstantlychange,withpopularitybeingtheoneconstant.Thisdynamicallowscloak-erstotargetmanymoresearchtermsandabroaddemographicofpotentialvictims—anyonebydenitionsearchingusingapopularsearchterm—atthecostoflimitedtimetoperformtheSEOneededtorankcloakedpageshighlyinthesearchresults.Incontrast,phar-maceuticalsearchtermsarestaticandrepresentproductsearchesinaveryspecicdomain.CloakersasaresulthavemuchmoretimetoperformSEOtoraisetherankoftheircloakedpages,resultinginmorecloakedpagesinthetopresults.Note,though,thatthesetargetedsearchtermslimittherangeofpotentialvictimstojustuserssearchinginthisnarrowproductdomain.Section4.7furtherexplorestheeffectsofSEOoncloakedresults.Lookingattrendsovertime,cloakerswereinitiallyslightlymoresuccessfulonYahoothanGooglefortrendingsearchterms,forin-stance.However,fromApril1throughMay4th,wefoundaclearshiftintheprevalenceofcloakedsearchresultsbetweensearchengineswithanincreaseinGoogle(1.2%onaverage)andade-creaseinYahoo(0.57%).WesuspectthisisduetocloakersfurtherconcentratingtheireffortsatGoogle(e.g.,weuncoverednewlinkfarmsperformingreverseDNScloakingfortheGoogleIPrange).Inaddition,wesawsubstantialuctuationincloakingfromdaytoday.Weattributethevariationtotheadversarialrelationshipbetweencloakersandsearchengines.CloakersperformblackhatSEOtoarticallyboosttherankingsoftheircloakedpages.Searchenginesrenetheirdefensivetechniquestodetectcloakingeitherdirectly(analyzingpages)orindirectly(updatingtherankingal-gorithm).Weinterpretourmeasurementsatthesetimescalesas 0 0.5 1 1.5 2 2.5 3 3.503/0104/0105/0106/0107/0108/01Percentage of Cloaked Search ResultsDateGoogle Hot SearchesGoogle SuggestTwitterAlexa(a)Google 0 0.5 1 1.5 2 2.5 3 3.5 403/0104/0105/0106/0107/0108/01Percentage of Cloaked Search ResultsDateGoogle Hot SearchesGoogle SuggestTwitterAlexa(b)YahooFigure2:Prevalenceofcloakedsearchresultsovertimeassociatedwitheachsourceoftrendingsearchterms.SearchTerm%CloakedPagesviagra50mgcanada61.2%viagra25mgonline48.5%viagra50mgonline41.8%cialis100mg40.4%genericcialis100mg37.7%cialis100mgpills37.4%cialis100mgdosage36.4%cialis10mgprice36.2%viagra100mgprices34.3%viagra100mgpricewalmart32.7%Table1:Top10pharmaceuticalsearchtermswiththehighestper-centageofcloakedsearchresults,sortedindecreasingorder.simplyobservingthestrugglebetweenthetwosides.Finally,wenotethattheabsoluteamountofcloakingwendislessthansomepreviousstudies,butsuchcomparisonsaredifculttointerpetsincecloakingresultsfundamentallydependuponthesearchtermsused.4.2SourcesofSearchTermsCloakerstryingtobroadlyattractasmanyvisitorsaspossibletargettrendingpopularsearches.Sinceweusedavarietyofsourcesforsearchterms,wecanlookathowtheprevalenceofcloakingcorrelateswithsearchtermselection.Figures2aand2bshowtheaverageprevalenceofcloakingforeachsourceonsearchresultsreturnedfromGoogleandYahoo,respectively,fortrendingsearches;wedonotpresentresultsfromBingduetotheoveralllackofcloaking.SimilartoFigure1,eachpointshowsthepercentageofcloakedlinksinthetop100searchresults.Here,though,eachpointshowstheaveragepercentageofcloakedresultsforaparticularsource,whichnormalizestheresultsindependentofthenumberofsearchtermswecrawledfromeachsource.(Becausedifferentsourcesprovideddifferentnumbersofsearchterms,thepercentagesdonotsumtotheoverallpercentagesinFigure1.)Fromthegraphs,weseethat,throughMay4th,usingsearchtermsfromGoogleSuggest,seededinitiallyfromGoogleHotSea-rches,uncoversthemostcloaking.ForGooglesearchresults,aver-agedacrossthedays,GoogleSuggestreturns3.5asmanycloakedsearchresultsasGoogleHotSearchesalone,2.6asTwitter,and3.1asAlexa.Similarly,evenwhenusingYahoo,GoogleSug- 0 10 20 30 40 50 60 70 0 50 100 150 200 250Percentage of Cloaked Search ResultsUnique Search TermGoogleYahooFigure3:Distributionofpercentageofcloakedsearchresultsforpharmaceuticalsearchterms,sortedindecreasingorder.gestreturns3.1asmanycloakedsearchresultsasGoogleHotSearchesalone,2.6asTwitter,and2.7asAlexa.AsdiscussedinSection3.1,cloakerstargetingpopularsearchtermsfacestiffSEOcompetitionfromothers(bothlegitimateandillegitimate)alsotargetingthosesameterms.Byaugmentingpopularsearchtermswithsuggestions,cloakersareabletotargetthesamesemantictopicaspopularsearchterms.Yet,becausethesuggestionises-sentiallyanautocomplete,itpossessesthelong-tailsearchbenetsofreducedcompetitionwhileremainingsemanticallyrelevant.Theaboveresultsdemonstratethattheprevalenceofcloakinginsearchresultsishighlyinuencedbythesearchterms.Asanotherperspective,foreachmeasurementperiodthatcrawlsthesearchtermsatagivenpointintime,wecancountthenumberofcloakedresultsreturnedforeachsearchterm.Averagingacrossallmea-surementperiods,23%and14%ofthesearchtermsaccountedfor80%ofthecloakedresultsfromGoogleandYahoo,respectively.Forreference,Table1liststheresultsforthetop10searchtermsonGoogleandFigure3showsthedistributionofthepercentageofcloakedsearchresultsforpharmaceuticalsearchterms.Thequery“viagra50mgcanada”isthepharmaceuticaltermwiththelargestpercentageofcloakedsearchresultsonGooglewith61%.Yet,themedianquery“tramadol50mg”containsonly7%ofcloakedsearchresults.Notethatthepercentagessumtomuchmorethan100%sincedifferentsearchtermscanreturnlinkstothesamecloaked 0 20 40 60 80 100 0 1 2 3 4 5 6 7 8Percentage Remaining in Google Search ResultsTime Delta (Days)overallcloaked(a)Google 0 20 40 60 80 100 0 1 2 3 4 5 6 7 8Percentage Remaining in Yahoo Search ResultsTime Delta (Days)overallcloaked(b)YahooFigure4:Churninthetop100cloakedsearchresultsandoverallsearchresultsfortrendingsearchterms. 0 20 40 60 80 100 0 5 10 15 20 25 30Percentage Remaining in Google Search ResultsTime Delta (Days)overallcloaked(a)Google 0 20 40 60 80 100 0 5 10 15 20 25 30Percentage Remaining in Yahoo Search ResultsTime Delta (Days)overallcloaked(b)YahooFigure5:Churninthetop100cloakedsearchresultsandoverallsearchresultsforpharmaceuticalsearchterms.pages.AsanexampleinFigure3,thesixthpointshowstheaver-agepercentageofcloakedsearchresults,acrossallmeasurements,forthesearchtermwiththesixthhighestpercentageofcloakedsearchresults.Weplottherst10pointswiththemostsignicantpercentageofcloakedsearchresults,thenplotevery10thsearchterm,forclarity.Fromtheseresultsweseehighvarianceinthepercentageofcloakedsearchresults.4.3SearchEngineResponseNextweexaminehowlongcloakedpagesremaininsearchre-sultsaftertheyrstappear.Foravarietyofreasons,searchenginestrytoidentifyandthwartcloaking.Althoughwehavelittleinsightintothetechniquesusedbysearchenginestoidentifycloaking,4wecanstillobservetheexternaleffectsofsuchtechniquesinpractice.Weconsidercloakedsearchresultstohavebeeneffectively“cle-aned”bysearchengineswhenthecloakedsearchresultnolongerappearsinthetop100results.Ofcourse,thisindicatormaynotbedirectlyduetothepagehavingbeencloaked.Thesearchenginerankingalgorithmscouldhaveadjustedthepositionsofcloakedpagesovertimeduetootherfactors,e.g.,theSEOtechniquesusedbycloakersmayturnouttobeusefulonlyintheshortterm.Eitherway,inthiscaseweconsiderthecloakedpagesasnolongerbeingeffectiveatmeetingthegoalsofthecloakers.4Googlehasapatentinthearea[13],butwehavenotseenevidenceofsuchaclient-assistedapproachusedinpractice.Tomeasurethelifetimeofcloakedsearchresults,weperformre-peatedsearchqueriesforeverysearchtermovertime(Section3.5).Wethenexamineeachnewsetofsearchresultstolookforthecloakedresultsweoriginallydetected.Thelatersearchresultswillcontainanyupdates,includingremovalsandsuppressions,thatsearchengineshavemadesincethetimeoftheinitialmeasurement.Toestablishabaselinewealsomeasurethelifetimeofallourorig-inalsearchresults,cloakedornot.Thisbaselineallowsustodif-ferentiateanychurnthatoccursnaturallywiththoseattributableto“cleaning”.Weperformtheserepeatedsearchesoneachtermeveryfourhoursupto24hoursandtheneverydayuptosevendays.Figures4aand4bshowthelifetimeofcloakedandoverallsearchresultsforGoogleandYahoofortrendingsearches.Eachpointshowstheaveragepercentageofsearchresultsthatremaininthetop100forthesamesearchtermsovertime.Theerrorbarsdenotethestandarddevationacrossallsearches,andweplotthepointsfor“cloaked”slightlyoff-centertobetterdistinguisherrorbarsondifferentcurves.Theresults,forbothsearchengines,showthatcloakedsearchresultsrapidlybegintofalloutofthetop100withintherstday,withamoregradualdropthereafter.Incontrast,searchresultsingeneralhavesimilartrends,butdeclinemoregradually.ForGoogle,nearly40%ofcloakedsearchresultshavealifetimeofadayorless,andoverthenextsixdaysonlyanadditional25%dropfromthetop100results.Incontrast,forthebaselineonly20%ofoverallsearchresultshavealifetimeofadayofless,and 1 1.5 2 2.5 3 3.5 4 4.5 5 0 1 2 3 4 5 6 7Proportional Increase of Harmful Search ResultsTime Delta (Days)Figure6:IncreaseinharmfultrendingsearchresultsovertimeonGoogleaslabeledbyGoogleSafeBrowsing.anadditional15%arecleanedafterthenextsixdays.Yahooex-hibitsasimilartrend,althoughwithlessrapidchurnandwithasmallerseparationbetweencloakedandthebaseline(perhapsre-ectingdifferencesinhowthetwosearchenginesdealwithcloak-ing).Overall,though,whilecloakedpagesdoregularlyappearinsearchresults,manyareremovedorsuppressedbythesearchen-gineswithinhourstoaday.Figures5aand5bshowsimilarresultsforpharmaceuticalsearches.Notethatthemaximumtimedeltais30daysbecause,unliketrend-ingterms,thepharmacysearchtermsdonotchangethroughoutthedurationofourexperimentandwehavealargerwindowofob-servation.Whilewestillseesimilartrends,wherecloakedsearchresultsdropmorerapidlythanthechurnrateandGooglechurnsmorethanYahoo,theresponseforbothGoogleandYahooisslowerforpharmaceuticaltermsthanfortrendingterms.Forexample,whereas45%and25%ofcloakedtrendingsearchresultswere“cleaned”forGoogleandYahoo,respectively,withintwodays,only30%and10%ofcloakedpharmacysearchresultswere“cleaned”forGoogleandYahoo,respectively.Asanotherperspectiveon“cleaning”,GoogleSafeBrowsing[7]isamechanismforshieldingusersbylabelingsearchresultsthatleadtophishingandmalwarepagesas“harmful”.Theseharm-fulsearchresultssometimesemploycloaking,whichGoogleSafeBrowsingisabletodetectandbypass.ThisinsightsuggeststhattheratethatGoogleisabletodiscoverandlabel“harmful”searchresultscorrelateswiththerateatwhichtheycandetectcloaking.WecanmeasurethisSafeBrowsingdetectionbyrepeatedlyquery-ingforthesametermsasdescribedinSection3.5andcountingthenumberof“harmful”searchresults.AsobservedinSection4.2,theprevalenceofcloakingisvolatileanddependsheavilyonthespecicsearchterms.Theprevalenceofdetectedharmfulpagesissimilarlyvolatile;although37%oftheresultsonaverageonGooglearemarkedasharmfulforthetermswesearchfor,thereissubstantialvarianceacrossterms.Therefore,wenormalizethechangeovertimeinthenumberofharmfulsearchresultslabeledbyGoogleSafeBrowsingrelativetotherstmea-surement.Figure6showstheaveragenormalizedchangeinthenumberofharmfullabelsovertime,acrossallqueriesontrendingsearchterms.Thenumberofharmfullabelsincreasesrapidlyfortherstday,withnearly2.5morelabelsthantheoriginalmea-surement,andthenincreasessteadilyovertheremainingsixdays,wheretherearenearly5morelabelsthantheoriginalquery.Thisbehaviormirrorstheresultsoncleaningabove. 0 20 40 60 80 100 0 1 2 3 4 5 6 7 8Percentage of Detected Remaining CloakedTime Delta (Days)GoogleYahooFigure7:Durationpagesarecloaked.4.4CloakingDurationCloakerswilloftensubvertexistingpagesasanSEOtechniquetocapitalizeonthealreadyestablishedgoodreputationofthosepageswithsearchengines.Wehaveseenthatthesearchenginesrespondrelativelyquicklytohavingcloakedpagesinsearchre-sults.Nextweexaminehowlonguntilcloakedpagesarenolongercloaked,eitherbecausecloakersdecidedtostopcloakingorbe-causeasubvertedcloakedpagewasdiscoveredandxedbytheoriginalowner.Tomeasurethedurationthatpagesarecloaked,werepeatedlycrawleverycloakedpagethatwendovertime,independentofwhetherthepagecontinuestoappearinthetop100searchresults.Wethenapplythecloakingdetectionalgorithmonthepage(Sec-tion3.4),andrecordwhenitisnolongercloaked.AsinSection4.3,wecrawleachpageeveryfourhoursupto24hoursandtheneverydayuptosevendays.Figure7showsthetimedurationsthatpagesarecloakedinre-sultsreturnedbyGoogleandYahoo.Eachpointshowstheper-centageofallcloakedpagesforeachmeasurementperiodthatre-maincloakedovertime,andtheerrorbarsshowthestandardde-viationsacrossmeasurementperiods.Weseethatcloakedpageshavesimilardurationsforbothsearchengines:cloakersmanagetheirpagessimilarlyindependentofthesearchengine.Further,pagesarecloakedforlongdurations:over80%remaincloakedpastsevendays.Thisresultisnotverysurprisinggiventhatcloak-ershavelittleincentivetostopcloakingapage.Cloakerswillwanttomaximizethetimethattheymightbenetfromhavingapagecloakedbyattractingcustomerstoscamsites,orvictimstomal-waresites.Further,itisdifcultforthemtorecycleacloakedpagetoreuseatalatertime.BeingblacklistedbyGoogleSafeBrows-ing,forinstance,requiresmanualinterventiontoregainapositivereputation.Andforthosecloakedpagesthatweresubverted,bydenitionitisdifcultfortheoriginalpageownerstodetectthattheirpagehasbeensubverted.Onlyiftheoriginalpageownersaccesstheirpageasasearchresultlinkwilltheyrealizethattheirpagehasbeensubverted;accessingitanyotherwaywillreturntheoriginalcontentsthattheyexpect.4.5CloakedContentSincethemaingoalofcloakingasanSEOtechniqueistoobtainusertrafc,itisnaturaltowonderwherethetrafcisheading.Bylookingatthekindofcontentdeliveredtotheuserfromcloakedsearchresults,notonlydoesitsuggestwhycloakingisnecessary Category%CloakedPagesTrafcSale81.5%Error7.3%Legitimate3.5%Software2.2%SEO-edBusiness2.0%PPC1.3%Fake-AV1.2%CPALead0.6%Insurance0.3%Linkfarm0.1%Table2:Breakdownofcloakedcontentformanually-inspectedcloakedsearchresultsfromGooglefortrendingsearchterms.Notethat“TrafcSale”pagesarethestartofredirectionchainsthattyp-icallyleadtoFake-AV,CPALead,andPPClandingpages.forhidingsuchcontent,butitalsorevealsthemotivescloakershaveinattractingusers.Wehavenofullyautomatedmeansforidentifyingthecontentbehindcloakedsearchresults.Instead,weclustercloakedsearchresultswiththeexactsameDOMstructureofthepagesasseenbytheuserwhenclickingonasearchresult.WeperformtheclusteringonallcloakedsearchresultsfromGoogleacrossallmeasurementpointsfortrendingsearches.Toformarepresentativesetofcloakedpagesforeachcluster,weselectahandfulofsearchresultsfromvariousmeasurementtimes(weekday,weekend,dayime,morning,etc.)andwithvariousURLcharacteristics.Wethenmanuallylabelpagesinthisrepresentativesettoclassifythecontentofthepagesbeingcloaked.Wemanuallylabelthecontentofeachclusterintooneoftencategories:trafcsales,pay-per-click(PPC),software,insurance,Fake-AV,CPALead,5linkfarm,SEO-edbusiness,error,andlegiti-mate.Trafcsalesarecloakedsearchresultswiththesolepurposeofredirectingusersthroughachainofadvertisingnetworks,mainlyusingJavaScript,beforearrivingatanallandingpage.Althoughweareunabletofollowthemsystematically,frommanuallyexam-iningthousandsoftrafcsales,weobservedthesesearchresultsdirectingusersprimarilytoFake-AV,CPALead,andPPCpages.Occasionallycloakedsearchresultsdonotfunnelusersthrougharedirectionchain,whichishowweareabletoclassifythePPC,software,insurance,Fake-AV,andCPALeadsets.Thelinkfarmsetcontainbenignpagesthatprovidemanybacklinkstoboosttherankingsofbeneciarypages.TheSEO-edbusinessreferstobusi-nessesthatemployblack-hatSEOtechniques,suchasutilizingfreehostingtospamalargesetofsearchresultsforasingleterm.Theerrorsarepagesthathaveclientsiderequirementswewereunabletomeet,i.e.,havinganAdobeFlashplugin.Finally,thelegitimatesetreferstopagesthatdisplaynomaliciousbehaviorbutwerela-beledascloakingduetodeliveringdifferingcontent,asisthecasewhensitesrequireuserstologinbeforeaccessingthecontent.Table2showsthebreakdownofcloakedsearchresultsaftermanuallyinspectingthetop62clusters,outof7671total,whichweresortedindecreasingorderofclustersize.These62clustersaccountfor61%ofallcloakedsearchresultsfoundinGooglefortrendingsearchesacrossallmeasurementpoints.Fromthis,weseethatabouthalfofthetimeacloakedsearchresultleadstosomeformofabuse.Further,over49%ofthetime,cloakedsearchresults5Cost-per-actionpagesthataskausertotakesomeaction,suchasllingoutaform,thatwillgenerateadvertisingrevenue. 0 20 40 60 80 10002/2603/1203/2604/0904/2305/0705/2106/0406/1807/0207/16Percentage of Cloaked Search ResultsDateredirectlinkfarmweakerrormiscFigure8:ProportionaldistributionofcloakedsearchresultsinGoogleovertimefortrendingsearches.sellusertrafcthroughadvertisingnetworks,whichwilleventuallyleadtoFake-AV,CPALeads,orPPC.Interestingly,theDOMstructureofthelargestcluster,whichaloneacountedfor34%ofcloakedsearchresults,wasasingleJavaScriptsnippetthatperformsredirectionaspartoftrafcsales.Otherlargeclustersthataccountedfor1–3%ofcloakedsearchre-sultsalsoconsistprimarilyofJavaScriptthatperformsredirectionaspartoftrafcsales.Despitethefactthattheclusterswehaveexaminedaccountfor61%ofallcloakedsearchresults,therestillexists39%thathavenotbeencategorizedandlikelydonotsharethesamedistribution.Whileincrementallyclustering,wenotedthatthetopclustersgrewlargerandlargerasmoreandmoredatawasadded.Thissuggeststhepresenceoflong-termSEOcampaigns,asrepresentedbythetopclusters,thatconstantlychangethesearchtermstheyaretar-getingandthehoststheyareusing.Therefore,sincetheuncatego-rizedsearchresultsfallwithinthelongtail,theyareunlikelytobeactivelyinvolvedindirecttrafcsales.Instead,wespeculatethattheyfallinthelinkfarmorlegitimatesetsgiventhatthosegroupshavethemostdifculttimeinforminglargeclustersbecausetheyarenotbeingSEO-edasheavilyacrosssearchterms.Thekindsofpagesbeingcloakedisalsodynamicovertime.Fig-ure8showstheclassicationofcloakedpagecontentforsearchresultsfromGoogleusingtrendingterms,fromMarch1,2011throughJuly15,2011.WeclassifytheHTMLofcloakedpages,usingthelesizeandsubstringsasfeatures,intooneofthefol-lowingcategories:Linkfarm,Redirect,Error,Weak,orMisc.TheLinkfarmcategoryrepresentscontentreturnedtoour“Googlebot”crawlerthatcontainsmanyoutboundlinkshiddenusingCSSprop-erties.TheRedirectcategoryrepresentscontentreturnedtoauserthatissmallerthan4KBandcontainsJavaScriptcodeforredirec-tion,oranHTMLmetarefresh.TheErrorcategoryrepresentsusercontentthatissmallerthan4KBandcontainsablankpageortypi-calerrormessages.TheWeakcategorycontainsusercontentbelow4KBinlesizenotalreadyclassied;similarly,theMisccategorycontainsusercontentlargerthan4KBnotalreadyclassied.Asanexample,onMarch7thapproximately7%ofthecloakedcontentdetectedwerelinkfarms,53%wereredirects,6%wereerrors,3%wereweakand31%weremisc.Lookingatthetrendsovertimerevealsthedynamicnatureofthecontentbeinghiddenbycloaking.Inparticular,wesawasurgeinredirectsfromMarch15thtoJune5th.Duringthisperiod,theaveragedistributionofredirectsperdayincreasedfrom11.4%to the rest*.cc*.tv*.fr*.edu*.ca*.pl*.de*.info*.net*.org*.com70%60%50%40%30%20%10%0%Googleahoo%-age of Cloaked Search Results, Trends0%10%20%30%40%50%60%70%%-age of Cloaked Search Results, PharmacyFigure9:HistogramofthemostfrequentlyoccuringTLDsamongcloakedsearchresults.41.3%andlaterdroppedoffto8.7%.Interestingly,asredirectsbe-gintofalloff,weseeacorrespondingincreaseinerrors.Duringthehighperiodofredirects,errorsrepresented8.0%oftheaveragedistribution,butafterwardsrepresented24.3%.Oneexplanationofthiscorrelationisthattheinfrastructuresupportingredirectsbe-ginstocollapseatthispoint.Anecdotally,thisbehaviormatchesthegeneraltimeframeoftheunderminingofkeyFake-AVafliateprograms[9],frequentlyobservedattheendoftheredirectchains.4.6DomainInfrastructureAnalyzingtheunderlyingintentofcloakedpagesconrmedaga-inthatcloakersareattemptingtoattracttrafcbyillegitimatelyoccupyingtopsearchresultpositionsfortrendingandpharmacysearchtermsandtheirsuggestions.Theimplicationisthatthekeyresourcethatspammersmustpossess,toeffectivelyutilizecloak-ingintheirscams,isaccesstoWebsitesandtheirdomains.Ideally,thesesitesshouldbeestablishedsitesalreadyindexedinsearchen-gines.Otherwise,solelyusingtraditionalSEOtactics,suchaslinkfarming,willhavelimitedsuccessinobtaininghighsearchresultpositions.Recentreportsconrmthatmanypageshavebeentar-getedandinfectedbywellknownexploitstotheirsoftwareplat-forms,andsubsequentlyusedtocloakhiddencontentfromsearchengines[18].Inthissection,weexaminethetopleveldomains(TLDs)ofcloakedsearchresults.Figure9showshistogramsofthemostfre-quentlyoccuringTLDsamongallcloakedsearchresults,forbothGoogleandYahoo.Weseethatthemajorityofcloakedsearchresultsarein.com.Interestingly,cloakedsearchresultsfromphar-maceuticalqueriesutilizedomainsin.eduand.orgmuchmorefrequently,wheretogethertheyrepresent27.6%ofallcloakedsearchresultsseeninGoogleand31.7%inYahoo.Forcomparison,.eduand.orgtogetherrepresentjust10%inGoogleand14.8%inYahoofortrendingsearches.Cloakersspammingpharmaceuticalsearchtermsareusingthe“reputation”ofpagesinthesedomainstoboosttheirrankinginsearchresultssimilartotherecentaccusa-tionsagainstoverstock.com[6].911008190718061705160415031402130112011020%15%10%5%0%Googleahoo%age of Cloaked Search Results, Trends0%5%10%15%20%%age of Cloaked Search Results, PharmacyFigure10:Distributionofcloakedsearchresultpositions.4.7SEOFinally,weexplorecloakingfromanSEOperspectivebyquan-tifyinghowsuccessfulcloakingisinhigh-levelspamcampaigns.Sinceamajormotivationforcloakingistoattractusertrafc,wecanextrapolateSEOperformancebasedonthesearchresultposi-tionsthecloakedpagesoccupy.Forexample,acampaignthatisabletooccupysearchresultpositionsbetween1–20ispresumablymuchmoresuccessfulthanonethatisonlyabletooccupysearchresultpositionsbetween41–60.Tovisualizethisinformation,wecalculatethepercentageofcloakedsearchresultsfoundbetweenrangesofsearchresultpo-sitionsforeachmeasurement.Thenwetaketheaverageacrossallmeasurements.Again,weonlyfocusonGoogleandYahooduetothelackofcloakedsearchresultsinBing.Forclarity,webinthehistogrambygroupingeverytenpositionstogether,thedefaultnumberofsearchresultsperpage.Figure10showstheresultinghistogramsfortrendingsearchtermsandpharmaceuticalterms,sidebyside.Fortrendingsearchesweseeaskeweddistributionwherecloakedsearchresultsmainlyholdthebottompositions;forbothGoogleandYahoo,positionsfurtherawayfromthetopcontainmorecloaking.Comparedtothepositions1–10ontherstpageofresults,thenumberofcloakedsearchresultsare2.1morelikleytoholdasearchresultposi-tonbetween31–40inGoogle,and4.7morelikelytobeinpo-sition91–100;resultsforYahooaresimilar.Insomeways,thisdistributionindicatesthatcloakingisnotveryeffectivefortrend-ingsearchterms.Itdoesnotleadtoahigherconcentrationinthemostdesirablesearchresultpositions(top20),likelyduetothelimitedamountoftimeavailabletoSEO.Althoughcloakerswillhavefeweropportunitiesfortheirscamsasaresult,presumablyitstillremainsprotableforcloakerstocontinuethepractice.Interestingly,weseeaverydifferenttrendinpharmaceuticalsearcheswherethereisanevendistributionacrosspositions.Thenumberofcloakedpagesarejustaslikelytorankintherstgroupofsearchresults(positions1–10)asanyothergroupwithinthetop100.WuandDavison[24]hadsimilarndingsfrom2005.Onepossibleexplanationisthatthedifferencesagainreectthediffer-encesinthenatureofcloakedsearchterms.Cloakingthetrendingtermsbydenitiontargetpopulartermsthatareverydynamic,withlimitedtimeandheavycompetitionforperformingSEOonthosesearchterms.Cloakingpharmacyterms,however,isahighlyfo-cusedtaskonastaticsetofterms,providingmuchlongertimeframesforperformingSEOoncloakedpagesforthoseterms.As aresult,cloakershavemoretimetoSEOpagesthatsubsequentlyspanthefullrangeofsearchresultpositions.5.CONCLUSIONCloakinghasbecomeastandardtoolinthescammer'stoolboxandonethataddssignicantcomplexityfordifferentiatinglegiti-mateWebcontentfromfraudulentpages.OurworkhasexaminedthecurrentstateofsearchenginecloakingasusedtosupportWebspam,identiednewtechniquesforidentifyingit(viathesearchenginesnippetsthatidentifykeyword-relatedcontentfoundatthetimeofcrawling)and,mostimportantly,wehaveexploredthedy-namicsofcloakedsearchresultsandsitesovertime.Wedemon-stratethatthemajorityofcloakedsearchresultsremainhighinrankingsfor12hoursandthatthepagesthemselvescanpersistfarlonger.Thus,cloakingislikelytobeaneffectivemechanismsolongastheoverheadofsiteplacementviaSEOtechniquesislessthantherevenueobtainedfrom12hoursoftrafcforpopularkey-words.Webelieveitislikelythatthisholds,andsearchengineproviderswillneedtofurtherreducethelifetimeofcloakedresultstodemonetizetheunderlyingscamactivity.AcknowledgmentsWethankDamonMcCoyforinsightfulcommentsanddiscussionofthiswork,andwealsothanktheanonymousreviewersfortheirvaluablefeedback.ThisworkwassupportedinpartbyNationalScienceFoundationgrantsNSF-0433668andNSF-0831138,bytheOfceofNavalResearchMURIgrantN000140911081,andbygenerousresearch,operationaland/orin-kindsupportfromGoogle,Microsoft,Yahoo,Cisco,HPandtheUCSDCenterforNetworkedSystems(CNS).6.REFERENCES[1]JohnBethencourt,JasonFranklin,andMaryVernon.MappingInternetSensorswithProbeResponseAttacks.InProceedingsofthe14thUSENIXSecuritySymposium,Baltimore,MD,July2005.[2]AndreiZ.Broder.OntheResemblanceandContainmentofDocuments.InProceedingsoftheCompressionandComplexityofSequences(SEQUENCES'97),pages21–29,June1997.[3]LeeG.Caldwell.TheFastTracktoProt.PearsonEducation,2002.[4]KumarChellapillaandDavidMaxwellChickering.ImprovingCloakingDetectionUsingSearchQueryPopularityandMonetizability.InProceedingsoftheSIGIRWorkshoponAdversarialInformationRetrievalontheWeb(AIRWeb),August2006.[5]MarcoCova,CorradoLeita,OlivierThonnard,AngelosKeromytis,andMarcDacier.AnAnalysisofRogueAVCampaigns.InProceedingsofthe13thInternationalSymposiumonRecentAdvancesinIntrusionDetection(RAID),September2010.[6]AmirEfrati.GooglePenalizesOverstockforSearchTactics.http://online.wsj.com/article/2753779521700.html,February24,2011.[7]GoogleSafeBrowsingAPI.http://code.google.com/apis/safebrowsing/.[8]JohnP.John,FangYu,YinglianXie,ArvindKrishnamurthy,andMartinAbadi.deSEO:CombatingSearch-ResultPoisoning.InProceedingsofthe20thUSENIXSecuritySymposium,August2011.[9]BrianKrebs.HugeDeclineinFakeAVFollowingCreditCardProcessingShakeup.http://krebsonsecurity.com/2011/08/huge-decline-in-fake-av-following-credit-card-processing-shakeup/,August2011.[10]NektariosLeontiadis,TylerMoore,andNicolasChristin.MeasuringandAnalyzingSearch-RedirectionAttacksintheIllicitOnlinePrescriptionDrugTrade.InProceedingsofthe20thUSENIXSecuritySymposium,August2011.[11]KirillLevchenko,NehaChachra,BrandonEnright,MárkFélegyházi,ChrisGrier,TristanHalvorson,ChrisKanich,ChristianKreibich,HeLiu,DamonMcCoy,AndreasPitsillidis,NicholasWeaver,VernPaxson,GeoffreyM.Voelker,andStefanSavage.ClickTrajectories:End-to-EndAnalysisoftheSpamValueChain.InProceedingsoftheIEEESymposiumandSecurityandPrivacy,Oakland,CA,May2011.[12]Jun-LinLin.Detectionofcloakedwebspambyusingtag-basedmethods.ExpertSystemswithApplications,36(4):7493–7499,2009.[13]MarcA.Najork.Systemandmethodforidentifyingcloakedwebservers,UnitedStatesPatentnumber6,910,077.IssuedJune21,2005.[14]YuanNiu,Yi-MinWang,HaoChen,MingMa,andFrancisHsu.AQuantitativeStudyofForumSpammingUsingContextbasedAnalysis.InProceedingsof15thNetworkandDistributedSystemSecurity(NDSS)Symposium,February2007.[15]MoheebAbuRajab,LucasBallard,PanayiotisMavrommatis,NielsProvos,andXinZhao.TheNoceboEffectontheWeb:AnAnalysisofFakeAnti-VirusDistribution.InProceedingsofthe3rdUSENIXWorkshoponLarge-ScaleExploitsandEmergentThreats(LEET'10),April2010.[16]SearchEngineMarketingProfessionalOrganization(SEMPO).StateofSearchEngineMarketingReportSaysIndustrytoGrowfrom$14.6Billionin2009to$16.6Billionin2010.http://www.sempo.org/news/03-25-10,March2010.[17]CraigSilverstein,MonikaHenzinger,HannesMarais,andMichaelMoricz.AnalysisofaVeryLargeWebSearchEngineQueryLog.ACMSIGIRForum,33(1):6–12,1999.[18]JulienSobrier.Trickstoeasilydetectmalwareandscamsinsearchresults.http://research.zscaler.com/2010/06/tricks-to-easily-detect-malware-and.html,June3,2010.[19]DannySullivan.SearchEngineOptimizationFirmSoldFor$95Million.http://searchenginewatch.com/2163001,September2000.SearchEngineWatch.[20]JasonTabeling.KeywordPhraseValue:Click-Throughsvs.Conversions.http://searchenginewatch.com/3641985,March8,2011.[21]Yi-MinWangandMingMa.DetectingStealthWebPagesThatUseClick-ThroughCloaking.TechnicalReportMSR-TR-2006-178,MicrosoftResearch,December2006.[22]Yi-MinWang,MingMa,YuanNiu,andHaoChen.SpamDouble-Funnel:ConnectingWebSpammerswithAdvertisers.InProceedingsofthe16thInternationalWorldWideWebConference(WWW'07),pages291–300,May2007.[23]Wordtracker.FiveReasonsWhyWordtrackerBlowsOther KeywordsToolsAway.http://www.wordtracker.com/find-the-best-keywords.html.[24]BaoningWuandBrianD.Davison.CloakingandRedirection:APreliminaryStudy.InProceedingsoftheSIGIRWorkshoponAdversarialInformationRetrievalontheWeb(AIRWeb),May2005.[25]BaoningWuandBrianD.Davison.DetectingSemanticCloakingontheWeb.InProceedingsofthe15thInternationalWorldWideWebConference,pages819–828,May2006.APPENDIXCLOAKINGILLUSTRATIONFigure11usesbrowserscreenshotstoillustratetheeffectsofcloakingwhensearchingonGoogleonMay3,2011withtheterms“bethenneyfrankeltwitter”,atopicpopularinuserqueriesatthattime.Thetopscreenshotshowstherstsearchresultspagewiththeeighthresulthighlightedwithcontentsnippetsandapreviewofthelinkedpage;thisviewofthepagecorrespondstothecloakedcontentreturnedtotheGooglecrawlerwhenitvisitedthispage.ThesesearchresultsarealsoexamplesofwhatDaggerobtainswhenqueryingGooglewithsearchterms(Section3.2).ThemiddlescreenshotshowsthepageobtainedbyclickingonthelinkusingInternetExploreronWindows.Specically,itcorre-spondstovisitingthepagewiththeUser-Agenteldsetto:Mozilla/5.0(compatible;MSIE8.0;WindowsNT5.2;Trident/4.0;MediaCenterPC4.0;SLCC1;.NETCLR3.0.04320)andtheReferereldindicatingthattheclickcomesfromasearchresultfortheterms“bethenneyfrankeltwitter”.ThispagevisitcorrespondstowhenDaggercrawlstheURLasasearch“user”(Section3.3),displayinganadvertisementforFake-AV.ThebottomscreenshotshowsthepageobtainedbycrawlingthelinkwhilemimickingtheGooglecrawler.Specically,itcorre-spondstovisitingthepagewiththeUser-Agenteldsetto:Mozilla/5.0(compatible;Googlebot/2.1;+http://www.google.com/bot.html)andcorrespondstowhenDaggervisitstheURLasasearch“craw-ler”.NotethatthispagealsousesIPcloaking:thecontentreturnedtoDagger,whichdoesnotvisititwithaGoogleIPaddress,isdif-ferentfromthecontentreturnedtotherealGooglesearchcrawler(asreectedinthesnippetandpreviewinthetopscreenshot).Nev-ertheless,becausethecloakedpagedoesreturndifferentcontenttosearchusersandtheDaggercrawler,Daggercanstilldetectthatthepageiscloaked.(a)GoogleSearchResultsPage(b)UserfromWindows(c)CrawlerFigure11:Exampleofcloakinginpractice:(a)therstsearchresultspageforthequery“bethenneyfrankeltwitter”,includingtheGooglepreview;(b)thepageobtainedbyclickingonthehighlightedlinkfromaWindowsmachine(withRefererindi-catingasearchresult);and(c)thesamepagevisitedbutwiththeUser-AgenteldintheHTTPrequestheadersettomimicthatoftheGooglesearchcrawler.