1013 ppropriatelyinEnglishthewordshamcandenotesomethingfakeornotgenuinewwlemurprojectorg 1014 ourgoodaccidentlawyer illdealwithwhiplashinjuryclaims spinalinjuryclaims personalinjuryclaims backi ID: 427049
Download Pdf The PPT/PDF document "hametobeSham:AddressingContent-BasedGrey..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
1013 hametobeSham:AddressingContent-BasedGreyHatSearchEngineOptimizationianaRaiberechnionIsraelInstituteofTechnologyKevynCollins-ThompsonicrosoftResearch1MicrosoftWayRedmond,WA,USA98052evynct@microsoft.comOrenKurlandechnionIsraelInstituteofTechnologyurland@ie.technion.ac.ilABSTRACTepresentaninitialstudyidentifyingaformofcontent-basedgreyhatsearchengineoptimization,inwhichaWeb ppropriately,inEnglishtheword`sham'candenotesome-thingfakeornotgenuine.ww.lemurproject.org 1014 ourgoodaccidentlawyer illdealwithwhiplashinjuryclaims spinalinjuryclaims personalinjuryclaims backinjuryclaims Itisthejobofthecaraccidentlawyer onegotiatewiththeinsurancecompanytogetyouahighautoaccidentinsurancesettlement swellasperformotherdutiessuchas...Figure1:Examplepassageforthequery[autoinjuryattorney].Inshamdocuments,contentismanipulated,e.g.,withexcessiverelatedqueries(underlined),whileretainingsomedegreeofrelevance.torsforwhenaqueryislikelytobethetargetofshammingeorts.Ourresultsshowthatquery-speci\fcfeaturesout-performanapproachthatusesspamclassi\fcationforthistaskaswellascontent-basedquery-independentdocumentqualitymeasures[2]..RELATEDWORKhereismuchworkonspamclassi\fcation(e.g.,[11,14,20,8,5,19])andonnullifyingitseectsonretrievaleec-tiveness[14,9].Asalreadynoted,thetaskofshamclas-si\fcationthatwepursuehereisdierentfromspamclas-si\fcationbyde\fnition.Speci\fcally,shamdocumentsaremanipulatedforspeci\fcqueries,anddobearusefulcontent,whilespamdocumentsareabsolutelyuseless.Therehasbeensomeworkonusingquery-dependentfeaturestoclas-sifyspaminitsabsolutequery-independentsense[20].Weusequery-dependentfeaturesforshamclassi\fcation.Usingquery-independentdocumentqualitymeasuresisknowntoimproveretrievaleectiveness[3,2].Wenotethatshamdocumentsarenotnecessarilyoflowquality.Further-more,weshowthatthemethodswestudyforthetaskofpredictingwhetheraqueryisthetargetofshammingout-performmethodsthatusecontent-baseddocumentqualitymeasureswhichareveryeectiveforsearch[2].Therealizationthat\notallcontentthatcomplicatesrank-ingisspam"wasechoedinpreviouswork[13].However,wearenotawareofanypreviousstudiesthatstudythedi-cultyoflabelingWebpagesasgreyhatSEO,orusingpre-andpost-retrievalfeaturestopredicthowlikelyaqueryistobethetargetofsuchSEOattempts..SHAMANNOTATIONPROCESSsa\frststepinexaminingtheshammingphenomenon,weconductedanannotationeort,whereinatotalof300documents,30queries10documentsperquery,wereeval-uated.(FurtherdetailsregardingthesetofdocumentsandthequeriesareprovidedinSection5.)Thedocumentswereevenlydividedbetween\fvePh.D.andM.Sc.studentsintheInformationRetrieval\feld.Eachdocumentwaslabeledbytwoannotatorsasbeingeitherspam,sham,orlegitimatepage.Incaseofadisagreementbetweenthe\frsttwoanno-tators,athirdannotatorwasaskedtolabelthedocument.Thus,foreachdocumentwehaveeithertwoorthreelabels.Annotatorswereinstructedtolabeladocumentasshamifanyofthefollowingconditionswasmet.(i)Thedocu-mentcontainedtextthatdidnotseemtosatisfyanyinfor-mationneed,includinginformationneedssatis\fedbyotherpartsofthesamedocument(ii)Thepagecontainedtextthatappearedtoberelatedtoinformationneedssatis\fedbythispage,butthetextcontainedmanyarti\fcial,repeated,orotherwiseunnecessaryextrawords,phrasesorsentencesaddedtothepagesolelyforthepurposeofpromotingthepageintheresultlists;and(iii)Thepagecontainedsec-tionsofcontentthatwerecopiedfromotherpages(e.g.,fromaWikipediapage)forpresumablythesolepurposeofpromotingthepage.Adocumentwaslabeledasspamifitcontainednousefulinformationatallforsatisfyinginformationneed.Theannotatorswereaskedtobasetheirdecisiononlyuponthecontentofthedocument,andtoignorenon-contentelementssuchasincomingandoutgoinglinks,pageURL,sitedomain,oradsandsponsoredlinksthatmightappearonthepage.Thedocumentswerepresentedtotheanno-tatorsinarandomquery-independentorderandthequeryitselfwasnotpresented..PREDICTORSFORQUERYSHAMMINGenextexaminetheproblemofpredictingwhichqueriesaremoresusceptibletosham,asmeasured,e.g.,bythefrac-tionofshamdocumentsinthetop10results.Tothisend,westudytheuseofvariouspredictorswhichweredevelopedtoestimateretrievaleectiveness[4].Asweshowbelow,someofthesearenegativelycorrelatewiththepercentageofsham,typicallybecausemostshamdocumentsarenon-relevant,whileothershaveapositivecorrelation.Pre-retrievalpredictorsarebasedonpropertiesofthequeryandthecorpus.Post-retrievalpredictorsutilize,inaddition,informationabouttheresultlistofdocuments,whichwasreturnedinresponsetothequery[4].DetailsregardingtheretrievalmodelusedinourexperimentsareprovidedinSection5.Inwhatfollowswepresentthedierentclassesofpredictorsanalyzedforourpredictiontask.re-retrievalpredictors.hepredictormea-suresthesimilaritybetweenthequeryandthecollectionusingthesumoftheTF.IDFvaluesofthequeryterms[21].SumVar[21]isthesumoverquerytermsofthevarianceoftheTF.IDFvaluesofthetermindocumentsinthecorpusinwhichitappears.ThepredictoristhesumoftheIDFvaluesofthequeryterms[10].Anotherpre-retrievalpredictorthatweuseis.Sinceshamdocu-mentsareassumedtonegativelyaectrelevanceranking,alowerpre-retrievalpredictedvalueisassumedtobecorre-latedwithhigherpercentageofsham.ost-retrievalpredictors.heClarity[10]predictionvalueistheKLdivergencebetweenarelevancelanguagemodelelR,inducedfromthedocumentsintheresultlistndthe(unsmoothed)languagemodelofthecollection.For-mally,let)and)betheprobabilityassignedtobyanunsmoothedlanguagemodelinducedfromdocument,andthescoreassignedtobythesearchal-gorithmthatwasusedtocreaterespectively.Then,theprobabilityassignedtobyisde\fnedas resTheassumptionisthatshamdocumentsforagivenqueryaresimilartooneanother;forexample,duetotherepeatedtermsthattheycontain.Thus, 1015 hesedocumentspresumablyformaclusterwhoselanguagemodelisfocusedwithrespecttothatofthecorpus.Wealsoconsideravariantof[22]whichissim-plytheaverageretrievalscorein Thepremiseisthatshamdocumentsarelikelytobeas-signedhighretrievalscores,asbyde\fnitiontheyhavebeenmanipulatedtobepromotedinresultlists.uery-independentdocument-qualitymeasures.heen-tropyofthetermdistributioninadocument,Entropythestopwordstonon-stopwordsratioinadocument,andthepercentageofstopwordsthatappearinthedocu-ment,,wereshowntobehighlyeectivecontent-basedquery-independentqualitymeasuresforWebsearch[2],andforpredictingqueryperformanceovertheWeb[17].WeusedINQUERY'sstopwordlist[1].ThedocumentPageR-score[3]isanothermeasurewidelyusedforimprovingretrievaleectiveness.Weusethesemeasuresforourshampredictiontask.Thepredictionvaluefortheresultliststhesumoftheper-documentvaluesassignedbyameasure.WhileEntropy,SW1andSW2arebasedonthecontentofadocument,PageRankisbasedonthehyperlinkstructure.Thecontent-basedmeasuresquantifythepresumedtextualcontentbreadthinthedocument.Wehypothesizethatthecontentbreadthinshamdocumentsislow,asthereisfocusonasmallsubsetoftermsthatarethetargetofshamming.Conversely,sincebyde\fnitionshamdocumentscanstillsat-isfyinformationneeds,weassumethatthePageRankscoreofsuchdocumentsislikelytobehigh.pam-basedpredictor.ostudytheconnectionbetweenspamandshamdocuments,weusetherecentlyproposedpredictor[17].Documentsinhatareclassi\fedasspambyWaterloo'sspamclassi\feraretreatedas\non-relevant";therestaretreatedas\relevant".Then,themeanaverageprecision(MAP)atcutoscomputedbasedonthesearti\fciallycreated\relevant"and\non-relevant"la-bels.Thisscoreisthenmultipliedbythetotalnumberof\relevant"(non-spam)documents.uery-log-basedpredictors.oreachquery,wederivedthefollowingfeaturesfromthequerylogsofacommercialsearchengine.RawImpressions:thetotalfrequencythequery(withstopwords)wasusedtosearch;ClicksOnRe-:percentageofclicksonthesearchengineresultspage(SERP)thatwereonasearchresultotherpartsofthepage,suchasquerysuggestions;ClicksForPaging:thepercentageofclicksontheSERPusedtogettothenextorpreviouspageofresults.ForallfeaturesweusedonemonthoftracfromJanuary2013originatingfromtheU.S.locale..EVALUATIONurexperimentswereconductedusingtheClueWeb09CategoryBcollection,whichcontainsabout50millionEn-glishWebpages.Weusedqueries1-150fromTREC2009-2011.TheIndritoolkit(www.lemurproject.org/indri)wasusedforexperiments.WeappliedKrovetzstemminguponqueriesanddocuments,andremovedstopwordsontheIN-QUERYlist[1]onlyfromqueries.Tocreateahighlyeectiveresultlist,wedidthefol-lowing.First,foreachquery,thedocumentsinthecollec-tionwererankedusingthenegativecrossentropybetweentheunsmoothedunigramlanguagemodelinducedfromandtheDirichlet-smoothedunigramlanguagemodelinducedfromadocument(withthesmoothingparametersetto1000).Then,followingpreviouswork[9],documentsas-signedbyWaterloo'sspamclassi\ferwithascorebelow50wereremovedtoptobottomfromthatrankinguntil100(presumablynon-spam)documentswereaccumulated.Thesedocumentswerethenre-rankedusingSVMVM12]appliedwith130features.Ten-foldcrossvalidationwasperformed.WeusedasubsetofthefeaturesusedbyMicrosoft'slearn-ingtorankdatasetsOurfeaturesdonotincludetheBooleanModel,VectorSpaceModel,LMIR.ABS,andthenumberofoutgoinglinks.TheSiteRankscore,thetwoqual-itymeasures,QualityScoreandQualityScore2,andtheclick-basedfeatureswerealsonotincludedastheseinformationtypesarenotavailablefortheClueWeb09collection.Threeadditionalfeaturesthatweused,andwhichwerefoundtobehighlyeectiveforWebretrieval[2],areEntropy,SW1,andSW2whichweredescribedinSection3.Asisthecasewiththefeaturesmentionedthusfar,thesethreefeatureswerealsocomputedseparatelyforallthetextinthedocument,itsanchortext,URL,body,andtitle.Waterloo'sspamclassi\ferscore[9]isanotherfeaturethatwasused.TheBM25scorewascomputedwith1=1and5,followingexperimentswith2fand2fLMIR.DIRandLMIR.JM,thelanguage-model-basedfea-tures,werecomputedwith=1000and9,respectively.Lastly,30querieswererandomlysampled,andthetop10non-WikipediadocumentswereusedtoformWeas-sumethatWikipediapagesarenotsham.Weusedtwoapproachestoaggregatethelabelsassignedbytheannotatorstothedocuments.Accordingtothe\frstapproach,adocumentisconsideredshamifitwaslabeledassuchbyatleastoneoftheannotators.This`weak'labelde\fnitionishenceforthreferredtoasAtLeastOne.Thesecondapproach,denotedAtLeastTwo,ismorestrict:adocumentisconsideredshamifitwasmarkedasshambyatleasttwoannotators.Toevaluatethequalityofthevariouspredictors,wereportPearson'scorrelation[4]betweenthevaluesassignedbyapredictortoeachofthetestedqueries,andthepercentageofshamdocumentscomputedforaquery,accordingtotheAtLeastOneandAtLeastTwoapproachesdescribedabove.Thenumberoftermsusedtoconstructtherelevancelan-guagemodel,whichisusedbytheClaritypredictor,issetto50.Giventhatdocumentsassignedwithascorebelow50byWaterloo'sspamclassi\ferareremovedintheprocessofcre-atingtodeterminethenon-spamdocumentsthatareusedbytheNSpredictorweuseathreshold.Thethresholdisselectedfromsoastooptimizethepredic-tionqualityofNS.Documentsassignedwithascoreabovethatthresholdareconsideredasnon-spam(\relevant").AnnotationResults.Theinter-annotatoragreementonthelabelpairsfromthetwomainjudgesforeachdocu-mentwas069,computedusingthefree-marginalmulti-raterkappameasure[18].Overall,79%ofthe300labelpairswereinagreement.Theaveragepercentageofshamdocumentsperquerywas30%basedonAtLeastOnelabels,and21%basedonAtLeastTwolabels.Queryshammingpredictoranalysis.Thecorrela-tionsofthepredictorsdescribedinSection4withtheper- ww.research.microsoft.com/en-us/projects/mslr 1016 redictorType redictorName hamLabelDefn. AtLeastOne tLeastTwo re-retrieval SumSCQ 0:500 0:401 umVar 0:398 0:337 SumIDF 0:424 0:363 QueryLength 0:553 0:432 ost-retrieval larity +0:276 +0:235 WIG +0:361 +0:351 Independent ntropy 0:324 0:242 SW1 0:001 0:022 SW2 0:024 +0:046 PageRank +0:226 +0:129 Spam NS +0:099 +0:183 ueryLog awImpressions +0:178 +0:231 licksOnResult 0:216 0:320 licksForPaging +0:277 +0:387 able1:Pearson'scorrelation(r)ofvariouspredic-torvariables(Section4)with%shamdocumentsinthetop10results,forweak(AtLeastOne)andstrict(AtLeastTwo)shamlabelde\fnitions.centageofshampagesretrievedforaqueryarepresentedinTable1.Wecanseethatthepre-retrievalpredictorsarenegativelycorrelatedwithshamming.Asthesepredictorsweredesignedtopredicttheretrievaleectivenessofusingthequeryforsearch,wecanconcludethatshamdocumentsareprevalentinquerieswhoseperformanceispredictedtobelow.Indeed,wefoundthataboutofshamdocumentsarenon-relevant,forbothAtLeastOneandAtLeastTwo.QueryLengthisnegativelycorrelatedwiththepercentageofshamdocuments,forbothweak(AtLeastOne)andstrict(AtLeastTwo)shamlabelde\fnitions.Thatis,ashorterqueryhasahigherchanceofbeingthetargetofshamming.Thismightbeattributedtothefactthatshortqueriesaremoreambiguous,andasaresultdocumentsreturnedinre-sponsetothesequeriesmightbepromotingthepagewithrespecttodierentinformationneeds,notnecessarilytheinformationneedthatthecurrentqueryexpresses.Thepost-retrievalpredictors,ClarityandWIG,areposi-tivelycorrelatedwiththepercentageofshamdocuments,ashypothesizedinSection4.However,thepredictionqualityislowerthanthatforthepre-retrievalpredictors.Wecanalsoseethatthecorrelationbetweenthequery-independentdocument-qualitymeasuresandshammingisrelativelylow.Furthermore,theNSpredictorisnotcor-relatedwithsham.These\fndingssuggestthatshamdocu-mentsmightbeofeitherhighorlowquality,andthatthereisnoevidentconnectionbetweenshamandspamdocuments.Forquerylogfeatures,thepositivecorrelationofRaw-Impressionswiththelevelofshamisconsistentwiththefactthatfrequentqueriesaremoresusceptibletoshamming.ClicksForPagingispositivelycorrelatedwiththepercentageofsham,whiletheClicksOnResultfeatureisnegativelycor-related.These\fndingsresonatewiththefactthatsham-mingdegradesretrievalperformance.Finally,forthetaskofpredictingshampercentageforagivenqueryatlow(3),medium((:3;0:6)),orhigh6)levels,weperformedordinalregression[6]usingallthepredictorswithdefaultregressionsettings.Themeanabsoluteerror(MAE),theabsolutedierencebetweenthepredictedclassordinalandthetrueordinal,averagedover100randomizedtrialswitha2:1train/testsplit,was0Forcomparison,anoraclerunusingonerater'sshamlabelstopredictshamlevelsderivedfromotherraters'labels,hadMAE037;whilealwaysguessingcategory`medium'hadMAE067.Theseinitialresultsshowthatdierentpredic-torscanbeeectivelyintegratedforpredictingper-queryshamlevels..CONCLUSIONehaveprovidedaninitialstudyontheidenti\fcationofdocuments:aformofgrey-hatSEOusingcontent-basedfeaturemanipulation,alongwithananalysisoffea-turesassociatedwithlow-orhigh-shamqueries.Evensearchenginesthatusehighlyeectiverankingmethodsstillsuf-ferfromshamdocumentsinthetop-mostrankingsthatad-verselyaectretrievalquality.These\fndingscallforaprin-cipledtreatmentofquery-speci\fcgreyhatSEO.Acknowledgments.Wethankthereviewersfortheircom-ments.ThisworkwassupportedbyandcarriedoutattheTechnion-MicrosoftElectronicCommerceResearchCenter..REFERENCESREFERENCES1]J.Allan,M.E.Connell,W.B.Croft,F.-F.Feng,D.Fisher,andX.Li.INQUERYandTREC-9.InProc.ofTREC,pages551{562,2000.[2]M.Bendersky,W.B.Croft,andY.Diao.Quality-biasedrankingofwebdocuments.InProc.ofWSDM,pages95{104,[3]S.BrinandL.Page.Theanatomyofalarge-scalehypertextualwebsearchengine.InProc.ofWWW,pages107{117,1998.[4]D.CarmelandE.Yom-Tov.EstimatingtheQueryDicultyforInformationRetrieval.Synthesislecturesoninformationconcepts,retrieval,andservices.Morgan&Claypool,2010.[5]C.Castillo,D.Donato,L.Becchetti,P.Boldi,S.Leonardi,M.Santini,andS.Vigna.Areferencecollectionforwebspam.SIGIRForum,40(2):11{24,2006.[6]W.ChuandZ.Ghahramani.GaussianprocessesforordinalJ.MachineLearningRes.,6:1019{1041,2005.[7]C.L.A.Clarke,N.Craswell,andI.Soboro.OverviewoftheTREC2009Webtrack.InProc.ofTREC,2009.[8]G.V.Cormack.TREC2007spamtrackoverview.InProc.ofTREC,2007.[9]G.V.Cormack,M.D.Smucker,andC.L.A.Clarke.Ecientandeectivespam\flteringandre-rankingforlargewebInformationRetrieval,14(5):441{465,2011.[10]S.Cronen-Townsend,Y.Zhou,andW.B.Croft.Predictingqueryperformance.InProc.ofSIGIR,pages299{306,2002.[11]Z.GyongyiandH.Garcia-Molina.Webspamtaxonomy.InProc.ofAIRWeb,pages39{47,2005.[12]T.Joachims.TraininglinearSVMsinlineartime.InProc.of,pages217{226,2006.[13]T.Jones,R.S.Sankaranarayana,D.Hawking,andN.Craswell.Nulli\fcationtestcollectionsforwebspamandSEO.InProc.ofAIRWeb,pages53{60,2009.[14]V.KrishnanandR.Raj.WebspamdetectionwithAnti-Trustrank.InProc.ofAIRWeb,pages37{40,2006.[15]V.LavrenkoandW.B.Croft.Relevance-basedlanguagemodels.InProc.ofSIGIR,pages120{127,2001.[16]T.-Y.Liu.Learningtorankforinformationretrieval.FoundationsandTrendsinInformationRetrieval,3(3),2009.[17]F.RaiberandO.Kurland.Usingdocument-qualitymeasurestopredictweb-searcheectiveness.InProc.ofECIR,pages134{145,2013.[18]J.J.Randolph.Onlinekappacalculator(2008).RetrievedFebruary16,2013,http://justus.randolph.name/kappa.[19]D.SculleyandG.Wachman.RelaxedonlineSVMsforspam\fltering.InProc.ofSIGIR,pages415{422,2007.[20]K.M.Svore,Q.Wu,C.Burges,andA.Raman.Improvingwebspamclassi\fcationusingrank-timefeatures.InProc.ofAIRWeb,2007.[21]Y.Zhao,F.Scholer,andY.Tsegay.Eectivepre-retrievalqueryperformancepredictionusingsimilarityandvariabilityevidence.InProc.ofECIR,pages52{64,2008.[22]Y.ZhouandB.Croft.Queryperformancepredictioninwebsearchenvironments.InProc.ofSIGIR,pages543{550,2007.