/
hametobeSham:AddressingContent-BasedGreyHatSearchEngineOptimizationian hametobeSham:AddressingContent-BasedGreyHatSearchEngineOptimizationian

hametobeSham:AddressingContent-BasedGreyHatSearchEngineOptimizationian - PDF document

tatiana-dople
tatiana-dople . @tatiana-dople
Follow
382 views
Uploaded On 2016-07-31

hametobeSham:AddressingContent-BasedGreyHatSearchEngineOptimizationian - PPT Presentation

1013 ppropriatelyinEnglishthewordshamcandenotesomethingfakeornotgenuinewwlemurprojectorg 1014 ourgoodaccidentlawyer illdealwithwhiplashinjuryclaims spinalinjuryclaims personalinjuryclaims backi ID: 427049

1013 ppropriately inEnglishtheword`sham'candenotesome-thingfakeornotgenuine.ww.lemurproject.org 1014 ourgoodaccidentlawyer illdealwithwhiplashinjuryclaims spinalinjuryclaims personalinjuryclaims backi

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "hametobeSham:AddressingContent-BasedGrey..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1013 hametobeSham:AddressingContent-BasedGreyHatSearchEngineOptimizationianaRaiberechnion—IsraelInstituteofTechnologyKevynCollins-ThompsonicrosoftResearch1MicrosoftWayRedmond,WA,USA98052evynct@microsoft.comOrenKurlandechnion—IsraelInstituteofTechnologyurland@ie.technion.ac.ilABSTRACTepresentaninitialstudyidentifyingaformofcontent-basedgreyhatsearchengineoptimization,inwhichaWeb ppropriately,inEnglishtheword`sham'candenotesome-thingfakeornotgenuine.ww.lemurproject.org 1014 ourgoodaccidentlawyer illdealwithwhiplashinjuryclaims spinalinjuryclaims personalinjuryclaims backinjuryclaims Itisthejobofthecaraccidentlawyer onegotiatewiththeinsurancecompanytogetyouahighautoaccidentinsurancesettlement swellasperformotherdutiessuchas...Figure1:Examplepassageforthequery[autoinjuryattorney].Inshamdocuments,contentismanipulated,e.g.,withexcessiverelatedqueries(underlined),whileretainingsomedegreeofrelevance.torsforwhenaqueryislikelytobethetargetofshamminge orts.Ourresultsshowthatquery-speci\fcfeaturesout-performanapproachthatusesspamclassi\fcationforthistaskaswellascontent-basedquery-independentdocumentqualitymeasures[2]..RELATEDWORKhereismuchworkonspamclassi\fcation(e.g.,[11,14,20,8,5,19])andonnullifyingitse ectsonretrievale ec-tiveness[14,9].Asalreadynoted,thetaskofshamclas-si\fcationthatwepursuehereisdi erentfromspamclas-si\fcationbyde\fnition.Speci\fcally,shamdocumentsaremanipulatedforspeci\fcqueries,anddobearusefulcontent,whilespamdocumentsareabsolutelyuseless.Therehasbeensomeworkonusingquery-dependentfeaturestoclas-sifyspaminitsabsolutequery-independentsense[20].Weusequery-dependentfeaturesforshamclassi\fcation.Usingquery-independentdocumentqualitymeasuresisknowntoimproveretrievale ectiveness[3,2].Wenotethatshamdocumentsarenotnecessarilyoflowquality.Further-more,weshowthatthemethodswestudyforthetaskofpredictingwhetheraqueryisthetargetofshammingout-performmethodsthatusecontent-baseddocumentqualitymeasureswhichareverye ectiveforsearch[2].Therealizationthat\notallcontentthatcomplicatesrank-ingisspam"wasechoedinpreviouswork[13].However,wearenotawareofanypreviousstudiesthatstudythedi-cultyoflabelingWebpagesasgreyhatSEO,orusingpre-andpost-retrievalfeaturestopredicthowlikelyaqueryistobethetargetofsuchSEOattempts..SHAMANNOTATIONPROCESSsa\frststepinexaminingtheshammingphenomenon,weconductedanannotatione ort,whereinatotalof300documents,30queries10documentsperquery,wereeval-uated.(FurtherdetailsregardingthesetofdocumentsandthequeriesareprovidedinSection5.)Thedocumentswereevenlydividedbetween\fvePh.D.andM.Sc.studentsintheInformationRetrieval\feld.Eachdocumentwaslabeledbytwoannotatorsasbeingeitherspam,sham,orlegitimatepage.Incaseofadisagreementbetweenthe\frsttwoanno-tators,athirdannotatorwasaskedtolabelthedocument.Thus,foreachdocumentwehaveeithertwoorthreelabels.Annotatorswereinstructedtolabeladocumentasshamifanyofthefollowingconditionswasmet.(i)Thedocu-mentcontainedtextthatdidnotseemtosatisfyanyinfor-mationneed,includinginformationneedssatis\fedbyotherpartsofthesamedocument(ii)Thepagecontainedtextthatappearedtoberelatedtoinformationneedssatis\fedbythispage,butthetextcontainedmanyarti\fcial,repeated,orotherwiseunnecessaryextrawords,phrasesorsentencesaddedtothepagesolelyforthepurposeofpromotingthepageintheresultlists;and(iii)Thepagecontainedsec-tionsofcontentthatwerecopiedfromotherpages(e.g.,fromaWikipediapage)forpresumablythesolepurposeofpromotingthepage.Adocumentwaslabeledasspamifitcontainednousefulinformationatallforsatisfyinginformationneed.Theannotatorswereaskedtobasetheirdecisiononlyuponthecontentofthedocument,andtoignorenon-contentelementssuchasincomingandoutgoinglinks,pageURL,sitedomain,oradsandsponsoredlinksthatmightappearonthepage.Thedocumentswerepresentedtotheanno-tatorsinarandomquery-independentorderandthequeryitselfwasnotpresented..PREDICTORSFORQUERYSHAMMINGenextexaminetheproblemofpredictingwhichqueriesaremoresusceptibletosham,asmeasured,e.g.,bythefrac-tionofshamdocumentsinthetop10results.Tothisend,westudytheuseofvariouspredictorswhichweredevelopedtoestimateretrievale ectiveness[4].Asweshowbelow,someofthesearenegativelycorrelatewiththepercentageofsham,typicallybecausemostshamdocumentsarenon-relevant,whileothershaveapositivecorrelation.Pre-retrievalpredictorsarebasedonpropertiesofthequeryandthecorpus.Post-retrievalpredictorsutilize,inaddition,informationabouttheresultlistofdocuments,whichwasreturnedinresponsetothequery[4].DetailsregardingtheretrievalmodelusedinourexperimentsareprovidedinSection5.Inwhatfollowswepresentthedi erentclassesofpredictorsanalyzedforourpredictiontask.re-retrievalpredictors.hepredictormea-suresthesimilaritybetweenthequeryandthecollectionusingthesumoftheTF.IDFvaluesofthequeryterms[21].SumVar[21]isthesumoverquerytermsofthevarianceoftheTF.IDFvaluesofthetermindocumentsinthecorpusinwhichitappears.ThepredictoristhesumoftheIDFvaluesofthequeryterms[10].Anotherpre-retrievalpredictorthatweuseis.Sinceshamdocu-mentsareassumedtonegativelya ectrelevanceranking,alowerpre-retrievalpredictedvalueisassumedtobecorre-latedwithhigherpercentageofsham.ost-retrievalpredictors.heClarity[10]predictionvalueistheKLdivergencebetweenarelevancelanguagemodelelR,inducedfromthedocumentsintheresultlistndthe(unsmoothed)languagemodelofthecollection.For-mally,let)and)betheprobabilityassignedtobyanunsmoothedlanguagemodelinducedfromdocument,andthescoreassignedtobythesearchal-gorithmthatwasusedtocreaterespectively.Then,theprobabilityassignedtobyisde\fnedas resTheassumptionisthatshamdocumentsforagivenqueryaresimilartooneanother;forexample,duetotherepeatedtermsthattheycontain.Thus, 1015 hesedocumentspresumablyformaclusterwhoselanguagemodelisfocusedwithrespecttothatofthecorpus.Wealsoconsideravariantof[22]whichissim-plytheaverageretrievalscorein Thepremiseisthatshamdocumentsarelikelytobeas-signedhighretrievalscores,asbyde\fnitiontheyhavebeenmanipulatedtobepromotedinresultlists.uery-independentdocument-qualitymeasures.heen-tropyofthetermdistributioninadocument,Entropythestopwordstonon-stopwordsratioinadocument,andthepercentageofstopwordsthatappearinthedocu-ment,,wereshowntobehighlye ectivecontent-basedquery-independentqualitymeasuresforWebsearch[2],andforpredictingqueryperformanceovertheWeb[17].WeusedINQUERY'sstopwordlist[1].ThedocumentPageR-score[3]isanothermeasurewidelyusedforimprovingretrievale ectiveness.Weusethesemeasuresforourshampredictiontask.Thepredictionvaluefortheresultliststhesumoftheper-documentvaluesassignedbyameasure.WhileEntropy,SW1andSW2arebasedonthecontentofadocument,PageRankisbasedonthehyperlinkstructure.Thecontent-basedmeasuresquantifythepresumedtextualcontentbreadthinthedocument.Wehypothesizethatthecontentbreadthinshamdocumentsislow,asthereisfocusonasmallsubsetoftermsthatarethetargetofshamming.Conversely,sincebyde\fnitionshamdocumentscanstillsat-isfyinformationneeds,weassumethatthePageRankscoreofsuchdocumentsislikelytobehigh.pam-basedpredictor.ostudytheconnectionbetweenspamandshamdocuments,weusetherecentlyproposedpredictor[17].Documentsinhatareclassi\fedasspambyWaterloo'sspamclassi\feraretreatedas\non-relevant";therestaretreatedas\relevant".Then,themeanaverageprecision(MAP)atcuto scomputedbasedonthesearti\fciallycreated\relevant"and\non-relevant"la-bels.Thisscoreisthenmultipliedbythetotalnumberof\relevant"(non-spam)documents.uery-log-basedpredictors.oreachquery,wederivedthefollowingfeaturesfromthequerylogsofacommercialsearchengine.RawImpressions:thetotalfrequencythequery(withstopwords)wasusedtosearch;ClicksOnRe-:percentageofclicksonthesearchengineresultspage(SERP)thatwereonasearchresultotherpartsofthepage,suchasquerysuggestions;ClicksForPaging:thepercentageofclicksontheSERPusedtogettothenextorpreviouspageofresults.ForallfeaturesweusedonemonthoftracfromJanuary2013originatingfromtheU.S.locale..EVALUATIONurexperimentswereconductedusingtheClueWeb09CategoryBcollection,whichcontainsabout50millionEn-glishWebpages.Weusedqueries1-150fromTREC2009-2011.TheIndritoolkit(www.lemurproject.org/indri)wasusedforexperiments.WeappliedKrovetzstemminguponqueriesanddocuments,andremovedstopwordsontheIN-QUERYlist[1]onlyfromqueries.Tocreateahighlye ectiveresultlist,wedidthefol-lowing.First,foreachquery,thedocumentsinthecollec-tionwererankedusingthenegativecrossentropybetweentheunsmoothedunigramlanguagemodelinducedfromandtheDirichlet-smoothedunigramlanguagemodelinducedfromadocument(withthesmoothingparametersetto1000).Then,followingpreviouswork[9],documentsas-signedbyWaterloo'sspamclassi\ferwithascorebelow50wereremovedtoptobottomfromthatrankinguntil100(presumablynon-spam)documentswereaccumulated.Thesedocumentswerethenre-rankedusingSVMVM12]appliedwith130features.Ten-foldcrossvalidationwasperformed.WeusedasubsetofthefeaturesusedbyMicrosoft'slearn-ingtorankdatasetsOurfeaturesdonotincludetheBooleanModel,VectorSpaceModel,LMIR.ABS,andthenumberofoutgoinglinks.TheSiteRankscore,thetwoqual-itymeasures,QualityScoreandQualityScore2,andtheclick-basedfeatureswerealsonotincludedastheseinformationtypesarenotavailablefortheClueWeb09collection.Threeadditionalfeaturesthatweused,andwhichwerefoundtobehighlye ectiveforWebretrieval[2],areEntropy,SW1,andSW2whichweredescribedinSection3.Asisthecasewiththefeaturesmentionedthusfar,thesethreefeatureswerealsocomputedseparatelyforallthetextinthedocument,itsanchortext,URL,body,andtitle.Waterloo'sspamclassi\ferscore[9]isanotherfeaturethatwasused.TheBM25scorewascomputedwith1=1and5,followingexperimentswith2fand2fLMIR.DIRandLMIR.JM,thelanguage-model-basedfea-tures,werecomputedwith=1000and9,respectively.Lastly,30querieswererandomlysampled,andthetop10non-WikipediadocumentswereusedtoformWeas-sumethatWikipediapagesarenotsham.Weusedtwoapproachestoaggregatethelabelsassignedbytheannotatorstothedocuments.Accordingtothe\frstapproach,adocumentisconsideredshamifitwaslabeledassuchbyatleastoneoftheannotators.This`weak'labelde\fnitionishenceforthreferredtoasAtLeastOne.Thesecondapproach,denotedAtLeastTwo,ismorestrict:adocumentisconsideredshamifitwasmarkedasshambyatleasttwoannotators.Toevaluatethequalityofthevariouspredictors,wereportPearson'scorrelation[4]betweenthevaluesassignedbyapredictortoeachofthetestedqueries,andthepercentageofshamdocumentscomputedforaquery,accordingtotheAtLeastOneandAtLeastTwoapproachesdescribedabove.Thenumberoftermsusedtoconstructtherelevancelan-guagemodel,whichisusedbytheClaritypredictor,issetto50.Giventhatdocumentsassignedwithascorebelow50byWaterloo'sspamclassi\ferareremovedintheprocessofcre-atingtodeterminethenon-spamdocumentsthatareusedbytheNSpredictorweuseathreshold.Thethresholdisselectedfromsoastooptimizethepredic-tionqualityofNS.Documentsassignedwithascoreabovethatthresholdareconsideredasnon-spam(\relevant").AnnotationResults.Theinter-annotatoragreementonthelabelpairsfromthetwomainjudgesforeachdocu-mentwas069,computedusingthefree-marginalmulti-raterkappameasure[18].Overall,79%ofthe300labelpairswereinagreement.Theaveragepercentageofshamdocumentsperquerywas30%basedonAtLeastOnelabels,and21%basedonAtLeastTwolabels.Queryshammingpredictoranalysis.Thecorrela-tionsofthepredictorsdescribedinSection4withtheper- ww.research.microsoft.com/en-us/projects/mslr 1016 redictorType redictorName hamLabelDefn. AtLeastOne tLeastTwo re-retrieval SumSCQ �0:500 �0:401 umVar �0:398 �0:337 SumIDF �0:424 �0:363 QueryLength �0:553 �0:432 ost-retrieval larity +0:276 +0:235 WIG +0:361 +0:351 Independent ntropy �0:324 �0:242 SW1 �0:001 �0:022 SW2 �0:024 +0:046 PageRank +0:226 +0:129 Spam NS +0:099 +0:183 ueryLog awImpressions +0:178 +0:231 licksOnResult �0:216 �0:320 licksForPaging +0:277 +0:387 able1:Pearson'scorrelation(r)ofvariouspredic-torvariables(Section4)with%shamdocumentsinthetop10results,forweak(AtLeastOne)andstrict(AtLeastTwo)shamlabelde\fnitions.centageofshampagesretrievedforaqueryarepresentedinTable1.Wecanseethatthepre-retrievalpredictorsarenegativelycorrelatedwithshamming.Asthesepredictorsweredesignedtopredicttheretrievale ectivenessofusingthequeryforsearch,wecanconcludethatshamdocumentsareprevalentinquerieswhoseperformanceispredictedtobelow.Indeed,wefoundthataboutofshamdocumentsarenon-relevant,forbothAtLeastOneandAtLeastTwo.QueryLengthisnegativelycorrelatedwiththepercentageofshamdocuments,forbothweak(AtLeastOne)andstrict(AtLeastTwo)shamlabelde\fnitions.Thatis,ashorterqueryhasahigherchanceofbeingthetargetofshamming.Thismightbeattributedtothefactthatshortqueriesaremoreambiguous,andasaresultdocumentsreturnedinre-sponsetothesequeriesmightbepromotingthepagewithrespecttodi erentinformationneeds,notnecessarilytheinformationneedthatthecurrentqueryexpresses.Thepost-retrievalpredictors,ClarityandWIG,areposi-tivelycorrelatedwiththepercentageofshamdocuments,ashypothesizedinSection4.However,thepredictionqualityislowerthanthatforthepre-retrievalpredictors.Wecanalsoseethatthecorrelationbetweenthequery-independentdocument-qualitymeasuresandshammingisrelativelylow.Furthermore,theNSpredictorisnotcor-relatedwithsham.These\fndingssuggestthatshamdocu-mentsmightbeofeitherhighorlowquality,andthatthereisnoevidentconnectionbetweenshamandspamdocuments.Forquerylogfeatures,thepositivecorrelationofRaw-Impressionswiththelevelofshamisconsistentwiththefactthatfrequentqueriesaremoresusceptibletoshamming.ClicksForPagingispositivelycorrelatedwiththepercentageofsham,whiletheClicksOnResultfeatureisnegativelycor-related.These\fndingsresonatewiththefactthatsham-mingdegradesretrievalperformance.Finally,forthetaskofpredictingshampercentageforagivenqueryatlow(3),medium((:3;0:6)),orhigh6)levels,weperformedordinalregression[6]usingallthepredictorswithdefaultregressionsettings.Themeanabsoluteerror(MAE),theabsolutedi erencebetweenthepredictedclassordinalandthetrueordinal,averagedover100randomizedtrialswitha2:1train/testsplit,was0Forcomparison,anoraclerunusingonerater'sshamlabelstopredictshamlevelsderivedfromotherraters'labels,hadMAE037;whilealwaysguessingcategory`medium'hadMAE067.Theseinitialresultsshowthatdi erentpredic-torscanbee ectivelyintegratedforpredictingper-queryshamlevels..CONCLUSIONehaveprovidedaninitialstudyontheidenti\fcationofdocuments:aformofgrey-hatSEOusingcontent-basedfeaturemanipulation,alongwithananalysisoffea-turesassociatedwithlow-orhigh-shamqueries.Evensearchenginesthatusehighlye ectiverankingmethodsstillsuf-ferfromshamdocumentsinthetop-mostrankingsthatad-verselya ectretrievalquality.These\fndingscallforaprin-cipledtreatmentofquery-speci\fcgreyhatSEO.Acknowledgments.Wethankthereviewersfortheircom-ments.ThisworkwassupportedbyandcarriedoutattheTechnion-MicrosoftElectronicCommerceResearchCenter..REFERENCESREFERENCES1]J.Allan,M.E.Connell,W.B.Croft,F.-F.Feng,D.Fisher,andX.Li.INQUERYandTREC-9.InProc.ofTREC,pages551{562,2000.[2]M.Bendersky,W.B.Croft,andY.Diao.Quality-biasedrankingofwebdocuments.InProc.ofWSDM,pages95{104,[3]S.BrinandL.Page.Theanatomyofalarge-scalehypertextualwebsearchengine.InProc.ofWWW,pages107{117,1998.[4]D.CarmelandE.Yom-Tov.EstimatingtheQueryDicultyforInformationRetrieval.Synthesislecturesoninformationconcepts,retrieval,andservices.Morgan&Claypool,2010.[5]C.Castillo,D.Donato,L.Becchetti,P.Boldi,S.Leonardi,M.Santini,andS.Vigna.Areferencecollectionforwebspam.SIGIRForum,40(2):11{24,2006.[6]W.ChuandZ.Ghahramani.GaussianprocessesforordinalJ.MachineLearningRes.,6:1019{1041,2005.[7]C.L.A.Clarke,N.Craswell,andI.Soboro .OverviewoftheTREC2009Webtrack.InProc.ofTREC,2009.[8]G.V.Cormack.TREC2007spamtrackoverview.InProc.ofTREC,2007.[9]G.V.Cormack,M.D.Smucker,andC.L.A.Clarke.Ecientande ectivespam\flteringandre-rankingforlargewebInformationRetrieval,14(5):441{465,2011.[10]S.Cronen-Townsend,Y.Zhou,andW.B.Croft.Predictingqueryperformance.InProc.ofSIGIR,pages299{306,2002.[11]Z.GyongyiandH.Garcia-Molina.Webspamtaxonomy.InProc.ofAIRWeb,pages39{47,2005.[12]T.Joachims.TraininglinearSVMsinlineartime.InProc.of,pages217{226,2006.[13]T.Jones,R.S.Sankaranarayana,D.Hawking,andN.Craswell.Nulli\fcationtestcollectionsforwebspamandSEO.InProc.ofAIRWeb,pages53{60,2009.[14]V.KrishnanandR.Raj.WebspamdetectionwithAnti-Trustrank.InProc.ofAIRWeb,pages37{40,2006.[15]V.LavrenkoandW.B.Croft.Relevance-basedlanguagemodels.InProc.ofSIGIR,pages120{127,2001.[16]T.-Y.Liu.Learningtorankforinformationretrieval.FoundationsandTrendsinInformationRetrieval,3(3),2009.[17]F.RaiberandO.Kurland.Usingdocument-qualitymeasurestopredictweb-searche ectiveness.InProc.ofECIR,pages134{145,2013.[18]J.J.Randolph.Onlinekappacalculator(2008).RetrievedFebruary16,2013,http://justus.randolph.name/kappa.[19]D.SculleyandG.Wachman.RelaxedonlineSVMsforspam\fltering.InProc.ofSIGIR,pages415{422,2007.[20]K.M.Svore,Q.Wu,C.Burges,andA.Raman.Improvingwebspamclassi\fcationusingrank-timefeatures.InProc.ofAIRWeb,2007.[21]Y.Zhao,F.Scholer,andY.Tsegay.E ectivepre-retrievalqueryperformancepredictionusingsimilarityandvariabilityevidence.InProc.ofECIR,pages52{64,2008.[22]Y.ZhouandB.Croft.Queryperformancepredictioninwebsearchenvironments.InProc.ofSIGIR,pages543{550,2007.

Related Contents


Next Show more