Cloaking and Redirection Preliminary Study Baoning and PDF document

Cloaking and Redirection Preliminary Study Baoning and PDF document

2015-05-16 59K 59 0 0

Description

Da vison Computer Science Engineering Lehigh Univ ersit baw4davison cselehighed Abstract Cloaking and redirection are ossible searc en gine spamming tec hniques In order to understand cloaking and redirection on the eb do wnloaded sets of eb pages w ID: 67813

Embed code:

Download this pdf



DownloadNote - The PPT/PDF document "Cloaking and Redirection Preliminary Stu..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentations text content in Cloaking and Redirection Preliminary Study Baoning and


Page 1
Cloaking and Redirection: Preliminary Study Baoning and Brian D. Da vison Computer Science Engineering Lehigh Univ ersit baw4,davison @cse.lehigh.ed Abstract Cloaking and redirection are ossible searc en- gine spamming tec hniques. In order to understand cloaking and redirection on the eb, do wnloaded sets of eb pages while mimic king opular eb cra wler and as common eb bro wser. estimate that 3% of the rst data set and 9% of the second data set utilize cloaking of some kind. By hec king man ually sample of the cloaking pages from the sec- ond data set, nearly one third of them app ear to aim to manipulate searc engine ranking. also examined redirection metho ds presen in the rst data set. prop ose metho of detecting cloaking pages calculating the dierence of three copies of the same page. examine the dieren yp es of cloaking that are found and the distribution of dieren yp es of redirection. In tro duction Cloaking is the practice of sending dieren con ten to searc engine than to regular visitors of eb site. Redirection is used to send users automatically to another URL after loading the curren URL. Both of these tec hniques can used in searc engine spam- ming [13 ]. Henzinger et al. [8] has oin ted out that searc engine spam is one of the ma jor hallenges of eb searc engines and cloaking is among the spamming tec hniques used to da Since searc en- gine results can sev erely aected spam, searc engines ypically ha olicies against cloaking and some kinds of dedicated redirection [5, 16 1]. Go ogle [5 describ es cloaking as the situation in whic \the ebserv er is programmed to return dif- feren con ten to Go ogle than it returns to regular Cop yrigh is held the author/o wner(s). AIR Web05 Ma 10, 2005, Chiba, Japan. users, usually in an attempt to distort searc engine rankings." An ob vious solution to detect cloaking is that for eac page, calculate whether there is dier- ence et een cop from searc engine's ersp ec- tiv and cop from eb bro wser's ersp ectiv e. But in realit this is non-trivial. Unfortunately it is not enough to kno that corresp onding copies of page dier; still cannot tell whether the page is cloaking page. The reason is that eb pages ma up dated frequen tly suc as in news eb- site or blog ebsite, or simply that the eb site puts time stamp on ev ery page it serv es. Ev en if cra wlers ere sync hronized to visit the same eb page at nearly the same momen t, some dynamically generated pages ma still ha dieren con ten t, suc as banner adv ertisemen that is rotated on eac ac- cess. Besides the dicult of iden tifying cloaking, it is also hard to tell whether particular instance of cloaking is considered acceptable or not. de- ne the cloaking eha vior that has the eect of ma- nipulating searc engine ranking results as seman tic cloaking. Unfortunately the arious searc engines ma ha dieren criteria for dening unacceptable cloaking. As result, ha fo cused on the simpler, more basic task when men tion cloaking in this pap er, usually refer to the simpler case of whether dieren con ten is serv ed to automated cra wlers er- sus eb bro wsers, but not dieren con ten to ev ery visitor. name this cloaking as syn tactic cloak- ing. So, for example, will not consider dynamic adv ertisemen ts to cloaking. In order to in estigate this issue, collected data sets: one is large data set con taining 250,000 pages and the other is smaller data set con taining 47,170 pages. The detail of these data set will giv en in Section 3. man ually examined um er of samples of those pages and found sev eral dier- en kinds of cloaking tec hniques. rom this study mak an initial prop osition to ard building an auto-
Page 2
mated cloaking detection system. Our hop is that these results ma of use to researc hers to design etter and more thorough solutions to the cloaking problem. Since redirection can also used as spamming tec hnique, also calculated some statistics based on our cra wled data for cloaking. our yp es of redirec- tion are studied. ew publications address the issue of cloaking on the eb. As result, the main con tribution of this pap er is to egin discussion of the problem of cloak- ing and its prev alence in the eb to da pro- vide view of actual cloaking and redirection tec h- niques. additionally prop ose metho for detect- ing cloaking using three copies of the same page. next review those few pap ers that men tion cloaking. The data sets use for this study are in tro duced in Section 3. The results of cloaking and redirection are sho wn in Section and resp ectiv ely conclude this pap er with summary and discus- sion in Section 6. Related ork Henzinger et al. [8 men tioned that searc engine spam is quite prev alen and searc engine results ould suer greatly without taking measures. They also men tioned that cloaking is one of the ma jor searc engine spam tec hniques. Gy ongyi and Garcia-Molina [7] describ cloaking and redirection as spam hiding tec hniques. They sho ed that eb sites can iden tify searc engine cra wlers their net ork IP address or user-agen names. They also describ ed the use of refresh meta tags and Ja aScript to erform redirection. They ad- ditionally men tion that some cloaking (suc as send- ing searc engine ersion free of na vigational links, adv ertisemen ts but no hange to the con ten t) are ac- cepted searc engines. erkins [13 argues that agen t-based cloaking is spam. No matter what kind of con ten is sen to searc engine, the goal is to manipulate searc en- gines rankings, whic is an ob vious haracteristic of searc engine spam. Cafarella and Cutting [4 men tion cloaking as one of the spamming tec hniques. They said that searc engines will gh cloaking enalizing sites that giv substan tially dieren con ten to dieren bro wsers. None of the ab pap ers discuss ho to detect cloaking, whic is one asp ect of the presen ork. In one cloaking forum [14 ], man examples of cloaking and metho ds of detecting cloaking are prop osed and discussed. Unfortunately generally these discussions can tak en as sp eculation only as they lac strong evidence or conclusiv exp erimen ts. Na jork led for paten [12 on metho for detect- ing cloak ed pages. He prop osed an idea of detect- ing cloak ed pages from users' bro wsers installing to olbar and letting the to olbar send the signature of user erceiv ed pages to searc engines. His metho do es not distinguish rapidly hanging or dynamically generated eb pages from real cloaking pages, whic is ma jor concern for our algorithms. Data set Tw data sets ere examined for our cloaking and redirection testing. or con enience, name the rst data as HITSdata and the second as HOTdata. 3.1 First data set: HITSdata In related ork to recognize spam in the form of link farms [15], collected eb pages in the neigh orho of the top 100 results for 412 queries follo wing the HITS data collection pro cess [9]. That is, for eac query presen ted to opular searc engine, collected the top 200 result references, and for eac URL also retriev ed the outgoing link set, and up to 100 incoming link pages. The resulting data set con tains 2.1M unique eb pages. rom these 2.1M URLs, randomly selected 250,000 URLs. In order to test for cloaking, cra wled these pages sim ultaneously from univ er- sit IP address (Lehigh) and from commercial IP address (V erizon DSL). set the user-agent from the univ ersit address to Mozilla/4.0 (compatible; MSIE 5.5; Windows 98) and the one from the commercial IP to Googlebot/2.1 (+http://www.go og leb ot .co m/ bot .h tm l) rom eac lo cation cra wled our dataset wice with time in terv al of one da So, for eac page, nally ha four copies, of whic are from eb bro wser's ersp ectiv and from cra wler's ersp ectiv e. or con enience, name these four copies as 1, 2, and resp ectiv ely or eac page, the time order of retriev al of these four copies is alw ys 1, 1, and 2.
Page 3
3.2 Second data set: HOTdata also an to kno the cloaking ratio within the top resp onse lists for hot queries. The rst step is to collect hot queries from opular searc engines. do this, collected 10 opular queries of Jan 2005 from Go ogle Zeigeist [6 ], top 100 searc terms of 2004 from Lycos [10 ], top 10 searc hes for the eek ending Mar 11, 2005 from Ask Jeev es [3 ], and 10 hot searc hes in eac of 16 categories ending Mar 11, 2005 from OL [2]. This resulted in 257 unique queries from these eb sites. The second step is to collect top resp onse list for these hot queries. or eac of these 257 queries, re- triev ed the top 200 resp onses from the Go ogle searc engine. The um er of unique URLs is 47,170. Lik the rst data set, do wnloaded four copies for eac of these 47,170 URLs, from bro wser's ersp ec- tiv and from cra wler's ersp ectiv e. But all these copies are do wnloaded from mac hines with univ ersit IP address. or con enience, name these four copies 1, 1, and re- sp ectiv ely This order also matc hes the time order of do wnloading them. Results of Cloaking In this section, will sho the results for the cloak- ing test. 4.1 Detecting Cloaking in HITSdata In tuitiv ely the goal of cloaking is to giv dier- en con ten to searc engine than to normal eb bro wsers. This can dieren text or links. use tec hniques to compare ersions retriev ed cra wler and bro wser consider the um er of dierences in the terms and links used er time to detect cloaking. As men tioned earlier in Section 1, calculat- ing the dierence et een pages from the bro wser's and cra wler's viewp oin ts is not strong enough to tell whether the page do es cloaking. Our prop osed metho is that can use three copies of page 1, and to decide if it is cloaking page. The detail is that for eac URL, rst calculate the dif- ference et een and (for con enience, use to represen this um er). Then calculated the dierence et een and (for con enience, use to represen this um er). Finally if is greater than then mark it as 1 10 100 1000 10000 1 10 100 1000 10000 100000 Number of URLs Difference between NCC and NBC Figure 1: Distribution of the dierence of and cloaking candidate. The in tuition is that the page ma hange frequen tly but if the dierence et een the bro wser's cop and the cra wler's cop is bigger than the dierence et een cra wler copies, the evidence ma enough that the page is cloaking. used metho ds to calculate the dierence et een pages the dierence in terms used, and the dierence in links pro vided. describ eac elo w, along with the results obtained. 4.1.1 erm Dierence The rst metho for detecting cloaking is to use term dierence among dieren copies. Instead of using all the terms in the HTML les, used the \bag of ords" metho for analyzing the eb pages, i.e., parse the HTML le in to terms and only coun eac unique term once no matter ho man times this term app ears. Th us, eac page is mark ed set of ords after parsing. or eac page, rst calculated the um er of dieren terms et een the copies and (desig- nated as describ ed ab e). then calculated the um er of dieren terms et een the copies and 1, (designated ). then select pages that ha bigger than as candidates of cloaking. or this data set, mark ed 23,475 candi- dates of the original 250K data set. The distribution of the dierence of these 23,475 pages forms er-la w-lik distribution, sho wn in Figure 1. hec what threshold for this dierence et een and is go indication for real cloaking, rst, put the 23,475 URLs in to ten dieren buc k- ets based on the dierence alue. The range for eac buc et and the um er of pages within eac buc et are sho wn in able 1. Then, from eac buc et randomly selected
Page 4
Buc et ID RANGE No. of ages 8084 10 2287 10 20 1938 20 40 2065 40 80 2908 80 160 1731 160 320 1496 320 640 912 640 1280 1297 10 1280 757 able 1: Buc ets of term dierence Figure 2: The ratio of syn tactic cloaking in eac buc et based on term dierence. thirt pages and hec ed them man ually to see ho man from these thirt pages are real syn tactic cloak- ing pages within eac buc et. The result is sho wn in Figure 2. The trend is ob vious in Figure 2. The greater the dierence, the higher prop ortion of cloaking that is con tained in the buc et. In order to kno whic is the optimal threshold to ho ose, calculated the precision, recall and F-measure based on the range of these buc ets. or these three measures, fol- lo the denitions in [11 and select to 0.5 in the F-measure form ula to giv equal eigh to recall and precision. Precision is the prop ortion of selected items that the system got righ t; Recall is the prop or- tion of the target items that the system selected; F- measure is the measure that com bines precision and recall. The results of these three measures are sho wn in able 2. If ho ose F-measure as the criteria, buc ets and ha the highest alue. Since the range of buc et and is around 40 in able 1, can set the threshold to 40 and declare that all pages with the dierence ab 40 to categorized as cloaking pages. In that case, the precision and Threshold PRECISION RECALL alue 0.355 1.000 0.502 0.423 0.828 0.560 10 0.480 0.799 0.560 20 0.534 0.758 0.627 40 0.580 0.671 0.622 80 0.633 0.498 0.588 160 0.685 0.388 0.496 320 0.695 0.262 0.380 640 0.752 0.196 0.311 1280 0.899 0.086 0.157 able 2: F-measure for dieren thresholds based on term dierence. recall are 580 and 671 resp ectiv ely rom Figure 2, can mak an estimation of what ercen tage of our 250,000 page set are cloaking pages. Since kno the total um er of pages within eac buc et and the um er of cloaking pages within the 30 man ually hec ed pages from eac buc et, the esti- mation of total um er of cloaking pages is the pro d- uct of the um er of pages within eac buc et and the ratio of cloaking pages within the 30 pages. The re- sult is 7,780, so exp ect that can iden tify nearly 8,000 cloaking pages (ab out 3%) within the 250,000 pages. 4.1.2 Link Dierence Similar to term dierence, also analyzed this data sets on the basis of link dierences. Here link dier- ence means the um er of dieren links et een corresp onding pages. First calculated the link dierence et een the cop of and (termed LC ).W then calculated the link dierence et een the cop of and (termed LB ). Finally mark ed the page that ha higher LB than LC as cloaking candidates. In this mark ed 8,205 candidates. The frequency of these candidates also appro ximates er-la distribution lik term cloaking. It is sho wn in Figure 3. As with term dierence, also put these 8,205 candidates in to 10 buc ets. The range and um er of pages within eac buc et is sho wn in able 3. rom eac buc et, randomly selected 30 pages and hec ed man ually to see ho man of them are real cloaking pages. The result is sho wn in Figure 4. It is ob vious that the most of the pages from buc et or ab are cloaking pages. also calculated the alues for these thresholds corresp onding to the
Page 5
1 10 100 1000 10000 1 10 100 1000 10000 100000 Number of URLs Difference between LCC and LBC Figure 3: Distribution of the dierence of LC and LB Buc et ID RANGE No. of ages 4415 10 787 10 20 746 20 35 783 35 55 441 55 80 299 80 110 279 110 145 182 145 185 100 10 185 173 able 3: Buc ets of link dierence range of eac buc et. The result is sho wn in able 4. can tell that is an optimal threshold with the est alue. Since the um er of pages ha ving link dierence is smaller than the ones ha ving term dierence in realit few er cloaking pages can found using link dierence alone, but are more accurate. Threshold PRECISION RECALL alue 0.479 1.000 0.648 0.727 0.700 0.713 10 0.822 0.627 0.711 20 0.906 0.520 0.660 35 0.910 0.340 0.496 55 0.900 0.236 0.374 80 0.900 0.167 0.283 110 0.900 0.104 0.186 145 0.878 0.060 0.114 185 0.866 0.038 0.072 able 4: F-measure for dieren thresholds based on link dierence. Figure 4: The ratio of syn tactic cloaking in eac buc et based on link dierence. Figure 5: In tersection of the four copies for eb page. 4.2 Detecting Cloaking in HOTdata Based on the exp erience of man ually hec king for cloaking pages for the rst data set, attempted to detect syn tactic cloaking automatically using all four copies of eac page. 4.2.1 Algorithm of detecting cloaking auto- matically Our assumption ab out syn tactic cloaking is that the eb site will send something consisten to the cra wler but send something dieren et still consisten to the bro wser. So, if there exists suc terms that only app ear in oth of the copies sen to the cra wler but nev er app ear in an of the copies send to the bro wser or vice ersa, it is quite ossible that the page is doing syn tactic cloaking. Here when getting the terms out of eac cop still use the \bag of ords" approac h, i.e., replace all the non-w ord haracters within an HTML le with blank and then get all the ords out of it for the in tersection op eration. easily describ our algorithm, the in tersection of four copies are sho wn as enn diagram in Fig-
Page 6
Buc et RANGE No. Accuracy 725 40% 540 30% 495 30% 623 40% 16 650 90% 16 32 822 100% 32 64 600 100% 64 128 741 100% 128 256 420 100% 10 256 1120 100% able 5: Buc ets of unique terms in area and ure 5. use capital letters from to to repre- sen eac in tersection comp onen of four copies. or example, the area con tains con ten that only ap- ears in 1, but nev er app ear in 2, and 2; area is the in tersection of four copies, i.e., the con ten that app ears on all of the four copies. The most in teresting comp onen ts to us are areas and G. Area represen ts terms that app ear on oth bro wsers' copies but nev er app ear on an of the cra wlers' copies, while area represen ts terms that app ear on oth cra wlers' copies but nev er app ear on an of the bro wsers' copies. So our algorithm of detecting syn tactic cloaking automatically is that for eac eb page, calculate the um er of terms in area and the um er of terms in area G. If the sum of these um ers is nonzero, ma mark this page as cloaking page. There are false negativ examples for this algo- rithm. simple example is that supp ose there is dynamic picture on the page, ev ery time the eb serv er will randomly select one from JPEG les (a1.jpg to a4.jpg) to serv the request. It happ ens that a1.jpg is sen ev ery time when our cra wler visits this page, but a2.jpg and a3.jpg are sen when our bro wser visit this page. By our algorithm, the page will mark ed as cloaking, but it can easily er- ied that this is not the case. So, again need threshold for the algorithm to ork more accurately or the 47,170 URLs, found 6466 pages that ha the sum of um er of terms in area and greater than 0. Again, put them in to 10 buc ets, as sho wn in able 5. The third column is the um er of pages within this buc et. rom eac buc et, randomly selected 10 pages and man ually hec ed to see whether this page is real syn tactic cloaking. The accuracy is sho wn in the fourth column in able 5. also calculated the F-measure, the results are sho wn in able 6. Thresholds PRECISION RECALL alue 0.647 1.000 0.785 0.703 0.965 0.813 0.766 0.952 0.849 0.836 0.940 0.885 0.902 0.881 0.891 16 0.922 0.756 0.831 32 0.960 0.599 0.738 64 0.979 0.470 0.635 128 0.972 0.358 0.523 256 1.000 0.267 0.422 able 6: F-measure of dieren threshold 0 1 2 3 4 5 6 7 8 9 0 50 100 150 200 The Cumulative Percentage of Cloaking pages Top Results Figure 6: ercen tage of syn tactic cloaking pages within go ogle's top resp onses. Since the 4th and 5th buc et ha highest alue in able 6, ho ose the threshold to the range et een buc et and buc et 5, i.e., 8. So, our au- tomated cloaking algorithm is revised to only mark pages with the sum of area and greater than as cloaking pages. So, for our second data set, all pages in buc et to buc et 10 are mark ed cloaking pages. Finally mark ed 4,083 pages out of the 47,170 pages, i.e., ab out 9% of pages from the hot query data set are syn tactic cloaking pages. 4.2.2 Distribution of syn tactic cloaking within top rankings Since ha iden tied 4,083 pages that utilize cloak- ing, can no dra the distribution of these cloak- ing pages within dieren top rankings. Figure sho ws the cum ulativ ercen tage of cloaking pages within the op 200 resp onse lists returned go ogle. As can see, ab out 2% of top 50, ab out 4% of top 100 URLs and more than 8% of top 200 URLs do uti- lize cloaking. The ratio is quite high and the cloaking
Page 7
A. Autos B. Companies C. Computing D. En tertainmen E. Games F. Health G. House H. Holida ys I. Lo cal J. Mo vies K. Music L. Researc M. Shopping N. Sp orts O. TV ra el Figure 7: Category-sp ecic Cloaking. ma helpful for these pages to rank ed high. Since retriev ed top 10 hot queries from eac of 16 categories from OL, can consider the topic of the cloaking pages. In tuitiv ely some opular cate- gories, suc as sp orts or computers, ma con tain more cloaking pages in the top ranking list. So also calculated the fraction of cloaking pages within eac category The results are sho wn in Figure 7. Some categories, suc as hopping and por ts are more lik ely to ha cloak ed results than other categories. 4.2.3 Syn tactic vs Seman tic cloaking Not all syn tactic cloaking is considered unacceptable to searc engines. or example, page sen to the cra wler that do esn't con tain adv ertising con ten or PHP session iden tier whic is used to distinguish dieren real users is not problem to searc engines. In con trast to acceptable cloaking, dene seman- tic cloaking as cloaking eha vior with the eect of manipulating searc engine results. mak one more step ab out our cloaking study randomly selected 100 pages from the 4,083 pages ha detected as syn tactic cloaking pages and man ually hec ed the ercen tage of seman tic cloak- ing among them. In practice, it is dicult to judge whether some eha vior is harmful to searc engine rankings. or example, some eb sites will send login page to bro wser, while send full page to cra wler. So, end up with three categories: acceptable cloaking, unkno wn and seman tic cloaking. rom these 100 pages, classied 33 pages as se- man tic cloaking, 32 as unkno wn and 35 as acceptable cloaking. 4.3 Dieren yp es of cloaking In the pro cess of man ually hec king 600 pages for the ab sections, found sev eral dieren yp es of cloaking. 4.3.1 yp es of term cloaking iden tied man dieren metho ds of sending dif- feren term con ten to cra wlers and eb bro wsers. They can categorized the magnitude of the dif- ference. rst consider the case in whic the con ten of the pages sen to the cra wler and eb bro wser are quite dieren t. The page pro vided to the cra wler is full of detail, but the one to the eb bro wser is empt or only con tains frames or Ja aScript. The eb site sends text page to the cra wler, but sends non-text con ten (suc as macrome- dia Flash con ten t) to eb bro wser. The page sen to the cra wler incorp orates con- ten t, but the one sen to the eb bro wser con- tains only redirect or 404 error resp onse. The second case is when con ten diers only par- tially et een the pages sen to the cra wler and the bro wser and the remaining con ten is iden tical, or one cop has sligh tly more con ten than the other. The pages sen to the cra wler con tain more text con ten than the ones to eb bro wser. or ex- ample, only the page sen to the cra wler con tains eyw ords sho wn in Figure 8. Dieren redirection target URLs are con tained in the pages sen to the cra wler and to the eb bro wser. The eb site sends dieren titles, meta- description or eyw ords to the cra wler than to eb bro wser. or example, the header to bro wser uses \Shap of Things mo vie info at Video Uni- erse" as the meta-description, while the one to the cra wler uses \Great prices on Shap of Things VHS mo vies at Video Univ erse. Great service, secure ordering and fast shipping at ev- eryda discoun prices." The page sen to the cra wler con tains Ja aScript, but no suc Ja aScript is sen to the bro wser, or the pages ha dieren Ja aScripts sen to the cra wler than to eb bro wser.
Page 8
game computer games PC games console games video games computer action games adv en ture games role pla ying games sim ulation games sp orts games strategy games con test con tests prize prizes game heats hin ts strategy computer games PC games computer action games adv en ture games role pla ying games Nin tendo Pla ystation sim ula- tion games sp orts games strategy games con test con tests prize prizes game computer games PC games computer action games adv en ture games role pla ying games sim ulation games sp orts games strategy games con test con tests prize prizes. Figure 8: Sample of eyw ords con ten only sen to the cra wler. ages to the cra wler do not con tain some banner adv ertisemen ts, while the pages to eb bro wser do. The NOSCRIPT elemen is used to dene an alternate con ten if script is not executed. The page sen to eb bro wser has the NOSCRIPT tag, while the page sen to the cra wler do es not. 4.3.2 yp es of link cloaking or link cloaking, again group the situations the magnitude of the dierences et een dieren ersions of the same page. In one case, oth pages con tain similar um er of links and the other is that oth pages ha quite dieren um er of links. or the rst situation, examples found include: There are the same um er of links within the page sen to the cra wler and eb bro wser, but the corresp onding link pairs ha dieren for- mat. or example, the link to eb bro wser ma con tain PHP session id while the link to the cra wler do es not. Another example is that the page to the cra wler only con tains absolute URLs, while the page to the bro wser con tains relativ URLs that are in fact oin ting to the same tar- gets as the absolute ones. The links in the page to the cra wler are direct links, while the corresp onding links within the page to eb bro wser are enco ded redirections. The links to eb bro wser are normal links, but the links to the cra wler are around small images instead of texts. The ebsite sho ws links to dieren st yle sheets to eb bro wser than to the cra wler. or example, the page to the cra wler con tains \href=/st yles/st yles win ie.css", while the page to the bro wser con tains \href=/st yles/st yles win ns.css". In some cases, the um er of links within the page to the cra wler and the page to the eb bro wser can quite dieren t. More links exist in the page sen to the cra wler than the page sen to eb bro wser. or example, these links ma oin to link farm. The page sen to eb bro wser has more links than the page sen to the cra wler. or example, these links ma na vigational links. The page sen to the bro wser con tains some nor- mal links, but in the same osition of the page sen to the cra wler, only error messages sa ying \no ermission to include links" exist. rom the results sho wn within this section, it is ob vious that cloaking is not rare in the real eb. It happ ens more often for hot queries or opular topics. Results of Redirect As ha discussed in Section 1, redirection can also used as spamming tec hnique. get an insigh in to ho often the redirection app ear and distribution of dieren redirect metho ds, use the HITSdata set men tioned in Section 3. don't use all four copies but only compare copies for eac page: one from the sim ulated bro wser's set BR WSER and the other from the cra wler's set CRA WLER ). 5.1 Distribution hec the distribution of four dieren yp es of redirection: HTTP 301 Mo ed ermanen tly and 302 Mo ed emp orarily resp onses, the HTML meta re- fresh tag, and the use of Ja aScript to load new page. In order to kno the distribution of ab four dieren redirects, tabulated the um er of ap- earances of eac yp e. or the rst yp es, the situation is simple: just coun the pages with resp onse status of \301" and \302". The last are more complicated; the HTTP refresh tag do es not necessarily mean redirection and Ja aScript
Page 9
TYPE CRA WLER BR WSER 301 20 22 302 56 60 Refresh tag 4230 4356 Ja aScript 2399 2469 able 7: Num er of pages using dieren yp es of redirection. is ev en more complicated for redirection purp ose. or the rst step, just coun the app earance of meta ttp-equiv=refresh tag for the third yp and coun the app earance of \lo cation.replace" and \windo w.lo cation" for the fourth yp e. The results for this rst step are sho wn in able 7. get more accurate um er of app earances of the HTTP refresh tag, examined this further. In realit the efr esh tag ma just mean refreshing, not necessarily to redirection to another page. or exam- ple, the efr esh tag ma put inside NOSCRIPT tag for bro wsers that do not supp ort Ja aScript. estimate the um er of real redirection using this refresh tag, randomly selected 20 pages from the 230 pages that use the refresh tag and hec ed them man ually found that 95% of them are real redi- rection and only 5% are inside NOSCRIPT tag. Be- sides, some pages ma ha iden tical target URL as themselv es in the efr esh tag to eep refreshing them- selv es. also coun ted these um ers. There are 47 pages out of 230 pages within the CRA WLER data set and 142 pages out of 356 pages within the BR WSER data set that refresh to themselv es. did one more step for the 214 (4 356 142) pages that are pages using efr esh tag and refresh to dieren page. Usually there is time alue assigned within the refresh tag to sho ho long to ait efore refreshing. If this time is small enough, i.e., or seconds, users can not see the origin page but are redirected to new page immediately fetc hed this time alue for these 214 pages and dra the distribution of dieren time alues from the range of seconds to 30 seconds in Figure 9. More than 50% of these pages refresh to dieren URL after seconds and ab out 10% refresh after second. estimate the real distribution of the Ja aScript refresh metho d, randomly selected 40 pages from the 399 pages that ha een iden tied as candi- dates for using Ja aScript for redirection in the rst step. After man ually hec king these 40 les, found the 20% of them are real redirections, 32.5% of them are conditional redirections, and the rest are not for redirection purp ose, suc as to oid sho wing the 0 10 20 30 40 50 60 0 5 10 15 20 25 30 Percentage of Pages Number of seconds before refresh Figure 9: Distribution of dela ys efore refresh. page within frame. Sometimes the target URL and origin URL are within the same site, while other times they are on dieren sites. In order to kno the ercen tage of redirections that redirection to the same sites, also analyzed our collected data set for this information. Since the Ja aScript redirection is complicated, only coun the rst three yp es of redirection here. The sum of the rst three yp es of redirection is 306. Within the CRA WLER data set, there are 328 pages within these 306 pages redirecting to the same site, while for the BR WSER data set, the um er is 453. 5.2 Redirection Cloaking As ha men tioned in Section 4.3, the site ma return pages redirecting to dieren lo cations in case of dieren user agen ts. consider this redirection cloaking. found that there are 153 pairs of pages out of 250 000 pairs that ha dieren resp onse co de for cra wler and normal bro wser when doing redirecting. Usually these eb sites will send 404 or 503 resp onse co de to one and send 200 resp onse co de to the other. ev en found that there are 10 pages that use dif- feren redirection metho for cra wler and normal eb bro wser. or example, they ma use 302 or 301 for the cra wler, but use refresh tag with the resp onse co de 200 for normal eb bro wser. Summary and Discussion Detection of searc engine spam is hallenging re- searc area. Cloaking and redirection are imp or-
Page 10
tan spamming tec hniques. This study is based on sample of quarter of mil- lion pages and top resp onses from opular searc engine to hot queries on the eb. iden tied dif- feren kinds of cloaking and ga an estimate of the ercen tage of pages that are cloak ed in the sample and also sho an estimation of distribution of dier- en redirect. There are four issues that ould lik to see ad- dressed in future ork. The rst is that of bias in the dataset used. Our data sets (pages in or near the top results for queries) do not nearly reect the eb as whole. Ho ev er, it migh argued that it re- ects the eb that is imp ortan (at least for the pur- oses of nding pages that migh aect searc engine rankings through cloaking). The second is that this pap er do es not address IP-based cloaking, so there are lik ely pages that do indeed pro vide cloak ed con- ten to the ma jor engines when they recognize the cra wling IP ould elcome the partnership of searc engine to collab orate on future cra wls. The nal issue is the ottom line. While searc engines ma in terested in nding and eliminat- ing instances of cloaking, our prop osed tec hnique re- quires three or four cra wls. Ideally future tec hnique ould incorp orate o-stage approac that iden ti- es subset of the full eb that is more lik ely to con tain cloak ed pages, so that full cra wl using bro wser iden tit ould not necessary Our hop is that this study can pro vide realistic view of the use of these tec hniques and will con- tribute to robust and eectiv solutions to the iden- tication of searc engine spam. References [1] AskJeev es eoma Site Submit managed in- eedhits.com: Program Terms, 2005. Online at ttp://ask.ineedhits.com/program terms.asp. [2] America Online, Inc. OL Searc h: Hot searc hes, Mar. 2005. ttp://hot.aol.com/hot/hot. [3] Ask Jeev es, Inc. Ask Jeev es Ab out, Mar. 2005. ttp://sp.ask.com/do cs/ab out/jeev esiq.h tml. [4] M. Cafarella and D. Cutting. Building Nutc h: Op en source. CM QUEUE 2, Apr. 2004. [5] Go ogle, Inc. Go ogle information for ebmasters, 2005. Online at ttp://www.go ogle.com/w ebmasters/ fa q.h tml. [6] Go ogle, Inc. Go ogle Zeitgeist, Jan. 2005. ttp://www.go ogle.com/press/zei tgeis t/zei tgeis t- jan05.h tml. [7] Z. Gy ongyi and H. Garcia-Molina. eb spam taxon- om In First International Workshop on dversarial Information etrieval on the Web (AIR Web) 2005. [8] M. R. Henzinger, R. Mot ani, and C. Silv erstein. Challenges in eb searc engines. SIGIR orum 36(2), all 2002. [9] J. M. Klein erg. Authoritativ sources in yp er- link ed en vironmen t. Journal of the CM 46(5):604{ 632, 1999. [10] Lycos. Lycos 50 with Dean: 2004 eb's most an ted, Dec. 2004. ttp://50.lycos.com/121504.a sp. [11] C. D. Manning and H. Sc utze. oundations of sta- tistic al natur al language pr essing hapter 8, pages 268{269. MIT press, 2001. [12] M. Na jork. System and metho for iden tifying cloak ed eb serv ers, Jul 10 2003. aten Applica- tion um er 20030131048. [13] A. erkins. White pap er: The classica- tion of searc engine spam, Sept. 2001. On- line at ttp://www.silv erdisc.co.uk/articles/spam- classication/. [14] ebmasterW orld.com. Cloaking, 2005. Online at ttp://www.w ebmasterw orld.com/forum24 /. [15] B. and B. D. Da vison. Iden tifying link farm spam pages. In Pr dings of the 14th International World Wide Web Confer enc e, Industrial ack Ma 2005. [16] aho o! Inc. aho o! Help a- ho o! Searc h, 2005. Online at ttp://help.y aho o.com/help/us/ysearc h/deletions/. 10

About DocSlides
DocSlides allows users to easily upload and share presentations, PDF documents, and images.Share your documents with the world , watch,share and upload any time you want. How can you benefit from using DocSlides? DocSlides consists documents from individuals and organizations on topics ranging from technology and business to travel, health, and education. Find and search for what interests you, and learn from people and more. You can also download DocSlides to read or reference later.