ASK HERE

seminar presentation · 05-05-2010, 09:56 AM

[attachment=3465]

WEB SPAM

Economic considerations
Search has become the default gateway to the web
Very high premium to appear on the first page of search results
e.g., e-commerce sites
advertising-driven sites

What is web spam
Spamming = any deliberate action solely in order to boost a web pageâ„¢s position in search engine results, regardless of the pageâ„¢s real value
Spam = web pages that are the result of spamming
Spammer
Approximately 10-15% of web pages are spam

Web spam taxonomy
Boosting techniques
Techniques to achieve high relevance or importance for some pages
Hiding techniques
Techniques to hide the adopted boosting techniques from human web users and crawlers

Boosting techniques
Term spamming
Manipulating the text/fields of web pages in order to appear relevant to queries
Link spamming
Creating page link structures that boost page rank

Term spamming

Target algorithms

[b]
Techniques[/b]
Repetition
of one or a few specific terms e.g., free, cheap
increase relevance for a document with respect to a small number of query terms
Dumping
of a large number of unrelated terms
e.g., copy entire dictionaries
Weaving
Copy legitimate pages and insert spam terms at random positions
Phrase Stitching
Glue together sentences and phrases from different sources
Term spam targets
Body of web page
Title
HTML meta tags
Anchor text
URL

Link spam
There are three kinds of web pages from a spammerâ„¢s point of view
Inaccessible pages
Accessible pages
e.g., web log comments pages
spammer can post links to his pages
Own pages
Completely controlled by spammer
May span multiple domain names

Target algorithms
Hypertext Induced Topic Search (HITS) algorithm
Â¢ Hub scores
Â¢ A spammer should add many outgoing links to the target page t to increase its hub score
Â¢ Authority scores
Â¢ Having many incoming links from presumably important hubs
PageRank
Â¢ Uses incoming page link information to assign numerical weights to all pages on the web
Â¢ Numerical weight that it assigns to any given element E is also called the PageRank of E and denoted by PR(E)
Â¢ Spammers manipulate the algorithm using links

Targets and Techniques
Outgoing links
Â¢ Directory cloning
Incoming links
Â¢ Create a honey pot
Â¢ Infiltrate a web directory
Â¢ Post links to unmoderated message boards or guest books
Â¢ Participate in page link exchange
Â¢ Create own spam farm

Link Farms
A page link farm is a densely connected set of pages, created explicitly with the purpose of deceiving a link-based ranking algorithm (such as Googleâ„¢s PageRank)
Spammerâ„¢s goal
Maximize the page rank of target page t
Technique
Get as many links from accessible pages as possible to target page t
Construct page link farm to get page rank multiplier effect

Link Farms
One of the most common and effective organizations for a page link farm
Hiding techniques
Content hiding
Use same color for text and page background
Spam terms or links on a page can be made invisible when the browser renders the page
Cloaking
Return different page to crawlers and browsers
Redirection
Alternative to cloaking
Redirects are followed by browsers but not crawlers

Detecting Web Spam
Term spamming
Analyze text using statistical methods
Similar to email spam filtering
Also useful: detecting approximate duplicate pages
Link spamming
Open research area
One approach: TrustRank

TrustRank idea
Basic principle: approximate isolation
It is rare for a good page to point to a bad (spam) page
Sample a set of seed pages from the web
Have an oracle (human) identify the good pages and the spam pages in the seed set
Expensive task, so must make seed set as small as possible

Trust Propagation
Call the subset of seed pages that are identified as good the trusted pages
Set trust of each trusted page to 1
Propagate trust through links
Each page gets a trust value between 0 and 1
Use a threshold value and mark all pages below the trust threshold as spam

Picking the seed set
Human has to inspect each seed page, so seed set must be as small as possible
Must ensure every good page gets adequate trust rank, so need make all good pages reachable from seed set by short paths

Approaches to picking seed set
Suppose we want to pick a seed set of k pages
PageRank
Pick the top k pages by page rank
Assume high page rank pages are close to other highly ranked pages
We care more about high page rank good pages

Rules for trust propagation
Trust attenuation
The degree of trust conferred by a trusted page decreases with distance
Trust splitting
The larger the number of outlinks from a page, the less scrutiny the page author gives each outlink
Trust is split across outlinks

Fighting against web spam
Identify instances of spam
Prevent spamming
Counterbalance
THANK YOU

seminar presentation · 05-05-2010, 10:19 AM

[attachment=3467]

Chapter 1
INTRODUCTION TO WEB SPAM

Web spamming refers to actions intended to mislead search engines and give some pages higher ranking than they deserve. Recently, the amount of web spam has increased dramatically, leading to a degradation of search results.

Today, more and more people rely on the wealth of information available on the World Wide Web, and thus, increased exposure on the web may yield significant financial gains for organizations. Often, search engines are the entryways to the web. That is why some people try to mislead search engines, so that their pages rank high in search results, and thus, capture user attention. Hence, just as with emails, we can talk about attempts of spamming the content of the web. The outcome is that the quality of search results decreases.

To provide quality services, it is critical for search engines to address web spam. Search engines currently fight spam with a variety of often manual techniques, but as far as we know, they still lack a fully effective set of tools for combating it. We believe that the first step in combating spam is understanding it, that is, analyzing the techniques the spammers use to mislead search engines. A proper understanding of spamming can then guide the development of appropriate countermeasures.
Chapter 2
WEB SPAM

2.1 Definition
The objective of a search engine is to provide high-quality results by correctly identifying all web pages that are relevant for a specific query, and presenting the user with the most important of those relevant pages. Relevance refers to the textual similarity between the query and a page. Pages can be given a query-specific, numeric relevance score; the higher the number, the more relevant the page is to the query. Importance refers to the global (query independent) popularity of a page, as often inferred from the page link structure (e.g., pages with many inlinks are more important), or perhaps other indicators. In practice, search engines usually combine relevance and importance, computing a combined rank score that is used to order query results presented to the user.
We use the term spamming (also, spamdexing) to refer to any deliberate human action that is meant to trigger an unjustifiably favorable relevance or importance for some web page, considering the pageâ„¢s true value. We will use the adjective spam to mark all those web
objects (page content items or links) that are the result of some form of spamming. People who perform spamming are called spammers.
One can locate on the World Wide Web a handful of other definitions of web spamming. For instance, some of the definitions are close to ours, stating that any modification done to a page solely because search engines exist is spamming. Specific organizations or web user groups define spamming by enumerating some of the techniques. An important voice in the web spam area is that of search engine optimizers (SEOs). Most SEOs claim that spamming is only increasing relevance for queries not related to the topic(s) of the page. At the same time, many SEOs endorse and practice techniques that have an impact on importance scores to achieve what they call ethical web page positioning or optimization. According to our definition, all types of actions intended to boost ranking, without improving the true value of a page, are considered spamming.
There are two categories of techniques associated with web spam. The first category includes the boosting techniques, i.e. methods through which one seeks to achieve high relevance and/or importance for some pages. The second category includes hiding techniques, methods that by themselves do not influence the search engineâ„¢s ranking algorithms, but that are used to hide the adopted boosting techniques from the eyes of human web users.
Chapter 3
BOOSTING TECHNIQUES
This section presents spamming techniques that influence the ranking algorithms used by search engines.
3.1 Term Spamming
In evaluating textual relevance, search engines consider where on a web page query terms occurs. Each type of location is called a field. The common text fields for a page p are the document body, the title, the meta tags in the HTML header, and page pâ„¢s URL. In addition, the anchor texts associated with URLs that point to p are also considered belonging to page p (anchor text field), since they often describe very well the contents of p. The terms in pâ„¢s text fields are used to determine the relevance of p with respect to a specific query (a group of query terms), often with different weights given to different fields. Term spamming refers to techniques that tailor the contents of these text fields in order to make spam pages relevant for some queries.
3.1.1 Target Algorithms
The algorithms used by search engines to rank web pages based on their text fields use various forms of the fundamental tf -idf metric used in information retrieval [1]. Given a specific text field, for each term t that is common for the text field and a query, tf (t) is the frequency of that term in the text field. For instance, if the term apple appears 6 times in the document body that is made up of a total of 30 terms, tf(apple) is 6/30 = 0.2. The inverse document frequency idf(t) of a term t is related to the number of documents in the collection that contain t. For instance, if apple appears in 4 out of the 40 documents in the collection, its idf(apple) score will be 10. The tf -idf score of a page p with respect to a query q is then computed over all common terms t:
tf -idf(p, q) = tf (t) Â¢ idf(t)
t p,t q
With tf -idf scores in mind, spammers can have two goals: either to make a page relevant for a large number of queries (i.e., to receive a non-zero tf -idf score), or to make a page very relevant for a specific query (i.e., to receive a high tf -idf score). The first goal can be achieved
by including a large number of distinct terms in a document. The second goal can be achieved by repeating some targeted terms. (We can assume that spammers cannot have real control over the idf scores of terms. Thus, the only way to increase the tf -idf scores is by increasing the frequency of terms within specific text fields of a page.)
3.1.2 Techniques
Term spamming techniques can be grouped based on the text field in which the spamming occurs. Therefore, we distinguish:
Body spam:
In this case, the spam terms are included in the document body. This spamming technique is among the simplest and most popular ones, and it is almost as old as search engines themselves.
Title spam:
Todayâ„¢s search engines usually give a higher weight to terms that appear in the title of a document. Hence, it makes sense to include the spam terms in the document title.
Meta tag spam:
The HTML meta tags that appear in the document header have always been the target of spamming. Because of the heavy spamming, search engines currently give low priority to these tags, or even ignore them completely. Here is a simple example of a spammed keywords meta tag:
<meta name=keywords content=buy, cheap, cameras, lens, accessories, nikon, canon>
Anchor text spam:
Just as with the document title, search engines assign higher weight to anchor text terms, as they are supposed to offer a summary of the pointed document. Therefore, spam terms are sometimes included in the anchor text of the HTML hyperlinks to a page. Please note that this spamming technique is different from the previous ones,in the sense that the spam terms are added not to a target page itself, but the other pages that point to the target. As anchor text gets indexed for both pages, spamming it has impact on the ranking of both the source and target pages. A simple anchor text spam is:
<a href=target.html>free, great deals, cheap, inexpensive, cheap, free</a>
URL spam:
Some search engines also break down the URL of a page into a set of terms that are used to determine the relevance of the page. To exploit this, spammers sometimes create long URLs that include sequences of spam terms. For instance, one could encounter spam URLs like:
buy-canon-rebel-300d-lens-case.camerasx.com,
buy-nikon-d100-d70-lens-case.camerasx.com.
Another way of grouping term spamming techniques is based on the type of terms that are added to the text fields. Correspondingly, we have:
Repetition of one or a few specific terms. This way, spammers achieve an increased relevance for a document with respect to a small number of query terms.
Dumping of a large number of unrelated terms, often even entire dictionaries. This way, spammers make a certain page relevant to many different queries. Dumping is effective against queries that include relatively rare, obscure terms: for such queries, it is probable that only a couple of pages are relevant, so even a spam page with a low relevance score would appear among the top results.
Weaving of spam terms into copied contents. Sometimes spammers duplicate text corpora (e.g., news articles) available on the web and insert spam terms into them at random positions. This technique is effective if the topic of the original real text was so rare that only a small number of relevant pages exist. Weaving is also used for dilution, i.e., to conceal some repeated spam terms within the text, so that search engine algorithms that filters out plain repetition would be misled. A short example of spam weaving is: Remember not only airfare to say the right plane tickets thing in the right place, but far cheap travel more difficult still, to leave hotel rooms unsaid the wrong thing at vacation the tempting moment.
Phrase stitching is also used by spammers to create content quickly. The idea is to glue together sentences or phrases, possibly from different sources; the spam page might then show up for queries on any of the topics of the original sentences.
3.2 Link Spamming
Beside term-based relevance metrics, search engines also rely on page link information to determine the importance of web pages. Therefore, spammers often create page link structures that they hope would increase the importance of one or more of their pages.
3.2.2 Target Algorithms
For our discussion of the algorithms targeted by page link spam, we will adopt the following model.
For a spammer, there are three types of pages on the web:
1. Inaccessible pages are those that a spammer cannot modify. These are the pages out of reach; the spammer cannot influence their outgoing links. (Note that a spammer can still point to inaccessible pages.)
2. Accessible pages are maintained by others (presumably not affiliated with the spammer), but can still be modified in a limited way by a spammer. For example, a spammer may be able to add a message to a guest book, and that message may contain a page link to a spam site. As infiltrating accessible pages is usually not straightforward, let us say that a spammer has a limited budget of A accessible pages. For simplicity, we assume that at most one outgoing page link can be added to each accessible page.
3. Own pages are maintained by the spammer, who thus has full control over their contents. We call the own pages a spam farm. A spammerâ„¢s goal is to boost the importance of one or more of his or her own pages. For simplicity, say there is a single target page t. There is a certain maintenance cost (domain registration, web hosting) associated with a spammerâ„¢s own pages, so we can assume that a spammer has a limited budget of O such pages, not including the target page.
With this model we discuss the two well-known algorithms used to compute importance scores based on page link information
HITS
The original HITS algorithm was introduced to rank pages on a specific topic. It is more common, however, to use the algorithm on all pages on the web to assigns global hub and authority scores to each page. According to the circular definition of HITS, important hub pages are those that point to many important authority pages, while important authority pages are those pointed to by many hubs. A search engine that uses the HITS algorithm to rank pages returns as query result a blending of the pages with the highest hub and authority scores.
Hub scores can be easily spammed by adding outgoing links to a large number of well known, reputable pages, such as cnn.com or mit.edu. Thus, a spammer should add many outgoing links to the target page t to increase its hub score.
Obtaining a high authority score is more complicated, as it implies having many incoming links from presumably important hubs. A spammer could boost the hub scores of his O pages (once again, by adding many outgoing links to them) and then make those pages point to the target. Links from important accessible hubs could increase the targetâ„¢s authority score even further. Therefore, the rule here is the more the better: within the limitations of the budget, the spammer should have all own and accessible pages point to the target. Non-target own pages should also point to as many other (known important) authorities as possible.
PageRank
PageRank uses incoming page link information to assign global importance scores to all pages on the web. It assumes that the number of incoming links to a page is related to that pageâ„¢s popularity among average web users (people would point to pages that they find important). The intuition behind the algorithm is that a web page is important if several other important web pages point to it. Correspondingly, PageRank is based on a mutual reinforcement between pages: the importance of a certain page influences and is being influenced by the importance of some other pages.
Recent analyses of the algorithm showed that the total PageRank score rtotal of a group of pages (at the extreme, a single page) depends on four factors:
rtotal = rstatic + rin - rout - rsink,
where rstatic is the score gained from the static score distribution (random jump); rin is the score flowing into the pages through the incoming links from external pages; rout is the score leaving the pages through their outgoing links to external pages; and rsink is the score loss due to sink pages within the group (i.e., pages without outgoing links).
The previous formula leads us to a class of optimal page link structures for our model that maximize the score of the target page. One such optimal structure is presented in Figure 2; it has the nice properties that (1) it makes all own pages reachable from the accessible ones (so that they could be crawled by a search engine), and (2) contains a minimal number of links.
For this structure, we used the following strategies to maximize the total PageRank score of the spam farm and of page t in particular:
1. Use all available O pages in the spam farm, thus maximizing the static score rstatic.
2. Accumulate the maximum number of A incoming links from accessible pages to the spam farm, thus maximizing the incoming score rin.
3. Suppress links pointing outside the spam farm, thus setting rout to zero.
4. Avoid sink pages within the farm, assuring that every page (including t) has some outlinks. This sets rsink to zero.
Within the spam farm, the page link structure maximizes the score of page t by abiding the following rules:
1. Make all accessible and own pages point directly to the target, thus maximizing its incoming score.
2. Add links from t to all other own pages. Without such links, t would had lost a significant part of its score being a sink, and the own pages would had been unreachable from outside the spam farm. The resulting short cycles help the score leaving t flow back into it. Note that it would not be wise to create similar cycles between t and the accessible pages, as those would decrease the total score of the spam farm.
Setting up sophisticated page link structures within a spam farm does not improve the ranking of the target page. A spammer can achieve high PageRank by accumulating many incoming links from accessible pages, and/or by creating large spam farms with all the pages pointing to the target. The corresponding spamming techniques are presented next.
3.2.3 Techniques
We group page link spamming techniques based on whether they add numerous outgoing links to popular pages or they gather many incoming links to a single target page or group of pages.
Outgoing links
A spammer might manually add a number of outgoing links to well known pages, hoping to increase the pageâ„¢s hub score. At the same time, the most widespread method for creating a massive number of outgoing links is directory cloning: One can find on the World Wide Web a number of directory sites, some larger and better known (e.g., the DMOZ Open Directory, dmoz.org, or the Yahoo! directory, dir.yahoo.com), some others smaller and less famous (e.g., the Librarianâ„¢s Index to the Internet, lii.org). These directories organize web content around topics and subtopics, and list relevant sites for each. Spammers then often simply replicate some or all of the pages of a directory, and thus create massive outlink structures quickly.
Incoming links
In order to accumulate a number of incoming links to a single target page or set of pages, a spammer might adopt some of the following strategies:
Create a honey pot, a set of pages that provide some useful resource (e.g., copies of some Unix documentation pages), but that also have (hidden) links to the target spam page(s). The honey pot then attracts people to point to it, boosting the ranking of the target page(s). Please note that the previously mentioned directory clones could act as honey pots.
Infiltrate a web directory. Several web directories allow webmasters to post links to their sites under some topic in the directory. It might happen that the editors of such directories do not verify and control page link additions strictly, or get misled by a skilled spammer. In these instances, spammers may be able to add to directory pages links that point to their target pages. As directories tend to have both high PageRank and hub scores, this spamming technique is useful in boosting both the PageRank and authority scores of target pages.
Post links to unmoderated message boards or guest books. As mentioned earlier, spammers may include URLs to their spam pages as part of the seemingly innocent messages they post. Without a moderator to oversee the submitted messages, pages of the message board or guest book end up linking to spam.
Participate in page link exchange. Often times, a group of spammers set up a page link exchange structure, so that their sites point to each other.
Create own spam farm. These days spammers can control a large number of sites and create arbitrary page link structures that would boost the ranking of some target pages. While this approach was prohibitively expensive a few years ago, today it is very common as the costs of domain registration and web hosting have declined dramatically.
Chapter 4
HIDING TECHNIQUES
It is usual for spammers to conceal the telltale signs (e.g., repeated terms, long lists of links) of their activities. They use a number of techniques to hide their abuse from regular web users visiting spam pages, or from the editors at search engine companies who try to identify spam instances. This section offers an overview of the most common spam hiding techniques.
4.1 Content Hiding
Spam terms or links on a page can be made invisible when the browser renders the page. One common technique is using appropriate color schemes: terms in the body of an HTML document are not visible if they are displayed in the same color as the background. We show a simple example next:
<body background=white>
<font color=white>hidden text</font>
. . .
</body>
In a similar fashion, spam links can be hidden by avoiding anchor text. Instead, spammers often create tiny, 1Ãƒâ€”1-pixel anchor images that are either transparent or background-colored:
<a href=target.html><img src=tinyimage.gif></a>
A spammer can also use scripts to hide some of the visual elements on the page, for instance, by setting the visible HTML style attribute to false.
4.2 Cloaking
If spammers can clearly identify web crawler clients, they can adopt the following strategy, called cloaking: given a URL, spam web servers return one specific HTML document to a regular web browser, while they return a different document to a web crawler. This way, spammers can present the ultimately intended content to the web users (without traces of spam on the page), and, at the same time, send a spammed document to the search engine for indexing.
The identification of web crawlers can be done in two ways. On one hand, some spammers maintain a list of IP addresses used by search engines, and identify web crawlers based on their matching IPs. On the other hand, a web server can identify the application requesting a document based on the user-agent field in the HTTP request message.
For instance, in the following simple HTTP request message the user-agent name is that one used by the Microsoft Internet Explorer 6 browser:
GET /db pages/members.html HTTP/1.0
Host: www-db.stanford.edu
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
The user-agent names are not strictly standardized, and it is really up to the requesting application what to include in the corresponding message field. Nevertheless, search engine crawlers usually identify themselves by a name distinct from the ones used by traditional web browser applications, in order to allow well-intended, legitimate optimizations. For instance, few sites serve to search engines a version of their pages that is free from navigational links, advertisements, and other visual elements related to the presentation, but not to the content. This kind of activity is welcome by the search engines, as it helps indexing the useful information.
4.3 Redirection
Another way of hiding the spam content on a page is by automatically redirecting the browser to another URL as soon as the page is loaded. This way the page still gets indexed by the search engine, but the user will not ever see itâ€pages with redirection act as intermediates (or proxies, doorways) for the ultimate targets, which spammers try to serve to a user reaching their sites through search engines.
Redirection can be achieved in a number of ways. A simple approach is to take advantage of the refresh meta tag in the header of an HTML document. By setting the refresh time to zero and the refresh URL to the target page, spammers can achieve redirection as soon as the page gets loaded into the browser:
<meta http-equiv=refresh content=0;url=target.html>
While the previous approach is not hard to implement, search engines can easily identify such redirection attempts by parsing the meta tags. More sophisticated spammers achieve redirection as part of some script on the page, as scripts are not executed by the crawlers:
<script language=javascript>
<!- -location.replace(target.html)- ->
</script>
Chapter 5
WEB SPAM STATISTICS
5.1 Statistics
While we have a good understanding of spamming techniques, the publicly available statistical data describing the amount and nature of web spam is very limited. In this section we review some of what is known.
Two papers discuss the prevalence of web spam, presenting results from three experiments. Fetterly et al. [3] manually evaluated sample pages from two different data sets. The _rst data set (DS1) represented 150 million URLs that were crawled repeatedly, once every week over a period of 11 weeks, from November 2002 to February 2003. The authors retained 0.1% of all crawled pages, chosen based on a hash of the URLs. A manual inspection of 751 pages sampled from the set of retained pages yielded 61 spam pages, indicating a prevalence of 8.1% spam in the data set, with a confidence interval of 1.95% at 95% confidence.
The second data set (DS2) was the result of a single breadth-first search started at the Yahoo! Home page, conducted between July and September 2002. The search covered about 429 million pages. During a later manual evaluation, from a random sample of 1,000 URLs, the authors were able to download 535 pages, of which 37 (6.9%) were spam. A third, independent set of statistics is provided by Gyongyi et al. [5]. In this case, the authors used the complete set of pages crawled and indexed by the AltaVista search engine as of August 2003.
The several billion web pages were grouped into approximately 31 million web sites (DS3), each corresponding roughly to an individual web host. Instead of random sampling, the following strategy was adopted: the authors segmented the list of sites in decreasing PageRank order into 20 buckets. Each of the buckets contained a different number of sites, with PageRank scores summing up to 5 percent of the total PageRank. Accordingly, the first bucket contained the 86 sites with the highest PageRank scores, bucket 2 the next 665, while the last bucket contained 5 million sites that were assigned the lowest PageRank scores. The upper part of Figure 4 shows the size of each bucket on a logarithmic scale. First, an initial sample of 1000 sites was constructed by selecting 50 sites at random from each bucket.
Then, the sample was reduced to 748 existing sites that could be categorized clearly. A manual inspection discovered that 135 (18%) of these sites were spam. The lower part of Figure 4 presents the fraction of spam in each bucket. It is interesting to note that almost 20% of the second PageRank bucket is spam, indicating that some sophisticated spammers can achieve high importance scores. Also, note that there is a high prevalence of spam (almost 50%) in buckets 9 and 10. This fact seems to indicate that \average" spammers can generate a significant amount of spam with mid-range logarithmic PageRank. Table 1 summarizes the results from the three presented experiments. The differences between the reported prevalence figures could be due to an interplay of several factors:
The crawls were performed at different times. It is possible that the amount of spam increased over time.
Different crawling strategies were used.
There could be a difference between the fraction of sites that are spam and the fraction of pages that are spam. In other words, it could be the case that the average number of pages per site is different for spam and non-spam sites.
Classification of spam could be subjective; individuals may have broader or narrower definition of what constitutes spam.
Despite the discrepancies, we can probably safely estimate that 10-15% of the content on the Web is spam. As the previous discussion illustrates, our statistical knowledge of web spam is sparse. It would be of interest to have data not only on what fraction of pages or sites is spam, but also on the relative sizes (as measured in bytes) of spam and non-spam on the Web.
This would help us estimate what fraction of a search engine's resources (disk space, crawling/indexing/query processing time) is wasted on spam. Another important question is how spam evolves over time. Finally, we do not yet know much about the relative frequencies of different spamming techniques, and the cooccurrence patterns between them. It is suspected that currently almost all spammers use page link spamming, usually combined with anchor text spamming, but there are no published research results supporting this hypothesis. It is our hope that future research in the held will provide some of the answers.
CONCLUSION
On one hand, it is possible to address each of the boosting and hiding technique presented in Sections 3 and 4 separately. Accordingly, one could:
Identify instances of spam, i.e., find pages that contain specific types of spam, and stop crawling and/or indexing such pages. Search engines usually take advantage of a group of automatic or semi-automatic, proprietary spam detection algorithms and the expertise of human editors to pinpoint and remove spam pages from their indices.
Prevent spamming, that is, making specific spamming techniques impossible to use. For instance, a search engineâ„¢s crawler could identify itself as a regular web browser application in order to avoid cloaking.
Counterbalance the effect of spamming. Todayâ„¢s search engines use variations of the fundamental ranking methods feature some degree of spam resilience.
REFERENCES
1. http://citeseerx.ist.psu.edu/viewdoc/sum....1.59.5304
2. http://ilpubs.stanford.edu:8090/771/
3. http://mpiinf.mpg.de/departments/d5/teac...Masood.pdf
4. http://iwaw.europarchive08/IWAW2008-Benczur.pdf
5. http://airweb.cse.lehigh.edu/2005/gyongyi.pdf

Possibly Related Threads...
Thread		Author	Replies	Views	Last Post
	Bluetooth Security Full Download Seminar Report and Paper Presentation	computer science crazy	21	26,226	07-08-2014, 11:32 PM Last Post: [email protected]
	Adaptive Replacement Cache Full Download Seminar Report and Paper Presentation	computer science crazy	1	2,996	19-04-2014, 07:01 PM Last Post: Guest
	web spoofing full report	computer science technology	9	11,029	26-03-2014, 06:29 AM Last Post: Guest
	Web Services Architecture	computer topic	0	7,580	25-03-2014, 10:20 PM Last Post: computer topic
	Opera (web browser)	computer science crazy	3	4,362	08-07-2013, 12:45 PM Last Post: computer topic
	Social networking sites presentation	project topics	4	9,605	27-03-2013, 10:59 AM Last Post: computer topic
	Relation-Based Search Engine in Semantic Web	project topics	1	2,160	21-12-2012, 11:00 AM Last Post: seminar details
	routing Protocols presentation	project topics	1	3,829	07-11-2012, 12:42 PM Last Post: seminar details
	Recent Researches on Web Page Ranking	computer science crazy	1	1,807	30-10-2012, 02:04 PM Last Post: seminar details
	Advanced Algorithm Design and Analysis FULL PRESENTATION	seminar class	1	2,560	10-10-2012, 12:22 PM Last Post: seminar details

Important Note..!

ASK HERE