18-03-2010, 12:35 PM
[attachment=2721]Geographically Distributed Web Crawler
Introduction
Web crawling is a resource intensive process, both in terms of processing and in terms of communication. Distributing the crawling activity among multiple machines can distribute processing, and spreading out the distribution geographically can significantly reduce the communication cost. The reduction in communication is because of the following reasons. ¢By choosing a crawler nearer to a web server being crawled, the http fetch of the content on the web server travels a shorter distance ¢Each crawler while sending back the index to the central indexing location, can compress the information as compared to uncompressed content that would have otherwise traveled over http
Presented By:
Aseem Bajaj and Emin Gun Sirer
Cornell University, Ithaca, NY
for more please read
http://research.yahoofiles/paper_0.pdf
http://aseempapers/GeoDistCrawler.pdf