18-02-2011, 02:06 PM
[attachment=8783]
INTRODUCTION
The “URL TRACKER” is a multithreaded windows application that down-loads and stores Web pages Uniform Resource Identifier (URI’s), for a Web search engine. Roughly, a crawler starts off by placing an initial set of URLs, so, in a queue, where all URLs to be retrieved are kept and prioritized. From this queue, the crawler gets a URL (in some order), downloads the page, extracts any URLs in the downloaded page, and puts the new URLs in the queue. This process is repeated until the crawler decides to stop. Collected pages are later used for other applications, such as a Web search engine or a Web cache.
As the size of the Web grows, it becomes more difficult to retrieve the whole or a significant portion of the Web using a single process. Therefore, many search engines often run multiple processes in parallel to perform the above task, so that download rate is maximized.
PROJECT OVERVIEW
“URL Tracker” aims to develop a user interface which brings the information about a particular given website. This is a multithreaded windows application that downloads and stores Uniform Resource Identifiers of typical website. This application has got its use as a backend processing component for a search engine. The results gathered by the Website Fetcher will be given to the indexer which indexes page data so that the search query gives the results faster. The proposed project once implemented can connect to the websites and download data which once indexed can be given to the search engine.
PROJECT DESCRIPTION
The Url Tracker is a multithreaded windows application that downloads and stores Web pages Uniform Resource Identifier (URI’s), for a Web search engine. Roughly, a crawler starts off by placing an initial set of URLs, so, in a queue, where all URLs to be retrieved are kept and prioritized. From this queue, the crawler gets a URL (in some order), downloads the page, extracts any URLs in the downloaded page, and puts the new URLs in the queue. This process is repeated until the crawler decides to stop. Collected pages are later used for other applications, such as a Web search engine or a Web cache.
As the size of the Web grows, it becomes more difficult to retrieve the whole or a significant portion of the Web using a single process. Therefore, many search engines often run multiple processes in parallel to perform the above task, so that download rate is maximized. We refer to this type of fetcher as a parallel crawler. This type of applications is often used in search engines where there is a need of collecting all the URL’s based on a query and indexing them on priority.
MODULES CRAWLER VIEW
This is a primary module to get initiate of tracking the URIS from URL. Firstly an Uri is given as input, then our crawler view takes the Uri and finds the URIS, URIS founded are placed into threads, if the threads memory is full, then they remain URIs are queued. When URIS is fetched with data, then the completed thread is killed and queued URI’s are placed into the threads.
There are two types of Functionalities in the module
1. Threads view
2. Crawler view.
Threads View and Requests View
First our system establish connection with the system after that user gives one URL (Uniform resource Locator) give one URL as input .It start searching or fetching the information of that URL by starting threads process. In This process 10 threads will be running continuously to get all the URI’S information and stores them in a queue.
At the time of down loading each URI it puts in threads view after completion of down load process it jus transfers the completed URI into the request phase .so while fetching any URI corresponding to URL, any difficulties or any errors occurs it just listed in error view phase.
CONFIGURATOR MODULE
Mime Types
In this will set all kinds of data we need to extract from the particular URI like weather we need storing data, Boolean data and images information or not
Output Settings
In this we mention the output folder name where we need to store the content about the website fetched.
Advanced Settings
These are the settings made by the user in order to restrict some kind of website like with domain name as .NET,.AC.IN like this.
MULTITHREADED DOWNLOADER
Here the multi threaded downloader is responsible for starting threads and obtaining the information about the website being fetched. So the multi threaded downloader starts threads and it pushes all URI’s one queue. Each and every thread is starts with one Uri in the queue. After completion it just jumps to the next URI’s in the queue. In this module one folder creates in the user desired path and the files created with the URI names having the static information.
SYSTEM CONFIGURATION
SOFTWARE SPECIFICATIONS
Microsoft .net framework
Microsoft C# .net language
Microsoft Windows 2000
Microsoft Visual Studio 2005
HARDWARE SPECIFICATIONS
PROCESSOR : P4 or higher
RAM : 512 MB
HARD DISK : 1GB