E-MINE: A novel web mining approach
#9
E-MINE: A NOVEL WEB MINING APPROACH


.doc   E-MINE.doc (Size: 272 KB / Downloads: 4)

ABSTRACT

In recent years government agencies and industrial enterprises are using the web as the medium of publication. Hence, a large collection of documents, images, text files and other forms of data in structured, semi structured and unstructured forms are available on the web. It has become increasingly difficult to identify relevant pieces of information since the pages are often cluttered with irrelevant content like advertisements, copyright notices, etc surrounding the main content. Thus, we propose a technique that mines the relevant data regions from a web page. This technique is based on three important observations about data regions on the web.

. Introduction

Extracting the regularly structured data records from web pages is an important problem. So far, several attempts have been made to deal with the problem. The main disadvantage with the existing automatic approaches is their assumption that the relevant information of a data record is contained in a contiguous segment of HTML code, which is not always true. Thus, we propose a more effective method to mine the data region in a web page. The algorithm, eMine, finds the data regions formed by all types of tags using visual cues.

Related Work

Related work, mainly in the area of mining data records in a web page is MDR (Mining Data Records). MDR is a well known approach which basically exploits the regularities in the HTML tag structure directly. MDR algorithm makes use of the HTML tag tree of the web page to extract data records from the page. However, an incorrect tag tree may be constructed due to the misuse of HTML tags, which in turn makes it impossible to extract data records correctly.

The Proposed Technique

We propose a novel and an effective method, eMine, to mine the data region from a web page automatically. The basic criteria which eMine uses are the locations on the screen at which tags are rendered i.e. visual Information.

How the Algorithm works?

The algorithm takes the HTML source of the web page as input. In step 2 we scan the HTML document for tags and identify the height and width of all the bounding rectangles. Thus, you have the area of each bounding rectangle. The step 3 finds the largest rectangle out of all the bounding rectangles. Step 4 identifies the container which holds most of the relevant data region (and some irrelevant regions also). Step 5 identifies the actual relevant data region by filtering the irrelevant regions.
The following sections provide more details about the individual modules associated with the algorithm.

Determining the Height and width of all bounding rectangles

In the first step of the proposed technique, we determine the dimensions of all the bounding rectangles in the web page. Every <table> tag in a web page will be associated with a specific height and width attribute. We extract them. If not specified, the MSHTML parsing and rendering engine of Microsoft Internet Explorer 6.0 can be used. This parsing and rendering engine of the web browser gives us the coordinates of a bounding rectangle. We scan the HTML file for tags. For each tag encountered, we determine the coordinates of the bounding rectangle of the corresponding tag and plot it.

Conclusion

In this paper, we have proposed a new approach to extract structured data from web pages. Although the problem has been studied by several researchers, existing techniques make many strong assumptions. eMine is a pure visual structure oriented method that can correctly identify the data regions. Most of the current algorithms fail to correctly determine the data region, when the data region consists of only one data record. Also, most of the approaches fail in the case where a series of data records is separated by an advertisement, followed again by a single data record. eMine works correctly for the above case. Further, the comparisons are made on numbers, unlike other methods where strings or trees are compared. Thus eMine overcomes the drawbacks of existing methods and performs significantly better than existing methods.



Reply

Important Note..!

If you are not satisfied with above reply ,..Please

ASK HERE

So that we will collect data for you and will made reply to the request....OR try below "QUICK REPLY" box to add a reply to this page
Tagged Pages: e mine a novel web mining approach, emine a novel web mining approach, e mine a web mining approach, e mine novel web mining approach, e mine a novel web mining approach definition, e mine a novile a web mining approach, emine a novel web mining approach ppt,
Popular Searches: web mining book, emine a noval web mining approach ppt, largest seminary, mining web, e mine novel web mining approach, land mine removal salary, abstract on web mining,

[-]
Quick Reply
Message
Type your reply to this message here.

Image Verification
Please enter the text contained within the image into the text box below it. This process is used to prevent automated spam bots.
Image Verification
(case insensitive)

Messages In This Thread
RE: E-MINE: A novel web mining approach - by Guest - 20-02-2012, 05:55 PM
RE: E-MINE: A novel web mining approach - by Guest - 13-03-2012, 04:57 PM
RE: E-MINE: A novel web mining approach - by computer girl - 07-06-2012, 12:29 PM

Possibly Related Threads...
Thread Author Replies Views Last Post
  A New Data Mining Based Network Intrusion Detection Model prem0597 2 4,317 04-05-2018, 09:42 PM
Last Post: Guest
  phonet a voice based web technology tejasree 4 2,529 02-08-2016, 09:36 AM
Last Post: seminar report asees
  made for each other novel by vibhavari verma 2 1,101 21-07-2016, 03:26 PM
Last Post: dhanabhagya
  ppt on effective pattern discovery for text mining 1 582 02-07-2016, 02:21 PM
Last Post: visalakshik
  ppt on effective pattern discovery for text mining 1 538 18-06-2016, 11:35 AM
Last Post: dhanabhagya
  web technologies by a a puntambekar free pdf 1 676 11-06-2016, 03:39 PM
Last Post: dhanabhagya
  web technologies book by aa puntambekar pdf free download 2 885 10-06-2016, 04:47 PM
Last Post: Guest
  download web technologies textbook of technical publications 2 765 10-05-2016, 10:48 AM
Last Post: dhanabhagya
  web technology technical publications puntambekar free download 1 986 07-05-2016, 11:37 AM
Last Post: dhanabhagya
  web based claim processing system project documentation pdf 1 678 29-04-2016, 10:48 AM
Last Post: dhanabhagya

Forum Jump: