07-04-2011, 04:37 PM
[attachment=11879]
E-MINING-A NOVEL WEB MINING APPROACH
DEFINITION
It is a technique that mines relevant data regions from a web page.
THE PROPOSED TECHNIQUE
E-Mine – An effective method to mine the data region from a web page automatically
It enables the system to identify gaps that separate records, which helps to segment data records correctly.
The visual information also contains information about the hierarchical structure of the tags.
By observing a webpage, it can be analysed that
the relevant data region occupies the major central part of the Webpage.
SYSTEM OF THE e-Mine TECHNIQUE
HOW ALGORITHM WORKS?
Determining the height and width of all bounding rectangles.
Identification of the largest rectangle.
Identification of the container within the largest rectangle.
Identification of data region containing data records with in the container.
STEP 1
DETERMINING HEIGHT AND WIDTH OF ALL BOUNDING RECTANGLES
Determine the dimensions of all the bounding rectangles in the web page.
If not specified, the MSHTML parsing and rendering engine of Microsoft Internet Explorer 6.0 can be used.
STEP 2
IDENTIFICATION OF THE LARGEST RECTANGLE
Based on the height and width of bounding rectangles obtained in the previous step, we determine the area of the bounding rectangles
Among these rectangles determine the largest rectangle.
PROCEDURE FOR IDENTIFICATION OF LARGEST RECTANGLE
Procedure getMaxRect
Input: <body> of the HTML source
for each child of <body> tag
Begin
Find the coordinates of the bounding rectangles for the child
If the area of the bounding rectangle > area of maximum Rectangle
then Maximum Rectangle = child
Endif
end
STEP 3
Identification of the container with in the largest rectangle
Once the largest rectangle is obtained, we determine the bounding rectangle having the largest area in the set.
The reason for determining the largest rectangle within this set is that only the largest rectangle will contain data records.
Procedure getContainer
Input: The Largest Rectangle out of all Bounding Rectangles.
List_of_Children=list of all the children tags associated with Maximum Rectangle.
for each tag in List_of_Children
begin
if area of bounding rectangle of a tag > half the area of Maximum Rectangle
then container = tag
Endif
End.
IRRELEVANT PORTION TO BE FILTERED
STEP 4
Identification of data region containing data records with in the container
Filter is used to remove the irrelevant data from a container
PROCEDURE FOR FILTER
Input: The container obtained from the previous step.
totalHeight=0
for each child tag within container
totalHeight+=height of the bounding rectangle of child
averageHeight = totalHeight/no of children of container
averageHeight = totalHeight/no of children of container
for each child within container
if height of child’s bounding rectangle < averageHeight
then Discard child from container
endif
end for
end for
ADVANTAGES
Overcomes the disadvantages of the existing automated approaches. Eg: MDR Algorithm.
It enables the system to identify gaps that separate records, which helps to segment data records correctly.
The visual information also contains information about the hierarchical structure of the tags.
DISADVANTAGES
It may extract large amount of unwanted data
The extracted relevant data region from a web page may not be of users interest
CONCLUSION
This is a new approach to extract structured data from web pages
eMine is a pure visual structure oriented method that can correctly identify the data regions.
eMine overcomes the drawbacks of existing methods and performs significantly better than existing methods.