ASK HERE

project uploader · 08-06-2012, 11:31 AM

An Ontology-Supported Web Focused-Crawler for Java Programs

.pdf

An Ontology-Supported Web Focused-Crawler for Java Programs.pdf (Size: 188.08 KB / Downloads: 0)
Abstract
This paper proposed an ontology-support web
focused-crawler: OntoCrawler III for Java programs,
in which only the user entered some keywords would
the system supported by the domain ontology actively
provide comparison and verification for those
keywords so as to up-rise the precision and recall
rates of webpage searching. This technique has
practically been installed in Google and Yahoo search
engines and furthermore searched and filtered out
unduplicated and related Java open source webpages
and accordingly downloaded and stored the results
into a database to let the backend systems to do
advanced processes. The preliminary experiment
outcomes proved the OntoCrawler III based on
ontology-supported techniques proposed in this paper
could not only really up-rise the precision and recall
rates of webpage searching but also should
successfully download related webpage information.
1. Introduction
In the information explosion era of the Internet, how
to find out the advantage Java Programs from the vast
rolling torrent of the Internet that just likes to search a
needle from the haystack and that work is usually in a
maze to users. However, how to search advantage
information has become essential topic for discussion
in users or even the search engine manufacturers
themselves [12]. Therefore, the active webpage
searching technique plays an important basic
component in the contemporary information system.
October in 1998, Google was officially born and
soon became the most satisfying searching engine
contemporarily for the users. That Google adapted
query method with keywords which resulted in a
hard-breathing searching outcome even with less
keyword. In such long and complicatedly listed
outcomes, it caused users not only took more time to
look up those information but also indicated the query
system itself couldn’t comprehensively know the true
query intention of users. The main cause was that the
keywords entered by users were not completed and not
able to obviously indicate the query demands of users.
Furthermore, there are so many keywords being the
same words with different meaning in different fields.
The system would finally produce many complicated
cross-field query outcomes when system didn’t
respectively managed to classifying query requisition
and specifying fields [11]. This condition always let
users not able to select real information they need.
Hence, how to make the webpage crawling techniques
more precise and clarify executing search what
information user really need has become an essential
issue between users and information systems.
Along with popularity of application and use of
Internet, people who want to search information they
need on Internet have to run on different independent
query engines and enter keywords to accomplish
getting needed information. To make users with faster
and more effective way get advantage information and
knowledge from huge amount of Web information.
Hence, we wanted to design an integrated
Focused-Crawler which can assist and release the
loading of query works of users as well as support the
core of webpage search systems so as to improve the
system performance.
To sum up, the main purpose of this paper was to
employ the ontology technique to design the ontology
of Java programming codes and construct the analyze
ontology classes of these codes through the ontology
construction tool Protégé [4], and then accompanied
with MS SQL Server database to set up an ontology
sharing platform of some keywords of Java
programming codes. Finally, we use Java [7] to build
up the OntoCrawler III (Ontology-supported
Focused-Crawler). In other words, Introducing
keywords comparison and judgment of Java
programming codes ontology not only exclude the
extra webpages resulting from the same word with
different meaning but also up-rise the precise rate and
recall rate of query webpage. In addition, those
searched Java programs and their related webpage
information were stored by the crawler for saving lots
of searching operation time of users as well as
providing support for both query works of users and
core of webpage search systems.
2. Background Knowledge and
Techniques
2.1. Ontology
_____________________________________
978-1-4244-6707-5/10/$26.00 ©2010 IEEE
Ontology was one theory in philosophy and
primarily to explore knowledge characteristics of life
and real objects. It can provide complete semantic
models with sharing and reusing characteristics. To
describe the structure of the knowledge content
through ontology can accomplish the knowledge core
in a specified domain and automatically learn related
information, communication, accessing and even
induce new knowledge; hence, ontology is a powerful
tool to construct and maintain an information system
[14]. Figure 1 illustrates the ontology structure of Java
programming codes, which defines related basic
knowledge of Java and its conceptual hierarchy
relationship and relevant features.
Figure 1. Part of ontology structure of Java
programming codes
2.2. Regular Expression
Regular expression is a character queue to describe
specified order. The descriptive style, so to call pattern,
could be used to search matched pattern in another
character queue. Regular expression can use universal
words, set of words, and some quantifiers as
specifying ways [10]. There were two supported
classes for this expression: Pattern and Matcher, and
we would use Pattern to define a Regular expression.
If we want to conduct pattern matching with other
character queue, we would use Matcher. Figure 2
showed an example program adapting regular
expression in this system.
(a) Hyperlink format in HTML
(b) Corresponding regular expression
Figure 2. Example on regular expression
2.3. Developing Techniques
The developing tool of this system is Borland
JBuilder. It is an integrated development environment
of Java, which have a fine human-machine interface
and code debugging mechanism to make a fast
integration of each code block when the system was
developed, and accordingly reduce the time of system
development. In addition, Java [7] provides lots of
functions and methods to integrate web applications
and databases. In the view of extensibility, Java is
absolutely the optimal choice for solving the problem
of cross platform.
This system adapted MS SQL Server as backend
knowledge-database sharing platform based on
ontology. MS SQL Server is one mostly used
relational database management system. SQL
(Structured Query Language) is one query language to
get the data in the database. The ontology construction
tool, Protégé, was an ontology freeware developed by
SMI (short for Stanford Medical Informatics). Protégé
not only was one of the most important platforms to
construct ontology but also the most frequently
adapted one [5,6]. Protégé was adapted in this paper
and its most special feature is that used multi
components to edit and make ontology and led
knowledge workers to constructing knowledge
management system based on ontology; furthermore,
users could transfer to different formats of ontology
such as RDF(S), OWL, XML or directly inherit into
database just like MySQL and MS SQL Server, which
have better supported function than other tool [15].
3. System Architecture
3.1. Construction of Ontology Database
Nowadays the research on ontology can be
branched into two fields: one is to configure huge
ontology in a specified field and through them to
assistant the knowledge analysis in this field; the other
is to study how to construct and express precisely with
ontology. In this paper, we adopted the former in
which took advantage of built ontology database of
some Java programming codes to support whole
system operation. In the mentioned ontology database,
it included two constructing stages [13,15]; one is
statistics and analysis of related concepts of Java
programming codes, the other is construction of
ontology database. First of all, we conducted statistics
and survey of related Java programming codes to fetch
out the related concepts and their synonym appearing
in those programs and employed the ontology
construction tool Protégé to construct that ontology.
The second stage of ontology construction is the
construction of ontology database of Java
programming codes, in which the main part work is to
transfer the ontology built with Protégé into MS SQL
database for conveniently processing by the system.
3.2. System Structure of OntoCrawler III
Figure 3 illustrated the operation system structure of
OntoCrawler III, and related techniques and functions
of every part were detailed below.
(1) Keyword & Download Directory: contains the
preprocess of executing query webpage,
including to empty the output area, transfer input
characters into URI code and then embedded into
Google’s/Yahoo’s query URL, transfer the input
string of the default downloading location into
the file name of the storage location and empty
that field, and finally the system would remind
users to input related default operations [14].
(2) LinkToGoogle & Yahoo: declares an URL and
add Google/Yahoo query URL on well
transferred URI code, and then used a
BufferedReader to read and used while loop to
add String variable “line” line by line. Finally,
output “line” as text file as final analysis
reference. The file content was the html source
file of the webpage.
(3) RetrieveLinks: uses regular expression to search
for whether there are matched URL from the
variable “line.” The matched ones can be
downloaded outputted the txt file to provide the
system for further processing.
Ontotlogy
Database
Keyword
&
Download Directory
RetrieveContent RemoveHTMLLabel
Google & Yahoo RetrieveLinks SearchMatches
LinkToGoogle &
Yahoo
Start
Download
complete
URL
Filtering
Selecting Download
Figure 3. System architecture of OntoCrawler III
(4) RetrieveContent: owing to wasting too much
time on RetrieveContent processing and the
executing order would be a problem to make
system interface entirely stopped. So, the system
designing with “thread” got free from Swing
thread events, and this made it possible to do
some proper change of the interface when
querying webpage. And then we used
BufferedReader to read in “RetrieveLink” with
“while” loop line by line, that meant we checked
one URL page link once a time and really linked the
URL. After judging what kind coding of the
webpage was, we read in the html source file of
webpage with correct coding and output it as text
file so as to let system conduct further processing.
After completing all procedures mentioned above,
we could used SearchMatches method to judge
whether the webpage was located in the range we
hoped to query; supposed the answer was “yes”,
we would execute RemoveHTMLLabel to delete
the html label from source file and remained only
the text content so as to let system conduct
further processing and analyzing. Finally, we
collected the number of queried webpage and
divided with total of the webpage and the mean
we got was the percentage of query processing,
detailed procedure as shown in Figure 4.
CrawlerThread=new Thread (new Runnable(){
public void run(){
Crawling=true;
ProgressBar.setValue(0);
EnterButton.setText("

Possibly Related Threads...
Thread		Author	Replies	Views	Last Post
	Mining Web Graphs for Recommendations	seminar paper	6	3,943	28-07-2016, 04:36 PM Last Post: jaseela123d
	Privacy-Preserving Updates to Anonymous and Confidential Databases - JAVA	project uploader	3	2,157	23-12-2012, 07:35 PM Last Post: mr.patil1234
	Web Based Meeting Scheduler	seminar addict	1	1,461	22-11-2012, 12:27 PM Last Post: seminar details
	On the Use of Mobile Phones and Biometrics for Accessing Restricted Web Services	project uploader	1	2,372	05-11-2012, 11:46 AM Last Post: Guest
	Web Server Software Architectures	project uploader	0	1,080	11-06-2012, 12:04 PM Last Post: project uploader
	Java™: The Complete Reference, Seventh Edition	project uploader	0	1,041	09-06-2012, 05:07 PM Last Post: project uploader
	Mobile Agents for World Wide Web Distributed Database Access	project uploader	0	1,242	08-06-2012, 03:01 PM Last Post: project uploader
	WEB PHARMACEUTICAL	seminar details	0	1,628	08-06-2012, 11:41 AM Last Post: seminar details
	Mobile Agents for World Wide Web Distributed Database Access	project uploader	0	1,030	07-06-2012, 05:44 PM Last Post: project uploader
	Building a Java chat server	seminar details	0	1,319	07-06-2012, 12:07 PM Last Post: seminar details

Important Note..!

ASK HERE