Signed Approach for Mining Web Content Outliers
#1

Signed Approach for Mining Web Content Outliers

Abstract”

The emergence of the Internet has brewed the revolution of information storage and retrieval. As most of the data in the web is unstructured, and contains a mix of text, video, audio etc, there is a need to mine information to cater to the specific needs of the users without loss of important hidden information. Thus developing user friendly and automated tools for providing relevant information quickly becomes a major challenge in web mining research. Most of the existing web mining algorithms have concentrated on finding frequent patterns while neglecting the less frequent ones that are likely to contain outlying data such as noise, irrelevant and redundant data. This paper mainly focuses on Signed approach and full word matching on the organized domain dictionary for mining web content outliers. This Signed approach gives the relevant web documents as well as outlying web documents. As the dictionary is organized based on the number of characters in a word, searching and retrieval of documents takes less time and less space

Presented By
G. Poonkuzhali, K.Thiagarajan, K.Sarukesi and G.V.Uma


. I. INTRODUCTION

the exponential growth of information available on the internet, updating incoming data and retrieving relevant information from the web quickly and efficiently is a growing concern. Most of the web search engines typically employ conventional information retrieval and data mining techniques to discover automatically useful and previously unknown information from web content. With the enormous growth on the web, users get easily lost in the rich hyper structure. In addition, as most of the data in the web is unstructured, and contains a mix of text, video, audio etc, there is a need to mine information to cater to the specific needs of the users[9]. Efforts are being made to make such data available, usually in some structured form as in matrix G.Poonkuzhali is Assistant professor in the Department of Computer Science and Engineering with the Rajalakshmi Engineering College, Affiliated to Anna University Chennai, Tamil Nadu, India, phone: 9444836861, email : Kuzhal_s[at]yahoo.co.in K.Thiagarajan is Senior Lecturer in the Department of Mathematics with the Rajalakshmi Engineering College, Affiliated to Anna University Chennai, Tamil Nadu, India, email : vidhyamannan[at]yahoo.com K.Sarukesi is Vice Chancellor with the Hindusthan University “ Chennai, email: profsaru[at]yahoo.com G.V.Uma is Professor in the Department of Computer Science and Engineering with the Anna University-Chennai, email: gvuma[at]annauniv.edu form for further manipulation. Web mining is an emerging research area focused on resolving these problems. The proposed work in web mining aims to develop new methodology to effectively mine useful knowledge or information from the web documents quickly. In general, web mining tasks can be classified into three major categories, web structure mining, web usage mining and web content mining. Web structure mining tries to discover useful knowledge from the structure of hyperlinks. Web usage mining refers to the discovery of user access patterns from web usage logs. Web content mining aims to extract/mine useful information from the web pages based on their contents [1],[4],[10],[11]. Two groups of web content mining are those that directly mine the content of documents and those that improve on the content search of other tools like search engine. For Web content mining data can be image, audio, text and video [15]-[16]. Existing web mining algorithms do not consider documents having varying contents within the same category called web content outliers. Generally, Outliers are the data that obviously deviate from others, disobey the general mode or behavior of data and disaccord with other existing data. Outliers may also reflect the true properties of data, such as the rare disastrous weather recorded in meteorological database, which often contains one or more properties whose values seriously deviate from the normal values. However, these data may contain more valuable information than normal data. Researches on outlier detection broadly fall into following categories: A. Distribution based methods are conducted by the statistics community. These methods deploy some known distribution model and detect as outliers points that deviate from the model. B. Depth based algorithms organize objects in convex hull layers in data space according to peeling depth and outliers expected to be with shallow depth values[13]. C. Deviation based techniques detect outliers by checking the characteristics of objects and identify an object as that deviates these features as outlier. D. Distance based algorithms give a rank to all points, using distance of point from k-th nearest neighbor, and orders points by this rank. The top n points in ranked list identified as outliers. Alternative approaches compute the outlier factor as sum of distances from k nearest neighbors. E. Density based methods rely on local outlier factor (LOF) of each point, which depends on local density of neighborhood. Points with high factor are indicated as outliers Unlike traditional outlier mining algorithm designed only for numeric data sets, web outliers mining algorithm should be applicable to various types of data including text, hypertext, image, video etc. Web pages that have different contents from the category in which they were taken constitute web content outliers.[7]-[8] Web content outliers mining concentrates on finding outliers such as noise, irrelevant and redundant pages from the web documents[10]-[11] Also, web content outliers mining can be used to determine pages with entirely different contents from their parent web sites. In the proposed system, web documents are extracted from the search engines by giving query by the user to the web. Then the obtained web documents D is preprocessed, i.e., stop words, stem words and except text other data such as hyperlinks, sound, images etc are removed. The output is a set of documents with white-spaced separated words and it is indexed in two dimensional format (i,j), where ˜i™ represent web pages and ™j™ represent words. Therefore, first word from first web page is indexed as (1,1), second word from the first page is indexed as (1,2) etc,. The domain dictionary is arranged in such a way that, all 1-letter word will be indexed first, followed by 2-letter words, then 3-letter words similarly up to 15-letters word which is a very reasonable upper bounds for number of characters in a word. Each page is mined individually to detect relevant and irrelevant documents using signed approach. Finally, a relevant web document is obtained which contains required information catering to the user needs.


full report
http://wasetjournals/waset/v56/v56-150.pdf
Reply
#2
Hello
I want to implemen this paper but i have some problem and question
like how can i do preprocessing web content?
how can i provide dataset and how an i use it?
i demand you help me,please.

thank you
Reply
#3

you can refer these page details of "Signed Approach for Mining Web Content Outliers"link bellow

http://studentbank.in/report-signed-appr...5#pid51835
Reply

Important Note..!

If you are not satisfied with above reply ,..Please

ASK HERE

So that we will collect data for you and will made reply to the request....OR try below "QUICK REPLY" box to add a reply to this page
Popular Searches: ieee papers on web content mining 2012, seminar topics on web mining, e mine a novel web mining approach definition, e mining a novel web minig approach, canonic signed digit, www e mine a novel web mining approach, seminar topic web mining,

[-]
Quick Reply
Message
Type your reply to this message here.

Image Verification
Please enter the text contained within the image into the text box below it. This process is used to prevent automated spam bots.
Image Verification
(case insensitive)

Possibly Related Threads...
Thread Author Replies Views Last Post
  web spoofing full report computer science technology 9 11,027 26-03-2014, 06:29 AM
Last Post: Guest
  Web Services Architecture computer topic 0 7,578 25-03-2014, 10:20 PM
Last Post: computer topic
  GREEN CLOUD -A Data Center Approach computer topic 0 1,536 25-03-2014, 10:13 PM
Last Post: computer topic
  Opera (web browser) computer science crazy 3 4,361 08-07-2013, 12:45 PM
Last Post: computer topic
  Layered Approach Using Conditional Random Fields for Intrusion Detection project report helper 11 7,749 01-03-2013, 11:58 AM
Last Post: [email protected]
Star DATA MINING AND WAREHOUSE seminar projects crazy 2 3,362 05-02-2013, 12:00 PM
Last Post: seminar details
  Relation-Based Search Engine in Semantic Web project topics 1 2,160 21-12-2012, 11:00 AM
Last Post: seminar details
  A survey of usage of Data Mining and Data Warehousing in Academic Institution and Lib seminar class 1 2,124 29-11-2012, 12:56 PM
Last Post: seminar details
  Integration Of Data mining And Data warehousing Systems computer science topics 1 3,261 29-11-2012, 12:56 PM
Last Post: seminar details
  OBJECT-ORIENTED APPROACH IN SOFTWARE DEVELOPMENT project report helper 2 2,495 20-11-2012, 12:48 PM
Last Post: seminar details

Forum Jump: