15-03-2011, 03:04 PM
[attachment=10210]
INTRODUCTION
What is web mining?
Web mining is the extraction of interesting and potentially useful pattern and implicit information from artifacts or activity related to World Wide Web
Why web usage mining?
E-commerce
E-business
How to perform web usage mining?
Web server log files were used initially by the webmasters and system administrators for the purposes of :
1. How much traffic they are getting?
2. How many requests fail?
3.What kind of errors are being generated?
TAXONOMY OF WEB MINING
Web content Mining:
Web crawler: To search the Web pages the problems are:
Scale, Variety, Duplicates, Domain Name Resolution
Types of crawler:
1. Traditional Crawler
2. Periodic Crawler
3. Incremental Crawling
4. Focused Crawling.
Harvest system:
1. Collector-Internet Service Provider
2. Broker-Index and query interface
Virtual Web View:
This approach is based in the database.
Personalization:
With Web personalization, users can get more information on the Internet faster because Web sites know their interests and needs.
The Web site then uses the database to match user’s needs to the products or information provided at the site with middleware facilitating the process.
Web Structure Mining
The two techniques for structure mining:
1. Page Rank: PR is one of the methods Google uses to determine a pages relevance or importance. The PR value for a page is calculated based on the number of pages that point to it. PR is displayed on the toolbar of your browser if you’ve installed the Google toolbar.
Page Rank: The actual page rank for each page is calculated by Google.
Toolbar PR: The page rank is displayed in the Google toolbar in your
browser. This ranges from 0 to 10.
Backline: If page A out to page B then page B is said to have a “Back link” from page A.
Definition by Google:
We assume page A has pages T1…Tn which point to it. The parameter d is a damping factor, which can be set between 0 and 1. We usually set d to 0.85. Also C(A) is defined as the number of links going out of page A.
The PR of a page A is given as follows:
PR(A)=(1-d)+d(PR(T1)/C(T1)+…+PR(Tn)/C(Tn))
2. Important Pages:
A page is important if important pages page link to it.
Assume that the Web consists of
only three pages say Netscape,
Microsoft and Amazon. The links
among these pages were shown
In the limit, the solution is n=a=6/5; m=3/5. That is Netscape and
Amazon each have the same importance and twice the importance of
Microsoft.
Following are the problems that are faced by on the Web:
a. Dead ends: A page that has no successors has now here to send its importance . Eventually all importance will “leak out of” the web.
b. Spider traps: A group of one or more pages that have no links out of the group eventually accumulate all the importance of the web
Web usage Mining:
Web usage mining has three activities given below:
Preprocessing activities center around reforming the web log data before processing.
Pattern discovery activities form the major portion of the mining activities because these activities look to find hidden pattern within log data.
Pattern analysis is the process of looking at and interpreting the results of discovery activities.
Application is totally different from other traditional data mining application such as “Goods Basket” model. We can interpret this problem from two aspects:
1. Weak Relations between user and site
2. Complicated behaviors
WEB MINING ARCHITECTURE
WebMiner system:
This system divides the Web Usage mining process into three main parts i.e., access referrer, agents, HTML files that make up the site
Data cleaning
Transaction Identification
Date integration
User identification
Session identification
Preprocessing:
Before processing in web usage mining include the following:
Collection of usage data for web visitors: In some services it needs the user registration.
User identification: It is easy to identify different users but it cannot avoid that some private personal registration information is misused by hackers.
Session construction: A session is a visit. Two time constraints needed for this session construction i.e., time gap between any two continuously accessed and duration for any session can not exceed a defined threshold.
Behavior recovery:
User behavior is recovered from the session for this user and defined as b=(S’,R), R is relation among S.
<0,292,300,304,350,326,512,510,512,515,513,292,319,350,517,286>
It includes two kinds of behavior
The first is that user behaviors are represented with only those unique accessed pages.
S’= <0,292,300,304,326,510,512, 513,515,319,350,517,286>
The second is that user behaviors are represented wit those unique accessed pages and also the access sequence among these pages.
<0-292-300-304-350-326-512-510-513-515-319-517-286>
Applications
Intelligent Web services
Log analysis for security applications
Contextual information access and retrieval
Recommendation and personalization systems
Fraud and misuse detection, such as credit-card fraud and
network detection.
Services:
User Modeling and Profiling
Enabling Technologies
Web content, usage, structure mining
Conclusion:
In this paper we proposed a definition of Web mining and developed a taxonomy of the various ongoing efforts related to it.
Companies find a new and better way to do business.
However, E-business cannot just build a web site and then sit back and reap the benefits, which , in most cases is fruitless.
Companies have to implement Web mining systems to understand their customers’ profiles, and to identify their own strength and weakness of their E-marketing efforts on the web through continuous improvements.