data mining full report
#1

[attachment=2423]

DATAMINING
PROJECT REPORT
Submitted by SHY AM KUMAR S MTHIN GOPINADH AJITH JOHN ALIAS RI TO GEORGE CHERIAN
1
INTRODUCTION
1.1 ABOUT THE TOPIC
Data Mining is the process of discovering new correlations, patterns, and trends by digging into (mining) large amounts of data stored in warehouses, using artificial intelligence, statistical and mathematical techniques. Data mining can also be defined as the process of extracting knowledge hidden from large volumes of raw data i.e. the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. The alternative name of Data Mining is Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, etc.
Data mining is the principle of sorting through large amounts of data and picking out relevant information. It is usually used by business intelligence organizations, and financial analysts, but it is increasingly used in the sciences to extract information from the enormous data sets generated by modern experimental and observational methods, it has been described as "the nontrivial extraction of implicit, previously unknown, and potentially useful information from data" and "the science of extracting useful information from large data sets or databases".
1.2 ABOUT THE PROJECT
The Project has been developed in our college in an effort to identify the most frequently visited sites, the site from where the most voluminous downloading has taken place and the sites that have been denied access when referred to by the users.
Our college uses the Squid proxy server and our aim is to extract useful knowledge from one of the log files in it. After a combined scrutiny of the log files the log named access.log was decided to be used as the database. Hence our project was to mine the contents ofaccess.log .
Finally the PERL programming language was used for manipulating the contents of the log file. PERL EXPRESS 2.5 was the platform used to develop the mining application.
The log file content is in the form of standard text file requiring extensive and quick siring manipulation to retrieve the necessary contents. The programs were required to sort the mined contents in the descending order of its frequency of usage and size.
CHAPTER 2 REQUIREMENT ANALYSIS
2.1 INTRODUCTION
Requirement analysis is the process of gathering and interpreting facts, diagnosing problems and using the information lo recommend improvements on the system. It is a problem solving activity that requires intensive communication between the system users and system developers.
Requirement analysis or study is an important phase of any system development process. The system is studied to the minutest detail and analyzed. The system analyst plays the role of an interrogator and dwells deep into the working of the present system. The system is viewed as a whole and the inputs to the system are identified. The outputs from the organization are traced through the various processing that the inputs phase through in the organization.
A detailed study of these processes must be made by various techniques like Interviews, Questionnaires etc. The data collected by these sources must be scrutinized to arrive to a conclusion. The conclusion is an understanding of how the system functions. This system is called the existing system. Now, the existing system is subjected to close study and the problem areas are identified. The designer now functions as a problem solver and tries to sort out the difficulties that the enterprise faces. The solutions are given as a proposal.
The proposal is then weighed with the existing system analytically and the best one is
selected. The proposal is presented to the user for an endorsement by the user. The proposal is
reviewed on user request and suitable changes are made. This loop ends as soon as the user is
satisfied with the proposal.
2.2 PROPOSED SYSTEM
In order to make the programming strategy optimal, complete and least complex a detailed understanding of data mining, related concepts and associated algorithms are required. This is to be followed by effective implementation of the algorithm using the best possible alternative.
2.3 DATAM1NING (KDD PROCESS)
The Knowledge Discovery from Data process involved / includes relevant prior knowledge and goals of applications: Creating a large dataset, Preprocessing of the data, Filtering or clearing, data transformation, identifying dimcnsionally and useful feature. It also involves classification, association, regression, clustering and summarization. Choosing the mining algorithm is the most important parameter for the process.
The final stage includes pattern evaluation which means visualization, transformation, removing redundant pattern etc. use of discovery knowledge of the process.
DM Technology and System: Data mining methods involves neural network, evolutionary programming, memory base programming, Decision trees. Genetic Algorithms, Nonlinear regression methods these work also involve fuzzy logic, which is a superset of conventional Boolean logic that has been extended handle the concept of partial truth, partial false between completely true and complete false.
The term data mining is often used to apply to the two separate processes of knowledge discovery and prediction. Knowledge discovery provides explicit information that has a readable form and can be understood by a user. Forecasting, or predictive modeling provides predictions of future events and may be transparent and readable in some approaches (e.g. rule based systems) and opaque in others such as neural networks. Moreover, some data mining systems such as neural networks are inherently geared towards prediction and pattern recognition, rather than knowledge discovery.
Metadata, or data about a given data set, are often expressed in a condensed data mine-able format, or one that facilitates the practice of data mining. Common examples include executive summaries and scientific abstracts.
Data Mining is the process of discovering new correlations, patterns, and trends by digging into (mining) large amounts of data stored in warehouses, using artificial intelligence, statistical and mathematical techniques.
Data mining can also be defined as the process of extracting knowledge hidden from large volumes of raw data i.e. the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. The alternative name of Data Mining is Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, etc. The importance of collecting data thai reflect your business or scientific activities to achieve competitive advantage is widely recognized now. Powerful systems for collecting data and managing it in large databases are in place in all large and mid-range companies.
LOG files
Preprocessing
Data cleaning
Session identification
Data conversion
mjnsup
Frequent
Iternset Discovery
mjnsup
Frequent
Sequence Discovery
mjnsup
Frequent Subtree Discovery
| Pattern RESULTS i Analysis
Figure 2.3.1 : Process of web usage mining
However, the bottleneck of turning this data into your success is the difficulty of extracting knowledge about the system you study from the collected data. DSS are computerize tools develop assist decision makers through the process of making of decision. This is inherently prescription which enhances decision making in some way. DSS are closely related to the concept of rationality which means the tendency to act in a reasonable'way to make good decision. To produce the key decision for an organization involve product/service, distribution of the product using different distribution channel, calculation /computation of the output on different time and space, prediction/trend of the output for individual product or service with in estimated time frame and finally the schedule of the production on the basis of demand, capacity and resource.
The main aim and objective of the work is to develop a system on dynamic decision which depend on product life cycle individual characteristics graph analysis has been done to give enhance and advance thought to analysis the pattern of the product. The system has been reviewed in terms of local and global aspect.
2.4 WORKING OF DATAMINTNG
While large-scale information technology has been evolving separate transaction and analytical systems, data mining provides the page link between the two. Data mining software analyzes relationships and patterns in stored transaction data based on open-ended user queries. Several types of analytical software are available: statistical, machine learning, and neural networks. Generally, any of four types of relationships are sought:
Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials.
Clusters: Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities.
Associations: Data can be mined to identify associations. The beer-diaper example is an example of associative mining.
Sequential patterns: Data is mined to anticipate behavior patterns and trends. For example, an otitdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes. Data mining consists of five major elements:
¢ Extract, transform, and load transaction data onto the data warehouse system.
¢ Store and manage the data in a multidimensional database system.
¢ Provide data access to business analysts and information technology professionals.
¢ Analyze the data by application software.
¢ Present the data in a useful format, such as a graph or table.
1 .Classification and Regression Trees (CART) and Chi Square
2.Detection (CHAID) : CART and CHAID are decision tree techniques used for classification of a dataset. They provide a set of rules that you can apply to a new (unclassified) dataset to predict which records will have a given outcome. CART' segments a dataset by creating 2-way splits while CHAID segments using chi square tests to create multi-way splits. CART typically requires less data preparation than CHAID.
¢ Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset. Sometimes called the A:-nearest neighbor technique.
¢ Rule induction: The extraction of useful if-then rules from data based on statistical significance.
¢ Data visualization: The visual interpretation of complex relationships in multidimensional data. Graphics tools are used to illustrate data relation.
2.5 DATA MINING ALGORITHMS
The data mining algorithm is the mechanism that creates mining models. To create a model, an algorithm first analyzes a set of data, looking for specific patterns and trends. The algorithm then uses the results of this analysis to define the parameters of the mining model.
The mining model that an algorithm creates can take various forms, including:
¢ A set of rules that describe how products are grouped together in a transaction.
¢ A decision tree that predicts whether a particular customer will buy a product.
¢ A mathematical model that forecasts sales.
¢ A set of clusters that describe how the cases in a dataset are related.
Microsoft SQL Server 2005 Analysis Services (SSAS) provides several algorithms for use in your data mining solutions. These algorithms are a subset of all the algorithms that can be used for data mining. You can also use third-party algorithms that comply with the OLE DB for Data Mining specification. For more information about third-party algorithms, see Plugin Algorithms.
Analysis Services includes the following algorithm types:
¢ Classification algorithms predict one or more discrete variables, based on the other attributes in the dataset. An example of a classification algorithm is the Decision Trees Algorithm.
¢ Regression algorithms predict one or more continuous variables, such as profit or loss, based on other attributes in the dataset. An example of a regression algorithm is the Time Series Algorithm.
¢ Segmentation algorithms divide data into groups, or clusters, of items that have similar properties. An example of a segmentation algorithm is the Clustering Algorithm.
¢ Association algorithms find correlations between different attributes in a dataset. The most common application of this kind of algorithm is for creating association rules, which can be used in a market basket analysis.
» Sequence analysis algorithms summarize frequent sequences or episodes in data, such as a Web path How. An example of a sequence analysis algorithm is the Sequence Clustering Algorithm.
2.6 SOFTWARE REQUIREMENTS
OPERATION SYSTEM PERL COMPILER. PERL SCRIPT EDITOR SERVER SOFTWARE
WINDOWS XP SP2 ACTIVE PERL
PERL EXPRESS
IIS SERVER
2.7 FUZZY LOGIC
Fuzzy logic is a form of multi-valued logic derived from fuzzy set theory to deal with reasoning that is approximate rather than precise. Just as in fuzzy set theory the set membership values can range (inclusively) between 0 and 1, in fuzzy logic the degree of truth of a statement can range between 0 and 1 and is not constrained to the two truth values ftrue, false} as in classic predicate logic. And when linguistic variables are used, these degrees may be managed by specific functions, as discussed below.
Both fuzzy degrees of truth and probabilities range between 0 and 1 and hence may seem similar at first. However, they are distinct conceptually; fuzzy truth represents membership in vaguely defined sets, not likelihood of some event or condition as in probability theory. For example, if a 100-ml glass contains 30 ml of water, then, for two fuzzy sets, Empty and Full, one might define the glass as being 0.7 empty and 0.3 full.
Note that the concept of emptiness would be subjective and thus would depend on the observer or designer. Another designer might equally well design a set membership function where the glass would be considered full for all values down to 50 ml. A probabilistic setting would first define a scalar variable for the fullness of the glass, and second, conditional distributions describing the probability that someone would call the glass full given a specific fullness level. Note that the conditioning can be achieved by having a specific observer that randomly selects ihe label for the glass, a distribution over deterministic observers, or both. While fuzzy logic avoids talking about randomness in this context, this simplification at the same time obscures what is exactly meant by the statement the 'glass is 0.3 full'.
2.7.1 APPLYING FUZZY TRUTH VALUES
A basic application might characterize sub ranges of a continuous variable. For instance, a temperature measurement for anti-lock brakes might have several separate membership functions defining particular temperature ranges needed to control the brakes properly. Each function maps the same temperature value to a truth value in the 0 to I range. These truth values can then be used to determine how the brakes should be controlled.
In this image, cold, warm, and hot are functions mapping a temperature scale. A point on that scale has three "truth values" ” one for each of the three functions. The vertical line in the image represents a particular temperature that the three arrows (truth values) gauge. Since the red arrow points to zero, this temperature may be interpreted as "not hot". The orange arrow (pointing at 0.2) may describe it as "slightly warm" and the blue arrow (pointing at 0.8) "fairly cold".
2.7.2 FUZZY LINGUISTIC VARIABLES
While variables in mathematics usually take numerical values, in fuzzy logic applications, the non-numeric linguistic variables are often used to facilitate the expression of rules and facts.
A linguistic variable such as age may have a value such as young or its opposite defined as old. ITowever, the great utility of linguistic variables is that they can be modified via linguistic operations on the primary terms. For instance, if young is associated with the value 0.7 then very young is automatically deduced as having the value 0.7 * 0.7 = 0.49. And not very young gets the value (l - 0.49), i.e. 0.51.
In this example, the operator very(X) was defined as X * X, however in general these operators may be uniformly, but flexibly defined to fit the application, resulting in a great deal of power for the expression of both rules and fuzzy facts.
CHAPTER 3 SYSTEM DESIGN
System design is the solution to the creation of a new system. This phase is composed of several systems. This phase focuses on the detailed implementation of the feasible system. Its emphasis is on translating design specifications to performance specification. System design has two phases of development logical and physical design.
During logical design phase the analyst describes inputs (sources), out puts (destinations), databases (data sores) and procedures (data flows) all in a format that meats the uses requirements. The analyst also specifies the user needs and at a level that virtually determines the information How into and out of the system and the data resources. Here the logical design is done through data flow diagrams and database design.
The physical design is followed by physical design or coding. Physical design produces the working system by defining the design specifications, which tell the programmers exactly what the candidate system must do. The programmers write the necessary programs that accept input from the user, perform necessary processing on accepted data through call and produce the required report on a hard copy or display it on the screen.
3.1 DATABASE DESIGN
The data mining process involves the manipulation of large data sets. Hence, a large database is a key requirement in the mining operation. Ordered set of information is now to be extracted from this database.
The overall objective in the development of database technology has been to treat data as an organizational resource and as an integrated whole. DBMS allow data to be protected and organized separately from other resources.
Database is an integrated collection of data. The most significant form of data as seen by the programmers is data as stored on the direct access storage devices. This is the difference between logical and physical data.
Database files are the key source of information into the system. It is the process of designing database files, which are the key source of information to the system. The files should be properly designed and planned for collection, accumulation, editing and retrieving the required information.
The organization of data in database aims to achieve three major objectives: -
¢ Data integration.
¢ Data integrity.
¢ Data independence.
A large data set is difficult to parse and to interpret the knowledge contained in it. Since the data base used in this project is the log file of a proxy server called SQUID, a detailed study of the squid style transaction logging is also required.
3.2 PKOXY SERVER
A proxy server is a server (a computer system or an application program) which services the requests of its clients by forwarding requests to other servers. A client connects to the proxy server, requesting some service, such as a file, connection, web page, or other resource, available from a different server. The proxy server provides the resource by connecting to the specified server and requesting the service on behalf of the client. A proxy server may optionally alter the client's request or the server's response, and sometimes it may serve the request without contacting the specified server. In this case, it would 'cache' the first request to the remote server, so it could save the information for later, and make everything as fast as possible.
A proxy server that passes all requests and replies unmodified is usually called a gateway or sometimes tunneling proxy. A proxy server can be placed in the user's local computer or at specific key points between the user and the destination servers or the Internet.
¢ Caching proxy server
A proxy server can service requests without contacting the specified server, by retrieving content saved from a previous request, made by the same client or even other clients. This is called caching.
¢ Web proxy
A proxy that focuses on WWW traffic is called a "web proxy". The most common use of a web proxy is to serve as a web cache. Most proxy programs (e.g. Squid, Net Cache) provide a means to deny access to certain URLs in a blacklist, thus providing content filtering.
¢ Content Filtering Web Proxy
A content filtering web proxy server provides administrative control over the content that may be relayed through the proxy. It is commonly used in commercial and non-commercial organizations (especially schools) to ensure that Internet usage conforms to acceptable use policy.
¢ Anonymizing proxy server
An anonymous proxy server (sometimes called a web proxy) generally attempts to anonymize web surfing. These can easily be overridden by site administrators, and thus rendered useless in some cases. There are different varieties of anonymizers.
¢ Hostile proxy
Proxies can also be installed by online criminals, in order to eavesdrop upon the dataflow between the client machine and the web. All accessed pages, as well as all forms submitted, can be captured and analyzed by the proxy operator.
3.3 THE SQUID PROXY SERVER
Squid is a caching proxy for the Web supporting HTTP, HTTPS, FTP, and more. It reduces bandwidth and improves response times by caching and reusing frequently-requested web pages. Squid has extensive access controls and makes a great server accelerator. It runs on Unix and Windows and is licensed under the GNU GPL. Squid is used by hundreds of Internet Providers world-wide to provide their users with the best possible web access.
Squid optimizes the data flow between client and server to improve performance and caches frequently-used content to save bandwidth. Squid can also route content requests to servers in a wide variety of ways to build cache server hierarchies which optimize network throughput.
Thousands of web-sites around the Internet use Squid to drastically increase their content delivery. Squid can reduce your server load and improve delivery speeds to clients. Squid can also be used to deliver content from around the world - copying only the content being used, rather than inefficiently copying everything. Finally, Squid's advanced content routing configuration allows you to build content clusters to route and load balance requests via a variety of web servers.
Squid is a fully-featured HTTP/1.0 proxy which is almost HTTP/1.1 compliant. Squid offers a rich access control, authorization and logging environment to develop web proxy and content serving applications. Squid is one of the projects which grew out of the initial content distribution and caching work in the mid-90s.
It has grown to include extra features such as powerful access control, authorization, logging, content distribution/replication, traffic management and shaping and more. It has many, many work¬arounds, new and old. to deal with incomplete and incorrect HTTP implementations.
Squid allows Internet Providers to save on their bandwidth through content caching. Cached content means data is served locally and users will see this through faster download speeds with frequently-used content.
A well-tuned proxy server (even without caching!) can improve user speeds purely by optimizing TCP flows. Its easy to tune servers to deal with the wide variety of latencies found on the internet - something that desktop environments just aren't tuned for.
Squid allows ISPs to avoid needing to spend large amounts of money on upgrading core equipment and transit links to cope with ever-demanding content growth. It also allows ISPs to prioritize and control certain web content types where dictated by technical or economic reasons.
3.3.1 SQUID STYLE TRANSACTION-LOGGING
Transaction logs allow administrators to view the traffic that has passed through the Content Engine. Typical fields in the transaction log are the date and time when a request was made, the URL that was requested, whether it was a cache-hit or a cache-miss, the type of request, the number of bytes transferred, and the source IP.
High-performance caching presents additional challenges other than how to quickly retrieve objects from storage, memory, or the web. Administrators of caches are often interested in what requests have been made of the cache and what the results of these requests were. This information is then used for such applications as:
¢ Problem identification and solving
¢ Load monitoring
¢ Billing
¢ Statistical analysis
¢ Security problems
¢ Cost analysis and provisioning
Squid log file format is:
time elapsed remotehost code/status bytes method URL rfc931 peerstatus/peerhost type A Squid log format example looks like this:
1012429341.115 100 172.16.100.152 TCP REFRESHJVIISS/304 1100 GET http://ciscoiiiiages/homepage/news.gif - DlRECT/cisco.com -
Squid logs are a valuable source of information about cache workloads and performance. The logs record not only access information but also system configuration errors and resource consumption, such as memory and disk space.
Field
Description
lme
UNIX time stamp as Coordinated Jniversal Time (UTC) seconds with a millisecond ¦esolution.
Elapsed
Length of time in milliseconds that the ache was busy with the transaction.
Note Entries are logged after the reply las been sent, not during the lifetime of the transaction.
Remote Host
IP address of the requesting instance.
Code/Status
Two entries separated by a slash. The first mtry contains information on the result of the xansaction: the kind of request, how it was satisfied, or in what way it failed. The second ¦ mtry contains the HTTP result codes.
Bytes
Amount of data delivered to the client. This does not constitute the net object size, because headers are also counted. Also, failed ¦equests may deliver an error page, the size of which is also logged here.
3.3.2 SQUID LOG FILES
The logs are a valuable source of information about Squid workloads and performance. The logs record not only access information, but also system configuration errors and resource consumption (eg, memory, disk space). There are several log file maintained by Squid. Some have 10 be explicitly activated during compile time, others can safely be deactivated during run-time.
There are a few basic points common to all log files. The lime stamps logged into the log files are usually UTC seconds unless stated otherwise. The initial time stamp usually contains a millisecond extension.
SQUID.OUT
If we run your Squid from the Run Cache script, a file squid.out contains the Squid startup times, and also all fatal errors, e.g. as produced by an assertQ failure. If we are not using Run Cache, you will not see such a file.
CACHE.LOG
The cache.log file contains the debug and error messages that Squid generates. If we start your Squid using the default RunCache .script, or start it with the -s command line option, a copy of certain messages will go into your syslog facilities. It is a matter of personal preferences to use a separate file for the squid log data.
From the area of automatic log file analysis, the cache.log file does not have much to offer. We will usually look into this file for automated error reports, when programming Squid, testing new features, or searching for reasons of a perceived misbehavior, etc.
USERAGENT.LOG
The user agent log file is only maintained, if
l.We configure the compile time ”enable-useragent-log option, and
2.We pointed the useragentjog configuration option to a file.
From the user agent log file you are able to find out about distribution of browsers of your clients. Using this option in conjunction with a loaded production squid might not be the best of all ideas.
STORE.LOG
The store.log file covers the objects currently kept on disk or removed ones. As a kind of transaction log it is usually used for debugging purposes. A definitive statement, whether an object resides on your disks is only possible after analyzing the complete log file. The release (deletion) of an object may be logged at a later time than the swap out (save to disk).
The store.log file may be of interest to log file analysis which looks into the objects on your disks and the time they spend there, or how many times a hot object was accessed. The latter may be covered by another log file, too. With knowledge of the cache_dir configuration option, this log file allows for a URL to filename mapping without recurring your cache disks. However, the Squid developers recommend to treat store.log primarily as a debug file, and so should you, unless you know what you are doing.
HIERARCHY.LOG
This log file exists for Squid-1.0 only. The format is
[date] URL peer status peer host
ACCESS.LOG
Most log file analysis program are based on the entries in access.log. Currently, there are two file formats possible for the log file, depending on your configuration for the emulate^ httpd Jog option. By default, Squid will log in its native log file format. If the above option is enabled. Squid will log in the common log file format as defined by the CER'N web daemon.
'The Common Logfile Format is used by numerous HTTP servers. This format consists of the following seven fields:
remote host rfc931 authuser [date] "method URL" status bytes
It is pars able by a variety of tools. The common format contains different information than the native log file format. The HTTP version is logged, which is not logged in native log file format.
The log contents include the site name, the IP address of the requesting instance, date and time in unix time format, bytes transferred, the requesting method and other such features. Log files are usually large in size, large enough to be mined. However, the values of an entire line of input changes with the change in header.
The common log file format contains other information than the native log file, and less. The native format contains more information for the admin interested in cache evaluation. The access.log is the squid log that has been made use of in this project. The log file was in the form of a text file shown below :
File Eft Form* View llei|>
ii85 s...._.s.:.3 -CP>:5S/290 i85.ON__.CT 1.1.1.1.44. ::»xc''/64.n.ioi.:5s :i:iic.:iS87.:jii
1198 85.141.2J7.136 ICP_MI5S/200 143 CONNECT login.icq.can:443 -DIRECT/205.188.153.121 -11204073887.231
8219 .'06.51. 233.54 TCPJ4ISS/200 10286 TOST http://go_gle -DIRECT/203.131.197.213 text/ht_ilDl040.'38.7.237
1229 203.212.38.43 TCF.flISS/302 630 GET http://Ww.around-japjncg1-b1n/rjnk/access.egl' -DIRtCi/210.188.2-5.12 text/html[1_04073337.263 170*7 81.199.63.27 TCP_HISS/200 5901 GET http://VnbBjil.charterimaqes/portal/MailAd.ipg -DIREC1/2Q9.225.8.224 image/ipegll204073387.265 1257 211.125.33.125 TCPJ4ISS/302 679 GET http://wm.club-support.riet/cgl-b1rVrank...nklink.cgl -DIRECT/202.212,131.188 text/html 112040/3887.266 1257 63.245.235.44 TCPJ.ISS/200 183 CONNECT login.icq.ttjm:443 -DIRECT/205.188.153.121 -11204073887.441
7891 206.51.237.114 TCP.MSS/500 758 POST http://Ww.7hue.com.cn/djtj/crjmnientAdd_Coinnent.asp -DIRECT/210.51.1 J.83 text/html 11204073887.471
1463 219.117,248.243 TCP_MISS/20u 6286 GET http://Ww.google -DIRECT/64.233.183.104 text/html_120407.8S7.4Bb
465 89.149.209.159 TCPJ1ISS/2Q. 977 POSI http://hiysstud1o.co_/proxy5/check.php -DIRECT/89.149.221.164 text/html[12040/3887.642
23638 82.46.97.132 TCPJ4ISS/999 3002 GET http://202.86.4.199/config/i5p_.erify_user' -DIRECT/202.86.4.1.99 text/htmlJ12O4073887.668
645 206.51.233.54 TCPJ.ISS/200 466 POST http://nuhost.info/eye.php -OIRECT/66.232.113.44 cext/Titmll1_.10.3387.G72
649 66.232.105.20C TCP..MISS/200 467 POST http://nuhost.info/_ye.php -DIRECT/66.232.113.44 text/himli)12u4073887.68i
3653 24.195,130.110 TCPJMISS/999 5080 GET http://209.191.92.64/confiq/isp_verify_USer'- -DIRECT, .09.191..2.64 te*t/ht«illll204073887.6.5
673 82.146.41.117 TCPJII5S/200 810 GET http://sinarteh.coiri.ru/proxy_checker/proxy_dest.php -DIRECT/82.146.46.25 text/html 03 2 04 0 . 3 8 87 . 731
708 216.163.8,34 TCPJMI55/200 581 GET hitp://itiobilel.login.vip.den.yahoo.cun/config/pwtokeiugrjt -DIRECT/716.155. 200.61 application/octet-stream!
1204073887. 743
2 5 6 3 5 60.172 . 204 . 2 5 0 TCPJ.ISS/200 12077 GET http://aqr'l.diytrade.c_ii/sdp/514222/2/...7270.htiin -MKECr/210.245.160.41 text/html 1120407388/.76
:747 89.128.26.162 TCPJ.ISS/200 581 GET http://_i.17.manlier.in.yahooconf1g/pwtoken_get -DIRECT/202.86.4.201 appHcat1on/octet-streainQl204073887.824
:801 59.23.225.51 TCPJMISS/200 595 GET http://w_w.arca_-_Hriners.c_t_/banners.php' -DIRECT/74.86.170.171 text/html[1204073837.835
754 147.32.92.702 TCPJ.I55/302 386 GET http://pod-o-lee.iiiyiiiinicity.fr/sec -DIRECT/87.98.205.19 text/htni!lll2O4073687.903
2684 61.28.181.18 TCPJ.ISS/500 451 POST http://sheblogs.peopleaggregatorcontent.php -DIRECT/207.7.143.178 text/htmllll204073887.974
951 32.146.61.251 TCPJMISS/200 139 CONNECT 205.188.153.97:443 -blRECT/205.188.153.97 -111204073888.010
3001 219.161,217.101 TCP_MSS/200 4144 GET http://mamono.2chtest/read.cgi/tvd/1200928402/1 -DIRECT/207.29.253.220 text/html[120S073888.153 1131 71.17.129.165 TCPJUSS/200 583 GET http://17.login.krs.yahooconfiq/pwroken_get -DIRECT/211.115.98.81 application/octet-screaM1204 0 7 3 8 38.189 1166 216.150.79.194 TCP_MISS/200 182 CONNECT 205.1881153.249:443 -DIRECT/205.188.153.249 -[1204073388.270 6264 211.154.46.103 TCP_MISS/200 199 CONNECT tcpconn.tencent.com:443 -DIRICT/219.133.49.206 -03204073888.423
1400 206.51.233.54 TCP_MI55/200 973 POST hnp;//hpcgi2.nifty.comA"inokankyo/BBS2/./aska.cgi -DIRECT/202.248.237.181 text/html.1204073888.423 4 10 64.124 . 9.8 KP_HIT/:00 10400 GkT http://Ww.pltorenihousariddigltstest.ixt -NONE/- te«t/pl»1r.U.040738J8.$45
34422 72.232.10.91 ICPJII55/200 5942 POST http://volijriteertravelcostarica.co_/fo...o5ting.php' -DIRECT/212.203.66.68 text/html .1204073888.634
1612 80.64.94.254 TCP._l.I5S/209 292 CONNECT 61.12.161.135:443 -DIRECT/64.12.161.185 -.120407388..649
636 66.197.130.149 TCPJMISS/200 601 POST http://sm.cusbbs.caii/proxy.php -DIRECT/66.197.130.149 text/htm! 1112040738.8.682
669 206.51.233.54 TCPJ.ISS/200 466 POST http://riuhost.1nfo/eye.php -DIRECT/66.232.113.44 text/htm^H204073883.759
746 206.51.225.48 TCPJMISS/200 401 POST http://megafasthost.info/eye.php -DIRECT/.2.232.67.226 text/html[1204073888.760
747 66.232.113.194 TCPJMISS/200 402 POST http://h1kufeye.php -DIRECT/72.232.225.186 text/html.1204073838.765
753 69.46.20.87 TCP_MIS5/200 399 POST http://megjfasthost.info/eye.php -DIRECT/72.232.67.226 text/html 11204 0 73 8 8 8 . 792 779 66.109.21.182 TCPJMISS/200 935 GET http://botmasternetproxy/http/engine.php -DIRECT/216.195.32.131 text/html[1204073388.818 5801 66.232.113.206 TCPJ1ISS/302 802 POST http://wwj.fngeetsphpbb/posting.php -DIRECT/205.134.165.122 text/html[1204073388.821 80S 66 . 2 3 2.113.194 TCPJ.ISS/200 402 POST http://hikufeve.php -DIRECT/72.232.225.186 text/html01 204 0 7 3 S88.833
8804 72.232.200.219 TCPJMISS/200 945 POST http://add-1n.co.3p/tbbs/old/imqbbs/1mgboard.cg1 -DIRECT/202.222.30.89 text/ht_i.lC1204073888.S41 828 66.232.113.194 1CPJ4ISS/200 4 02 POST http://hikufeye.php -DIRECT/72.232.225.186 text/htmlD1204U73838.849
8821 206.51.237.114 TCPJMISS/200 521 POST http://tesi.zJleJs1ng.c_n/Guestboofc/e_jdd msg.asp -DIRECT/210.51.169.29 text/htmiDl/04073888.852 839 90.61.95.208 TCP_MIiS/200 753 GET http://engine.espace.netaven1r -DIRECT/213.186.52.197 tex.Aitmlll204 0 7 S888.939 926 06.232.113.62 TCPJ.ISS/200 1957 POST http://victors-iwmaindex.php -DIRECT/65.110.48.60 text/htmlB204073888.94¬929 210.170.204.201 TCPJ1ISS/302 913 GET http;//www,gettakaratok/rankllnk.cgi -DIRECT/206.223.148.15 text/htmlD12O4073888.947 9935 72.21.34.26 TCP_HISS/302 374 POST http://ww.dinexus.nl/guestbook/s1gnbook.php -DIRECT/85.92.140.60 text/html 11204073889.000 84 7 77 . 73.185.2 5 0 TCP.XI5S/304 4 4 0 GET http://singlepjreritmeetconmunity/imjges/htiil_liook.qif -DIRECi/63.241.160.71 -112O4073339.023 2001 24.20.117.148 TCPJ.ISS/200 10340 GET http://youtube.c_it/barackobama -OIRECT/208.65.153.238 text/ht_ilo_204073889.221 3212 58.19.162.2 TCP_MISS/200 3916 GET http://kyksy.C_ll/5ite/promotion.php' -DIRECT/91.121.88.177 text/html 112040,'3839.251 1238 84.53.86.19 TCPJMISS/200 183 CONNECT 205.183.153.100:443 -OIRECT/205.188.153.100 -.1204073889. 271 1256 82,114.228.67 TCP_MI5S/200 185 CONNECT login.icq.CO»i:443 -DIRECT/205.188.153.121 -_L.2O4073889.414
451 89.128.26.162 TCPJMISS/200 581 GET http://ml7.iiiember.1n.yahooconf1g/pwtoken_get -OIRECT/202.86.4.201 applicjtion/o.re'.-stream012040388-.499
15911 206.51.226.106 TCPJ.ISS/200 701 POST http://qixiusoft.cn/addjnsg.asp -DIRECT/222.191.251.101 text/html 11204 0 7 3 8 89 . 5 08
19622 24. 95.156.140 TCP_MI55/999 3002 GET http://n37.loqin.mud.yahooconfig/login -DIRECT/209.191.92.100 text/html[1204073889.604
2581 201.248.194.111 TCPJ1IS5/999 5082 GET http://209.191.92.73/conf1g/1sp_ver1fy_iiser -DIRECT/209.191.92.73 text/htmlol.204 0 7 3 8 8 9 . 634
7629 69.46.23.203 TCPJUSS/502 1366 POST http://megafasthost,info/eye!php -DIPECT/72.232.67,226 text/htii.111204973889.648
7642 206.51.225.48 FCP.HISS/502 1366 POST http://megjfjsthost.info/eye.php -DIRECT/72.232.67,226 text/html 11204073889.659
6642 141.151.215.9 TCP_MISS/999 5082 GET http://fl.m_iiber.ukl.yahooconfiq/login -DIRECT/217.12.8.235 text/htm 111204073889.674
41070 72.233.58.23 TCP_MISS/200 3053 POST http://blogs.shintak.info/archive/2005/06/ie/6309.aspx -DIRECT/70.85.106.148 text/html_1204D73839.689
686 216.163.3.34 tcpj.55/200 581 GET http://rhobilel.login.v1p.dcn.yahoo.ccmi...okeri_get7 -OIRECT/216.155.200.61
appiicat1on/octet-streaml2C4073889.714
3706 69.46.27.184 TCPJMISS/302 580 POST http://helpdesk.fasthitindex.php -DIRECT/202.53. 5.147 text/html 11204073889.723
6706 89.149.220.229 TCPJ1ISS/200 675 HEAD http://ww.axishq.wwlionlinephpBB2/v1e_topic.php -DIRECT/66,28.224.201 text/html[1204073889.741
738 69.46.23.203 TCPJ-S5/200 400 POST http://meqjfJSthost.info/eye.php -DIRECT/72.232.67.226 text/hti_lB12O4073889.770
76 7 66 . 2 3 2.113.194 TCPJitSS/200 402 POST http://hikufeye.php -DIRECT/72.232.225.186 text/html 11204 0 7 3 8 89 . 971
3962 213.227.245.146 TCPJ.I5S/200 184 CONNECT login.icq.coi»:443 -DIRECT/205.188.153,121 -11204073890.016
36739 124.115.0.172 TCPJ.ISS/200 4701 GET http://__vj.ba1du.eom/s -DIRECT/202.108.22.44 text/htmlD12040738.0,022
401 3 6 9 . 64 . 45.239 TCPJMISS/200 4530 POST http://denic.de/Ae_w1.ois/iridex -DIRECT/81.91.170.12 toxt/ht_il.l20.1073890.o22
1019 194.186.94.194 TCP_HISS/200 144 CONNECT 205.188.179.233:443 -DIRECT/205.188.179.233 -11234073890.129
988 82.115.48.59 TCPJUSS/200 489 GET http://gadr.et.h1t.g_.1us.pl/-l2O4074244l40/redot.gif -DIRECT/194.9.24.41 .mage/gifQ12O4073S90.i56 32445 68.73.167.159 TCP-WSS/999 5084 GET http://87.248.107.127/ci_ifig/isp_verify_user' -DIRECT/87, 248.107.127 text/htmlll2O407<990.178 6357 69.46.23.203 TCPJMISS/200 585 POST http://tenayagroup.eom/blog/_p-cc_ment5-post.php -DIRECT/198.170.85.4 text/tltm 111204073890.228
Figure 3.3.2.1 : Access.log used as database
3.3.3 SQUID RESULT CODES
The TCP_ codes refer to requests on the HTTP port (usually 3128). The UDP_ codes refer to requests on the ICP port (usually 3130). If ICP logging was disabled using the logicp queries option, no ICP replies will be logged.
TCPJEIIT
A valid copy of the requested object was in the cache. TCP_MISS
The requested object was not in the cache. TCP REFRESH HIT
The requested object was cached but STALE. The IMS query for the object resulted in "304 not modi lied".
TCP REFFAILHIT
The requested object was cached but STALE. The IMS query failed and the stale object was delivered.
TCPREFRESHJVHSS
The requested object was cached but STALE. The IMS query returned the new content. TCP CLIENTJREFRESH MISS
The client issued a "no-cache" pragma, or some analogous cache control command along with the request. Thus, the cache has to-prefect the object.
TCP IMS_HIT
The client issued an IMS request for an object which was in the cache and fresh. TCP SWAPFAIL MISS
The object was believed to be in the cache, but could not be accessed. TCPNEGATIVEHIT
Request for a negatively cached object, e.g. "404 not found", for which the cache believes to know that it is inaccessible. Also refer to the explanations for negative^ ttl in your squid.conf file.
TCPMEMHIT
A valid copy of the requested object was in the cache and it was in memory, thus avoiding disk accesses.
TCPDENIED
Access was denied for this request. TCP_OFFLINE_IIIT
The requested object was retrieved from the cache during offline mode. The offline mode never validates any object.
UDP HIT
A valid copy of the requested object was in the cache. UDP MISS
The requested object is not in this cache.
UDPDENIED
Access was denied for this request. UDP_IN VALID An invalid request was received. UDP_MISS_NOFEl CH
During "-Y" startup, or during frequent failures, a cache in hit only mode will return either UDPJHIT or this code. Neighbors will thus only fetch hits.
NONE
Seen with errors and cache manager requests.
3.4 HTTP RESULT CODES
These are taken from RFC 2616 and verified for Squid. Squid-2 uses almost all codes except 307 (Temporary Redirect), 416 (Request Range Not Satisfactory), and 417 (Expectation Failed). Extra codes include 0 for a result code being unavailable, and. 600 to signal an invalid header, a proxy error. Also, some definitions were added as for RFC 2518. Yes, there are really two entries for status code 424, compare with http_status in src/enums.h;
000
USED MOSTLY WITH UDP TRAFFIC
100
CONTINUE
101
SWITCHING PROTOCOLS
102
PROCESSING
200
OK
201 CREATED
202 ACCEPTED
203 NON-AUTHORITATIVE INFORMATION
204 NO CONTENT
205 RESET CONTENT
206 PARTIAL CONTENT
207 MULTI STATUS
300 MULTIPLE CHOICES
301 MOVED PERMANENTLY
302 MOVED TEMPORARILY
304 NOT MODIFIED
305 USE PROXY
307 TEMPORARY REDIRECT
400 BAD REQUEST
401 UNAUTHORIZED
402 PAYMENT REQUIRED
403 FORBIDDEN
404 NOT FOUND
405 METHOD NOT ALLOWED
406 NOT ACCEPTABLE
407 PROXY AUTHENTICATION REQUIRED
408 REQUEST TIMEOUT
409 CONFLICT
410 GONE
411 LENGTH REQUIRED
412 PRECONDITION FAILED
413 REQUEST ENTITY TOO LARGE
414 REQUEST URI TOO LARGE
415 UNSUPPORTED MEDIA TYPE
416 REQUEST RANGE NOT SATISFIABLE
417 EXPECTATION FAILED
424 LOCKED
424 FAILED DEPENDENCY
433 UNPROCESSABLE ENTITY
500 INTERNAL SERVER ERROR
501 NOT IMPLEMENTED
502 BAD GATEWAY TABLE 3.4.1 : HTTP result codes
3.5 HTTP REQUEST METHODS
Squid recognizes several request methods as defined in RFC 2616. Newer versions o Squid also recognize RFC 2518 "HTTP Extensions for Distributed Authoring WEBDAV extensions.
GET
OBJECT RETRIEVAL AND SIMPLE SEARCHES.
HEAD
METADATA RETRIEVAL.
'OST
SUBMIT DATA (TO A PROGRAM).
PUT
DELETE
UPLOAD DATA (E.G. TO A FILE).
REMOVE RESOURCE (E.G. FILE).
TRACE
APPLN LAYER TRACE OF REQUEST ROUTE.
OPTIONS
REQUEST AVAILABLE COMM. OPTIONS.
CONNECT
TUNNEL SSL CONNECTION.
PROPF1ND
RETRIEVE PROPERTIES OF AN OBJEC
PROPATCH COPY
CHANGE PROPERTIES OF AN OBJECT
CREATE A DUPLICATE OF SRC IN DST.
MOVE LOCK UNLOCK
ATOMICALLY MOVE SRC TO DST.
LOCK AN OBJECT AGAINST MODIFICATIONS.
UNLOCK AN OBJECT.
TABLE 3.4.2 : HTTP request methods
CHAPTER 4
CODING
4.1 FEATURES OF LANGUAGE (PERL)Practical Extraction and Reporting Language is an interpreted language optimized for scanning arbitrary text files, extracting information from those text files, and printing reports based on that information, it's also a good language for many system management tasks.
¢ The language is intended to be practical (easy to use, efficient, complete) rather than beautiful (tiny, elegant, minimal).
¢ It combines (in the author's opinion, anyway) some of the best features of c, sed, awk, and sh, so people familiar with those languages should have little difficulty with it. (language historians will also note some vestiges of Pascal and even basic-plus.)
¢ Unlike most UNIX utilities, Perl does not arbitrarily limit the size of our data ” if we have got the memory, Perl can slurp in our whole file as a single string, recursion is of unlimited depth.
¢ The hash tables used by associative arrays grow as necessary to prevent degraded performance. Perl uses sophisticated pattern matching techniques to scan large amounts of data very quickly.
¢ Although optimized for scanning text, Perl can also deal with binary data, and can make dbm files look like associative arrays (where dbm is available).Setuid Perl scripts are safer than c programs through a dataflow tracing mechanism which prevents many stupid security holes.
¢ The overall structure of Perl derives broadly from C. Perl is procedural in nature, with variables, expressions, assignment statements, brace-delimited code blocks, control structures, and subroutines.
¢ Perl also takes features from shell programming. All variables are marked with leading sigils. which unambiguously identify the data type (scalar, array, hash, etc.) of the variable in context. Importantly, sigils allow variables to be interpolated directly into strings.
¢ Perl has many built-in functions which provide tools often used in shell programming (though many of these tools are implemented by programs external to the shell) like sorting, and calling on system facilities.
¢ Perl takes lists from Lisp, associative arrays (hashes) from AWK, and regular expressions from sed. These simplify and facilitate many parsing, text handling, and data management tasks.
¢ In Perl 5, features were added that support complex data structures, first-class functions (i.e., closures as values), and an object-oriented programming model. These include references, packages, class-based method dispatch, and lexically scoped variables, along with compiler directives .
¢ All versions of Perl do automatic data typing and memory management. The interpreter knows the type and storage requirements of every data object in the program; it allocates and frees storage for them as necessary using reference counting (so it cannot reallocate circular data structures without manual intervention). Legal type conversions -for example, conversions from number to string”are done automatically at run time; illegal type conversions are fatal errors.
¢ Perl has a context-sensitive grammar which can be affected by code executed during an intermittent run-time phase. Therefore Perl cannot be parsed by a straight Lex/Yacc lexer/parser combination. Instead, the interpreter implements its own laxer, which coordinates with a modified GNU bison parser to resolve ambiguities in the language.
¢ The execution of a Perl program divides broadly into two phases: compile-timc and run-time. At compile time, the interpreter parses the program text into a syntax tree. At run time, it executes the program by walking the tree.
4.2 PERL CODE FOR MINING
i 6 :
12 j nptn (DAT, Sdi.uifiJ.-f ! ! 1.1 fiile content-<LiU>;
]:eM7h * line ft'".
Ltiop(f line);
.U | (5ET,tP,iC3,SBYTt;,;MT,8KAHi:,;P:;;H.: ^1| peint "*NA«E"; 32 : print "\n"; 83! inumfgarray, "SWAHr'.i ; ¦2*1 ! ! -<:S ¦
j 27 : £uiedch (IJaEtttyj
icounc»<5 )++;
teach $Weye (keys '
j
FIGURE 4.2.1: PERL Program for mining
The Perl code to mine access.log makes use of the construct splitf) which is required to split a line of text in the log file. The extracted site name is pushed into an array for comparison purposes. After the required comparison to determine the number of times that a site has been repeated, both the site and its corresponding count is inserted into a hash array.
The Hashed array is now utilized for sorting the site name in the descending order of its count. The count and the corresponding site name is displayed as the output.
4.3 DISPLAYED OUTPUT
-
He "dt vm Rut feUM* Pflri Serve Mndm ti*>
(«"j:."l61.I53:4«
'login.lC3.eom:443 Ihttp;/fvvv.google
l.t.tp://w««.around-]apanc:gi-bin,/tarikyacce33.cgi http://rebiaail.cliarteciniages/portal/IIaiHd.jP9 6ttp://»«».club-supportcgi-bio/cank'ing/ranklink.cgii ¦ login.icq.com:413
http://«vv.2hue.com.cti/d*i'.a/cuuDetit'/Add Conwent.asp
http;//iww,ti(jogla
http://Biysstudio.crjm/proxy5/checJt.php
ht tp://ZOZ. 86.4.199/conf ig/ ispverify_u3er 7
fcttp://nuhost- into/eye¦php
http://nuhost.info/eye.php
http://E09.191.92.64/conIig/isp_verify_usei-
http://5marteh.com.ru/proxy checker/proxy de3t.php
http://nobilel.lcjgin.vip.den.yahoocoiifig/pHtoken gut
http://«qrl.diyocarte.cc^S!Jp/514222/2/ind-2732062/3707270.html
tttp://i»r*.BW*>L. in. yahoo, cui/coni ig/pirtoken_get
httpj//wwb.arcartebanaersbanners.php
http://pod-o-lee.inymiriicity.fr/sec
http://shebiog3.p«oplt:aggreqati:t:.r...'ntent.php
.15.188.153.97:413 V.tp://mamono. 2chtest/read, cai/cvoV 1200928402/1
FIGURE 4.2.2 : VISITED SITES
This is the output to the program in figure 4. It displays only the sites that have been reqtiested for, visited and even those that have been denied access from the proxy server. Hence, the log records all the transactions that have been successful and those that have failed.
TOTAL SITES VISITED : 5238
SITES SORTED IN ORDER OF FREQUENCY OF USiGF.:
200
93
80
69
53
51
50
11
31
26
24
23
23
22
20
19
19
18
18
17
15
14
13
13
13
13
13
12
11
11
11
11
10
10
10
http://megalasthost.info/eye.php
http://miho3t.into/eye.php
loyin.icq.com:443
hi tp ://hi)nf. coai/'eye. phr
205.188.179.233:143
http ://wvw.dertic.de//wet'Vhois/ index
http://thedou Hies ite. com/ eye. php
http://google.conv'
64.12.200.89:443
205.188.153.121:443
http://nwbllei.login.vip.dcn.yahooconfig/pwtoken get
205.1B8.153.100:113
http: // iwf iiids. org/ eye. php
http://vap.a0Iautn/l03iB.d0
http://htts.biog.sina.com. .-n/hits
http:,'/202.86,4,192/config/pwtokenjjec
http://brtidu.c0m/3
205.188.153.219:143
205.IBS.153.94:113
http://wwa.wgbni.cn
http://tteecoticmarlcer.php
61.12.161.153:413
61.12,161.185:443
http ://espace. netavettir. com/ diffusion/ http://72.2l.31.2S/-sirset/eye.php 205.188.153.99:443 205.188.153.97:113
http:III 17.146.187.137/config/pwtoken_get http://vw.dti-tanker.coM/public/jp/click: http://m22.member.in.yahooconfig/pwtoken get tlcketmastet.con: 143 http://botttia.tterrioi proxy/http/eiigine.plip http :// youtttbtbarackoboma http://googlesearch http://72.233.58.23/-sirsct/eve.chp
|NaiTfi''iiiain,:MT''tt^aJycfce:poisWe!^oatMtetkiie$.^lffi20 : Maine 'man ET" used onV ooce possible typo at sortedSites.pl tie 20: ¦ Name 'man IP' used onV once possiete two at sottedsitespl rte 20.
Figure 4.2.3 : Sites sorted in frequency of usage
BYTES DOWNLOAD EI1 yiTK NAHE
606811 http //2O2.1Q4.241.3/qq£ile/qq/update/qqiipCiateeenter206.zip
89926 http //hwk, antrecotci. net/cgi-bin/bbs.cgi
89955 http //uw.casba.ne, jp/cgi~bin/ca3-bbs/yybb3.cgi
78307 http //B»H.blowjob-pics.info/submit.html
78240 http //rfiy-real-livegirl/sayuki/bbs/c lever, cgi6442 6 http //iBage32.singleparentrseet.coBi/30/l4S/4689l15/ 1137852.jpg
62330 http //bp 12 3. spre ebb. cost/index, php
62414 http //forum, pouweb showthread.php
61633 http //ww.soybean.co. jp/cgi-Qpt/bbs/soybean_bbs.cgi
58949 http //tvoyapolovina.at.ua/
56631 http //uw. spike, com/search
56594 http //wwu. gennim-guji-clappasggc/index, php49106 http //engine.espace.netavenirlib/NETAVENIR/HETAVtNlR.is
47775 http //theeharly.f2sver taller.php47558 http //tnithlaidbearshowdetails.php45410 http / / 3eshg. coin/ vb/sendmessage. php
45039 http //kr.blog. yahoo.coin/cmkr/tHBLCWarite curt, html
43060 http // hardplayharclbb3/yybbs.cgi42152 http //comedy,irk.ni/guestbook/gueatbook/
42142 http //05xx. sub, jp/ sfsrver/bbs/ index.cgi
41878 http /Jvm.aemwT.vz
39246 http / / veetra. auto-art. org/ web/sue/6/
38502 http // yahoo, corn/
38110 http //nuninovacat-list/4/added/27834569 http //ostee,com/cgi-bin/bbs/clever.cgi33900 http / / x-iaods. co. nz/t orum/ index. php
33895 http //oztee. coiti/cgi-bin/bbs/clever. cgi33595 http // rainboapushcgi-biti/discus/board-post. pi33449 http //faithandrear.blogharborblog/ciiidKdo post corntiTent
33206 http //cim-phil.hp.infoaeek.co.jp/cgi-bin/yybbs.cgi
30382 http //ok. 2 lciitoplist/song.jsp29594 http //search,en.yahoosearch
2 8757 http //blog.sina.com.cn/s/blog 4al87039010005ui».ht»l
27543 http //phot0370.nas2a-klasa.pl7devll3/0/O6l/266/OO61266097.jpg
26430 http //ews.sogou.eoio/websearch/corp/search. jsp
2593 6 http //hi. baid«.cora/hggggi8/b log/it ett^bedcci4 3 d447ca4c3 9e3d62e9.html
25483 http //gecsan.ru/vent cond.html25464 http //hww.ticketamstec.cora/event/06003F65BEE317745
25316 http //M«w.3ingleparentBieet.cora/coi«(iunity/nieinber/
25227 http //uiiiqueduiiip.coi'd/ indes. php
25225 http /7phot03l7.aasza-kltt3a.pl/dev42/0/036/134/00361340V8.jpg
25105 http //sacradoctrina.b logspot2006/ll/gestur efj-toHarti.':"-snare r.s-25040 http //inwaoes. gooqle imaqes
Nome "main ET" used only once: ixjistole lypo <i rowidowiloadedpl line 1 h«P8 ¦ '6Wi MT"u^onivnrco' possUel>\)o jl iixAldu^n>>odded.pl line 19 Name ' toanJP'usedor»vonce, potable lw» atmoridowibauedpiSue 1S.
Figure 4.2.4 : Sites sorted in terms of bytes downloaded
I* Sid Input! '(! Scrip) © Sid [Up'i
ITCP'" HISS/iob
fCP^HISS/200 TCPJ1ISS/2QQ TCPJIISS/200 TCPJUSS/302 JCP_flISS/4CJl TCPJHSS/200 TCP HISS/200 TCP JUSS/2QQ |CP_HI3S/200 TCP_HIS5/200
NUMBER OF SITES THAT WEP.E DEN IIP ACCESS
ACCESS DENIED SITES ***'
ms94.UEl.com.tw:25, TCP_DENKD/4Q3 iroxyzone.ru:8030, TCP_DF.NI£D/403 proxyvay.net:60, TCP_DENIED/403 Cup,mail.xmte.net;25, iCP_DENIED/403 H-iW.ftp8.co.uk:80, TCP_DENIED/4C3 google.com:80, TCP DENIED/403
; ,man:APP"usetion|i'«ico: poBfc!e-jT»aUepdtfiedpl!M»li ¦F*j.o'iri«n::MT',u)edon(|Jor«! ijoufcletypoattcpctencd.plline 12 ¦¦¦.'¦c "maitlP" wed onk> one* owtiie !wo a' tcudoniedplliiw 12.
Figure 4.2.5 : Sites that were denied access
<».¦¦¦¦ -w.
CHAPTER 5 TESTING
5.1 SYSTEM TESTING
Testing is a set activity that can be planned and conducted systematically. Testing begins at the module level and work towards the integration of entire computers based system. Nothing is complete without testing, as it is vital success of the system.
Testing Objectives:
There are several rides that can serve as testing objectives, they are Testing is a process of executing a program with the intent of finding an error A good test case is one that has high probability of finding an undiscovered error. A successful test is one that uncovers an undiscovered error.
If testing is conducted successfully according to the objectives as stated above, it would uncover errors in the software. Also testing demonstrates that software functions appear to the working according to the specification, that performance requirements appear to have been met.
There are three ways to test a program
¢ For Correctness
¢ For Implementation efficiency
¢ For Computational Complexity.
Tests for correctness are supposed to verify that a program does exactly what it was designed to do. This is much more difficult than it may at first appear, especially for large programs.
Tests for implementation efficiency attempt to find ways to make a correct program faster or use less storage. It is a code-refining process, which reexamines the implementation phase of algorithm development.
Tests for computational complexity amount to an experimental analysis of the complexity of an algorithm or an experimental comparison of two or more algorithms, which solve the same problem.
Testing Correctness
The following ideas should be a part of any testing plan:
¢ Preventive Measures
¢ Spot checks
¢ Testing all parts of the program
¢ Test Data
¢ Looking for trouble
¢ Time for testing
¢ Re Testing
The data is entered in all forms separately and whenever an error occurred, it is corrected immediately. A quality team deputed by the management verified all the necessary documents and tested the Software while entering the data at all levels. The entire testing process can be divided into 3 phases
Unit Testing
Integrated Testing Final/ System testing
5.1.1 UNIT TESTING
As this system was partially GUI based WINDOWS application, the following were tested in this phase
Tab Order
Reverse Tab Order
Field length
Front end validations
In our system, Unit testing has been successfully handled. The test data was given to each and every module in all respects and got the desired output. Each module has been tested found working properly.
5.1.2 INTEGRATION TESTING
Test data should be prepared carefully since the data only determines the efficiency and accuracy of the system. Artificial data are prepared solely for testing. Every program validates the input data
5.1.3 VALIDATION TESTING
In this, all the Code Modules were tested individually one after the other. The following were tested in all the modules
Loop testing
Boundary Value analysis
Equivalence Partitioning Testing
In our case all the modules were combined and given the test data. The combined module works successfully with out any side effect on other programs. Everything was found tine working.
5.1.4 OUTPUT TESTING
This is the final step in testing. In this the entire system was tested as a whole with all forms, code, modules and class modules. This form of testing is popularly known as Black Box testing or system testing.
Black Box testing methods focus on the functional requirement of the software. That is, Black Box testing enables the software engineer to derive sets of input conditions that will fully exercise all functional requirements for a program.
Black Box testing attempts to find errors in the following categories; incorrect or missing functions, interface errors, errors in data structures or external database access, performance errors and initialization errors and termination errors.
CHAPTER 6 CONCLUSION
The project report entitled "DATAMINING USING FUZZY LOGIC" has come to its final stage. The system has been developed with much care that it is free of errors and at the same time it is efficient and less time consuming. The important thing is that the system is robust. We have tried our level best to make the complete the project with all its required features.
However due to time constraints the fuzzy implementation over the mined data has not been possible. Since, the queries related to mining require the proper retrieval of data, actual connl is preferred over applying fuzziness into count.
APPENDICES
OVERVIEW OF PERL EXPRESS 2.5
PERL EXPRESS 2.5 is a free integrated development environment (IDE) for Perl with multiple tools for writing and debugging your scripts. It features multiple CGI scripts for editing, running, and debugging; multiple input fdes; full server simulation; queries created from an internal Web browser or query editor; test MySQL, MS Access scripts: interactive I/O; directory window; code library; and code templates.
Perl Express allows us to set environment variables used for running and debugging script. It has a customizable code editor with syntax highlighting, unlimited text size, printing, line numbering, bookmarks, column selection, a search-and-replace engine, multilevel undo/redo operations. Version 2.5 adds command line and bug fixes.
RESUME
The developed system is flexible and changes can be made easily. The system is developed with an insight into the necessary modification that may be required in the future. Hence the system can be maintained successfully without much rework.
One of the main future enhancements of our system is to include fuzzy logic which is a form of multi-valued logic derived from fuzzy set theory to deal with reasoning that is approximate rather than precise.
REFERENCES
1. frequent Pattern Mining in Web Log Data - Renata Ivancsy, lstvan Vajk
2. Squid-Style Transaction Logging (log formats) - http://cisco
3. Mining interesting knowledge from weblogs: a survey - Federico Michele Facca, Pier Luca lanzi.
4. http://software.techrepublic.comabstract.aspx
5. http://en.wikipedia
6. http://msdn.microsoft
Reply
#2
Data mining introduction
Ghada H. El-Khawaga
Marwa M. El-Sadeeq
2007
Agenda
What is data mining ?
Why data mining?
Data mining types
Data mining tasks
Knowledge discovery in databases (KDD) processes
Data mining processes
Data mining techniques
Data mining and Data warehousing
Data Mining System Components
Data Mining Applications
Data Mining Tools






What is data mining ?
Non-trivial extraction of implicit, previously unknown and potentially useful information from data.

A step in the knowledge discovery process consisting of particular algorithms (methods) that under some acceptable objective, produces a particular enumeration of patterns (models) over the data.

Why data mining ?
Data volumes are too large for classical analysis approaches:
Large number of records
High dimensional data
Leverage organizationâ„¢s data assets
Only a small portion of the collected data is ever analyzed
Data that may never be analyzed continues to be collected, at a great expense, out of fear that something which may prove important in the future is missing.
As databases grow, the ability to support the decision support process using traditional query languages becomes infeasible
Query formulation problem








Data mining types
Predictive data mining: which produces the model of the system described by the given data. It uses some variables or fields in the data set to predict unknown or future values of other variables of interest.

Descriptive data mining: which produces new, nontrivial information based on the available data set. It focuses on finding patterns describing the data that can be interpreted by humans.

Data mining tasks
Data processing [descriptive]
Prediction [predictive]
Regression [predictive]
Clustering [descriptive]
Classification [predictive]
Link analysis/ associations [descriptive]
Evolution and deviation analysis [predictive]

Knowledge Discovery in Databases (KDD) processes
Data mining processes
Data mining techniques
Statistical methods
Case-based reasoning
Neural networks
Decision trees


Data mining and Data warehousing
Data warehousing + data mining =

increased performance of decision making process
+
knowledgeable decision makers

SQL Vs. Data mining Vs. OLAP

Data Mining Applications
Data Mining For Financial Data Analysis
Data Mining For Telecommunications Industry
Data Mining For The Retail Industry
Data Mining In Healthcare and Biomedical Research
Data Mining In Science and Engineering

Data Mining System Components
The Function of the data mining system is to
assign scores to various profiles.

Data Mart
Data Mining System(Processing)
Operational Data Store
Scoring Software
Reporting System

Data Mining Applications
Data Mining For Financial Data Analysis
In Banking Industry data mining is used :
1- in the predicting credit fraud
2- in evaluation risk
3- in performing trend analysis
4- in analyzing profitability
5- in helping with direct marketing campaigns

In financial markets and neural networks data mining is used :
1- forecasting stock prices
2- forecasting commodity-price prediction
3- forecasting financial disasters


Data Mining Applications
Data Mining For Telecommunications Industry
- Answering some strategic questions through data-mining applications such as:
1-How does one retain customers and keep them loyal
as competitors offer special offers and reduced rates?
2-When is a high-risk investment, such as new fiber optic
lines, acceptable?
3-How does one predict whether customers will buy
additional products like cellular services, call waiting,
or basic services?
4-What characteristics differentiate our products from those of
our competitors?










Data Mining Applications
Data Mining For The Retail Industry
-The retail industry is a major application area for data mining since it collects huge amounts of data on sales, customer-shopping history, goods transportation, consumption patterns, and service records.
-Retailers are interested in creating data-mining models to answer questions such as:
1- What are the best types of advertisements to reach certain segments of customers?
2- What is the optimal timing at which to send mailers?
3- What types of products can be sold together?
4- How does one retain profitable customers?
5- What are the significant customer segments that
buy products?




Data Mining Applications
Data Mining In Healthcare and Biomedical Research
- Storing patients' records in electronic format and the development in medical-information systems cause a large amount of clinical data to be available online. Regularities, and surprising events extracted from these data by data-mining methods are important in assisting clinicians to make informed decisions, thereby improving health services.
- data mining has been used in many successful medical applications, including data validation in intensive care, the monitoring of children's growth, analysis of diabetic patient's data, the monitoring of heart-transplant patients.


Data Mining Applications
Data Mining In Science and Engineering

- a few important cases of data-mine applications in engineering problems. Pavilion Technologies' Process Insights, an application-development tool that combines neural networks, fuzzy logic, and statistical methods was used to develop chemical manufacturing and control applications to reduce waste, improve product quality, and increase plant throughput.



Data Mining Tools
Data Mind
Agent Base/Marketer
DB Miner
Decision Series
IBM Intelligent Miner
Data Mining Suite
Darwin (now part of Oracle)
Business Miner
Data Engine




Data Mining Tools
Agent Base/Marketer
It is based on emerging intelligent-agent technology.
It can access data from all major sources, and it runs on Windows95, Windows NT, and the Solaris operating system.
Business Miner
It is a single-strategy, easy-to-use tool based on decision trees.
It can access data from multiple sources including Oracle, Sybase, SQL Server, and Teradata.
It runs on all Windows platforms
Data Engine
It is a multiple-strategy data-mining tool for data modeling, combining conventional data-analysis methods with fuzzy technology, neural networks, and advanced statistical techniques.
It works on the Windows platform.



Problems of Data Mining Tools
Difficult to use
Needs Expert to run the tool
Difficult to add new functionality
Difficult to interface
Short lifetime
Limited Number of algorithms
Need lot of resources
References
Data Mining: Concepts, Models, Methods, and Algorithms, Mehmed Kantardzic, ISBN:0471228524, John Wiley & Sons © 2003.
Privacy data mining report, DHS privacy office,2005. 
Building Data Mining Solutions with OLE DB for DM and XML for Analysis, Zhaohui Tang, Jamie Maclennan, Peter Pyungchul Kim, SIGMOD Record, Vol. 34, No. 2, June 2005
Reply
#3
Smile 
Mining Complex Types of Data


Multidimensional analysis and descriptive mining of complex data objects
Mining spatial databases
Mining time-series and sequence data
Mining the World-Wide Web to be covered Dec. 4, if time
Summary

Mining Complex Data Objects: Generalization of Structured Data

Set-valued attribute
Generalization of each value in the set into its corresponding higher-level concepts
Derivation of the general behavior of the set, such as the number of elements in the set, the types or value ranges in the set, or the weighted average for numerical data
E.g., hobby = {tennis, hockey, chess, violin, nintendo_games} generalizes to {sports, music, video_games}
List-valued or a sequence-valued attribute
Same as set-valued attributes except that the order of the elements in the sequence should be observed in the generalization

Generalizing Spatial and Multimedia Data

Spatial data:
Generalize detailed geographic points into clustered regions, such as business, residential, industrial, or agricultural areas, according to land usage
Require the merge of a set of geographic areas by spatial operations
Image data:
Extracted by aggregation and/or approximation
Size, color, shape, texture, orientation, and relative positions and structures of the contained objects or regions in the image
Music data:
Summarize its melody: based on the approximate patterns that repeatedly occur in the segment
Summarized its style: based on its tone, tempo, or the major musical instruments played

An Example: Plan Mining by Divide and Conquer

Plan: a variable sequence of actions
E.g., Travel (flight): <traveler, departure, arrival, d-time, a-time, airline, price, seat>
Plan mining: extraction of important or significant generalized (sequential) patterns from a planbase (a large collection of plans)
E.g., Discover travel patterns in an air flight database, or
find significant patterns from the sequences of actions in the repair of automobiles
Method
Attribute-oriented induction on sequence data
A generalized travel plan: <small-big*-small>
Divide & conquer:Mine characteristics for each subsequence
E.g., big*: same airline, small-big: nearby region

For more information about this article,please follow the link:
http://googleurl?sa=t&source=web&cd=1&ve...m_mct1.ppt&ei=VOa7TLWaDISycNGB4cIM&usg=AFQjCNGk3GjWb40JdBGWClNnbV41-NgqvA

Reply
#4


DATA MINING IN TELECOMMUNICATIONS

Gary M. Weiss
Department of Computer and Information Science
Fordham University



Abstract:

Telecommunication companies generate a tremendous amount of data. These
data include call detail data, which describes the calls that traverse the
telecommunication networks, network data, which describes the state of the
hardware and software components in the network, and customer data, which
describes the telecommunication customers. This chapter describes how data
mining can be used to uncover useful information buried within these data
sets. Several data mining applications are described and together they
demonstrate that data mining can be used to identify telecommunication fraud,
improve marketing effectiveness, and identify network faults.


INTRODUCTION


The telecommunications industry generates and stores a tremendous
amount of data. These data include call detail data, which describes the calls
that traverse the telecommunication networks, network data, which describes
the state of the hardware and software components in the network, and
customer data, which describes the telecommunication customers. The
amount of data is so great that manual analysis of the data is difficult, if not
impossible. The need to handle such large volumes of data led to the
development of knowledge-based expert systems. These automated systems
performed important functions such as identifying fraudulent phone calls and
identifying network faults. The problem with this approach is that it is timeconsuming
to obtain the knowledge from human experts (the “knowledge
acquisition bottleneck”) and, in many cases, the experts do not have therequisite knowledge. The advent of data mining technology promised
solutions to these problems and for this reason the telecommunications
industry was an early adopter of data mining technology.
Telecommunication data pose several interesting issues for data mining.
The first concerns scale, since telecommunication databases may contain
billions of records and are amongst the largest in the world. A second issue
is that the raw data is often not suitable for data mining. For example, both
call detail and network data are time-series data that represent individual
events. Before this data can be effectively mined, useful “summary” features
must be identified and then the data must be summarized using these
features. Because many data mining applications in the telecommunications
industry involve predicting very rare events, such as the failure of a network
element or an instance of telephone fraud, rarity is another issue that must be
dealt with. The fourth and final data mining issue concerns real-time
performance: many data mining applications, such as fraud detection, require
that any learned model/rules be applied in real-time. Each of these four
issues are discussed throughout this chapter, within the context of real data mining applications.

for more information::->

http://citeseerx.ist.psu.edu/viewdoc/dow...1&type=pdf
Reply
#5


Abhishek M. Mehta

[attachment=7990]

TOPICS
Why we required data mining?
What is DATA MINING?
Standards Of Data Mining
Methods Of Data Mining
What is KNOWLEDGE DISCOVERY?
Process of KNOWLEDGE DISCOVERY
Input Data For Knowledge Discovery
Output Format For Discovered Knowledge

Requirement Of Data Mining
Wal-Mart reported to have 24 Tera-byte DB
AT&T handles billions of calls per day
data cannot be stored .
Mobil Oil : 100 TB of Oil Exploration Data

NASA: EOS –generates 50GB /hr Remotely Sensed Image Data

What is DATA MINING?

Data mining is the process of analyzing data from different perspectives and summarizing it into useful information.




Reply
#6
[attachment=10224]
1. INTRODUCTION:
Data mining involves the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets. These tools can include statistical models, mathematical algorithms, and machine learning methods (algorithms that improve their performance automatically through experience, such as neural networks or decision trees). Consequently, data mining consists of more than collecting and managing data, it also includes analysis and prediction.
Data mining can be performed on data represented in quantitative, textual, or multimedia forms. Data mining applications can use a variety of parameters to examine the data. They include association (patterns where one event is connected to another event, such as purchasing a pen and purchasing paper), sequence or path analysis, classification, clustering (finding and visually documenting groups of previously unknown facts, such as geographic location and brand preferences), and forecasting (discovering patterns from which one can make reasonable predictions regarding future activities)
Reflecting this conceptualization of data mining, some observers consider data mining to be just one step in a larger process known as knowledge discovery in databases (KDD). Other steps in the KDD process, in progressive order, include data cleaning, data integration, data selection, data transformation, (data mining), pattern evaluation, and knowledge presentation.
1.1 FEATURE SELECTION:
Data mining is the process of finding interesting patterns in data. Data mining often involves datasets with a large number of attributes. Many of the attributes in most real world data are redundant and/or simply irrelevant to the purposes of discovering interesting patterns. Attribute reduction selects relevant attributes in the dataset prior to performing data mining. This is important for the accuracy of further analysis as well as for performance. Because the redundant and irrelevant attributes could mislead the analysis, including all of the attributes in the data mining procedures not only increases the complexity of the analysis, but also degrades the accuracy of the result. For instance, clustering techniques, which partition entities into groups with a maximum level of homogeneity within a cluster, may produce inaccurate results. In particular, because the clusters might not be strong when the population is spread over the irrelevant dimensions, the clustering techniques may produce results with data in a higher dimensional space including irrelevant attributes.
Attribute reduction improves the performance of data mining techniques by reducing dimensions so that data mining procedures process data with a reduced number of attributes. With dimension reduction, improvement in orders of magnitude is possible. Attribute selection and reduction aim at choosing a small sub- set of attributes that is sufficient to describe the data set. It is the process of identifying and removing as much as possible the irrelevant and redundant information. The intrinsic dimensionality of data is the minimum number of parameters needed to account for the observed properties of the data. Attribute reduction is important in many domains, since it facilitates classification, visualization, and compression of high-dimensional data, by mitigating the curse of dimensionality and other undesired properties of high-dimensional spaces. Attribute reduction can be beneficial not only for reasons of computational efficiency but also because it can improve the accuracy of the analysis. By working with this reduced representation, tasks such as classification or clustering can often yield more accurate and readily interpretable results, while computational costs may also be significantly reduced. The identification of a reduced set of features that are predictive of outcomes can be very useful from a knowledge discovery perspective. For many learning algorithms, the training and/or classification time increases directly with the number of features. Sophisticated attribute selection methods have been developed to tackle three problems: reduce classifier cost and complexity, improve model accuracy (attribute selection), and improve the visualization and comprehensibility of induced concepts.
In machine learning and statistics, feature selection, also known as variable selection, feature reduction, attribute selection or variable subset selection, is the technique of selecting a subset of relevant features for building robust learning models.
Feature selection is the process of selecting a subset of the terms occurring in the training set and using only this subset as features. Feature selection serves two main purposes. First, it makes training and applying a classifier more efficient by decreasing the size of the effective vocabulary. This is of particular importance for classifiers that, unlike NB, are expensive to train. Second, feature selection often increases classification accuracy by eliminating noise features. A noise feature is one that, when added to the document representation, increases the classification error on new data.
This has been an active research area in pattern recognition, statistics, and data mining communities. The main idea of feature selection is to choose a subset of input variables by eliminating features with little or no predictive information. Feature selection can significantly improve the comprehensibility of the resulting classifier models and often build a model that generalizes better to unseen points. Further, it is often the case that finding the correct subset of predictive features is an important problem in its own right. For example, physician may make a decision based on the selected features whether a dangerous surgery is necessary for treatment or not.
Feature selection in supervised learning has been well studied, where the main goal is to find a feature subset that produces higher classification accuracy. For feature selection in unsupervised learning, learning algorithms are designed to find natural grouping of the examples in the feature space. Thus feature selection in unsupervised learning aims to find a good subset of features that forms high quality of clusters for a given number of clusters. By removing most irrelevant and redundant features from the data, feature selection helps improve the performance of learning models by:
• Alleviating the effect of the curse of dimensionality.
• Enhancing generalization capability.
• Speeding up learning process.
• Improving model interpretability.
Feature selection also helps people to acquire better understanding about their data by telling them which are the important features and how they are related with each other.
Feature selection has several advantages , such as:
• Improving the performance of the machine learning algorithm.
• Data understanding, gaining knowledge about the process and perhaps helping
to visualize it.
• Data reduction, limiting storage requirements and perhaps helping in reducing costs.
• Simplicity, possibility of using simpler models and gaining speed.
In this project, Information gain and Bayes Theorem is employed for determining the redundant attributes and irrelevant attributes in a dataset and removing those irrelevant attributes, thereby reducing the attribute set for increasing the classification accuracy and reducing the computational time. Bayes Theorem is used for the task of attribute reduction. The Naive Bayes classifier is a simple but effective classifier which has been used in numerous applications of information processing such as image recognition, natural language processing, information retrieval, etc. The Naive Bayes algorithm affords fast, highly scalable model building and scoring. It scales linearly with the number of predictors and rows. The build process for Naive Bayes is parallelized. (Scoring can be parallelized irrespective of the algorithm.)Naive Bayes can be used for both binary and multiclass classification. Given a set of objects, each of which belongs to a known class, and each of which has a known vector of variables, our aim is to construct a rule which will allow us to assign future objects to a class, given only the vectors of variables describing the future objects. Problems of this kind, called problem of supervised classification, are ubiquitous, and many methods for constructing such rules have been developed. One very important one is the Naive Bayes method—also called idiot’s Bayes, simple Bayes, and independence Bayes. This method is important for several reasons. It is very easy to construct, not needing any complicated iterative parameter estimation schemes. This means it may be readily applied to huge datasets. It is easy to interpret, so users unskilled in classifier technology can understand why it is making the classification it makes. And finally, it often does surprisingly well: it may not be the best possible classifier in any particular application, but it can usually be relied on to be robust and to do quite well.
Reason for Naïve Bayes:
• Handles quantitative and discrete data
• Robust to isolated noise points
• Handles missing values by ignoring the instance
• During probability estimate calculations
• Fast and space efficient
• Not sensitive to irrelevant features
• Quadratic decision boundary
One of the most important components of a decision tree algorithm is the criterion used to select which attribute will become a test attribute in a given branch of the tree. There are different criteria one of the most well known is Information gain. Information gain is usually a good measure for deciding the relevance of an attribute This approach minimizes the expected number of tests needed to classify a given tuple.
Reply
#7
Hello! I just would like to give an enormous thumbs up for the nice information you may have right here on this post. I will probably be coming again to seminarprojects.org for extra soon.
Reply
#8
Presented by:
Chris Nelson

[attachment=10871]
Data Mining
 New buzzword, old idea.
 Inferring new information from already collected data.
 Traditionally job of Data Analysts
 Computers have changed this.
Far more efficient to comb through data using a machine than eyeballing statistical data.
Data Mining – Two Main Components
 Wikipedia definition: “Data mining is the entire process of applying computer-based methodology, including new techniques for knowledge discovery, from data.”
 Knowledge Discovery
Concrete information gleaned from known data. Data you may not have known, but which is supported by recorded facts.
(ie: Diapers and beer example from previous presentation)
 Knowledge Prediction
Uses known data to forecast future trends, events, etc. (ie: Stock market predictions)
 Wikipedia note: "some data mining systems such as neural networks are inherently geared towards prediction and pattern recognition, rather than knowledge discovery.“ These include applications in AI and Symbol analysis
Data Mining vs. Data Analysis
 In terms of software and the marketing thereof
Data Mining != Data Analysis
 Data Mining implies software uses some intelligence over simple grouping and partitioning of data to infer new information.
 Data Analysis is more in line with standard statistical software (ie: web stats). These usually present information about subsets and relations within the recorded data set (ie: browser/search engine usage, average visit time, etc. )
Data Mining Subtypes
 Data Dredging
The process of scanning a data set for relations and then coming up with a hypothesis for existence of those relations.
 MetaData
Data that describes other data. Can describe an individual element, or a collection of elements.
Wikipedia example: “In a library, where the data is the content of the titles stocked, metadata about a title would typically include a description of the content, the author, the publication date and the physical location”
 Applications for Data Dredging in business include Market and Risk Analysis, as well as trading strategies.
Applications for Science include disaster prediction.
Propositional vs. Relational Data
 Old data mining methods relied on Propositional Data, or data that was related to a single, central element, that could be represented in a vector format. (ie: the purchasing history of a single user. Amazon uses such vectors in its related item suggestions [a multidimensional dot product])
 Current, advanced data mining methods rely on Relational Data, or data that can be stored and modeled easily through use of relational databases. An example of this would be data used to represent interpersonal relations.
 Relational Data is more interesting than Propositional data to miners in the sense that an entity, and all the entities to which it is related, factor into the data inference process.
Key Component of Data Mining
 Whether Knowledge Discovery or Knowledge Prediction, data mining takes information that was once quite difficult to detect and presents it in an easily understandable format (ie: graphical or statistical)
 Data mining Techniques involve sophisticated algorithms, including Decision Tree Classifications, Association detection, and Clustering.
 Since Data mining is not on test, I will keep things superficial.
Uses of Data Mining
 AI/Machine Learning
Combinatorial/Game Data Mining
Good for analyzing winning strategies to games, and thus developing intelligent AI opponents. (ie: Chess)
 Business Strategies
Market Basket Analysis
Identify customer demographics, preferences, and purchasing patterns.
 Risk Analysis
Product Defect Analysis
Analyze product defect rates for given plants and predict possible complications (read: lawsuits) down the line.
 User Behavior Validation
Fraud Detection
In the realm of cell phones
Comparing phone activity to calling records. Can help detect calls made on cloned phones.
Similarly, with credit cards, comparing purchases with historical purchases. Can detect activity with stolen cards.
 Health and Science
Protein Folding
Predicting protein interactions and functionality within biological cells. Applications of this research include determining causes and possible cures for Alzheimers, Parkinson's, and some cancers (caused by protein "misfolds")
Extra-Terrestrial Intelligence
Scanning Satellite receptions for possible transmissions from other planets.
 For more information see Stanford’s Folding@home and SETI@home projects. Both involve participation in a widely distributed computer application.
 Sources of Data for Mining
 Databases (most obvious)
Text Documents
 Computer Simulations
 Social Networks
Privacy Concerns
 Mining of public and government databases is done, though people have, and continue to raise concerns.
 Wiki quote:
"data mining gives information that would not be available otherwise. It must be properly interpreted to be useful. When the data collected involves individual people, there are many questions concerning privacy, legality, and ethics."
Prevalence of Data Mining
 Your data is already being mined, whether you like it or not.
 Many web services require that you allow access to your information [for data mining] in order to use the service.
 Google mines email data in Gmail accounts to present account owners with ads.
 Facebook requires users to allow access to info from non-Facebook pages. Facebook privacy policy:
"We may use information about you that we collect from other sources, including but not limited to newspapers and Internet sources such as blogs, instant messaging services and other users of Facebook, to supplement your profile.
 This allows access to your blog RSS feed (rather innocuous), as well as information obtained through partner sites (worthy of concern).
Data Mining Controversies
 Latest one: Facebook's Beacon Advertising program (Just popped on Slashdot within the last week)
 What Beacon does:
“when you engage in consumer activity at a [Facebook] partner website, such as Amazon, eBay, or the New York Times, not only will Facebook record that activity, but your Facebook connections will also be informed of your purchases or actions.” [taken from http://trickytrickywhiteboy.blogspot2007...eacon.html]
Controversies continued
 Implications: "Thus where Facebook used to be collecting data only within the confines of its own website, it will now extend that ability to harvest data across other websites that it partners with. Some of the companies that have signed on to participate on the advertising side include Coca-Cola, Sony, Verizon, Comcast, Ebay — and the CBC. The initial list of 44 partner websites participating on the data collection side include the New York Times, Blockbuster, Amazon, eBay, LiveJournal, and Epicurious.”
[Remember the privacy policy on the previous slide]
 Verdict is still out. This may violate an old (100+ years) New York law prohibiting advertising using endorsements without the endorsee’s consent.
 Facebook currently offers users no way to opt out of Beacon (once it has been activated ?). Users can close the accounts, but account data is never deleted.
Bottom Line
 Data obtained through Data Mining is incredibly valuable
 Companies are understandably reluctant to give up data they have obtained.
 Expect to see prevalence of Data Mining and (possibly subversive) methods increase in years to come.
Reply
#9
Hey, I'm fairly new at this forum and that is my quite initially post, I just wished to understand what are your ideas about this fat burning furnace factor? Its observed at - fatburning furnace scam
Reply
#10
All information about this is very useful
thanx for the sharing
Reply
#11

to get information about the topic"data mining full report" refer the page link bellow
http://studentbank.in/report-data-mining-full-report

http://studentbank.in/report-data-mining...ort?page=2
Reply
#12
Data mining (the analysis step of the knowledge discovery in databases process, a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems.

The goal of data mining is to extract knowledge from a data set in a human-understandable structure and involves database and data management, data preprocessing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of found structure, visualization and online updating.

orlando inn
Reply
#13
Data Mining

[attachment=16836]

What Is Data Mining?

Data mining (knowledge discovery from data)
Extraction of interesting patterns or knowledge from huge amount of
data
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, information harvesting, business
intelligence, etc.


Predictive Modeling

Model is developed using a supervised
learning approach, which has two phases:
training and testing.
– Training builds a model using a large sample of
historical data called a training set.
– Testing involves trying out the model on new,
previously unseen data to determine its
accuracy and physical performance
characteristics.


Predictive Modeling -
Classification

Used to establish a specific predetermined
class for each record in a database from a
finite set of possible, class values.

Two specializations of classification: tree
induction and neural induction.


Predictive Modeling - Value Prediction

Used to estimate a continuous numeric value that
is associated with a database record.

Uses the traditional statistical techniques of linear
regression and nonlinear regression.
Relatively easy-to-use and understand.
Reply
#14
Overview
Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both.
Data mining software is one of a number of analytical tools for analyzing data.

It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.

hotels in ormond
Reply
#15

to get information about the topic statistical mining full report ,ppt and related topic refer the page link bellow

http://studentbank.in/report-data-mining...?pid=47571

http://studentbank.in/report-real-time-d...ull-report
Reply
#16
to get information about the topic data mining full report ,ppt and related topic refer the page link bellow

http://studentbank.in/report-data-mining-full-report

http://studentbank.in/report-data-mining...ars-report

http://studentbank.in/report-data-mining-project-topics

http://studentbank.in/report-data-mining...a-proposal

http://studentbank.in/report-data-mining...ort?page=2

http://studentbank.in/report-data-mining...techniques

http://studentbank.in/report-data-mining...nformatics

http://studentbank.in/report-data-mining...eas?page=2

http://studentbank.in/report-using-data-...test-suite

http://studentbank.in/report-data-mining...er-present

http://studentbank.in/report-data-mining...re-testing

Reply
#17
I need more seminar report on data mining from which i want to choose one of them

Reply
#18
data mining full report

[attachment=18131]

Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data
Data mining: a misnomer?
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
Watch out: Is everything “data mining”?
(Deductive) query processing.
Expert systems or small ML/statistical programs


Real Example from the NBA

Play-by-play information recorded by teams
Who is on the court
Who shoots
Results
Coaches want to know what works best
Plays that work well against a given team
Good/bad player matchups
Advanced Scout (from IBM Research) is a data mining tool to answer these questions

Why Data Mining?—Potential Applications

Data analysis and decision support
Market analysis and management
Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation
Risk analysis and management
Forecasting, customer retention, improved underwriting, quality control, competitive analysis
Fraud detection and detection of unusual patterns (outliers)
Other Applications
Text mining (news group, email, documents) and Web mining
Stream data mining
DNA and bio-data analysis


Reply
#19
to get information about the topic "data mining" full report ,ppt and related topic refer the page link bellow

http://studentbank.in/report-data-mining-project-ideas

http://studentbank.in/report-data-mining-full-report

http://studentbank.in/report-data-mining...ars-report

http://studentbank.in/report-data-mining-project-topics

http://studentbank.in/report-java-based-...ject-ideas

http://studentbank.in/report-data-mining...a-proposal
Reply
#20
To get full information or details of data mining please have a look on the pages



http://studentbank.in/report-data-mining-full-report


if you again feel trouble on data mining please reply in that page and ask specific fields in data mining
Reply
#21
To get full information or details of data mining full report please have a look on the pages





http://studentbank.in/report-data-mining-full-report





if you again feel trouble on data mining full report please reply in that page and ask specific fields in data mining full report

Reply
#22
To get full information or details of data mining please have a look on the pages


http://studentbank.in/report-data-mining-full-report

http://studentbank.in/report-data-mining...a-proposal

http://studentbank.in/report-data-mining-project-topics

http://studentbank.in/report-data-mining...eas?page=2

http://studentbank.in/report-data-mining...ort?page=2

http://studentbank.in/report-data-mining...techniques

http://studentbank.in/report-data-mining...nformatics




if you again feel trouble on data mining please reply in that page and ask specific fields in data mining
Reply
#23
To get full information or details of data mining full report please have a look on the pages



http://studentbank.in/report-data-mining-full-report

http://studentbank.in/report-data-mining...techniques

http://studentbank.in/report-data-mining-project-topics

http://studentbank.in/report-data-mining...a-proposal

http://studentbank.in/report-data-mining...ort?page=2




if you again feel trouble ondata mining full report please reply in that page and ask specific fields in data mining full report
Reply
#24
buy ultram online reviews - can you buy ultram online
Reply

Important Note..!

If you are not satisfied with above reply ,..Please

ASK HERE

So that we will collect data for you and will made reply to the request....OR try below "QUICK REPLY" box to add a reply to this page
Popular Searches: yahoo, fm ic 810 ic1619, mkuddepc yahoo com, ajith vijay, curso academico yahoo, what is iss in middle, 1984 kdx 200,

[-]
Quick Reply
Message
Type your reply to this message here.

Image Verification
Please enter the text contained within the image into the text box below it. This process is used to prevent automated spam bots.
Image Verification
(case insensitive)

Possibly Related Threads...
Thread Author Replies Views Last Post
  SAMBA SERVER ADMINISTRATION full report project report tiger 3 4,719 17-01-2018, 05:40 PM
Last Post: AustinnuAke
  air ticket reservation system full report project report tiger 16 46,803 08-01-2018, 02:33 PM
Last Post: RaymondGom
  A Link-Based Cluster Ensemble Approach for Categorical Data Clustering 1 1,062 16-02-2017, 10:51 AM
Last Post: jaseela123d
  Exploiting the Functional and Taxonomic Structure of Genomic Data by Probabilistic To 1 750 14-02-2017, 04:15 PM
Last Post: jaseela123d
  An Efficient Algorithm for Mining Frequent Patterns full report project topics 3 4,714 01-10-2016, 10:02 AM
Last Post: Guest
  online examination full report project report tiger 14 42,735 03-09-2016, 11:20 AM
Last Post: jaseela123d
  Employee Cubicle Management System full report computer science technology 4 5,081 07-04-2016, 11:37 AM
Last Post: dhanabhagya
  e-Post Office System full report computer science technology 27 25,744 30-03-2016, 02:56 PM
Last Post: dhanabhagya
  Remote Server Monitoring System For Corporate Data Centers smart paper boy 3 2,806 28-03-2016, 02:51 PM
Last Post: dhanabhagya
  Secured Data Hiding and Extractions Using BPCS project report helper 4 3,644 04-02-2016, 12:52 PM
Last Post: seminar report asees

Forum Jump: