ASK HERE

seminar class · 15-03-2011, 04:06 PM

[attachment=10224]
1. INTRODUCTION:
Data mining involves the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets. These tools can include statistical models, mathematical algorithms, and machine learning methods (algorithms that improve their performance automatically through experience, such as neural networks or decision trees). Consequently, data mining consists of more than collecting and managing data, it also includes analysis and prediction.
Data mining can be performed on data represented in quantitative, textual, or multimedia forms. Data mining applications can use a variety of parameters to examine the data. They include association (patterns where one event is connected to another event, such as purchasing a pen and purchasing paper), sequence or path analysis, classification, clustering (finding and visually documenting groups of previously unknown facts, such as geographic location and brand preferences), and forecasting (discovering patterns from which one can make reasonable predictions regarding future activities)
Reflecting this conceptualization of data mining, some observers consider data mining to be just one step in a larger process known as knowledge discovery in databases (KDD). Other steps in the KDD process, in progressive order, include data cleaning, data integration, data selection, data transformation, (data mining), pattern evaluation, and knowledge presentation.
1.1 FEATURE SELECTION:
Data mining is the process of finding interesting patterns in data. Data mining often involves datasets with a large number of attributes. Many of the attributes in most real world data are redundant and/or simply irrelevant to the purposes of discovering interesting patterns. Attribute reduction selects relevant attributes in the dataset prior to performing data mining. This is important for the accuracy of further analysis as well as for performance. Because the redundant and irrelevant attributes could mislead the analysis, including all of the attributes in the data mining procedures not only increases the complexity of the analysis, but also degrades the accuracy of the result. For instance, clustering techniques, which partition entities into groups with a maximum level of homogeneity within a cluster, may produce inaccurate results. In particular, because the clusters might not be strong when the population is spread over the irrelevant dimensions, the clustering techniques may produce results with data in a higher dimensional space including irrelevant attributes.
Attribute reduction improves the performance of data mining techniques by reducing dimensions so that data mining procedures process data with a reduced number of attributes. With dimension reduction, improvement in orders of magnitude is possible. Attribute selection and reduction aim at choosing a small sub- set of attributes that is sufficient to describe the data set. It is the process of identifying and removing as much as possible the irrelevant and redundant information. The intrinsic dimensionality of data is the minimum number of parameters needed to account for the observed properties of the data. Attribute reduction is important in many domains, since it facilitates classification, visualization, and compression of high-dimensional data, by mitigating the curse of dimensionality and other undesired properties of high-dimensional spaces. Attribute reduction can be beneficial not only for reasons of computational efficiency but also because it can improve the accuracy of the analysis. By working with this reduced representation, tasks such as classification or clustering can often yield more accurate and readily interpretable results, while computational costs may also be significantly reduced. The identification of a reduced set of features that are predictive of outcomes can be very useful from a knowledge discovery perspective. For many learning algorithms, the training and/or classification time increases directly with the number of features. Sophisticated attribute selection methods have been developed to tackle three problems: reduce classifier cost and complexity, improve model accuracy (attribute selection), and improve the visualization and comprehensibility of induced concepts.
In machine learning and statistics, feature selection, also known as variable selection, feature reduction, attribute selection or variable subset selection, is the technique of selecting a subset of relevant features for building robust learning models.
Feature selection is the process of selecting a subset of the terms occurring in the training set and using only this subset as features. Feature selection serves two main purposes. First, it makes training and applying a classifier more efficient by decreasing the size of the effective vocabulary. This is of particular importance for classifiers that, unlike NB, are expensive to train. Second, feature selection often increases classification accuracy by eliminating noise features. A noise feature is one that, when added to the document representation, increases the classification error on new data.
This has been an active research area in pattern recognition, statistics, and data mining communities. The main idea of feature selection is to choose a subset of input variables by eliminating features with little or no predictive information. Feature selection can significantly improve the comprehensibility of the resulting classifier models and often build a model that generalizes better to unseen points. Further, it is often the case that finding the correct subset of predictive features is an important problem in its own right. For example, physician may make a decision based on the selected features whether a dangerous surgery is necessary for treatment or not.
Feature selection in supervised learning has been well studied, where the main goal is to find a feature subset that produces higher classification accuracy. For feature selection in unsupervised learning, learning algorithms are designed to find natural grouping of the examples in the feature space. Thus feature selection in unsupervised learning aims to find a good subset of features that forms high quality of clusters for a given number of clusters. By removing most irrelevant and redundant features from the data, feature selection helps improve the performance of learning models by:
• Alleviating the effect of the curse of dimensionality.
• Enhancing generalization capability.
• Speeding up learning process.
• Improving model interpretability.
Feature selection also helps people to acquire better understanding about their data by telling them which are the important features and how they are related with each other.
Feature selection has several advantages , such as:
• Improving the performance of the machine learning algorithm.
• Data understanding, gaining knowledge about the process and perhaps helping
to visualize it.
• Data reduction, limiting storage requirements and perhaps helping in reducing costs.
• Simplicity, possibility of using simpler models and gaining speed.
In this project, Information gain and Bayes Theorem is employed for determining the redundant attributes and irrelevant attributes in a dataset and removing those irrelevant attributes, thereby reducing the attribute set for increasing the classification accuracy and reducing the computational time. Bayes Theorem is used for the task of attribute reduction. The Naive Bayes classifier is a simple but effective classifier which has been used in numerous applications of information processing such as image recognition, natural language processing, information retrieval, etc. The Naive Bayes algorithm affords fast, highly scalable model building and scoring. It scales linearly with the number of predictors and rows. The build process for Naive Bayes is parallelized. (Scoring can be parallelized irrespective of the algorithm.)Naive Bayes can be used for both binary and multiclass classification. Given a set of objects, each of which belongs to a known class, and each of which has a known vector of variables, our aim is to construct a rule which will allow us to assign future objects to a class, given only the vectors of variables describing the future objects. Problems of this kind, called problem of supervised classification, are ubiquitous, and many methods for constructing such rules have been developed. One very important one is the Naive Bayes method—also called idiot’s Bayes, simple Bayes, and independence Bayes. This method is important for several reasons. It is very easy to construct, not needing any complicated iterative parameter estimation schemes. This means it may be readily applied to huge datasets. It is easy to interpret, so users unskilled in classifier technology can understand why it is making the classification it makes. And finally, it often does surprisingly well: it may not be the best possible classifier in any particular application, but it can usually be relied on to be robust and to do quite well.
Reason for Naïve Bayes:
• Handles quantitative and discrete data
• Robust to isolated noise points
• Handles missing values by ignoring the instance
• During probability estimate calculations
• Fast and space efficient
• Not sensitive to irrelevant features
• Quadratic decision boundary
One of the most important components of a decision tree algorithm is the criterion used to select which attribute will become a test attribute in a given branch of the tree. There are different criteria one of the most well known is Information gain. Information gain is usually a good measure for deciding the relevance of an attribute This approach minimizes the expected number of tests needed to classify a given tuple.

Possibly Related Threads...
Thread		Author	Replies	Views	Last Post
	SAMBA SERVER ADMINISTRATION full report	project report tiger	3	4,759	17-01-2018, 05:40 PM Last Post: AustinnuAke
	air ticket reservation system full report	project report tiger	16	46,887	08-01-2018, 02:33 PM Last Post: RaymondGom
	A Link-Based Cluster Ensemble Approach for Categorical Data Clustering		1	1,084	16-02-2017, 10:51 AM Last Post: jaseela123d
	Exploiting the Functional and Taxonomic Structure of Genomic Data by Probabilistic To		1	768	14-02-2017, 04:15 PM Last Post: jaseela123d
	An Efficient Algorithm for Mining Frequent Patterns full report	project topics	3	4,764	01-10-2016, 10:02 AM Last Post: Guest
	online examination full report	project report tiger	14	42,891	03-09-2016, 11:20 AM Last Post: jaseela123d
	Employee Cubicle Management System full report	computer science technology	4	5,121	07-04-2016, 11:37 AM Last Post: dhanabhagya
	e-Post Office System full report	computer science technology	27	25,985	30-03-2016, 02:56 PM Last Post: dhanabhagya
	Remote Server Monitoring System For Corporate Data Centers	smart paper boy	3	2,851	28-03-2016, 02:51 PM Last Post: dhanabhagya
	Secured Data Hiding and Extractions Using BPCS	project report helper	4	3,670	04-02-2016, 12:52 PM Last Post: seminar report asees

Important Note..!

ASK HERE