30-08-2011, 09:38 AM
Abstract
The association rules represent an important class of knowledge that can be discovered from data warehouses. Current research efforts are focused on inventing efficient ways of discovering these rules from large databases. As databases grow, the discovered rules need to be verified and new rules need to be added to the knowledge base. Since mining afresh every time the database grows is inefficient, algorithms for incremental mining are being investigated. Their primary aim is to avoid or minimize scans of the older database by using the intermediate data constructed during the earlier mining. In this paper, we present one such algorithm. We make use of large and candidate itemsets and their counts in the older database, and scan the increment to find which rules continue to prevail and which ones fail in the merged database. We are also able to find new rules for the incremental and updated database. The algorithm is adaptive in nature, as it infers the nature of the increment and avoids altogether, if possible, multiple scans of the incremental database. Another salient feature is that it does not need multiple scans of the older database. We also indicate some results on its performance against synthetic data.
1. Introduction
Data mining, which is also referred to as knowledge discovery in databases, is a process of nontrivial extraction of implicit, previously unknown, and potentially useful information from data in a database. Data mining has recently attracted considerable attention from database user community as they realize that this information, locked inside the large organizational databases built over many years, can provide information and knowledge for enhancing their organization’s effectiveness and competitiveness. The process of data mining provides knowledge in the form of rules and patterns based on statistical analysis of data. The process is challenging because the source databases from which the knowledge is extracted are large and growing. The knowledge itself is time-varying, as some rules and patterns may hold now but not in future, or vice-versa. The mining techniques must scale well to handle very large and growing databases, and should permit efficient maintenance of extracted knowledge. One of the most studied data mining problems is mining for association rules. Given a collection of items and a set of records (i.e, transactions), each of which contain some number of items from the given collection, the association rules indicate affinities that exist among the collection of items. These affinities can be expressed by rules such as ”62 % of all the records that contain items A, B and C also contain items D and E.” The specific percentage of occurrences is called the confidence factor of the rule. A database may throw up a very large number of association rules. Much work has been done in the field of finding association rules [1] [2] [3] [8] [6]. These efforts are directed at devising algorithms to mine the rules efficiently in large databases. They commonly require multiple scans of the given database. As databases grow over time, there is a need to undertake mining again for maintaining (i.e., verifying) rules discovered earlier and also for discovering new rules. However, it has been realized that applying the proposed algorithms on the updated database (i.e, the older and the incremental database together) may be too costly. Researchers are now investigating ways by which rule maintenance can be done by processing the incremental part separately, and scanning the older database only if necessary. To achieve this, the incremental mining algorithms generally plan to use intermediate information collected during earlier mining process.
Download full report
http://googleurl?sa=t&source=web&cd=2&ve...F15304.pdf&ei=_mFcTuHnO4bsrAfwppSlDw&usg=AFQjCNHloi7ghY1gAqHdriBuM9YIgLLHDQ