ASK HERE

seminar project explorer · 14-02-2011, 05:29 PM

Application of Rough Sets in Data Mining
A Project Report
Submitted in partial fulfilment of
the requirements for the award of the degree of
Master of Technology
in
Computer Science and Engineering
by
Abdul Nassar . A.A.
M105101
Department of Computer Science & Engineering
College of Engineering Trivandrum
Kerala - 695016
2010-11

Contents
1 Introduction 5
1.1 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.2 Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Advantages of RST Approach in Clustering . . . . . . . . . . . . . . . . . . . . 7
2 An Algorithm For Clustering Using Similarity - Measure In RST 8
2.0.1 Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Data Flow Diagram 9
4 Analysis of the Algorithm 12
5 Conclusion 13
3
Abstract
Data mining is the technique of extracting meaningful information from large and mostly
unorganized data banks.
Rough set is a mathematical approach proposed by Z. Pawlak in the early 1980’s. It deals
with classificatory analysis of information systems. The basic concepts of rough set theory
discussed in this project include indiscernibility relation, reduct, core, upper approximation
and lower approximation.
Clustering is a major task in Data mining. The use of clustering enables you to create
new groups and classes based on the study of patterns and relationship between values of
data in a data bank. Rough Set based Indiscernibility relations can be used for clustering by
measuring the similarity among the data items. In the proposed approach the strict notion of
indiscernibility is relaxed and classes are formed on the basis that objects are similar rather
than identical.

[attachment=8653]

1 Introduction
Organizations worldwide generate large amount of data, mostly unorganized. This unorga-
nized data requires processing to be done to generate meaningful and useful information. In
order to organize large amount of data, you implement the concept of database management
systems such as Oracle and SQL Server, which require you to use SQL, a specialized query
language to retrieve data from a database. However, the use of SQL is not always adequate to
meet the end user requirements of specialized and sophisticated information from an unorga-
nized large data bank.
1.1 Data Mining
Data mining is technique of extracting meaningful information from large and mostly un-
organized data banks. It is the process of performing automated extraction and generating
predictive information from large data banks.
The extraction of meaningful information from a large bank is otherwise known as Knowledge
discovery. One school of thought considers data mining as a step in the process of knowledge
discovery in databases or KDD while other school of thought considers data mining considers
synonym to KDD.
Data mining makes use of various algorithms to perform a variety of tasks. These algorithms
examine the sample data of a problem and determine a model that fits close to solving the
problem. These models are classified as predictive and descriptive models.A predictive model
enables you to predict the values of data by making use of known results from a different set
of sample data. The data mining tasks that forms the part of predictive model are:
1. Classification
2. Regression
3. Time series analysis
A descriptive model enables you to determine the patterns and relationships in a sample data.
The data mining tasks that forms the part of descriptive model are:
1. Clustering
2. Summarization
3. Association Rules
4. Sequence discovery.
1.1.1 Clustering
The use of clustering enables you to create new groups and classes based on the study of
patterns and relationship between values of data in a data bank.
5
1.1.2 Association Rules
The use of association rules enables you to establish association and relationships between large
and unclassified data items based on certain attributes and characteristics. Association rules
define certain rules of associativity between data items and then use those rules to establish
relationships.
1.1.3 Problem Statement
How the concepts of Rough Set Theory - Indiscernibility , Reduct and Core can be used in data
mining area- clustering.
A rough set is a formal approximation of crisp set in terms of a pair of sets, which give the
lower and upper approximation of the original set. Rough set is an emerging soft computing
tool with wide range of applications, which includes problems in Machine Learning.
Data mining is one of the areas in which rough sets are widely used. Data mining is the process
of automatically searching large volumes of data for patterns using tools such as classification,
association rule mining, clustering etc. The rough set theory is a well-understood format
framework for building data mining models in the form of logic rules, on the basis of which it
is possible to issue predictions that allow classifying new cases. Indiscernibility relation of RST
can be used as a measure of similarity without any distance function for clustering the object.
1.2 Objective
By applying the concept of Rough Set Theory, develop/propose innovative algorithms/approaches
in clustering/rule mining. The project mainly concentrates the application of rough set for clus-
tering in data mining.The project is divided into two phases
1. In the first phase of the project, the indiscernibility relation of RST is used for the
generation of clusters and an algorithm is developed for clustering of data.
2. In the second phase, The algorithm developed has to be implemented and tested on a
variety of databases of different sizes and for different applications.
6
1.3 Advantages of RST Approach in Clustering
1. Cluster formation is natural and easy.
2. RST approach provides definitions and method for finding which attribute separates one
classification from another.
3. It uses only internal information for the formation of clusters.
4. It rely on attribute reduction.
5. This approach handles uncertainty in clustering process.
6. It is rather easy to implement and can handle any volume of data.
7
2 An Algorithm For Clustering Using Similarity - Measure In RST
Basically, there are two requirements. The first one is to form all the identical groups together
to form base clusters. In the case of base clusters, all the attribute values of the objects that
belong to the same cluster will be identical. This forms the first functional requirement of our
algorithm. The process is to identify and club objects having the same attribute values, which
in turn forms the base clusters.
In the case of the second requirement, the strict notion of indiscernibility is relaxed. With
r-value ’n’, there may utmost ’n’ attribute values of object that may differ between them are
clubbed together to form a cluster. The process basically starts form the base clusters, where,
identical objects are clubbed together. These base clusters are compared each other and clubbed
when there is a maximum difference of ’n’ attribute values between the objects to form new
clusters.
2.0.1 Functional Requirements
Requirement R1 - Generate Clusters with r = 0
1. database that contains data records.
2. Generate database that contains groups of data records of same attribute values.
Process
1. Identify data records with the same attribute value (r = 0) and store it, which forms
identical groups
2. Continue the above process to generate all such groups
3. Identify all the distinct records/clusters and form each one as separate group
4. Store all the data groups.
Requirement R2 - Generate Clusters with r = k
1. Input data file that contains data records with r = 0
2. Generate a database that contains groups of data records with r = k
Process
1. From the database with r = 0, generate groups of data records with attribute value
difference of ’k’ between the groups of data records whose r value is 0.
2. Repeat the process to form the minimum number of clusters thus formed.
3. Repeat the process to form the minimum number of clusters thus formed.
4. Store all the data groups.
A Data Flow Diagram has been developed for the above said process. There are five pro-
cesses in the DFD, each of which can be refined further.
8
3 Data Flow Diagram
.
Figure 5 Big Grin

ata Flow Diagram.
9
An algorithm is proposed for clustering data based on the above approach. The algorithm
is very simple. When this algorithm is implemented and tested with various databases of small
and medium size, we expect to get encouraging results.
Algorithm - Basic Steps
1. Classify the objects with the same attribute values ( indiscernibility with r value = 0 )
to form base clusters. Form all such base clusters.
2. From the clusters thus formed, identify and club groups with indiscernibility r value = k
between them to form new groups.
3. Repeat step 2, such that maximum groups can be clubbed together thus attaining mini-
mum number of clusters with r = k.
Procedure BaseClusters( object[] );
Declare baseclust[size];
Begin
. K := 1;
. Repeat
. For I := 2 to totalobjects do
. If ( difference( object[K], object[I] ) == 0 ) then
. Begin
. Addtobasecluster(object[K],object[I] );
. K := k + 1;
. End;
. Until all the objects are processed.
. //Add the remaining distinct clusters into baseclust
. I := 1;
. Repeat
. If ( object[I] is not in baseclust ) then
. Begin
. Addtobasecluster(object[K],object[I] );
. K := k + 1;
. End;
. I := I + 1;
. Until all objects are processed.
{ procedure ends }
End
10
The above pseudo code generates base clusters in which indiscernibility with r value is 0.
The pseudo code given below generates clusters in which indiscernibility with r value ’n’.
Procedure ClusterwithDifferN( baseclust[], n )
Declare clustN[size];
Begin
. Repeat
. K := 1;
. I:= 1;
. Repeat
. For J := I + 1 to last record do
. . . Begin
. . . If ( difference( aseclust[I].object,baseclust[J].object )¡= n ) then
. . . Begin
. . . Addtoclustn( clustN[K], baseclust[I], baseclust[J] )
. . . K := K + 1;
. . . . End;
End;{ for }
. .
. I := I + 1;
. Until all base cluster objects are processed.
. //add remaining base cluster objects into clustern
. . For M := 1 to last record do
. . Begin
. . . If ( baseclust[M] is not in clustN[] )
. . . . Begin
. . . . . Addtoclustn( clustN[K], baseclust[M] )
. . . . . K := K + 1;
. . . . End;
. . End;
. Until no two groups in clustN have difference n.
End;
11
4 Analysis of the Algorithm
The above algorithm for the generation of clusters is quite easy to implement. The RST
approach provides definitions and methods for finding which attribute separates one classifica-
tion from another and hence cluster formation is easy and natural. It uses only the internal
information to form clusters.
Even though the algorithm can handle small and medium sized databases effectively, there
may be some restrictions in the case of large sized databases due to the system limitations (
available memory ). In such cases, the following modification is suggested.
In the case of large databases, files may be used for the storage of intermediate clusters. A
part of the data structure/database is loaded in to memory, processed to generate clusters and
store to a file. The next part is then loaded and processed to generate clusters and updated to
the file and so on for the whole database. Then file created may be refined again until no more
refinem
12
5 Conclusion
Rough Set Theory can be used for Data Mining applications like Clustering and Rule gener-
ation. In clustering, the concept, indiscernibility relation of rough set theory is utilized for the
generation of clusters. Using this concept, clusters are generated without making use of any
additional information such as probability distribution or a membership function in fuzzy set
theory. An algorithm is developed based on this concept and can easily implement. Encourag-
ing results are expected when the algorithm is tested on a variety of databases.
The Rough Set concepts, Reduct, Core, Lower approximation and Upper approximation are
used for Rule Mining. Important rules can be generated by considering the Lower approxima-
tion of the target set. By considering the generated rules as attributes and by constructing a
new decision table, a reduct rule set can be generated. The reduct rules thus generated are
more important, and it does not contain any rules with low rule importance.
13
References
[1] Y. Y. Yao T. Y. Lin and L. A. Zadeh (Editors). “Data mining, rough sets and granular
computing”. Physica- Verlag, March 2002.
[2] S.K. Pal and A. Skowron. “Rough fuzzy hybridization - a new trend in decision making”.
Springer - Verlag, April 1999.
14

Possibly Related Threads...
Thread		Author	Replies	Views	Last Post
	Block Chain and Data Science	jntuworldforum	0	8,222	06-10-2018, 12:15 PM Last Post: jntuworldforum
	Data Encryption Standard (DES)	seminar class	2	9,384	20-02-2016, 01:59 PM Last Post: seminar report asees
	Skin Tone based Secret Data hiding in Images	seminar class	9	7,049	23-12-2015, 04:18 PM Last Post: HelloGFS
	XML Data Compression	computer science crazy	2	2,408	07-10-2014, 09:26 PM Last Post: seminar report asees
	Data Security in Local Network using Distributed Firewalls	computer science crazy	10	15,100	30-03-2014, 04:40 AM Last Post: Guest
	GREEN CLOUD -A Data Center Approach	computer topic	0	1,545	25-03-2014, 10:13 PM Last Post: computer topic
	3D-OPTICAL DATA STORAGE TECHNOLOGY	computer science crazy	3	8,531	12-09-2013, 08:28 PM Last Post: Guest
	Security in Data Warehousing	seminar surveyer	3	10,088	12-08-2013, 10:24 AM Last Post: computer topic
	Particle Swarm Optimization Algorithm and Its Application in Engineering Design Optim	computer science crazy	3	5,503	03-05-2013, 10:28 AM Last Post: computer topic
	data warehousing concepts	project topics	7	7,139	05-02-2013, 12:00 PM Last Post: seminar details

Important Note..!

ASK HERE