16-02-2017, 10:51 AM
Although attempts have been made to solve the problem of grouping categorical data across clusters, the results being competitive with conventional algorithms, it is observed that these techniques, unfortunately, generate a final data partition based on incomplete information. The underlying set information array only displays cluster data point relationships, with many entries remaining unknown. The article presents an analysis that suggests that this problem degrades the quality of the result of the cluster and presents a new approach based on links that improves the conventional matrix by discovering unknown inputs through the similarity between groups in a set. In particular, we propose an efficient link-based algorithm for the evaluation of underlying similarity. Subsequently, to obtain the final grouping result, a graphical partitioning technique is applied to a weighted bipartite graph that is formulated from the refined matrix. Experimental results in multiple real data-sets suggest that the proposed link-based method almost always outperforms conventional clustering algorithms for well-known categorical data and clustering techniques.
Data mining is the practice of automatically searching for large data stores to discover patterns and trends that go beyond simple analysis. Data mining models (prediction and description) are achieved using the following main data mining tasks: Classification, Regression, Grouping, Summarization and Dependency Modelling, and Detection of Changes and Deviations. The grouping groups the elements into a data set according to their similarity in such a way that the elements of each grouping are similar, whereas the elements of different groups are dissimilar. It is about analysing or processing multivariate data, such as: characterise customer groups based on purchasing patterns, classify web documents, group genes and proteins that have similar functionality, group spatial locations prone to earthquakes based on seismological data, etc. It is the integration of the results of several clustering algorithms using a consensus function to obtain stable results. The idea of combining different clustering results (cluster set or cluster aggregation) emerged as an alternative approach to improve the quality of clustering algorithm results. In this work we have designed and implemented a clusters cluster approach using the divide and conquer technique to treat this type of mixed data sets. Therefore, the initial data set is divided into sub-sets of data, ie, numerical and categorical. Next, clustering algorithms designed for numeric and categorical data sets can be used to produce corresponding clusters. Finally, the grouping results from the previous step are combined as a categorical data set in which the same categorical grouping algorithm or any other one can be used to produce the final output clusters.