iam doing project on effective pattern discovery in text mining.
Posts: 810
Threads: 0
Joined: Jul 2016
To get information about the topic Effective Pattern Discovery for Text Mining full report ppt and related topic refer the page link below
Abstract
Many data mining techniques have been proposed for mining useful patterns in text documents. However, how to effectively use and update discovered patterns is still an open research issue, especially in the domain of text mining. Since most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy and synonymy. Over the years, people have often held the hypothesis that pattern (or phrase)-based approaches should perform better than the term-based ones, but many experiments do not support this hypothesis. This paper presents an innovative and effective pattern discovery technique which includes the processes of pattern deploying and pattern evolving, to improve the effectiveness of using and updating discovered patterns for finding relevant and interesting information. Substantial experiments on RCV1 data collection and TREC topics demonstrate that the proposed solution achieves encouraging performance.
INTRODUCTION
Due to the rapid growth of digital data made available in
recent years, knowledge discovery and data mining have
attracted a great deal of attention with an imminent need for
turning such data into useful information and knowledge.
Many applications, such as market analysis and business
management, can benefit by the use of the information and
knowledge extracted from a large amount of data.
Knowledge discovery can be viewed as the process of
nontrivial extraction of information from large databases,
information that is implicitly presented in the data,
previously unknown and potentially useful for users. Data
mining is therefore an essential step in the process of
knowledge discovery in databases.
RELATED WORK
Many types of text representations have been proposed in
the past. A well known one is the bag of words that uses
keywords (terms) as elements in the vector of the feature
space. In addition to TFIDF, the global IDF and entropy
weighting scheme is proposed in [9] and improves
performance by an average of 30 percent. Various weighting
schemes for the bag of words representation approach were
given in [1],[14]. The problem of the bag of words approach
is how to select a limited number of features among an
enormous set of words or terms in order to increase the
system’s efficiency and avoid overfitting. In order to reduce
the number of features, many dimensionality reduction
approaches have been conducted by the use of feature
selection techniques, such as Information Gain, Mutual
Information, Chi-Square, Odds ratio, and so on.
CONCLUSION
Many data mining techniques have been proposed in the last
decade. These techniques include association rule mining,
frequent itemset mining, sequential pattern mining,
maximum pattern mining, and closed pattern mining.
However, using these discovered knowledge (or patterns) in
the field of text mining is difficult and ineffective. The
reason is that some useful long patterns with high specificity
lack in support (i.e., the low-frequency problem). We argue
that not all frequent short patterns are useful. Hence,
misinterpretations of patterns derived from data mining
techniques lead to the ineffective performance.