09-06-2012, 11:28 AM
Text Classification from Labeled and Unlabeled
Documents using EM
Text Classification.PDF (Size: 600.96 KB / Downloads: 0)
Abstract.
This paper shows that the accuracy of learned text classiers can be improved by
augmenting a small number of labeled training documents with a large pool of unlabeled documents.
This is important because in many text classication problems obtaining training labels
is expensive, while large quantities of unlabeled documents are readily available.
Introduction
Consider the problem of automatically classifying text documents. This problem
is of great practical importance given the massive volume of online text available
through the World Wide Web, Internet news feeds, electronic mail, corporate
databases, medical patient records and digital libraries. Existing statistical text
learning algorithms can be trained to approximately classify documents, given a
sucient set of labeled training examples. These text classication algorithms have
been used to automatically catalog news articles (Lewis & Gale, 1994; Joachims,
1998) and web pages (Craven, DiPasquo, Freitag, McCallum, Mitchell, Nigam, &
Slattery, 1998; Shavlik & Eliassi-Rad, 1998), automatically learn the reading interests
of users (Pazzani, Muramatsu, & Billsus, 1996; Lang, 1995), and automatically sort electronic mail (Lewis & Knowles, 1997; Sahami, Dumais, Heckerman, &
Horvitz, 1998).
Argument for the Value of Unlabeled Data
How are unlabeled data useful when learning classication? Unlabeled data alone
are generally insucient to yield better-than-random classication because there is
no information about the class label (Castelli & Cover, 1995). However, unlabeled
data do contain information about the joint distribution over features other than
the class label. Because of this they can sometimes be used|together with a sample
of labeled data|to signicantly increase classication accuracy in certain problem
settings.
To see this, consider a simple classication problem|one in which instances are
generated using a Gaussian mixture model. Here, data are generated according to
two Gaussian distributions, one per class, whose parameters are unknown. Figure 1
illustrates the Bayes-optimal decision boundary (x > d), which classies instances
into the two classes shown by the shaded and unshaded areas. Note that it is
possible to calculate d from Bayes rule if we know the Gaussian mixture distribution
parameters (i.e., the mean and variance of each Gaussian, and the mixing parameter
between them).
The Probabilistic Framework
This section presents a probabilistic framework for characterizing the nature of
documents and classiers. The framework denes a probabilistic generative model
for the data, and embodies two assumptions about the generative process: (1) the
data are produced by a mixture model, and (2) there is a one-to-one correspondence
between mixture components and classes.1 The naive Bayes text classier we will
discuss later falls into this framework, as does the example in Section 2.