Text Classification from Labeled and Unlabeled Documents using EM
#1

Text Classifi cation from Labeled and Unlabeled
Documents using EM



.pdf   Text Classification.PDF (Size: 600.96 KB / Downloads: 0)

Abstract.

This paper shows that the accuracy of learned text classi ers can be improved by
augmenting a small number of labeled training documents with a large pool of unlabeled documents.
This is important because in many text classi cation problems obtaining training labels
is expensive, while large quantities of unlabeled documents are readily available.

Introduction

Consider the problem of automatically classifying text documents. This problem
is of great practical importance given the massive volume of online text available
through the World Wide Web, Internet news feeds, electronic mail, corporate
databases, medical patient records and digital libraries. Existing statistical text
learning algorithms can be trained to approximately classify documents, given a
sucient set of labeled training examples. These text classi cation algorithms have
been used to automatically catalog news articles (Lewis & Gale, 1994; Joachims,
1998) and web pages (Craven, DiPasquo, Freitag, McCallum, Mitchell, Nigam, &
Slattery, 1998; Shavlik & Eliassi-Rad, 1998), automatically learn the reading interests
of users (Pazzani, Muramatsu, & Billsus, 1996; Lang, 1995), and automatically sort electronic mail (Lewis & Knowles, 1997; Sahami, Dumais, Heckerman, &
Horvitz, 1998).

Argument for the Value of Unlabeled Data

How are unlabeled data useful when learning classi cation? Unlabeled data alone
are generally insucient to yield better-than-random classi cation because there is
no information about the class label (Castelli & Cover, 1995). However, unlabeled
data do contain information about the joint distribution over features other than
the class label. Because of this they can sometimes be used|together with a sample
of labeled data|to signi cantly increase classi cation accuracy in certain problem
settings.
To see this, consider a simple classi cation problem|one in which instances are
generated using a Gaussian mixture model. Here, data are generated according to
two Gaussian distributions, one per class, whose parameters are unknown. Figure 1
illustrates the Bayes-optimal decision boundary (x > d), which classi es instances
into the two classes shown by the shaded and unshaded areas. Note that it is
possible to calculate d from Bayes rule if we know the Gaussian mixture distribution
parameters (i.e., the mean and variance of each Gaussian, and the mixing parameter
between them).

The Probabilistic Framework

This section presents a probabilistic framework for characterizing the nature of
documents and classi ers. The framework de nes a probabilistic generative model
for the data, and embodies two assumptions about the generative process: (1) the
data are produced by a mixture model, and (2) there is a one-to-one correspondence
between mixture components and classes.1 The naive Bayes text classi er we will
discuss later falls into this framework, as does the example in Section 2.
Reply

Important Note..!

If you are not satisfied with above reply ,..Please

ASK HERE

So that we will collect data for you and will made reply to the request....OR try below "QUICK REPLY" box to add a reply to this page
Popular Searches: free download vehicle classification and communication using zigbee protocol ppt, text classification projects, thermally insu, introduction of 3d searchingture clustering algorithm for text classification pdf, lewis clark, a fuzzy self constructing feature algorithm for text classification pdf, digital watermarking techniques for text documents ppt,

[-]
Quick Reply
Message
Type your reply to this message here.

Image Verification
Please enter the text contained within the image into the text box below it. This process is used to prevent automated spam bots.
Image Verification
(case insensitive)

Possibly Related Threads...
Thread Author Replies Views Last Post
  Emotional Annotation of Text project topics 4 3,234 07-02-2013, 10:24 AM
Last Post: seminar details
  A survey of usage of Data Mining and Data Warehousing in Academic Institution and Lib seminar class 1 2,151 29-11-2012, 12:56 PM
Last Post: seminar details
  AI-based Classification and Retrieval of Reusable Software Components computer girl 0 1,051 11-06-2012, 12:07 PM
Last Post: computer girl
  Intelligent Electronic Devices (IEDs) and Supervisory Control and Data Acquisition computer girl 0 1,159 09-06-2012, 06:01 PM
Last Post: computer girl
  A NOVEL REPLICA DETECTION SYSTEM USING BINARY CLASSIFIERS, R-TREES, AND PCA computer girl 0 1,054 07-06-2012, 05:16 PM
Last Post: computer girl
  Grayscale Image Retrieval using DCT on Row mean, Column mean and Combination computer girl 0 1,076 06-06-2012, 04:57 PM
Last Post: computer girl
  The 8051 Microcontroller and Embedded Systems Using Assembly and C computer girl 0 1,058 04-06-2012, 05:41 PM
Last Post: computer girl
  Finding Bugs in Web Applications Using Dynamic Test Generation and Explicit-State Mod seminar surveyer 2 2,367 14-02-2012, 12:55 PM
Last Post: seminar paper
  Secured Data Transmission using Cryptographic and Steganographic Techniques Electrical Fan 2 2,886 14-09-2011, 10:17 AM
Last Post: seminar addict
  Lean and Zoom: Proximity-Aware User Interface and Content Magnification seminar class 0 948 05-05-2011, 02:39 PM
Last Post: seminar class

Forum Jump: