07-06-2012, 01:44 PM
Fuzzy c-Means Clustering of Incomplete Data
FCM clustering of incomplete data.pdf (Size: 241.71 KB / Downloads: 2)
INTRODUCTION
WE ARE interested in clustering a set of objects
represented by a numerical object data set
into clusters, . The
numerical data describes the objects by specifying values for
particular features. For example, if the th component of each
datum corresponds to the feature of weight, then the th component
of , denoted by , gives the weight of object . In
some situations, feature vectors in can have missing components
[1], [2]. Any datum with some (but not all) missing feature
values is referred to as an incomplete datum. An example of an
incomplete datum is , where
and are missing. A data set with at least one incomplete
datum is referred to as an incomplete data set; otherwise, it is
called complete.
Previous Work
The problem of doing pattern recognition with incomplete
data is certainly not new. Theoretically-oriented approaches
based on probabilistic assumptions date back to at least 1962
[3]. An important empirically-oriented study was done in 1979
[1], and those results are summarized in [2]. It may be the
case that predicted values of the missing data are desired [4],
Partial Distance Strategy (PDS)
The top strategy recommended by Dixon [1] for the case that
is sufficiently large that the WDS cannot be justified consists
of calculating partial (squared Euclidean) distances using
all available (i.e., nonmissing) feature values, and then scaling
this quantity by the reciprocal of the proportion of components
used. We will call this the partial distance strategy (PDS) and
illustrate the PDS form of the calculation in (5) using the
example
NUMERICAL EXAMPLES
In this section we compare the WDS, PDS, OCS, and NPS
versions of FCM using artificially generated incomplete data
sets. The scheme for artificially generating an incomplete data
set is now described. First a complete data set is chosen.
Then is modified to obtain an incomplete data set by randomly
selecting a specified percentage of its components
to be designated as missing.
CONCLUSION
We considered four different approaches for doing FCM
clustering of incomplete data sets. The PDS FCM uses an
approach recommended by Dixon to alter the calculations
in a way that uses all available information. The OCS FCM
treats the missing data as variables that are to be optimized
in order to obtain the smallest possible value of the FCM
functional . The NPS FCM dynamically estimates the
missing data values based on component values of the nearest
current prototype. The convergence properties of PDSFCM
and OCSFCM exactly parallel those for standard FCM, and
they follow from existing alternating optimization [20], [23]
and Zangwill’s [22] general theory. In practice, all techniques
terminated in all cases, but theoretical convergence results
for NPSFCM are missing.