12-04-2017, 09:52 AM
The categorization of text (also called text classification) is the process of identifying the class to which a text document belongs. This article proposes to use a KNN algorithm of simple unweighted characteristics for the categorization of text. We propose to use a feature selection method that finds the characteristics relevant to the task of learning by hand using the feature interaction (based on word interdependencies). This will allow us to significantly reduce the number of selected features from which to learn, making our KNN algorithm applicable in contexts where both the volume of documents and the size of the vocabulary are high, such as with the World Wide Web. Therefore, the KNN algorithm we propose becomes efficient to classify text documents in that context (in terms of their predictability and interpretability), as demonstrated. Its simplicity (WRT its implementation and fine-tuning) becomes its main asset for in-the-field applications.Text categorization is the process of grouping text documents into one or more predefined categories based on their content. Several techniques of statistical classification and automatic learning have been applied to the categorization of text, including regression models, Bayesian classifiers, decision trees, closest neighbor classifiers, neural networks and support vector machines.