José María Gómez Hidalgo
Departamento de Inteligencia Artificial
Universidad Europea de Madrid, 28670 - Villaviciosa de Odón, Madrid,
Spain
jmgomez@dinar.esi.uem.es
http://www.esi.uem.es/~jmgomez/
Automatic Text Categorization (ATC), the automatic assignment of text documents to predefined classes, is a language engineering task very relevant to a number of applications, including automatic content and knowledge management in corporations and the Internet, information access and filtering, etc. With first works dating back to 60's, and increased work in the last decade, there is currently a ATC solid model based on Information Retrieval and Machine Learning techniques.
Today's learning-based ATC systems are able to reach nearly human-being performance in effectiveness for thematic classification, i.e. applications in which categories are defined in terms of theme or topic (e.g. economics, arts, etc.). However, there are a number of applications in which this model is not so successful, mainly due to the fact that classification should not be based on the semantics of a set of selected words, but also on other stylistic text properties. These applications include genre detection, authorship identification, pornographic Web content detection, spam e-mail filtering, etc. Also, there are a number of approaches for increasing ATC effectiveness focusing on a better modelling of text semantics, including the utilization of less shallow text processing techniques (e.g. using phrases or concepts instead of terms for representing/indexing text documents, applying Information Extraction techniques for the identification of better representation concepts, etc.).
This tutorial describes a number of approaches to text representation for ATC, focusing on stylistic and deeper semantic modelling of text. Approaches included in this tutorial include stylistic text modelling for genre and author identification, heuristics for spam classification, the usage of syntactic phrases, concepts and text patterns for text indexing in thematic classification and pornographic Web content detection, among others. The presentation of the techniques is done after the description of the general, state-of-the-art model for ATC.
See http://www.esi.uem.es/~jmgomez/tutorials/eacl03/index.html