|
|
|
Stan Matwin Text classification is a building block for many intelligent systems, in particular in Information Extraction and Text Mining. Classifier induction is often a technique of choice to build text classifiers. Past experience in text mining shows that, by and large, there are limited differences in performance between different existing classifier induction systems. Therefore, there is interest in research on other aspects of text classification, in particular on representation of the text given to the classifier, and on limiting the effort of the user in labeling the examples. In this presentation, we will give an account of two experiments following the above line of research. In the first experiment, we have carried out an investigation of the use of phrases, synonyms and hypernyms in representing the text. The results did not show a significant improvement in performance ensuing from any of the new representations. There has been, however, a performance improvement when an ensemble of classifiers obtained from different representations was arranged into a mojority-vote committee. In the second experiment, we have used the co-training idea in which two classifiers obtained from mutually redundant representations were used to train each other. In the email classification task, and using SVM as the classifier induction system, we have obtained significant performance improvement wrt the use of a single classifier. We will conclude in outlining how we plan to use some of the above techniques in the Information Extraction system Caderige at Universite Paris XI. Caderige works on abstracts of scientiific papers on genomics (e.g. Medline) and extracts from them all information pertaining to interaction between specific proteins.
Intérêts de recherche: http://www.site.uottawa.ca/~stan/contents/research.html
Stan Matwin is a Professor of Information Technology and Engineering,
He is the former President of the Canadian Society for Computational
Studies of Intelligence, and
His research interests are in Data and Text Mining and Knowledge-based Systems. He is Programme Chair of the 12th International Conference on Inductive Logic Programming in Sydney, Australia, in July 2002.
|
|
|
Contacts :Patrick Paroubek & William Turner Dernière mise à jour : 29 Janvier 2002 |