L I M S I - C N R S

Séminaire CHM

Laboratoire
CNRS
LIMSI
Séminaire CHM


Programmes précédents


1ersemestre 01/02
Année 00-01
Année 99-00
Année 98-99
Année 97-98

VENISE




Autres
Séminaires


Talana
RISC

Groupes de Travail

REVERIES


Non-standard Text Representations and Co-training in Text Classification

Stan Matwin
University of Ottawa and Universite de Paris XI
Séminaire CHM du 26 février

Abstract

Text classification is a building block for many intelligent systems, in particular in Information Extraction and Text Mining. Classifier induction is often a technique of choice to build text classifiers. Past experience in text mining shows that, by and large, there are limited differences in performance between different existing classifier induction systems. Therefore, there is interest in research on other aspects of text classification, in particular on representation of the text given to the classifier, and on limiting the effort of the user in labeling the examples. In this presentation, we will give an account of two experiments following the above line of research. In the first experiment, we have carried out an investigation of the use of phrases, synonyms and hypernyms in representing the text. The results did not show a significant improvement in performance ensuing from any of the new representations. There has been, however, a performance improvement when an ensemble of classifiers obtained from different representations was arranged into a mojority-vote committee. In the second experiment, we have used the co-training idea in which two classifiers obtained from mutually redundant representations were used to train each other. In the email classification task, and using SVM as the classifier induction system, we have obtained significant performance improvement wrt the use of a single classifier. We will conclude in outlining how we plan to use some of the above techniques in the Information Extraction system Caderige at Universite Paris XI. Caderige works on abstracts of scientiific papers on genomics (e.g. Medline) and extracts from them all information pertaining to interaction between specific proteins.


Intérêts de recherche: http://www.site.uottawa.ca/~stan/contents/research.html

Stan Matwin is a Professor of Information Technology and Engineering,
Director of Graduate Studies in Computer Science, and
Director of the Graduate Certificate on Electronic Commerce at the University of Ottawa.

He is the former President of the Canadian Society for Computational Studies of Intelligence, and
former Head of IFIP WG 12.2 (Machine Learning).

His research interests are in Data and Text Mining and Knowledge-based Systems.
He has autored and co-authored some 100 research papers in refereed confernces and journals.
Currently on sabbatical, he is a Visting Professor at the Laboratoire de recherche en informatique, Universite Paris XI et CNRS.

He is Programme Chair of the 12th International Conference on Inductive Logic Programming in Sydney, Australia, in July 2002.


Contacts :Patrick Paroubek & William Turner
Dernière mise à jour : 29 Janvier 2002