Corpora and representations

P. Paroubek, C. Grouin, M. Asadullah, A. Fraisse, M. Delaborde, A. Braffort, M. Filhol, T. Hamon, A. Max, V. Moriceau, A. Névéol, X. Tannier, A. Vilnat, P. Zweigenbaum

The theme Corpora and Representations studies the linguistic events which occur in the graphic or signed representation systems used by humans to communicate. Our approach is corpus-based. We study documents from various origins: book, newspapers, speech transcriptions, technical reports, scientific articles, web pages blogs, microblogs, sign language videos etc. These documents are collected according to a specific working hypothesis, with the aim of producing models usable for language processing. Defining the necessary and sufficient representation required to perform a particular Natural Language Processing task, such as part-of-speech tagging, parsing, named entity recognition, opinion mining, etc., is a fundamental step in designing language processing functions. Building annotated corpora, i.e., sets of documents enriched with linguistic representations, gives us the means to develop, train, test, compare our algorithms and to organize evaluation campaigns. The latter have proved to be key events for identifying and supporting emerging research trends both nationally and internationally.

The ILES group has a long experience in producing annotated corpora and organizing evaluation campaigns based on these corpora. Our expertise in the domain gave us the means to collaborate in research projects with an ever growing number of partners from both academia (CHIST-ERA uComp project, yearly DEFT series of text mining challenges) and industry (Big Data REQUEST, SONAR, Systematic PROJESTIMATE). For information specific to sign language representations, please refer to the theme Sign Language.