Multilingualism and Paraphrasing

T. Lavergne, T. Hamon, A.-L. Ligozat, A. Max, A. Névéol, P. Zweigenbaum with the participation of C. Grouin, G. Illouz, B. Marie, V. Moriceau, P. Paroubek, X. Tannier, A. Vilnat

By working on language productions with similar meanings but different forms, this research theme provides handles on semantics, the core of human language. At the same time, cross-language portability is a recurring issue in system development. This topic interacts in a transverse way with each of the other three research themes of the ILES group, as well as with the Machine Translation activity of the TLP group.

Work in this research theme is mainly conducted in the following areas:

  • Production of multilingual corpora: Corpora are essential elements for training and assessing NLP systems, and collecting and annotating corpora in a multilingual setting is an evergoing activity in the ILES group. For instance, annotation has been an important activity in the European uComp project to associate sentiment and opinion information to tweets in French and German.
  • Transfer of information across languages: Identifying and extracting interlingual semantic representations, similar to abstract ontological representations of domain concepts, is an essential step towards knowledge representation. Transfer of information, in particular using lightly supervised approach, is the focus of an ILES-led Digicosme research group Multilingual Semantic Representations involving other researchers from LIMSI/TLP, CEA/LIST, INRA/MIG, LRI/LaHDAK, LTCI/DBWeb, and E3S.
  • Adapting existing NLP systems to new languages: Once an NLP system has been developed for a language, it is useful to consider approaches to adapt it to new languages. For instance, work has been conducted in tuning the HeidelTime temporal tagger to the biomedical domain in both English and French and to the general domain in French.
  • Acquisition of monolingual units: Various natural language expressions may hold similar or related meanings in context, a phenomenon at the heart of natural language semantics which represents a major difficulty for NLP applications. The ILES group has been developing techniques for the acquisition and use of related monolingual units and studying criteria that can motivate text rewriting. Joint projects with the universities of Strasbourg, Marseilles, and Leuven have produced reasults in the domain of text simplification, including lexical and syntactic simplification as well as assessment of text readability, while another collaboration with Lille University has focused on the acquisition of medical term simplified variants and the automatic determination of the complexity of medical words.
  • Acquisition of bilingual units: NLP systems need access to resources describing equivalences across languages (bilingual pairs of words, terms, segments, rules, etc.), in order to provide access to information available in foreign languages and to automatically transform some text into an equivalent text in another language. For instance, work on the construction of specialized bilingual lexicons making use of large-scale background knowledge has been conducted. Furthermore, research on Statistical Machine Translation is also conducted in ILES in collaboration with the related activity in the TLP group, where work has most notably been conducted on the development of a fully-discriminative translation framework.


