TLP - Automatic translation and machine learning

Research activities in this theme are focused on developing, testing, and adapting proven and sound statistical Machine Learning algorithms to the peculiarities of Natural Language and Speech Processing data and problems. The main testbed is a final application, Machine Translation, which implies many intermediate tasks (part-of-speech (POS) tagging, chunking, named-entity recognition (NER), etc) that can also be approached with ML tools. Besides their intrinsic complexity, these problems involve dealing with (i) very large and (ii) heterogeneous datasets, containing both (iii) annotated and non-annotated data; further, linguistic data is often (iv) structured and can be described by (v) myriads of linguistic features, involving (vi) complex statistical dependencies. These are the six main scientific challenges that are are being addressed. However, contrary to many teams working in this lively domain, improving the current state-of-the-art in Machine Translation, as measured in international evaluation campaigns, is also a major objective; thus the need to develop and maintain our own Machine Translation software(s).

Statistical Machine Translation (SMT)

Statistical Machine Translation systems rely on the statistical analysis of large bilingual corpora to train stochastic models describing the mapping between a source language (SL) and a target language (TL). In their simplest form, these models express probabilistic associations between source and target strings of words, as initially formulated in the famous IBM models in the early nineties. More recently, these models have been extended to more complex representations (eg. chunks, trees, or dependency structures) and to probabilistic mappings between these representations. Translation models are typically trained using parallel corpora containing examples of source texts aligned with their translation(s), where the alignment is defined at a sub-sentential level.

In this context, LIMSI is developing its research activities in several directions, ranging from the design of word and phrase alignment models, to the conception of novel translation or language models; from the exploration of new training or tuning methodologies to the development of new decoding strategies. All these innovations need to be properly evaluated and significant efforts are devoted to the vexing issue of quality measurements of MT outputs (Marie & Apidianaki, 2015). These research activities have been published in a number of international conferences or journals. LIMSI is finally involved in a number of national and international projects.

Regarding alignment models, most recent work deals with the design and training of discriminative alignment techniques (Allauzen & Wisniewski, 2009; Tomeh et al, 2010, 2011a, 2011b, 2012) in order to improve both word alignment and phrases extraction. (Lardilleux et al, 2011a, 2011b, 2012, 2013) explores alternative alignment techniques, based on statistical association measures between phrases (see our implementation of anymalign).

LIMSI's decoder, N-code, belongs to the class of n-gram based systems. In this approach, translation is defined as a two step process, in which an input SL sentence is first non-deterministically reordered yielding a large word lattice of possible reorderings. This lattice is then translated monotonically using a bilingual n-gram model; as in the more standard approach, hypotheses are scored using several probabilistic models, the weights of which are discriminatively optimized with minimum error weight training. Recent evolutions of this approach are described in (Crego & Yvon, 2009, 2010a, 2010b). This system is now released as an open source software (Crego & Yvon, 2011); an online demo is also available. As an alternative training strategy, a CRF-based translation model (Lavergne et al, 2011) has recently been proposed, which builds on our in-house CRF toolkit (Lavergne et al, 2011)Lavergne et al, 2010). We are exploring more agile and adaptive approaches to training for MT in which the model parameters are computed on the fly (Li et al, 2012).

LIMSI's activities are not restricted to these core modules and many other aspects of SMT are also investigated, such as "tuning" (Sokolov & Yvon, 2011), multi-source machine translation (Crego et al 2010a, 2010b), diagnostic evaluation of MT, notably via the computation of oracle scores (Max et al, 2010; Wisniewski et al, 2010, 2013; Sokolov et al, 2012), confidence estimation (Zhang et al, 2012), word sense disambiguation for MT (Sokolov et al, 2012; Apidianaki et al, 2012; Apidianaki & Gong, 2015), extraction of parallel sentences from comparable corpora (Braham-Ghabiche et al, 2011), sentence alignment (Yu & al, 2012a, 2012b), etc.

LIMSI's MT systems have taken part in several international MT evaluation campaigns. This includes a yearly participation to the WMT evaluation series (2006-2015), where LIMSI has consistently been ranked amongst the top systems, especially when translating from and into French. We have also partaken in the 2009 NIST MT evaluation for the Arabic-English task, as well as in the 2010, 2011 and 2014 IWSLT evaluations for translation of speech.

Machine Learning

LIMSI's activities in the area of Machine Learning bridge a gap between Machine Translation and Machine Learning: on the one hand, MT is a difficult application which provides us with a realistic testbed for many ML innovations. Conversely, it appears that the development of efficient, large-scale MT systems poses problems the solutions of which can also be used in other contexts or give rise to generic solutions.

A major achievement is the development of Wapiti, an open source package for linear chain Conditional Random Fields (CRFs) tailored for very large scale tasks (Lavergne et al, 2010). Owing to a very careful implementation of the core routines (gradient computation and optimization procedure) and to the selection of very sparse models through l1 regularization, allied with a very expressive language for representing feature patterns, this package is able to handle very large feature sets (up to billion of features), very large label set (up to hundreds of features), and very large datasets (up to millions of instances). This software has achieved state-of-the-art performance for many NLP tasks (grapheme-to-phoneme conversion, POS tagging, Named Entity Recognition, etc.) in a variety of languages.

Another recent achievement is the development of original architectures for training and using very large neural networks having millions of neurons on their output layer. These architectures are especially useful in the context of Neural Network Language Models (NNLMs), a theme on which LIMSI has been contributing since (Gauvain & Schwenk, 2002) and which has provided us consistent performance improvements in many tasks. The recent work of (Le & al, 2010, 2011, 2013) has lead the development of the first NNLMs capable of predicting very large output vocabularies and of taking advantage of large context (up to 10-grams). These models have been successfully used to rescore n-best lists for speech recognition and for machine translation. This work has been generalized to Neural Network Translation Models (Lavergne et al, 2011; Le et al, 2012; Do et al, 2014, 2015) which are even more demanding in terms of their output vocabulary, since they manipulate bilingual segments.

Some current and past projects:

Samar demo

Software

Demos

Corpora

LIMSI
Campus universitaire bât 508
Rue John von Neumann
F - 91405 Orsay cedex
Tél +33 (0) 1 69 85 80 80
Email

SCIENTIFIC REPORTS

LIMSI in numbers

10 Research Teams
100 Researchers
40 Technicians and Engineers
60 Doctoral Students
70 Trainees

 Paris-Sud University new window

 

Paris-Saclay University new window