Research on machine translation is primarily oriented towards improving existing statistical machine translation (SMT) systems, or more generally data-driven machine translation engines. In a nutshell, SMT systems rely on the statistical analysis of large bilingual corpora to train stochastic models of the mapping between a source and a target language. In their simplest form, these models correspond to probabilistic rational relations between source and target strings of words, as initially formulated in the famous IBM models in the early nineties. More recently, these models have extended to capture more complex representations (eg. chunks, trees, or dependency structures) and the possible probabilistic relasionships between these representations. Such models are typically trained from parallel corpora, ie from examples of source texts aligned with their translation(s), where the alignment is typically defined at the subsentential level.

In this context, LIMSI is developping its research activities in several directions, from the design of word and phrase alignment models, to the conception of novel translation or language models; from the exploration of new training or tuning methodologies to the development of new decoding strategies. All these innovations need to be evaluated and diagnosed, and we also devote a significant fraction of our efforts to address the vexing issue of quality measurements in MT outputs. All these activities have been published in a number of international conferences or journals (see the Publications section). We are finally involved in a number of national and international projects (see the Project section below.)

Regarding alignment models, most of our recent work deals with the design and training of discriminative alignment techniques (Tomeh et al, 2011a, 2011b, 2010b; Allauzen & Wisniewski, 2009) to be used either to actually compute word alignments, to symmetrize existing word alignments, or to refine the extraction process. Recent work (Lardilleux et al, 2011; 2012; 2013) explores alternative alignment techniques, based on a phrase association measures: the goal is to explore flexible on-demand alignments strategies.

Our main decoder, N-code, belongs to the class of n-gram based systems. In a nutshell, these systems define the translation as a two step process, where an input source sentence is first reordered non-deterministically yielding a input word lattice containing several possible reorderings. This lattice is then translated monotonically using a bilingual n-gram model; as in the more standard approach, hypotheses are scored using a battery of probabilistic models, whose weights are tuned with minimum error weight training. Recent evolutions of this approach are described in (Crego & Yvon 2009, 2010a, 2010b). This system is now released as open source software (see Ncode web pages) and (Crego et al 2012); an online demo is also available. As an alternative training strategy, we have recently proposed a CRF-based translation model (Lavergne et al. 2011; 2013).

Our activities are not restricted to these core modules of SMT systems, and we are investigating many other aspects of SMT systems, such as tuning (Sokolov & Yvon, 2011; Wisniewsk & Yvon 2013), multi-source machine translation (Crego & al 2010a, 2010b), evaluation of MT (Max & al 2010, Wisniewski & al, 2010), confidence estimation for MT (Wisniewski et al 2012, 2013, 2014), WSD in SMT (Apidianaki et al, 2011), extraction of parallel sentences from comparable corpora (Braham-Ghabiche & al 2011), etc.

Activities in SMT are finally closely related to the work carried out on language modeling, a theme on which LIMSI has been contributing for many years. A major recent contribution is the work on Neural Network Language models, initiated in (Gauvain & Schwenk, 2002), and recently revisited in (Le & al, 2010, 2011, 2012).

Our research activities are conducted in close relationship with several academic and industrial partners in the context of several national and international projects. A partial list of these projects is given below.

LIMSI's systems have taken part in several international MT evaluation campaigns. This includes a yearly participation to the WMT evaluation series (2006-2014), where LIMSI has consistently been amongst the top ranking systems, especially when translation into French is concerned. We have also ran the 2009 NIST MT evaluation for the Arabic-English task, as well as the IWSLT evaluations in 2010 and 2011.

LIMSI has recently been actively involved in the organization of various scientific events: EAMT 2010 in St Raphaël and IWSLT 2010 in Paris, as well as the Tralogy series. A. Allauzen has launched the series of ACL workshop on learning representations (2013 in Sophia, 2014 in Gothemburg. F. Yvon is again chairing the IWSLT 2014 in Lake Tahoe scientific committee.

The LIMSI system performed best in the SemEval-2015 Task 13: Multilingual All-Words Sense Disambiguation and Entity Linking for English.

Permanent staff

Temporary staff

Past Members

  • Linlin Li, was Post-doctoral research associate (2014-2015)
  • Natalia Segal, was Post-doctoral research associate (2014-2015), and is now with Systran, in downtown Paris
  • Souhir Gahbiche-Braham, was ATER at Paris-Sud University
  • Marco Dinarelli, was Post-doctoral research associate (2011-2013), now CNRS resarcher at LATTICE/Paris
  • Anil Kumar Singh, was Post-doctoral research associate (2012-2013) working on confidence estimation
  • Le Hai Son, (2009-2012) did is Ph.D at LIMSI, now researcher at the Vietnamese Academy of Science
  • Thomas Lavergne was Post-doctoral research associate (2009-2012), now assistant professor at Univ. Paris Sud
  • Artem Sokolov was Post-doctoral research associate (2010-2012), now research associate at Univ. of Heidelberg with S. Riezler.
  • Nadi Tomeh, did his Ph.D in LIMSI (2008-2012), is now assistant professor at Univ. Paris Nord.
  • Qian Yu was reseach associate, working on sentence alignments.
  • Adrien Lardilleux, did a Post-doc with us, now working at Affinity-Engine
  • Josep Maria-Crego did a post-doc at LIMSI, is now with Systran, in downtown Paris
  • Ilknur Durgar did a post-doc at LIMSI in 2010, and is now with Tübitaek in Turkey
  • Alassane Seck was reseach associate, working on spell checking and normalization
  • Daniel Déchelotte did his ph.D in LIMSI (2005-2008), is now with Bing, near Paris
  • Holger Schwenk is now Full Professor at Univ. du Maine, Le Mans


Visitors and collaborators

  • june, 20, 2014: Daniel Ortiz (Univ. Valencia) Online learning Techniques for Machine Translation
  • march, 12, 2014: Jan Niehues (Karlsruhe Institute of Technology) Adaptation in Machine Translation
  • january, 29, 2014: Stefan Rieszler (Univ. Heidelberg)
  • february, 26, 2013: Sylvain Raybaud (LORIA) Confidence measures for machine translation: evaluation, post edition and application to speech translation
  • february, 01, 2013: Pascal Fung(HK-UST) Rare Word Translation Extraction from Aligned Comparable Documents
  • november 12, 2012: Anil Kumar-Singh(LIMSI) Machine Translation as a Problem of Estimating Linguistic Similarity and the Specific Problem of Translating TAM Markers
  • july 4 2012: Simon Lacoste-Julien (Inria, Winnow) Structured alignment methods in machine learning
  • june 19, 2012: Kashif Shah (LIUM, Le mans) Domain adaptation in SMT
  • may 30, 2012: Hermann Ney (IMMI) Bayes Decision Rule and the Classification Error in Systems for HLTPR (Human Language Technology and Pattern Recognition): Results and Open Problems
  • march 03, 2012: Adrien Lardilleux (LIMSI) : Amélioration de l'alignement sous-phrastique par échantillonnage
  • feb 28, 2012: Charlotte Lecluze (GREYC) Alignement de documents multilingues sans présupposé de parallélisme
  • jan 24, 2012: Marianna Apidianaki (LIMSI) Clustering : Sémantique pour la désambiguïsation lexicale interlingue et l'évaluation de la traduction automatique
  • dec 11, 2011: Hugo Larochelle (University Sherbrooke) : Training Restricted Boltzmann Machines on Word Observations
  • july 7 2011: Marco Turchi (JRC) Multi-linguality via Statistical Machine Translation: SMT activities carried out the EC’s Joint Research Centre
  • june 9 2011: Dekai Wu (HKUST) Inversion Transduction Grammars, Linear Transduction Grammars, and Linear Inversion Transduction Grammars for SMT
  • may 2 2011: Nicola Cancedda (XRCE) Confidence-Weighted Learning of Factored Discriminative Language Models
  • december 14 2010: Hermann Ney (RWTH) Revisiting the principles of the KN method for language modelling
  • november 23 2010: Hai Son Le (LIMSI), Continuous space neural network language models
  • october 26 2010: Nadi Tomeh (LIMSI), Word Alignment for Statistical Machine Translation
  • june 29 2010: Adrien Lardilleux (GREYC) Contribution des basses fréquences à l'alignement sous-phrastique multilingue
  • march 2 2010: Dimitra Vergyri (SRI) SRI's 2-way S2S Translation system: summary of the TRANSTAC project
  • december 18 2009: Marine Carpuat (Columbia) Désambiguïsation lexicale pour une approche sémantique de la traduction automatique statistique
  • december 15 2009: Jia Xu (RWTH), Sequence segmentation and alignment for statistical machine translation
  • novembre 3 2009: Ilknur Durgar (LIMSI), A prototype English-Turkish statistical machine translation system
  • october 27 2009: Vassilina Nikoulina (XRCE), Syntax-Augmented Phrase-Based Translation
  • april 29 avril 2009: Loïc Barrault (LIUM), Combinaison de systèmes (application à la reconnaissance automatique de la parole et à la traduction statistique)
Some current and past projects:

