Research on machine translation is primarily oriented towards improving existing statistical machine
translation (SMT) systems, or more generally data-driven machine translation engines. In a
nutshell, SMT systems rely on the statistical analysis of large bilingual corpora to train
stochastic models of the mapping between a source and a target language. In their simplest form,
these models correspond to probabilistic rational relations between source and target strings of
words, as initially formulated in the famous IBM models in the early nineties. More recently, these models have
extended to capture more complex representations (eg. chunks, trees, or dependency structures) and the
possible probabilistic relasionships between these representations. Such models are typically
trained from parallel corpora, ie from examples of source texts aligned with their
translation(s), where the alignment is typically defined at the subsentential level.
In this context, LIMSI is developping its research activities in several
directions, from the design of word and phrase alignment models, to the conception of novel
translation or language models; from the exploration of new training or tuning methodologies to the development
of new decoding strategies. All these innovations need to be evaluated and diagnosed, and we also
devote a significant fraction of our efforts to address the vexing issue of quality measurements in
MT outputs. All these activities have been published in a number of international conferences or
journal (see the Publications section). We are finally involved in a
number of national and international projects (see the Project section below.)
Regarding alignment models, most of our recent work deals with the design and training of
discriminative alignment techniques (Tomeh et al, 2011a, 2011b, 2010b; Allauzen & Wisniewski, 2009) to
be used either to actually compute word alignments, to symmetrize existing word alignments, or to
refine the extraction process. Recent work (Lardilleux et al, 2011) explores alternative
alignment techniques, based on a phrase association measure.
Our main decoder, N-code, belongs to the class of n-gram based systems. In a nutshell, these systems
define the translation as a two step process, where an input source sentence is first reordered
non-deterministically yielding a input word lattice containing several possible reorderings. This
lattice is then translated monotonically using a bilingual n-gram model; as in the more standard
approach, hypotheses are scored using a battery of probabilistic models, whose weights are tuned
with minimum error weight training. Recent evolutions of this approach are described in (Crego &
Yvon, 2009, 2010a, 2010b). This system is now released as open source software (see Ncode
web pages); an online demo is also available. As an
alternative training strategy, we have recently proposed a CRF-based translation model (Lavergne et al, 2011).
Our activities are not restricted to these core modules of SMT systems, and we are investigating
many other aspects of SMT systems, such as tuning (Sokolov & Yvon, 2011), multi-source
machine translation (Crego & al 2010a, 2010b), evaluation of MT (Max & al 2010, Wisniewski
& al, 2010), extraction of parallel sentences from comparable corpora (Braham-Ghabiche & al,
2011), etc.
Activities in SMT are finally closely related to the work carried out on language modeling, a
theme on which LIMSI has been contributing for many years. A major recent contribution is the work
on Neural Network Language models, initiated in (Gauvain & Schwenk, 2002), and recently
revisited in (Le & al, 2010, 2011, 2012).
Our research activities are conducted in close relationship with several academic and industrial
partners in the context of several national and international projects. A partial list of these
projects is given below.
LIMSI's systems have taken part in several international MT evaluation campaigns. This includes a
yearly participation to the WMT evaluation series (2006-2012), where LIMSI has consistently been
amongst the top ranking systems, especially when translation into French is concerned.
We have also ran the 2009 NIST MT evaluation for the Arabic-English task, as well as
the IWSLT evaluations in 2010 and 2011.
LIMSI has recently been actively involved in the organization of various scientific events:
EAMT 2010 in St Raphaël and IWSLT 2010 in Paris, as well
as the Tralogy series.
Alexandre Allauzen, Fran\ccois Yvon. Textual Information Access. In Statistical Methods for Machine Translation, Eric Gaussier, Fran\ccois Yvon (eds.), Chap. 7, pp. 223-304, ISTE/Wiley, Paris, 2012.
International Conferences
Wang Ling, Nadi Tomeh, Guang Xiang, Alan Black, Isabel TrancosoImproving Relative-Entropy Pruning using Statistical Significance. Proceedings of the 24th International Conference on Computational Linguistics (COLING-2012), 8-15 December, Mumbai, (2012)
Marianna ApidianakiMeasuring the adequacy of cross-lingual paraphrases in a Machine Translation setting. Proceedings of the 24th International Conference on Computational Linguistics (COLING-2012), 8-15 December, Mumbai, India, pp. 63--72. 2012.
Artem Sokolov, Guillaume Wisniewski and Fran\c{c}ois YvonNon-linear n-best List Reranking with Few Features. In Proceedings of the Conference of the Association for Machine Translation in the Americas (AMTA), San Diego (CA), 2012.
Marianna Apidianaki, Guillaume Wisniewski, Artem Sokolov, Aurélien Max, François Yvon. WSD for n-best reranking and local language modeling in SMT. In Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation, Pages 1-9, Jeju, Republic of Korea, July 2012.
Artem Sokolov. LIMSI: Learning Semantic Similarity by Selecting Random Word Subsets. In *SEM 2012: The First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), Pages 543-546, Montréal, Canada, 2012. download
Markus Freitag, Stephan Peitz, Matthias Huck, Hermann Ney, Jan Niehues, Teresa Herrmann, Alex Waibel, Le Hai-Son, Thomas Lavergne, Alexandre Allauzen, Bianka Buschbeck, Josep Maria Crego, Jean Senellart. Joint WMT 2012 Submission of the QUAERO Project. In Proceedings of the Seventh Workshop on Statistical Machine Translation, Pages 322-329, Montréal, Canada, June 2012. download
Adrien Lardilleux, François Yvon, Yves Lepage. Hierarchical Sub-sentential Alignment with Anymalign. In Proceedings of the annual meeting of the European Association for Machine Translation, 2012.
Qian Yu, Aurélien Max, François Yvon. Revisiting sentence alignment algorithms for alignment visualization and evaluation. In Proceedings of the 5th Workshop on Building and Using Comparable Corpora, Istambul, Turkey, 2012.
Artem Sokolov, Guillaume Wisniewski, Francois Yvon. Computing Lattice BLEU Oracle Scores for Machine Translation. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Pages 120-129, Avignon, France, April 2012. download
Josep Maria Crego, José M. Mariño, François Yvon. N-code: an open-source Bilingual N-gram SMT Toolkit. Prague Bulletin of Mathematical Linguistics, 96: pages 49-58, 2011. link
Adrien Lardilleux, Yves Lepage, François Yvon. The Contribution of Low Frequencies to Multilingual Sub-sentential Alignment: a Differential Associative Approach. International Journal of Advanced Intelligence, 3(2):189-217, 2011.
International Conferences
Thomas Lavergne, Hai-Son Le, Alexandre Allauzen, François Yvon. LIM
SI's experiments in domain adaptation for IWSLT11. In Proceedings of the heigth Internation
al Workshop on Spoken Language Translation (IWSLT), Mei-Yuh Hwang, Sebastian StĂĽker (eds.), San Francisco, CA, 2011.
Nadi Tomeh, Marco Turchi, Guillaume Wisniewski, Alexandre Allauzen, François Yvon. How Good Are Your Phrases? Assessing Phrase Quality with Single Class Classification. In Proceedings of the heigth International Workshop on Spoken Language Translation (IWSLT), Mei-Yuh Hwang, Sebastian Stüker (eds.), San Francisco, CA, 2011.
Markus Freitag, Gregor Leusch, Joern Wuebker, Stephan Peitz, Hermann Ney, Teresa Herrmann, Jan Niehues, Alex Waibel, Alexandre Allauzen, Gilles Adda, Josep Maria Crego, Bianka Buschbeck, Tonio Wandmacher, Jean Senellart. Joint WMT Submission of the QUAERO Project. In Proceedings of the Sixth Workshop on Statistical Machine Translation, Pages 358-364, Edinburgh, Scotland, 2011. download
Souhir Gahbiche-Braham, Hélène Bonneau-Maynard, François Yvon. Two Ways to Use a Noisy Parallel News Corpus for Improving Statistical Machine Translation. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, Pages 44-51, Portland, Oregon, 2011. download
Thomas Lavergne, Alexandre Allauzen, Josep Maria Crego, François Yvon. From n-gram-based to CRF-based Translation Models. In Proceedings of the Sixth Workshop on Statistical Machine Translation, Pages 542-553, Edinburgh, Scotland, 2011. download
Hai Son. Le, Ilya Oparin, Abdel. Messaoudi, Alexandre Allauzen, Jean-Luc Gauvain, François Yvon. Large Vocabulary SOUL Neural Network Language Models. In Proceedings of InterSpeech 2011, 2011.
Artem Sokolov, François Yvon. Minimum Error Rate Semi-Ring. In Proceedings of the European Conference on Machine Translation, Mikel Forcada, Heidi Depraetere (eds.), Pages 241-248, Leuven, Belgium, 2011.
Nadi Tomeh, Alexandre Allauzen, François Yvon. Discriminative Weighted Alignment Matrices for Statistical Machine Translation. In Proceedings of the European Conference on Machine Translation, Mikel Forcada, Heidi Depraetere (eds.), Pages 305-312, Leuven, Belgium, 2011.
Nadi Tomeh, Alexandre Allauzen, Thomas Lavergne, François Yvon. Designing an Improved Discriminative Word Aligner. In Proceedings of the 12th International Conference on Intelligent Text Processing and Computational Linguistics, Alexander Gelbukh (ed.), CICLING, Waseda, Japan, 2011.
Proceedings of the 14th Annual Conference of the European Association for Machine Translation. François Yvon, Viggo Hansen (eds.),
Saint-Raphaël, France, 2010.
Journals
Josep Maria Crego, François Yvon. Factored bilingual n-gram language models for statistical machine translation. Machine Translation, pages 1-17, 2010.
Alexandre Allauzen, Josep Maria Crego, Ilknur Durgar El-Kahlout, Hai-Son Le, Guillaume Wisniewski, François Yvon. LIMSI @ IWSLT 2010. In Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT), Marcello Federico, Ian Lane, Michael Paul, François Yvon (eds.), Pages 105-112, 2010.
Alexandre Allauzen, Josep Maria Crego, Ilknur Durgar El-Kahlout, Francois Yvon. LIMSI's Statistical Translation Systems for WMT'10. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and Metrics MATR, Pages 54-59, Uppsala, Sweden, 2010. download
Josep Maria Crego, François Yvon. Improving Reordering with Linguistically Informed Bilingual n-grams. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010: Posters), Pages 197-205, Beijing, China, 2010. download
Hai Son Le, Alexandre Allauzen, Guillaume Wisniewski, François Yvon. Training Continuous Space Language Models: Some Practical Issues. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Pages 778-788, Cambridge, MA, 2010. download
Nadi Tomeh, Alexandre Allauzen, Guillaume Wisniewski, François Yvon. Refining Word Alignment with Discriminative Training. In Proceedings of the ninth Conference of the Association for Machine Translation in the America (AMTA), Denver, CO, 2010.
Guillaume Wisniewski, Alexandre Allauzen, François Yvon. Assessing Phrase-Based Translation Models with Oracle Decoding. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Pages 933-943, Cambridge, MA, 2010. download
. The pay-offs of preprocessing for German-English Statistical Machine Translation. In Proceedings of the seventh International Workshop on Spoken Language Translation (IWSLT), Pages 251-258, 2010.
Alexandre Allauzen, Guillaume Wisniewski. Modèles discriminants pour l'alignement mot à mot. Traitement Automatique des Langues, 50(3):173-203, 2009.
International Conferences
Philippe Langlais, François Yvon, Pierre Zweigenbaum. Improvements in Analogical Learning: Application to Translating multi-Terms of the Medical Domain. In Proceedings of the European Conference on Computational Linguistics (EACL'09), Pages 487-495, Athens, Greece, 2009. download
Josep Maria Crego, François Yvon. Gappy translation units under left-to-right SMT decoding. In Proceedings of the meeting of the European Association for Machine Translation (EAMT), Pages 66-73, Barcelona, Spain, 2009.
Philippe Langlais, François Yvon, Pierre Zweigenbaum. Translating Medical Words by Analogy
. In Proceedings of the workshop on Intelligent Data Analysis in bioMedicine and Pharmacology (IDAMAP) 2008, Washington, DC, 2008.
Philippe Langlais, François Yvon. Scaling up analogical learning. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008), Pages 49-52, Manchester, UK, 2008. download
Philippe Langlais, François Yvon, Pierre Zweigenbaum. Analogical translation of medical words in different languages. In Proceedings of the 6th International Conference on Natural Language Processing, GoTAL 2008 - Advances in Natural Language Processing, Lecture Notes in Computer Science, Pages 284-295, 2008.
Daniel Dechelotte, Holger Schwenk, Jean-Luc Gauvain. The 2006 LIMSI Statistical Machine Translati
on System for TC-STAR . In TC-STAR Workshop on Speech-to-Speech Translation, Pages 25-30, Barcelona, Spain, 2006.
Holger Schwenk, Daniel Dechelotte, Jean-Luc Gauvain. Continuous Space Language Models for Statist
ical Machine Translation. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions,
pages 723-730, Sydney, Australia, 2006. pdf
Daniel Dechelotte, Holger Schwenk, Jean-Luc Gauvain, Olivier Galibert, Lori Lamel. Investigating
Translation of Parliament Speeches. In In Proceeding of IEEE Workshop on Automatic Speech Recognition,
San Juan, Porto Rico, November 2005.
They have visited LIMSI in the past, so why don't you ? If you are interested, and happen to visit Paris, just drop us a mail !
february, 26, 2013: Sylvain Raybaud(LORIA)
february, 04, 2013: Pascal Fung(HK-UST)
november 12, 2012: Anil Kumar-Singh(LIMSI) Machine Translation as a Problem of Estimating Linguistic Similarity and the Specific Problem of Translating TAM Markers
july 4 2012: Simon Lacoste-Julien (Inria, Winnow) Structured alignment methods in machine learning
june 19, 2012: Kashif Shah (LIUM, Le mans) Domain adaptation in SMT
may 30, 2012: Hermann Ney (IMMI) Bayes Decision Rule and the Classification Error in Systems for HLTPR (Human Language Technology and Pattern Recognition): Results and Open Problems
december 15 2009: Jia Xu (RWTH), Sequence segmentation and alignment for statistical machine translation
novembre 3 2009: Ilknur Durgar (LIMSI), A prototype English-Turkish statistical machine translation system
october 27 2009: Vassilina Nikoulina (XRCE), Syntax-Augmented Phrase-Based Translation
april 29 avril 2009: Loïc Barrault (LIUM), Combinaison de systèmes (application à la reconnaissance automatique de la parole et à la traduction statistique)