TLP - Robust analysis of spoken language and dialog systems

This research topic addresses the analysis of spoken and written language for open domain interactive information retrieval. Given the wide scope, in addition to the requirement for fast-processing, there is a need for robustness in order to handle any kind of input (prepared speech, spontaneous speech, written texts, web documents, etc.). Spoken documents are complicated to process in part due to the nature of spoken language which includes phenomenas such as repairs, false starts, hesitations etc, and also mey be suject to errors induced by automatic transcription. Searching for information in speech data provides new challenges as compared to the more usual newspaper-type documents often used in Information Retreival and Question-Answering tasks. The syntax of speech is different, and in particular far less rigid, than written language syntax. Structure tends to be more local, with individual chunks generally following traditional grammar rules, and relations designed between them at a sometimes syntactic, but more often semantic, level. Analysis tools expecting long-range somewhat rigid structuration, for instance syntactic analyzers created for written, well-formed text, tend to fail when applied to speech transcripts.

Dealing with speech data

Analyzing speech requires similar approaches as are needed for the analysis of not-so-well formed, or simply unusual, texts: robustness. A robust analyzer will not try for a complete analysis of a given sentence or document. It will work on a best effort basis, trying to produce as much structured information as possible, as reliably as possible, without expecting completeness. For speech transcriptions, some parts will be able to be interpreted within a formalism (POS, syntax, Named Entities, Dialog Acts...), but other parts cannot be. Automatic speech recognition exacerbates the need for robustness: not only errors are introduced in the transcription, but some useful clues present in text are unreliable. For example, the lack of reliable punctuation entails the loss of the sentence concept. Sentences are usually considered as self-contained units for text-based analysis modules, limiting the context size for rule-based systems and search space for stochastic ones. Commas provide chunk separations where ambiguity may happen, and prosody may provide the necessary context in speech for a human transcriptor. However, reliably providing such information is still beyond that of todays automatic speech recognition systems. Speech contains intrisic ambiguities that speech recognition systems cannot solve; therefore dealing with recognition errors (which can range from one word in ten to one in two being wrong depending of the language and of the task) is a part of any analysis task for spoken data.

Speech has its advantages though: a speech recognition system produces an output in which words are clearly delimited, abbreviations are not used with their inherent ambiguity, uppercase, when present, is limited to proper nouns and acronyms. As a result, most of the tokenisation issues present with texts have already been handled by the system developers. Ideally, a text at the entry of an analysis step would combine the advantages of both speech and written text: words separated from punctuation marks and from each other, uppercase only on proper nouns and acronyms, punctuation to be able to split into sentence units, etc.

In order to streamline the processing of both text and automatic (ASR) transcripts, all inputs to the analysis are converted to a common normalized form. This work builds upon extensive experience with normalizing data for language model building for ASR systems. However processing speech presents additional challenges related to spoken language and the inherent imperfection of ASR systems. A related activity thus aims to study and understand the different kinds of possible errors produced by ASR systems in order to better deal with them. The primary objective of two projects (ANR-CONTINT VERA project and PEPS ERRARE) launched in 2013 is to study the impact of ASR errors on systems using a semantic analysis of automatic transcriptions such as Named Entity detection, Spoken Language Understanding, Spoken Question-Answering and Dialog Systems. A PhD Thesis (Mohamed Ben Jannet) began in September 2012 on this subject as part of a CIFRE contract between LNE, LPP-Paris 3, and LIMSI. Another aim is to classify ASR errors according to their impact on different systems so as to lead to more robust analysis systems. Such studies are done in collaboration with and based on work produced by other topics of the group (ASR, Perception and Corpus).

Methodology and approaches for system development

A robust analyser is considered one that extracts as much structured information as possible from the data. Two complementary methodologies are studied, each applied to a class of applications. The first class of applications requires that a set of fuzzily defined information be extracted from the speech. A typical example is spoken language understanding for open domain dialog systems. The capability for experimentation on the classification of information is paramount. In such cases, symbolic approaches are privileged. An efficient rule-based engine designed for generalized incremental analysis was implemented and used to develop a wide domain, mostly semantic multi-level analysis for the French, English and Spanish languages. These analyzers serve as a basis for all of our dialogue systems. These experiments on multi-level analysis were further developed within the framework of the ANR-CONTINT GVLex project in collaboration with the AA group, for which our objective was to produce a multi-level analysis at the document level.

The second type of methodology applies when the task is well defined and annotated corpora are available, for which stochastic approaches are preferred. A typical case is that of Named Entity recognition and classification. Different machine learning approaches including Decision Trees, Support Vector Machines or Hidden Markov Models have been used for sequence labeling tasks, such as Named Entity Recognition. Conditional Random Field (CRF) been growing in popularity over the last few years, their main advantage being their capacity to include a variety of symbolic and stochastic features. The standard features used for training a CRF model for NE detection include word prefixes and suffixes, various predicates such as Does the word start with a capital letter? and morphosyntactic labels. The originality of our approach is to leverage the rule-based multi-level analysis to provide a series of features at the word level for the CRF model. This information was used to predict single-layer named entities (Ester project). When handling semantically driven tree-structured named entities (Quaero project), this approach was used to bootstrap tree rebuilding via Probabilistic Context Free Grammars (PCFG). These analysers are used for Question-Answering (Quaero project) and spoken dialog systems (Ritel project), and have led to collaborations with other laboratories.

The REPERE challenge

The objective of the REPERE challenge (see Topic 1), is to recognize who is speaking and who is seen. As it is common practice for anchors to introduce their guests by stating their name, different person entity detection systems were developed based on CRF models. In addition to multimodal person identification, one of our aims is to study different features and their robustness against ASR errors.

Dialog Systems

Our research on spoken language dialog systems mainly concerns open domain interactive search and intelligent assistants. Our main scientific interest is on dialog management and more specifically, managing dialog context and history. Our model is based on the use of semantically motivated clusters and a three-step algorithm which manages explicit and implicit user and system confirmations. The Ritel system is used as an experimental platform to validate our approaches. We recently launched two projects in order to explore new research directions. The first one aims at developing a cognitive assistant, which involves developing a multi-task model for dialog management. A first version of the multi-task dialogue management model has been implemented in a demonstration system within the Compagnon Numérique, a Futur En Seine project (Cap Digital and Région Ile de France). The second one aims at developing a system able to learn a task via interaction with a user, and involves work on automatic learning of dynamic task models.

All these activities are carried out in collaboration with the ILES and AA groups.

Some publications

Current projects

  • ANR VERA (Speech recognition error analysis)
  • CHIST-ERA JOKER (to begin in January 2014)
  • FUI Patient GeneSys (to begin in January 2014)
  • Quaero project

Past projects



Video Le Compagnon Numérique

Campus universitaire bât 508
Rue John von Neumann
F - 91405 Orsay cedex
Tél +33 (0) 1 69 85 80 80


LIMSI in numbers

10 Research Teams
100 Researchers
40 Technicians and Engineers
60 Doctoral Students
70 Trainees

 Paris-Sud University new window


Paris-Saclay University new window