TLP - Speech Recognition

Speech recognition is concerned with converting the speech waveform, an acoustic signal, into a sequence of words. Today's most performant approaches are based on a statistical modelization of the speech signal. Our research addresses the most of the main components of state-of-the-art speech recognizers, that is, language modeling (in close collaboration with Topic 5), lexical representation, acoustic-phonetic modeling and decoding. The realization of any individual word is highly dependent on the individual speaker, the social context and the acoustic environment (cf Topic 2). Automatic speech recognizers, also called speech-to-text systems, must be able to handle such time-varying contextual effects. In addition to the changes in acoustic context, the system must be able to evolve over time to handle changes in style and topic, and to dynamically update the vocabulary. Language model adaptation aims to compensate for such changes in style, topic and dialect. For almost two decades, large vocabulary, continuous speech recognition has served as a focus for evaluation of models and algorithms. Over time the tasks have become more challenging and the number of languages and tasks addressed has grown.

One of the recent trends in speech-to-text systems is using discriminative techniques with large corpora to build more accurate models. The discriminative property can be included in the feature extraction by using discriminative classifiers such as multi-layer perceptrons (MLPs). By covering a wide temporal context MLP features can potentially capture different speech properties than the widely used short-term cepstral features. In addition, MLPs can be trained to deliver estimates of class posteriors which can be used as features for Gaussian mixture acoustic models. Training an MLP on large corpora requires efficient algorithms to remain computationally manageable. One of the important properties of MLP features is their complementarities to cepstral features. Research has addressed how to best include both feature types in a transcription system. Without adaptation, the MLP features have better performance than standard cepstral features. However, once speaker adaptive training and unsupervised adaptation are used, the two feature types have comparable performances. Feature concatenation is an efficient combination method, providing the best gain at the lowest decoding cost. Ongoing work is exploring how to best adapt the probabilistic features across tasks, variants and languages. Exploration of alternative features (Spectrally Reduced Speech, Boosted Binary Features) is underway. To date these features have been investigated for small tasks such as phone recognition, but never validated on larger tasks with a state-of-the-art system.

Research on speech recognition is carried out in a multilingual context, investigating and developing models for a variety of languages and variants. Typically ideas are first explored in one language, with successful methods then transferred to other languages. An example is the incorporation of pitch features in the recognition feature vector, which was first explored for Mandarin Chinese in the context of research for the GALE program, which has since led to improved systems for all languages. The context of the Quaero program, research addresses 9 European languages, with plans to cover all 23 official European languages by the end the program. In this context various approaches for unsupervised training continue to be studied, extending previous work on semi-supervised and unsupervised acoustic model training, and unsupervised language model adaptation is most effective to address task domain mismatch. Unsupervised methods are also being used to address the automatic generation of pronunciations and variants for the English language. The group continues to collaborate with BBN, being a partner on Babelon team (iARPA Babel program) addressing speech recognition and spoken term for low resourced languages. In particular work is exploring the automatic discovery of acoustic and lexical units for speech recognition and multilingual acoustic modeling.

A closely related research topic is language recognition, including language identification (that is identifying the language and/or dialect of an audio document) and language detection. The language recognition research applies the parallel phonotactic approach, with recent studies aimed at the improving estimation of the phonotactic models (exploring various acoustic models, pitch and MLP features, model adaptation), as well as score normalization and fusion. Language recognition is investigated for varied data types within the context of the Quaero program and for telephone speech, participating in the 2011 NIST organized language recognition evaluation.

Speech recognition is a core technology for processing of audio and audiovisual documents, and is one of the central research topics in the Quaero program, serving for several application projects (Voxalead, Yacast, Orange, Systran OMTP). For such applications, the speech-to-text output must meet two needs: a representation that is easily searchable by machine and a representation that can be easily read by humans. Concerning the latter, reliable punctuation is needed. We are developing algorithms to identify punctuation (periods, question marks, commas, etc.) and disfluency markers, using a combination of language and acoustic/prosodic models (features such as pitch contours, duration, energy, pause lengths, etc). Having a long term goal of developing speech recognition technology that is as good or even better than a human on the same task it important to assess human performance on speech recognition tasks (in collaboration with Topic 3). The human performance serves to provide target performance levels as well as to identify potential technological weaknesses.

We also support the MediaEval 2010-2013 and TrecVid 2007-2013 evaluations by providing automatic transcripts for several hundreds of hours of audio data.

Some publications

Recent Projects

  • DGA RAPMAT (Speech translation)
  • iARPA Babel (Agile and robust speech recognition)
  • Matrice (Equipex)
  • QCompere
  • Quaero (Multimedia and multilingual indexing)

LIMSI
Campus universitaire bât 508
Rue John von Neumann
F - 91405 Orsay cedex
Tél +33 (0) 1 69 15 80 15
Email

SCIENTIFIC REPORTS

LIMSI in numbers

10 Research Teams
100 Researchers
40 Technicians and Engineers
60 Doctoral Students
70 Trainees

 Paris-Sud University new window

 

Paris-Saclay University new window