TLP - Speaker characterization in a multimodal context

Speaker recognition consists of determining who spoke when, where the identity can be that of the true speaker or an identity specific to one document or a set of documents. Different sources of information can be used to identify the speaker in multimedia documents (the speaker's voice, what is said, or what is written.

Speaker verification

Timbre, prosody, accent or idiomatic expressions can all provide cues for speaker characterization, however most state-of-the-art automatic systems for speaker recognition rely on a modeling of a short-term spectral analysis of the speech signal, focusing mainly on the timbre information. Gaussian Mixture Models (GMM) of a generic model of speech are often a building block of such a system, even if they are combined with other modelings such as Support Vector Machines (SVM) or more recently the i-vector representation. We participated since 2002 to the international evaluations on speaker verification organized by the NIST (National Institute of Standard and Technologies, USA) and performed studies on features and score normalisation, unsupervised adaptation, prosodic features and speaker adaptation methods as features, especially MLLR (Maximum Likelihood Linear Regression) adaptation.Speaker diarization

Speaker recognition can have applications for security, access control and forensic, but also for audiovisual documents analysis and multimodal applications. Speaker diarization, defined as an automatic acoustic segmentation and clustering into speaker turns, can enrich an automatic transcription and improve its readability and more generally the search into audiovisual archives. Our approach for speaker diarization consists in a multi-stage architecture, combining a first BIC-based (Bayesian Information Criterion) agglomerative clustering stage optimized for providing pure clusters and second stage with the CLR (Cross Log-likelihood Ratio) criterion as cluster distance, taking advantage of an increased amount of data per cluster with more complex models. Integration of speaker recognition approaches into a diarization system proved to be fruitful, bringing state-of-the-art performance in several national and international evaluations.

Cross-show diarization

In the framework of the Quaero program, we aim to improve speaker diarization for multimedia broadcast data. We combined speaker diarization with speaker identification of known speakers for broadcast news and conversations, either with GMM speaker models adapted from a generic model or with a SVM classifier for speaker segmentation and tracking. We also considered the situation where a collection of shows from the same source has to be processed. This is frequent for digital library and multimedia archives and it is likely that in this case some speakers (journalists, actors, frequent guests ...) will occur in several shows. It is important that a given speaker shares the same identifier across all the shows. With our partners of the Quaero program, we have addressed this cross-show diarization task and have experienced different architectures for cross-show diarization, either with a global clustering or with an incremental presentation of the shows.

Multimodal person identification

There are cases where voice is not the only cue available to identify a speaker. In TV broadcast news or talk shows for instance, guests or reporters identity is often provided as an overlaid text that can be automatically extracted using video optical character recognition (OCR). Similarly, it is common practice for anchors to introduce their guests by quoting their name. In the framework of the QCOMPERE consortium for the REPERE challenge, we were able to greatly improve the performance of our supervised speaker identification system by combining those multimodal cues. Most importantly, this lays the ground for an unsupervised speaker identification system that can be useful when no or very little data is available to train speaker models. We proposed a completely unsupervised multimodal speaker identification system using video OCR to name the clusters provided by our state-of-art speaker diarization system. Results show that one can expect this unsupervised multimodal approach to get very close performance to the one of a supervised acoustic-only speaker identification system.

Audiovisual content structuring

Speaker diarization also proved to be very helpful for several multimedia applications and audiovisual content structuring in particular. Through collaboration with the Institut de Recherche en Informatique de Toulouse, we rely on speaker diarization for most building blocks of a novel approach to automatic summarization of TV shows. First, a graph-based system for temporal segmentation into scenes relies on the multimodal fusion of color information, speaker diarization and automatic speech transcription. As modern TV shows usually contain multiple intricated stories, a subsequent semantic plot de-interlacing step relies on speaker diarization and other cues to group semantically similar scenes into coherent stories. The last ongoing step aims at summarizing each detected story into a short self-contained video excerpt for easier browsing.

Some publications

Current Projects

  • QCompere (Person recognition in audiovisual documents)
  • Quaero (Multimedia and multilingual indexing)
  • CHIST-ERA CAMOMILE (Collaborative Annotation of multi-modal, multi-Lingual and multi-media documents)

Related links



Campus universitaire bât 508
Rue John von Neumann
F - 91405 Orsay cedex
Tél +33 (0) 1 69 85 80 80


LIMSI in numbers

10 Research Teams
100 Researchers
40 Technicians and Engineers
60 Doctoral Students
70 Trainees

 Paris-Sud University new window


Paris-Saclay University new window