TLP - Language Resources

Statistical models and comparative evaluation have been a driving force in speech processing for over 30 years. Corpora are central to these two major paradigms. While in the past, the use of large corpora has been limited to a few domains and languages, the last decade has witnessed a real expansion towards multilinguality and multimodality. Developing corpora and organizing evaluations are crucial for the language community, and in turn pose scientific problems which need to be solved, such as what corpora to collect and how they should be annotated, as well as scientific questions on how to reward their promoters and how to ensure the ethics in the collection process.

This topic deals with theoretical and practical problems concerning the collection, annotation and diffusion of large multilingual corpora.

The domain of speech processing requires the development of large corpora in order to train and evaluate models. The nature of the data and therefore the type of annotation vary with respect to the type of application: automatic speech recognition requires the collection and normalization of written texts as well as the manual transcription of oral corpora. Translation systems essentially need parallel texts during the training stage whereas annotation using named entities (which consists of marking word groups corresponding to, for example, surnames, first names, dates, units, organizations, etc.) is privileged for interactive information retrieval systems.

Specific problems usually lead to specific corpora. For instance, punctuating speech transcriptions is a controversial subject as many linguists deny the validity of punctuation conventions for spoken language. Nevertheless, punctuating automatic speech transcripts is very useful for many applications. Therefore, to evaluate the automatic production of punctuation for ASR, a 100h multilingual corpus of punctuated speech transcripts was developed along with specific annotation guidelines to maximize the inter-annotator agreement. The evaluation of methods to deals with out-of-vocabulary words (OOV) in ASR is another example: classical corpora contain a very small number of OOV (typically less than 1%) which makes the use of WER inappropriate to compare methods; therefore for the Edylex project a specific corpus of 20h of French and English broadcast news selected containing a high proportion (>4%) of OOVs.

Error classification, diagnosis and impact measurement via perceptive tests constitute important steps in identifying weaknesses in the models of state-of-the-art transcription systems and preparing for future generations of spoken language processing systems. We address this important matter in close coordination with other topics of the group, Topic 3 (perception), 4 (robust analysis), and 6 (speech recognition). In this topic the focus is on the problem of multilevel annotation of errors in speech corpora.

Corpus diversity is equally due to the nature of the data that it contains which is related to the domain of application. For example, texts can be newspaper extracts, transcriptions of European Parliament debates or even taken from blogs. The same is true for the oral corpora: the multiplication of radio, television and Internet media provide easy access to a wide variety of content such as broadcast news shows or conversational broadcasts. In addition, regional broadcasters can offer programs with speakers having numerous accents.

During the preparation of a corpus, this diversity pushes us to precisely define annotation guidelines to guarantee corpus homogeneity. In the case of manual audio transcriptions, the guidelines define, among others, the way to annotate overlapping speech, hesitations or truncated words, but its main purpose is to define how to give a normalized orthographic form to each oral realization. Individual language appendices are written so as to take into account the particularities of each language. Afterwards, the respect of established guidelines during the entire process is often a matter of collaboration between the researchers, the organization responsible for evaluating these systems and the annotators. Techniques such as crossvalidation are systematically applied when the number of transcribers allows for it. Another technique to ensure consistence within annotations is to compute inter-annotator agreement (IAA) when possible. In collaboration with the ILES group, an extended definition for Named Entities was proposed. These extended Named Entities are hierarchical (with types and components) and compositional (with recursive type inclusion and metonymy annotation). Following these guidelines, two different corpora, one from contemporary broadcast news and the other from old OCRized newspaper (December 1890) have been annotated, each one containing about 1.5 million of words. Because human annotation is an interpretation process, there is no "truth" to rely on. It is therefore impossible to really evaluate the validity of an annotation. All we can do is to evaluate its reliability which is achieved through computation of the inter-annotator agreement (IAA). The best way to compute it is to use one of the Kappa family coefficients, namely Cohen's Kappa or Scott's Pi. However, these coefficients imply a comparison with a "random baseline" which depends on the number of "markables". In the case of Named Entities, this "baseline" is known to be difficult to identify. A study was done, in collaboration with LNE and INIST in which different hypothesis were examined. This study allowed validation of the overall quality of the two corpora which was made available to the research community.

Specific tools have been developed with the intention of automating several stages of the processing chain. For the acquisition of oral data, a podcast recording platform was set up which assures the daily recovery of audio files to be transcribed, renames them, and normalizes the signal. After manual transcription, orthography is verified using automatic correction or tools interfaced with online resources. Validation of generated format is equally controlled semi-automatically by scripts. Several methods were experimented with to improve transcription speech, such as correction of texts available on Internet or produced by a recognition system. A different approach to produce fast transcriptions is to apply the partitioner of an automatic speech recognition system to the audio file to be transcribed. This is a very fast step and the approximate time segments output created by the partitioner can be used by transcribers as a starting point.

As part of the Quaero programme, 35 transcribers were hired on fixed-term contracts in order to annotate large multilingual corpora: over 1,700 hours of varied broadcast audio data as well as seminars were manually transcribed. This work concerned 25 different languages of which some are under-resourced, such as Luxembourgish or Lithuanian for which few language resources are nowadays available. These data contribute to the development of automatic speech recognition systems and to the improvement of speaker diarization and for annual evaluations.

For some purposes, such as person recognition in broadcast news shows, semantic information is found not only in the speech, but from all different modalities present in the media. In this context, we are developing collaborative tools and guidelines to annotate multilingual, multimodal, multimedia data.

In addition activities in corpus production, more general investigations on Language Resources (LR) are conducted, where the term "resource" includes data, tools, evaluation and meta-resources (guidelines, methodologies, metadata, Best Practices), for both spoken and written language. Those activities are mostly conducted in connection with the FLaReNet and META-NET European Networks. They address the compilation of LR mentioned in papers presented at conferences (LRE Map), the comparison of the status across languages (Language Matrices and Tables) and the detection of gaps for some languages (Less-Resourced Languages), the unique identification of a LR and the computation of its impact factor. It also concerns the ethical dimension of LR production and distribution in the context of an increase of interest internationally for Data Sharing and Crowdsourcing. In particular, after some preliminary studies concerning the ethical and legal issues of the use of Amazon Mechanical Turk for Language Resource production, a charter of good practice "Ethics and Big Data" has been developed in collaboration with the Aproged, Cap Digital, AFCP and ATALA.

Some publications

Recent Projects

  • ANR VERA (Speech recognition error analysis)
  • CHIST-ERA CAMOMILE (Collaborative Annotation of multi-modal, multi-Lingual and multi-media documents)
  • ANR EDyLex (Enrichissement Dynamique de ressources Lexicales)

Campus universitaire bât 508
Rue John von Neumann
F - 91405 Orsay cedex
Tél +33 (0) 1 69 85 80 80


LIMSI in numbers

10 Research Teams
100 Researchers
40 Technicians and Engineers
60 Doctoral Students
70 Trainees

 Paris-Sud University new window


Paris-Saclay University new window