Rechercher  


Version française English version
INS2I INSIS Annuaire LIMSI
   
Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur
Logo LIMSI
Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur

[ Dérouler vers : Contenu, Menus, Bannière, Aide à la navigation. ]

Action transversale Corpus parole/Texte et Evaluation

COPTE Action 

Corpora speech-text & evaluation

COrpus Parole/Texte & Evaluation

Gilles Adda, Martine Adda-Decker, Claude Barras, Philippe Boula de Mareüil, Benoît Habert, Patrick Paroubek
Objective
The transversal action COPTE bridges two domains of Natural Language Processing: Speech Processing on the one side and Written Language Analysis on the other one. COPTE has been launched in august 2004 as results from the fusion of two complementary transversal actions : Archimed and CORVAL of the Man Machine Communication Department of LIMSI. Archimed, started in 2001, aimed at building on experience gained in both speech processing and written language processing. CORVAL, active since October 1997, intented to favour  scientific and technical exchange among LIMSI staff involved in corpus-based (spoken and written) research, including evaluations of language processing systems. Current research activities of COPTE include:
  1. participation of different LIMSI teams (TLP, LIR, PS) in the TECHNOLANGUE evaluation program of the French Ministry of Research.
  2. study of combining sibling resources (related audio and text data) with automatic speech transcription to produce enriched high-quality exact audio transcripts automatically.
Description
TECHNOLANGUE aims at developping a national infrastructure for evaluating technology related to language processing, improving the availability of open-source packages for language processing and promoting the use of standards. All the previous activities are key issues in the development of man-machine communication. LIMSI participates in 5 evaluation campaigns involving the French language: ESTER, EQUER, MEDIA, EVASY, EASY, which have been launched 3 years ago in the scope of the EVALDA project:
  1. ESTER concerns rich speech transcription evaluation (TLP),
  2. EQUER is about evaluation of information extraction (LIR),
  3. MEDIA deals with the evaluation of language understanding for dialog processing (LIR,TLP),
  4. EVASY evaluates speech synthesis (PS),
  5. EASY addresses parsing evaluation (LIR).
LIMSI acts as a participant in the first four campaigns and as a co-organizer in the last one. Developing an evaluation paradigm as TECHNOLANGUE does, is crucial for the future of Language Engineering both for Spoken and Written Language processing. It entails the necessity of shared evaluation metrics and well documented language resources for training and testing. Evaluations allow for a better understanding of the advantages and drawbacks of the different approaches methods and systems, which are discussed in light of the achieved results, during dedicated workshops concluding the evaluation campaigns. Multilinguality becomes an important issue when deploying the evaluation paradigm in an international context, especially in the european context of the information society. In the US, the evaluation paradigm has been widely used within DARPA and NIST actions and programs since 1987, mostly involving American English. More recently it has been extended to other languages (Multilingual TREC). Advocating the deployment of the evaluation paradigm on an international basis for preparing a possible Human Language Technologies Evaluation infrastructure within the next EC Framework program is a long term goal to which COPTE is dedicated.
technolangue in the research/technology/application context

On a shorter time scale, the current activities of COPTE address the issue of combining spoken and written language processing techniques on sibling resources. We use press oriented transcriptions of TV broadcasted political interviews, as provided by INA, to improve automatic speech transcription.  A special focus is  put on the processing of speech disfluencies (repetitions, revisions, fillers, etc.) within a corpus of 10 hours of TV shows from the 90s'. During each show, a politician or public personality is interviewed by several journalists. Press-oriented (bona fide) transcripts are available for these shows. These transcripts provide an almost exact transcription of the recorded speech : the meaning is intended to be exactly reflected rather than reporting the exact wording. Especially hesitations,  reformulations and incomplete utterances tend to be omitted or reworded. Nevertheless these transcripts remain globally close to what was said, since they provide a base for exact quotation of the most striking sentences. Among the goals of automatic speech transcription we can cite the production of transcripts without the portions corresponding to disfluent speech. Annotating disfluencies is a first step towards evaluation campaigns as organized by NIST on enriched transcriptions (http://nist.gov/speech/tests/rt/index.htm). Such material, without any disfluencies and segmented in complete but short information chunks (simple sentences) is a very useful resource for further content processing.

As a first step,  the press transcripts have been aligned with the speech signal. Then, ten percent of the whole corpus (10,000 words) have been hand-corrected to provide an  exact transcription (including all audible speech events).

image Sibling resources - informed manual annotation
Figure 2: From top to bottom, different ways of producing exact audio transcripts from the audio signal. (A) from scratch: 60 times real-time; (B) using sibling written material: 12 times real-time; (C) from informed automatic transcripts: 8 times real-time).


Mark-up for disfluencies has been added following the Linguistic Data Consortium (LDC) guidelines. A specificity of political debates or controversial interviews is a relatively frequent  fighting for the floor among speakers.  As a consequence overlapping speech and related disfluencies are relatively frequent in our corpus.  Disfluencies have been categorized as "fillers" (filler words like hum), discourse markers, editing marks from the speaker about his own speech, "asides", repetitions, revisions, false starts...

image Transcriber + annotations disfluences
Figure 3: Screenshot of Transcriber with an excerpt of manually checked exact audio transcripts and manually annotated spontaneous speech specific events.
Results and perspectives
Our study shows that Filled Pauses can be found almost anywhere. More precisely, 35% of Filled Pauses occur at a sentence boundary indicated by a full stop (14%) or at a major phrase boundary indicated by a comma (21%) with respect to the punctuation in the Press transcripts. For the remaining 65% Filled Pauses, Table 1 gives the distribution of the most frequent left and right contexts, considered independently. Even in the middle of a sentence, Filled Pauses frequently precede a determiner or a preposition and they rather follow a conjunction or a preposition. This asymmetry suggests that Filled Pauses (transcribed as "euh" in French) are avoided within noun phrases, especially between a determiner and a noun. In this situation, other mechanisms such as final lengthening or repetitions are preferred.
Repetitions and Revisions exhibit some features in common: first, they both involve 1 or 2 words on average, and there is a high correlation (0.8) among speakers between their counts of Repetitions and Revisions. Speakers who produce many repetitions also tend to make many revisions. Second, most frequent Repetitions and Revisions tend to be monosyllabic function words: de ("of", 72 Repetitions + 45 Revisions), le ("the/him", 40 Repetitions + 39 Revisions), etc. For all speakers, in the first two places and in the same order, we have very frequent French words. The form "le" is by far more often a determiner than a pronoun, even though nothing prevents a subject pronoun such as "je" ("I") from being one of the most repeated or revised words. In table 2, most words are shared between Repetitions and Revisions. It is not surprising if we adhere to the following interpretation: in the process which consists of looking for words, a bootstrap word such as the masculine singular article "le" in French (or the pronounced as [Di:] in English) may be repeated if it agrees grammatically with what follows, and may be corrected otherwise. The fact that there are more masculine nouns than feminine nouns in French (16k vs. 12k in the BDLEX dictionary) does not seem to be sufficient to explain why "le" outweights "la" in both Repetitions and Revisions. By contrast, the conjunction "et" ("and") hardly lends itself to Revision, and we only find it among Repetitions.
Inspection of the right part of Table 1 shows that the most frequent words that follow Revision-labeled words are " d' " ("of") and " l' " ("the"): precisely the shortened forms of the most frequent revised words. This means that the most frequent repairs are of the form " de d' ", before a word beginning with a vowel. We then have "la" (more frequent than "le"), which is in keeping with what we have just seen previously. Next, the presence of "vous" ("you") or "on" ("someone") is striking, since these personal pronouns are absent from Table 2: they really represent syntactic breaks, following abandoned phrases.


Filled Pause
Left Context
Filled Pause
Right Context
Revision
Right Context
Word # % Word # % Word # %
que 40 4.2 de 53 5.5 d' 34 4.7
et 27 2.8 la 41 4.3 l' 30 4.1
pour 26 2.7 des 38 4.0 la 29 4.0
de 21 2.2 les 33 3.4 vous 25 3.4
avec 19 2.0 l' 26 2.7 de 23 3.2
à 13 1.4 le 23 2.4 on 21 2.9
qui 12 1.3 un 21 2.2 le 19 2.6
          
Discourse Markers Repetitions Revisions
Word # % Word # % Word # %
et 214 9.8 de 72 43 de 45 22
alors 141 6.5 le 40 24 le 39 19
je crois que 50 2.3 et 33 20 à 15 07
mais 44 2.0 je 29 17 que 14 07
donc 36 1.6 un 23 14 la 13 06
eh bien 33 1.5 à 23 14 les 11 05
hein 32 1.5 les 23 14 je 11 05
Table 1: Distribution of the most frequent 
contexts, considered independently for Filled
Pauses and Revision. Frequency counts and 
percentages of the most frequent  disfluency
context words are given. 
Table 2: Most frequent words involved in disfluencies 
(Discours Markers, Repetitions and Revisions).
The table gives the occurence numbers and 
percentages of the most frequent disfluency words.
Despite the size of our corpus, the conclusions we draw should be related to its genre, that of broadcast interviews, and would benefit from a comparison with conversational speech. With this end in view, the probabilities of discourse markers such as "je crois que", "je pense que" ("I think that") were considered and compared to what is obtained in other corpora of fine-grained transcriptions in French -- Broadcast News (3.6M words) and Telephone Conversational Speech (1M words). We notice that for interviewees, we are close to the value estimated in conversational speech, whereas for journalists, we are even below the value estimated in Broadcast News. In the near future, we plan to study the relationship between disfluencies and turn taking, their position within sentence-like units as well as the influence that struggle for speech has on disfluencies. Finally, this type of analysis would arguably improve by being related to the study of eye movements and body gestures, since we have video recordings at our disposal.

Lexicon update and language model interpolation, using the a priori related written resources, already allow for relatively low error automatic transcription. But taking into account the disfluencies can provide extra improvement of spontaneous speech modeling since disfluencies are responsible for about half of the alignment offsets between the press transcripts and the exact transcription, although their impact remains small on the number of recognition errors.
Sibling corpora which can be made "parallel" like the one used in [Adda et al. 03] can be found rather easily (interviews of public figures, public debate archives, e.g. parlementary debates). Our work shows that it is relatively straightforward to align automatically produced transcriptions of the audio corpus with manual approximate transcripts by relying on "confidence islands" made of identical subsequences. We can thus provide draft transcripts to human annotators at relatively low cost, where portions of the audio to be checked are automatically highlighted to the annotator for extra attention (cf. Figure 2).

References
[1] Philippe Boula de Mareüil, Benoît Habert, Frédérique Bénard, Martine Adda-Decker, Claude Barras, Gilles Adda, and Patrick Paroubek. A quantitative study of disfluencies in French broadcast interviews. In Proceedings of Disfluency In Spontaneous Speech (DISS) Workshop, Aix-en-Provence, September 2005.
[2] Claude Barras, Gilles Adda, Martine Adda-Decker, Benoît Habert, Philippe Boula de Mareüil and Patrick Paroubek. Automatic Audio and Manual Transcripts Alignment, Time-code Transfer and Selection of Exact Transcripts. In LREC, Lisbon, May 2004.
[3] Martine Adda-Decker, Benoît Habert, Claude Barras, Gilles Adda, Philippe Boula de Mareüil, and Patrick Paroubek. Une étude des disfluences pour la transcription automatique de la parole spontanée et l'amélioration des modèles de langage. In JEP, Fez, April 2004.
[4] M. Adda-Decker, B. Habert, C. Barras, G. Adda, P. Boula de Mareüil and P. Paroubek « A disfluency study for cleaning spontaneous speech automatic transcripts and improving speech language models », Disfluency in Spontaneous SpeechWorkshop, p. 67-70, 2003, Robert Eklund (ed), Göteborg, Sweden
[5] Stéphanie Strassel, Simple Metadata Annotation Specification Linguistic Data Consortium, 2003, Annotation Guide, Version 5.0 - http://www.ldc.upenn.edu/Projects/MDE/


[ Dérouler vers : Contenu, Menus, Bannière, Aide à la navigation. ]

[ Dérouler vers : Contenu, Menus, Bannière, Aide à la navigation. ]