|
|
|
Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur
|

On a shorter time scale, the current activities of COPTE address the issue of combining spoken and written language processing techniques on sibling resources. We use press oriented transcriptions of TV broadcasted political interviews, as provided by INA, to improve automatic speech transcription. A special focus is put on the processing of speech disfluencies (repetitions, revisions, fillers, etc.) within a corpus of 10 hours of TV shows from the 90s'. During each show, a politician or public personality is interviewed by several journalists. Press-oriented (bona fide) transcripts are available for these shows. These transcripts provide an almost exact transcription of the recorded speech : the meaning is intended to be exactly reflected rather than reporting the exact wording. Especially hesitations, reformulations and incomplete utterances tend to be omitted or reworded. Nevertheless these transcripts remain globally close to what was said, since they provide a base for exact quotation of the most striking sentences. Among the goals of automatic speech transcription we can cite the production of transcripts without the portions corresponding to disfluent speech. Annotating disfluencies is a first step towards evaluation campaigns as organized by NIST on enriched transcriptions (http://nist.gov/speech/tests/rt/index.htm). Such material, without any disfluencies and segmented in complete but short information chunks (simple sentences) is a very useful resource for further content processing.
As a first step, the press transcripts have been aligned with the speech signal. Then, ten percent of the whole corpus (10,000 words) have been hand-corrected to provide an exact transcription (including all audible speech events).
Mark-up for disfluencies has been added following the Linguistic
Data Consortium (LDC) guidelines. A specificity
of political debates or controversial interviews is a relatively frequent
fighting for the floor among speakers. As a consequence overlapping
speech and related disfluencies are relatively frequent in our corpus.
Disfluencies have been categorized as "fillers" (filler words like hum),
discourse markers, editing marks from the speaker about his own speech,
"asides", repetitions, revisions, false starts...
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Table 1: Distribution of the most frequent
contexts, considered independently for Filled Pauses and Revision. Frequency counts and percentages of the most frequent disfluency context words are given. |
Table 2: Most frequent words involved in disfluencies
(Discours Markers, Repetitions and Revisions). The table gives the occurence numbers and percentages of the most frequent disfluency words. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Lexicon update and language model interpolation, using the a priori
related written resources, already allow for relatively low error automatic
transcription. But taking into account the disfluencies can provide extra
improvement of spontaneous speech modeling since disfluencies are responsible
for about half of the alignment offsets between the press transcripts and
the exact transcription, although their impact remains small on the number
of recognition errors.
Sibling corpora which can be made "parallel" like the one used in [Adda
et al. 03] can be found rather easily (interviews of public figures,
public debate archives, e.g. parlementary debates). Our work shows that
it is relatively straightforward to align automatically produced transcriptions
of the audio corpus with manual approximate transcripts by relying on "confidence
islands" made of identical subsequences. We can thus provide draft transcripts
to human annotators at relatively low cost, where portions of the audio
to be checked are automatically highlighted to the annotator for extra
attention (cf. Figure 2).
References
[1] Philippe Boula de Mareüil, Benoît Habert, Frédérique
Bénard, Martine Adda-Decker, Claude Barras, Gilles Adda, and Patrick
Paroubek. A quantitative study of disfluencies in French broadcast interviews.
In Proceedings of Disfluency In Spontaneous Speech (DISS) Workshop, Aix-en-Provence,
September 2005.
[2] Claude Barras, Gilles Adda, Martine Adda-Decker, Benoît Habert,
Philippe Boula de Mareüil and Patrick Paroubek. Automatic Audio and
Manual Transcripts Alignment, Time-code Transfer and Selection of Exact
Transcripts. In LREC, Lisbon, May 2004.
[3] Martine Adda-Decker, Benoît Habert, Claude Barras, Gilles
Adda, Philippe Boula de Mareüil, and Patrick Paroubek. Une étude
des disfluences pour la transcription automatique de la parole spontanée
et l'amélioration des modèles de langage. In JEP, Fez, April
2004.
[4] M. Adda-Decker, B. Habert, C. Barras, G. Adda,
P. Boula de Mareüil and P. Paroubek « A disfluency study for
cleaning spontaneous speech automatic transcripts and improving speech
language models », Disfluency in Spontaneous SpeechWorkshop,
p. 67-70, 2003, Robert Eklund (ed), Göteborg, Sweden
[5] Stéphanie Strassel, Simple Metadata
Annotation Specification Linguistic Data Consortium, 2003, Annotation
Guide, Version 5.0 - http://www.ldc.upenn.edu/Projects/MDE/