|
|
||||||||||||||||||||
| Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur | |||||||||||||||||||||
Spoken Language Processing Group (TLP)Spoken language systemsThis research area is concerned with designing systems which integrate the various aspects of the research and technology development carried out in the group. Practical issues in system design are addressed such as integration of system components and efficient decoding strategies to reduce the computational needs. The research can be classified according to three primary axes: 1) speaker-independent, continuous speech recognition; 2) identification of non-linguistic speech features, such as speaker and language identification; and 3) spoken language understanding and dialog systems. Speech RecognitionConcerning continuous speech recognition our research focuses on developing basic technology for recognition of spontaneous speech that is independent of the speaker and the task, and robust to noise and acoustic conditions. One application of particular interest is the transcription and indexation of radio and television news broadcasts. The performance on such a task depends essentially on the quality of the acoustic and linguistic models used. These models should take into account phenomena found in spontaneous speech (hesitations, breath noise, restarts, syntax different from written text, ...) as well as the wide acoustic variability (microphone, background noise, transmission channel, music, ...). This work is carried out in a multilingual context (English, German, French) in close coordination with the research activities in themes 2 and 3. Adapting the speech recognizer to a new language requires speech and text corpora, and a lexicon with phone transcriptions. It can also be necessary to modify the model structure to account for language specificities at the phonological of syntactic level.Speaker-independent, continuous speech recognition systems have been developed in French, American and British English, and German with vocabularies up to 65k words. Our system for American English has also been evaluated in annual ARPA benchmark tests since 1992 on the "Wall Street Journal", "North American Business News" and "Broadcast News" tasks. Despite the increase in task complexity, the system performance has increased due to more accurate acoustic models trained on larger speech corpora and by improving the decoder efficiency so as to be able to augment the size of the recognition vocabulary and the language model complexity. The same decoder is used for all languages and applications. Radio and television broadcasts are particularly difficult to trancscribe because they contain segments of different acoustic and linguistic natures with gradual or rapid transitions between segments. The first step is to partition the continuous stream of data into homogenous segments by detecting changes (acoustic conditions, speaker changes, ...) and identifying certain characteristics for each segment (such as speech/non-speech, telephone band or wideband, background music or noise, the speaker identity, ...). We have developed a segmentation algorithm to produce such a partition for a given show using only the acoustic signal (i.e., no other apriori information is available). This partition is used by the recognizer to select the most appropriate models and to adapt them to each segment's particular characteristics.
Identification of non-linguistic speech featuresThe second main research activity is the identification of non-linguistic speech features. This activity is a logical extension of the continuous speech recognition research, as the same modeling techniques can be used. A statistical modeling approach is taken, where the talker is viewed as a source of phones, modeled by a fully connected Markov chain. The basic idea consists of constructing a set of acoustic models for each value of the non-linguistic feature to be identified (language, gender, speaker, ...), and evaluating the a posteriori probability of the speech signal for each of the model sets. Instead of retaining the recognized string as is done in recognition, here we are interested in the feature value corresponding to the model set with the highest likelihood.The identification of the sex of the speaker was initially used to reduce the necessary computation for sex-dependent models in word recognition. Sex identification on the corpora used for the CSR work is very close to 100% on speech from adult speakers, recorded under laboratory conditions or recorded over the telephone. This method has also been applied to the problem of speaker identification using speech corpora in American English and French. In the context of a research contract with France Telecom, and in collaboration with the Vecsys company, we have designed and recorded a telephone corpus for development and evaluation of speaker verification algorithms as a function of the quantity and type of data used for training and test. This corpus contains recordings from one hundred users, each having provided 10 training calls and 25 test calls, and from 1000 impostors. The experiments carried out with this corpus enabled us to measure the identification error rate as a function of a variety of parameters (the type of data, the duration of the test utterance, model aging, the number of training sessions, the call location, ...). The text-dependent speaker verification equal error rate (i.e. the false rejection of users is equal to the false acceptance of impostors) is 1.0% allowing a maximum of two verification attempts per trial and with a minimum duration of 1.2s per utterance. The same technique has been used for language identification for which a variety of potential applications can be envisioned, such as automatically routing telephone calls to an operator, as a component of information servers or translation systems. In the context of a research contract with the CNET we have recorded a multilingual telephone corpus (French, British English, German and Spanish) with over 300 calls per language. These data are being used to carry out language identification experiments under conditions for which the recording setup and the linguistic content of the data are controlled. The main problems to be addressed are modeling of the telephone channel and of noise and the interaction between the acoustic and phonotactic levels for the different languages. Our original phone-based approach required large orthographically transcribed corpora for each language of interest. In order to be able to treat languages for which such transcribed corpora are not available, we are working on developing language-independent acoustic models. These models should allow new languages to be modeled with only the acoustic data. In the context of a research project with the DGA, we are evaluating this approach with 10 languages.
Spoken language understanding and dialogThe third research area is that of spoken language systems. In these systems the aim is not to transcribe what was said but to understand the spoken message and to carry out an interactive dialog with the user to accomplish a task. Our objective is to develop spoken language systems to provide vocal access for information retrieval. Task and domain knowledge must be used to define the vocabulary and the concepts specific to the application in order to construct appropriate acoustic, language and semantic models. Often no application-specific training data (acoustic or textual) are available.Modeling spontaneous speech is particularly important and new problems are encountered in developing the understanding component or when integrating the speech recognizer with other modalities such as a touch screen, keyboard, speech or other audio output, etc... In our system the output of the speech recognizer is passed to a natural language component which carries out a case frame analysis to extract the meaning of the spoken query. The main work in developing the understanding module is writing the grammar rules, and defining the concepts and keywords relevant for the task. A dialog module manages the interaction with the user, prompting the user to supply the information needed for database access. Natural language responses are generated from a semantic frame, the dialog history and retrieved DBMS information, and synthesized using concatenated speech from stored dictionary units. We are applying statistical methods to relate the recognized word string to the task-specific concepts, in order to facilitate porting to new applications. The results thus far have been quite satisfactory (that is comparable to those obtained with the rule-based system), and we envision to integrate this approach in systems such as Mask and Arise. We are developing spoken language systems in the context of three European projects (Mask, Arise and Home) and in the Aupelf-Uref Concerted Action B1. The Mask kiosk allows users to obtain train travel information for over 500 cities in France, such as timetables, prices and reservations. This system was developed in the context of the Esprit Mask (Multimodal-Multimedia Automated Service Kiosk) project in which Limsi was responsibe for the vocal interface. The system is undergoing user trials in the Saint-Lazare train station in Paris. In the LE Arise project we are developing a prototype telephone information service for train information. One particularity of telephone information services is that all information must be exchanged vocally. As a result response generation and dialog management are crucial aspects of the system design. Compared with out initial system (explored in the LE Mlap Railtel project), the main improvements concern dialog managment, the use of confidence measures in the recognizer, and the possibility to interrupt the system (barge-in). The dialog improvements are based on an analysis of the strategies used by human operators carrying out the same task. The Tide Home-AOM project aims to develop a user-friendly multimodal multimedia interface to aid disabled and elderly persons control household appliances. The user interface combines a touch screen, gesture recognition, speech recognition and synthesis. We are developing a spoken language understanding component in close collaboration with the Vecsys company. The system will allow users to control their environment using naturally spoken commands, avoiding navigation menus. The prototype system will be extensively evaluated at the Garches Hospital.
EvaluationWe devote substantial effort to evaluation of our systems, to technology transfer and to the development of spoken language corpora. Concerning our evaluation activities, LIMSI has participated in the last 7 benchmark tests organized by the U.S. DARPA program: DARPA RM (Sep'92), ARPA WSJ (Nov'92, Nov'93), ARPA NAB (Nov'94, Nov'95), BN (Nov'96, Nov'97). The evaluations enable an international comparison of different systems on the same data (American English) using common corpora for training and test, and a common test protocol.In collaboration with the SNCF we are currently evaluating the Mask kiosk in the Saint Lazare train station. In order to develop the Mask acoustic and language models, we recorded over 700 subjects interacting with the system. We also participate in the Aupelf-Uref concerted actions B1 and B2 addressing the evaluation of dictation and dialog systems in French. In this context we participated in the first evaluation campaign of Arc B1 held in 1997. Concerning Arc B2, we have ported our Mask spoken language system to a more general touristic information task, selected as a common task for this action. This system has been used to collect the dialog corpus (``ParisCorp'') containing 3400 utterances by 44 subjects. The corpus is being studied to determine methodologies which will be used in the second phase of the action, that is, in the evaluation of spoken dialog systems. A study to annotate the corpus semantically and at the dialog level is also underway. LIMSI participates in the Disc project, a long-term concerted action in the Esprit programme, which aims to codify current best practice in spoken dialog systems development and evaluation. There is no existing reference methodology for dialog system development even though there has been growing interest in language engineering. The Disc project will develop concepts and guidelines, based on a ``grid'' and ``lifecyle'' analysis of exemplars, producing recommendations and tools for dialog system development and evaluation.
Activities - Themes - Projects - Publications - People Last modified: Sunday,11-December-05 06:13:34 CET |
|||||||||||||||||||||