& Elsnet
| Time | Speaker | Topic |
| 14.00 - 14.15 | - | Welcome & Introduction |
| 14.15 - 14.45 | J. Mariani (Ministry of Research) | The Evaluation Paradigm in Speech and Language Technology Programs across the world.
(Invited talk) |
| 14.45 - 15.30 | Herman Steeneken (TNO Human Factors) | Assessment Activities of Speech Technology: Methodology and International Standardisation |
| 15.30 - 15.45 | - | Questions & Recap. |
| 15.45 - 16.00 | - | Coffee Break |
| 16.00 - 16.45 | Dave Pallett (NIST) | The Role of the National Institute of Standards and Technology (NIST) in Benchmark Testing for Automatic Speech Recogntion Systems |
| 16.45 - 17.00 | - | Questions & Recap. |
| 17.00 - 17.45 | Valerie Mapelli (ELRA) | Language Resource Repositories and Standards for Evaluation |
| 17.45 - 18.00 | - | Questions & Day Recap. |
| Time | Speaker | Topic |
| 09.00 - 09.45 | Philip Resnik (UMIACS) | Evaluations of Meaning: Word Sense Disambiguation and Machine Translation |
| 09.45 - 10.00 | - | Questions & Recap. |
| 10.00 - 10.45 | Patrick Paroubek (Limsi-CNRS) | The contribution of the Evaluation Paradigm to Research, Industry and Language Resource Stock |
| 10.45 - 11.00 | - | Questions & Recap. |
| 11.00 - 11.15 | - | Coffee Break |
| 11.15 - 12.00 | Beth Sundheim (SPAWAR Systems Center) | Assessment of Text Analysis Technologies: How "Message Understanding" came to mean "Information Extraction" |
| 12.00 - 12.15 | - | Questions & Recap. |
| 12.15 - 12.45 | - | Pannel discussion & Morning Recap. |
| 12.45 - 14.00 | - | Lunch break |
| Time | Speaker | Topic |
| 14.00 - 14.45 | Kathleen Stibler | A Three-tiered Evaluation Approach for Interactive Spoken Language Dialog Systems |
| 14.45 - 15.00 | - | Questions & Recap. |
| 15.00 - 15.45 | John Garofolo | Integrating Human Language Technologies via Common Evaluations at NIST: The TREC Spoken Document Retrieval Track and the Automatic Meeting Transcription Project. |
| 15.45 - 16.00 | - | Questions & Recap. |
| 16.00 - 16.30 | - | Coffee Break |
| 16.30 - 17.15 | Niels Ole Bernsen | User Oriented Evaluation for Spoken Language Dialog Systems |
| 17.15 - 17.30 | - | Questions & Recap. |
| 17.30 - 18.00 | - | Pannel discussion & Course Recap. |
The comparative evaluation paradigm has been used for R&D in Human Language Technologies (HLT) for more than 15 years now, in different areas of HLT, including speech dictation, spoken language understanding, broadcast news transcription, named entities extraction, topic detection and tracking, text retrieval, message understanding, machine translation, speaker verification and character recognition. The USA have been very active in this area, especially through DARPA which used this paradigm for accompanying its research programs. Although no equivalent infrastructure has been installed in Europe, several initiatives may be reported on HLT Evaluation, both for spoken and written language processing. Some have been supported within the programs of the European Commission, in the form of actual evaluation campaigns, such as the ones conducted within SQALE or CLEF, or as accompanying projects such as DISC, EAGLES, ELSE or CLASS. Elsnet, the European Language and Speech Network, has a specific Working Group on that topic. Other activities have been conducted within national programs, such as, in France, the GRACE action at CNRS, or the Aupelf-Uref ARC, or in an international framework (Senseval/Romanseval, Aurora, Cocosda...). It appears that there is a need to conduct evaluation at the international level, for various languages, and to install accordingly an international HLT evaluation infrastructure.
The complexity of the human-computer interface, and the subtle role of speech and language processing within it, has been a source of difficulty in deploying speech and language systems in many applications. Not only are field conditions very different from laboratory conditions, but there has been a serious lack of agreed protocols for specifying such systems and for assessing their overall effectiveness. Therefore this introduction will focus on the methodology and the relation between various assessment paradigms. International standardisation of assessment methods is in progress. In the past assessment activities were (among others) sponsored by DARPA (NIST annual benchmark tests) and the European Union (SAM, SQALE, and Eagles). Presently an ISO technical report is in preparation. What are the goals that we want to achieve with speech and language oriented systems, do we aim at a performance similar to human performance? We will compare some human and system results to mark the state-of-the art.
This presentation will review the development and implementation of benchmark tests for automatic speech recogntion technology. A "benchmark test" consists of: (1) developing agreement within a research community on details of permissable training materials, dates, submission format, etc., (2) provision of training materials, (3) release of previously unseen test material, (4) submission of results to an organization such as NIST for objective scoring, (5) distribution of results to participants, and (6) discussion of the attributes of the systems at workshops. Benchmark tests have been implemented within the automatic speech recognition research community by NIST since 1987, and these tests have provided valuable insights into the technology.
Language technology has at long last reached the marketplace, at least in part because we have learned to build successful applications (such as speech recognition systems and search engines) that manipulate language without "understanding" the meaning in any deep sense. What are the issues involved in creating a technology of meaning, and does the market need it? I will look at this question by focusing on language technology at two levels. First, I will look at the technical sub-task of word sense disambiguation (WSD), that is the problem of determining which meaning is meant by a word in text -- if my seat on the airplane is "free", is it 'disponible' or is it 'gratuite'? Second, I will look at machine translation as the ultimate technology of meaning, with a focus on the question of what it takes for machine translation technology to succeed and how such technology can be evaluated.
The presentation will show how the evaluation paradigm can provide a meeting ground for all the actors the domain, exhibiting the benefits that each can expect from their involvement. In particular we will see through the exemple of the GRACE evaluation campaign for Part Of Speech tagging of French, which benefits industry can draw from its involvement in the deployment of the evaluation paradigm and how the evaluation paradigm can function as language resource producer for high quality and low cost validated language resources. First the paradigm of evaluation will be presented, with emphasis on its recent hisotry in Europe, (ELSE, CLASS etc.). Then the implementaion of the evaluation paradigm in GRACE will be detailled and the consequence that the campaing had on the domain will be analyzed. Finally, the method used to produce high-quality validated language resources at low cost from the by-products of evaluation will be exposed and illustrated with the MULTITAG project (valorization of the corpus produced in GRACE).
A series of seven U.S. government-sponsored evaluations of text analysis
technologies was carried out between 1987 and 1998. They are known
as the Message Understanding Conference (MUC) evaluations. These
task-based evaluations and the technology developers' responses to them
contributed significantly to the definition of a field of endeavor
labeled information extraction, which is different in orientation
from message understanding. Some of the reactions to the
evaluations have been positive, such as the excitement about some
relatively simple extraction tasks on which systems have either already
reached very high levels performance or on which they are at least
perceived to be within reach of high performance levels. Negative
feelings have also been expressed, including doubts about the "grand
challenge" extraction task that has been the common thread
throughout the series of MUC evaluations. This talk will highlight
some of the constants and some of the changes in the nature of the MUC
evaluations over the years, will summarize lessons learned, and will
provide pointers to sources of additional information.
I will describe our three-tiered approach to evaluation of spoken dialogue systems. The three tiers measure user satisfaction, system support of mission success and component performance. We have applied this approach in numerous fielded user studies conducted with the U.S. military. I will discuss the role these findings played in the future development of the spoken dialogue system and the metrics themselves.
The great challenge of the twenty first century will be to integrate multiple core language technologies into coherent synergistic applications. The National Institute of Standards and Technology (NIST) has begun designing and administering common evaluations requiring the fusion of two or more core language technologies. The NIST common evaluation model has helped to accelerate the development of such technologies by providing a forum for the integration and exchange of approaches among diverse research communities. The Spoken Document Retrieval (SDR) Track demonstrated that automatic speech recognition technologies can be combined successfully with information retrieval technologies to create highly effective audio indexing/search systems. The new Automatic Meeting Transcription Project will help to create technologies for the automatic production of meeting minutes using a combination of video and audio sensors and language technologies. This talk will provide an overview of how we design the infrastructure for a multi-technology evaluation via a discussion of the evolution of the SDR Track. We will also discuss plans for the new Automatic Meeting Transcription Project.