ELSE Executive Summary (short version)
Executive Summary of a Blueprint for a General Infrastructure
for
Natural Language Processing Systems Evaluation
Using Semi-Automatic Quantitative Black Box Approach
in a Multilingual Environment.
Authors: Patrick Paroubek (Limsi-CNRS)
& Marc Blasband (Compuleer).
June 10th 1999.
Version 3.0
The ELSE project (Evaluation in Language and Speech Engineering) has a contract from the European Commission to study the possible implementation of comparative evaluation in Europe. This short management summary introduces the concepts, describes existing comparative evaluations (e.g. US DARPA, GRACE, SENSEVAL) and proposes approaches for implementation in Europe.
1.1 What is Comparative Evaluation?
Comparative evaluation in Language Engineering has been used as a basic paradigm in the US DARPA program on human language technology since 1984. Since then, other enterprises based on the same paradigm have been conducted in Europe, both at national and at European level, but on a smaller scale and over a limited time.
Comparative evaluation consists of a set of participants that compare the results of their systems using similar tasks and related data with metrics that were agreed upon. Usually this evaluation is performed in a number of successive evaluation campaigns with more complex task to perform at every campaign.
ELSE proposition departs from the USA DARPA first by considering usability criteria in the evaluation, and second by trading competitive aspects for more contrastive and collaborative ones through the use of multidimensional results.
1.2 The Advantages for the Developers
The comparative element provides a psychological incentive to the participants to deliver the best results possible.
The results are presented and compared in special workshops where the methods used by the participants are discussed and contrasted. The performance of the system is not the major output of the evaluation exercise. More importantly, the common metric used and the knowledge gained during the evaluation are shared by the participants and by the funding agencies during this workshops. It may happen that better results are obtained by some participants because they have used better quality data. Therefore, evaluation helps identifying better quality data, not only better techniques. It also contributes to assess the impact that data quality has on system performance.
The developers benefit indirectly from evaluation because complete evaluation toolkits and by-product data become available afterwards.
The objective evaluation helps assessing the pros and cons of a solution. It complements advantageously paper publication by pitting scientific ideas against real data common to all the participants. The results reported in papers have sometimes been obtained on specific data, or with specific measures, which are hard to generalize and do not always meet the common evaluation requirements that the metric provides.
1.3 Advantages for the Community at Large
At the same time, the paradigm of evaluation allows the funding agencies to measure if the money they have invested in technology development has led to significant progress and to identify areas where the technology needs further improvement.
The commercial deployers and the end-users will be able to understand where the technology can help them and provide new solutions to the problems they face. The full evaluation program will also provide indications of the applicability of the technology to practical solutions and show the importance of the technology for the society at large.
The US DARPA example is very informative in this regard, as the results obtained for text dictation over the previous years show that it was possible to put dictation systems on the market. For more difficult tasks, such as unconstrained telephone dialogues, the poor results measured during evaluation campaigns show that more investment is still needed for improving speech recognition and dialog robustness.
1.4 Resources
A side effect of evaluation is often the production of high quality resources. Data are distributed to the participants in order to help them with training and testing their systems. As the participants need the data, there is an imperative to provide data of good quality and in due time. After a campaign the data become available to the community.
The availability of metrics and measurement tools alongside with the data used for training and testing the systems, allows the participants to measure their progress. It also gives the means to institutions that have not participated in a campaign, to evaluate their own technology. Actors of the domain can thus easily position themselves with respect to the state of the art. The initial effort for newcomers is lessened, making the considered technology available to a larger community.
1.5 Role of Evaluation
The experience of DARPA and others show that the comparative evaluation paradigm should be considered as a very powerful tool for research in the field of Language Engineering: the performance of the evaluated technologies and the understanding of the phenomena were significantly improved. The ELSE consortium thinks that this will remain an important factor in the future.
In Language Engineering a shift is taking place from theoretic and rule-based approaches to more empirical and data-driven approaches. Systematic observation of corpora with real life speech and language becomes more and more important. The comparative evaluation paradigm fits naturally into this development.
1.6 Technology and Usage evaluation
In ELSE we are interested in the deployment of comparative evaluation for:
1.7 Criticism
Some criticisms have been raised, saying that comparative evaluation may kill innovative ideas because the focus is put on a single approach and only short term considerations will be taken into account by the researchers. This problem is more likely to occur when a strong competitive element is introduced in the comparative evaluation. The project ELSE however, attempts to reduce the competitive element of the comparative evaluation.
A second criticism argues that the comparative evaluation can hinder the development of newer technologies because, initially, newer technologies will provide worse results before showing their value. This can be solved by using evaluation campaigns of longer duration and by funding long term research projects with specific meeting points located far enough in the future to allow for the development of new ideas.
A third criticism is related to the choice of the comparison task. It may not necessarily be related to the key function that is needed to determine the practical value of the technologies. Using the evaluation paradigm is like using a very powerful lamp to help finding an object in a dark room, but focusing the light on a place where the object is not. A counter argument would be that not using the evaluation paradigm is like trying to find this object without a lamp.
2. Why
2.1 Why does Language Engineering need Comparative Evaluation?
The ELSE consortium has reached the conclusion that the Language Engineering research that has the potential to lead to commercial success in the short term is based on data . At this moment we have no applicable theory that allows us to deduct properties from first principles. Therefore, evaluation is required for validating hypotheses, for assessing progress and for choosing between alternatives.
The choice of criteria and metrics for the comparison is empirically deduced from the needs of the field and not from a theory. The successful usage of a metric is determined by the agreement of the actors in the field upon that choice. Many examples are needed to establish the key parameters of the metric. Comparative evaluation forces the agreement and provides the examples. The major success of the US DARPA evaluation campaign, the recognition of the value of the HMM approach, is due to such an agreement on such a metric.
Furthermore, Language Engineering displays a paradoxical property. In many areas the state of the technology has reached a level barely sufficient to be usable in practice. Nevertheless, many commercial language based applications do exist (e.g. machine translation, text summarization, dictation, spoken dialogue systems). Comparative evaluation could help clear up the issues, where the advertised performance claims are difficult to assess and to compare objectively.
2.2 Why at European Level?
The major reason to have an international dimension is that the technology is international and multilingual. All the major developers and suppliers work on several languages, even if there are few real multilingual applications. Furthermore, all the major suppliers operate world-wide. Even if they do not take part in the evaluation campaign, they need an adequate infrastructure that comparative evaluation provides.
As the applications that are built for the end-users are often monolingual, one could argue that evaluation campaigns should be organized either nationally or in a linguistic region. The French national evaluation programs (e.g. GRACE and FRANCIL) have achieved positive results but an international dimension is necessary to obtain the desired impact. It is also interesting to see that national research programs like VERBMOBIL in Germany or the NWO priority program in The Netherlands show very clearly that co-operation on European and international level is necessary for what concerns research. Also, the current development of language technologies makes the porting of technology across languages more and more frequent in the field, leading to a greater need for multilingual comparative evaluation:
Moreover, most European language markets are too small to allow for proper evaluation programs. A language with relatively few speakers (e.g. Danish, Dutch) can only rely on European co-operation to organize the evaluation campaign that they need. It is obvious that a language with no or inferior supporting computer systems will suffer in the competition between cultures. After the movies, television and music, the computer systems could become the next battle field for the expansion or contraction of the European cultures. By organizing evaluation campaigns with an emphasis on multilinguality, the European Commission will support multiculturalism as it has done with the support of research programs and resources. Note also that these markets will advance to the forefront of the economic battlefield once the markets of more widespread languages will have been saturated. It is important to be ready before competition arrives in force, and comparative evaluation is a very good way to stimulate the field.
Finally, comparative evaluation is another way to make researchers of different countries communicate and so forge a stronger European community for language research.
3. What kind of evaluation?
3.1. Concepts
It is important to mention the differences that the ELSE project sees between competition, validation and evaluation in relation with the specification activities. The purpose of specification in this context is to determine before implementation the set of criteria used in the assessment activities and the reasons behind their choice.
Because many different criteria must play an important role, the ELSE consortium feels that the evaluations are multidisciplinary and that strict competitions are counter productive.
The difficulty of a competition is best exemplified by the comparative evaluation of translation systems. First of all, the quality of a translation is subjective. Secondly, it is not clear how to compare a cheap translator giving ungrammatical output with an expensive one which produces correct output. One can only validate the fit for their intended use through explicit criteria.
Validation is usually performed for one product or one application with one set of requirements, whereas in evaluation several technologies or systems are compared with a set of criteria. Depending on the situation, both can be conducted on the same system. ELSE addresses the evaluation of technologies.
3.2 Different Types of Evaluation
Different types of evaluation. Evaluation on a large scale is needed, but which kind? Looking at the whole development lifecycle of a technology, a few stages exist, each requiring the use of a particular type of evaluation. The ELSE consortium has identified the following five types (the first four are related to a stage, the fifth to all stages):
User-oriented evaluation is used in all five types when consideration of the end-user perception and behavior are included in the evaluation, e.g. the speed of speech, the acceptance of an interaction mode.
3.3 Relationship between Basic Research Evaluation and Technology Evaluation
Basic research evaluation can best be performed for a new concepts that are expected to replace older ones. It tries to show if a concept is viable and if it provides a significant improvement over existing methods. When possible basic research evaluation can use previous results of technology evaluation to validate the fact that an improvement is brought by the novelty under consideration.
3.4 Relationship between Technology Evaluation and Usage Evaluation
Both the technology and the usage evaluation use the results of a control task to perform the evaluation. Relationship with Usage evaluation Their main difference lies in the presence or absence of end-user considerations in this task: both try to establish how a technology performs this task, but usage evaluation is also concerned with the usefulness of the task for the end-user and the related usability aspects for the systems under assessment.
The task used in technology evaluation is simplified and a number of its features are abstracted with the obvious risk that this abstraction has gone too far: what is measured then becomes irrelevant for any deployment. Usage evaluation must take into consideration the attributes of the system that are essential for usability but not necessarily related to the technology itself: some measures then become irrelevant for the technology.
Significant problems remain after the technology evaluation. These problems are not technological in their core. Because natural language is close to the human psyche, the behavior of the users and their reaction to the technology have a significant influence on the performance in actual field conditions. In order to be reliable, the usage evaluation must be exercised in real environments with different environments, applications, languages and cultures.
The more one looks towards field aspects, the more the number of parameters to take into account increases, and the parameters themselves become more and more context specific. As we move from technology-oriented consideration towards field consideration the complexity and size of the search space defined by the possible interactions of the input parameters increases drastically. One of the key results of the usage evaluation will be a reduction of this complexity with an understanding of which parameters have the largest impact on the users for what concerns usability.
The performance of a technology (as measured by the technology evaluation) tends to evolve through thresholds: an improvement of a technology inside an interval between two thresholds is not perceptible in usage evaluation. This interaction between technology and usage evaluation determines, through these thresholds, decision points for the industrial deployments of the technology: every new threshold that is passed defines a new class of applications that may successfully be deployed (of course marketability is another issue).
Ideally technology evaluation should be able to predict the results of usage evaluation, especially in those cases where it is cheaper.
Technology evaluation and usage evaluation are complementary. Both are needed as they each provide one part of the technical answer for assessing progress and for selecting a technology for a given application.
In the beginning of the life cycle of a technology, one first expects to perform only technology evaluation, then technology evaluation and usage evaluation together, and when the technology is matured, only usage evaluation until it is replaced by validation once a standard has been established.
3.5 Relationship between Technology Evaluations and Impact Evaluation
Relationship with Impact Evaluation. The relationship between technology and impact evaluation is difficult to appreciate. These types of evaluation occur at distant points in the development lifecycle.
However, usage evaluation is closer in time to the full deployment of the technology. By the involvement of the end-users, it can predict the possible impact of the technology on the end-user as consumer or citizen. The relation however remains difficult and dangerous to make. It is only after several years that the socio-economic consequences which follow the recognition of an emerging technology can be fully appreciated.
Very often the prediction proves to be wrong, e.g. the predicted paper consumption decrease as the result of adopting computer technology for office work. On that score, the caution with which the main holders of speech recognition technology approached the market once the technology had passed the first trial of technology evaluation, is characteristic. They knew the market was there, they knew that the technology had reached a sufficient level of performance (at least for a single speaker in office conditions), but they also knew that the wrong market approach would kill the golden egged goose.
Depending on where we are located in the development lifecycle of a given technology, it would be more appropriate to talk in the early stages of Impact Prospective Analysis and later on of Impact Assessment.
3.6 Relationship between Technology Evaluations and Program Evaluation
Relationship with Program Evaluation. As program evaluation contains a sum of all the other types of evaluation, the relationship between Technology and Usage Evaluations on the one hand and program evaluation on the other hand, is a straightforward one. Technology Evaluation is one term of the sum, and Usage Evaluation is another. They contribute one part of the general picture, namely the progress achieved by a given technology during the program. The progress can be quantified by several aspects of the results produced by Technology and Usage evaluation, e.g. the performance improvement, the increase in the number of participants, the higher diversity of their origins, the augmentation of the number of languages handled for a given control task, the number of applications and environments where systems are deployed.
Naturally the relationship between progress and program quality is not linear. In conclusion, we could say that technology and usage evaluations provide some useful indicators for program evaluation, but not all of them.