ELSE LE4-8340

Evaluation in Language and Speech Engineering

Executive Summary of a Blueprint for a General Infrastructure for Natural Language Processing Systems Evaluation Using Semi-Automatic Quantitative Black Box Approach in a Multilingual Environment.

Editors: Patrick Paroubek, Marc Blasband

Contributors: Niels Ole Bernsen, Marc Blasband, Nicoletta Calzolari, Jean-Pierre Chanod, Khalid Choukri, Laila Dybkjær, Robert Gaizauskas, Steven Krauwer, Isabelle de Lamberterie, Joseph Mariani, Klaus Netter, Patrick Paroubek, Martin Rajman, Antonio Zampolli

Document first version date:

May 5th 1998

Document date:

June 10th 1999

Document ID:

EXEC-SUM-3

Version:

3.2

Document type:

Executive Summary

Document status

Final

Contents

1. Introduction

1.2. The ELSE Project.

1.2 Preamble

1.2.1 What is Comparative Evaluation?

1.2.2 The Advantages of Comparative Evaluation.

1.2.3 Resources.

1.2.4 Role of Evaluation.

1.2.5 Technology and Usage Evaluation.

2. Background

2.1 Why?

2.1.1 Why does Language Engineering need Comparative Evaluation?

2.1.2 Why at European Level?

2.2 What Kind of Evaluation?

2.2.1 Concepts.

2.2.2 Different Types of Evaluation.

2.2.3 Relationship between Basic Research Evaluation and Technology Evaluation.

2.2.4 Relationship between Technology Evaluation and Usage Evaluation.

2.2.5 Relationship between Technology Evaluation and Impact Evaluation.

2.2.6 Relationship between Technology Evaluation and Program Evaluation.

2.3 A Bit of History

2.3.1 The Evaluation of Speech in the USA.

2.3.3 Contrasts between USA and Europe.

2.3.4 Subsidiary Control Tasks, Secondary Tasks, Hubs and Spokes.

2.3.5 Lessons from the History.

2.4 Criticism

2.4.1 Innovative Ideas.

2.4.2 The Control Task.

3. Proposal

3.1 The Objectives of Evaluation.

3.1.1 The Objectives of Evaluation for the Developers.

3.1.2 The Objectives of Evaluation for the Community at Large.

3.1.3 Language Engineering's Current Need for Data.

3.1.4 The Contribution of Usage Evaluation.

3.2 Structure of a Campaign.

3.2.1 The Control Task.

3.2.2 Baseline and Metrics.

3.2.3 Basic Requirements for Evaluation Data.

3.3 What?

3.3.1 Six Candidate Control Tasks for Technology Evaluation.

3.3.2 Data Resources.

3.3.3 Complementary Usage Evaluation.

3.3.4 One Control Task: NODE ( News On Demand Evaluation).

3.4 How?

3.4.1 Clustering Considerations.

3.4.2 Multilingualism.

3.4.3 Phases of Evaluation.

3.4.5 Results Computation.

3.5 Resources.

3.5.1 Evaluation Data Lifecycle.

3.5.2 Budget Estimates for Technology Evaluation.

3.5.3 Budget Estimates for Usage Evaluation.

Annex 1 Some European Examples of Comparative Quantitative Black Box Evaluation

A1.1 The ARCs (Actions de Recherche Concertées) of the Aupelf-Uref

A1.2 GRACE (Grammars and Resources for Analysers of Corpora and their Evaluation)

A1.3 SENSEVAL/ROMANSEVAL (Word Sense Disambiguators Evaluation)

Annex 2 - Thirty-one Candidate Control Tasks

Annex 3 - Practical Considerations for Implementation

A3.1 The Need for a Permanent Infrastructure

A3.2 Selection of Evaluators and Participants

A3.3 Integrating Evaluation in the Call for Proposals

A3.4 Evaluation in a Multilingual Context

A3.5 Proactive or Reactive Approach?

References

 

1. Introduction

The ELSE Project.

The ELSE project (Evaluation in Language and Speech Engineering) has a contract from the European Commission to study the possible implementation of comparative evaluation in Europe. This management summary introduces the concepts, describes existing comparative evaluations (e.g. USA DARPA, GRACE, SENSEVAL) and proposes approaches for implementation in Europe.

1.2 Preamble

1.2.1 What is Comparative Evaluation?

Comparative evaluation in language engineering has been used as a basic paradigm in the USA DARPA program on human language technology since 1984. Since then, other enterprises based on the same paradigm have been conducted in Europe, both at national and at European level, but on a smaller scale and over a limited time.

Comparative evaluation consists of a set of participants that compare the results of their systems using the same or similar control tasks and related data with metrics that are agreed upon. Usually this evaluation is performed in a number of successive evaluation campaigns with more complex data for every campaign. For every campaign, the results are presented and compared in special workshops while the methods that are used by the participants are discussed and contrasted.

The ELSE proposal differs from the USA DARPA-competition in three ways:

1.2.2 The Advantages of Comparative Evaluation.

The experience with comparative evaluation in the USA and in Europe has shown that the approach has significant advantages.

The performance of the system is not the major output of the evaluation exercise. More importantly, the evaluation will yield common knowledge for the participants and the funding agencies about the tasks, the metric and the techniques with which the problems can best be solved. The objectivity of the evaluation helps assessing the pros and cons of a technique.

For the stakeholders, evaluation has the following advantages:

The USA DARPA example is very informative in this regard, as the results obtained for text dictation over the previous years show that it was possible to put dictation systems on the market. For more difficult tasks, such as unconstrained telephone dialogues, the poor results measured during evaluation campaigns show that more investment is still needed.

1.2.3 Resources.

A side effect of evaluation is often the production of high quality resources. Data are distributed to the participants in order to help them with training and testing their systems. As the participants need the data, there is an imperative to provide data of good quality and in due time.

The availability of metrics and measurement tools alongside with the data used for training and testing the systems, allows the participants to measure their progress. The data can be distributed in the community after the campaign is completed and re-used as training material in other campaigns.

1.2.4 Role of Evaluation.

The experience of DARPA and others show that the comparative evaluation paradigm should be considered as a very powerful tool for research in the field of language engineering: the performance of the evaluated technologies and the understanding of the phenomena were significantly improved. The ELSE project thinks that this will remain an important factor in the future.

In language engineering a shift is taking place from theoretic and model-based approaches to more empirical and data-driven approaches. Systematic observation of corpora with real life speech and language becomes more and more important. The comparative evaluation paradigm fits naturally into this development.

1.2.5 Technology and Usage Evaluation.

In ELSE, we are interested in the deployment of comparative evaluation for technology and Usage Evaluation.

2. Background

2.1 Why?

2.1.1 Why does Language Engineering need Comparative Evaluation?

The ELSE consortium has reached the conclusion that language engineering research that has the potential to lead to commercial success in the near future, is based on data. At this moment we have no applicable theory that allows us to deduce properties from first principles. Therefore, evaluation is required for validating hypotheses, for assessing progress and for choosing between alternatives.

The choice of criteria and metrics for the comparison is empirically deduced from the needs of the field and not from a theory. The successful usage of a metric is determined by the agreement of the actors in the field upon that choice. Many examples are needed to establish the key parameters of the metric. Comparative evaluation forces the agreement and provides the examples. The major success of the USA DARPA evaluation campaign, the recognition of the value of the HMM approach, is due to such an agreement on such a metric.

Furthermore, language engineering displays a paradoxical property. In many areas the state of the technology has reached a level barely sufficient to be usable in practice. Nevertheless, many commercial language-based applications do exist (e.g. machine translation, text summarization, dictation, spoken dialogue systems). Comparative evaluation could help clear up the issues, where the advertised performance claims are difficult to assess and compare objectively.

2.1.2 Why at European Level?

The major reason to have an international dimension is that the technology is international and multilingual. All the major developers and suppliers work on several languages, even if there are few real multilingual applications. Furthermore, all the major suppliers operate world-wide. Even if they do not take part in the evaluation campaign, they need an adequate infrastructure that comparative evaluation provides.

As the applications that are built for the end-users are often monolingual, one could argue that evaluation campaigns should be organized either nationally or in a linguistic region. The French national evaluation programs (e.g. GRACE and FRANCIL) have achieved positive results. Nevertheless, an international dimension is necessary to obtain the desired impact at the European level. Also the national research programs like VERBMOBIL in Germany or the NWO priority program in The Netherlands show very clearly that co-operation on European and international level is necessary for what concerns research.

Moreover, most European language markets are too small to allow for strictly national evaluation. A language with relatively few speakers (e.g. Danish, Dutch) can only rely on European co-operation to organize the evaluation campaign that they need. It is obvious that a language with no or inferior supporting computational tools will suffer in the competition between cultures. After film, television and music, the computer systems could become the next battlefield for the expansion or contraction of the European cultures. By organizing evaluation campaigns with an emphasis on multilinguality, the European Commission will support multiculturalism as it has done with the support of research programs and resources.

The porting of technology across languages becomes more and more prominent in the industrial field, leading to a greater need for multilingual comparative evaluation.

Finally, comparative evaluation is another way to make researchers of different countries communicate and so forge a stronger European community for language research.

2.2 What Kind of Evaluation?

In analyzing the situation of evaluation, a clear terminology proved to be essential. In this section the ELSE project proposes definitions and key words that will be used throughout this report.

2.2.1 Concepts.

It is important to mention the differences that the ELSE project sees between competition, validation and evaluation in relation with the specification activities. The purpose of specification in this context is to determine before implementation the set of criteria used in the assessment activities and the reasons behind their choice.

Depending on the situation, both validation and evaluation can be conducted on the same system. ELSE addresses the evaluation of technologies, as required in the contract with the European Commission.

Because many different criteria must play an important role, the ELSE consortium feels that the evaluations are multidisciplinary and that strict competitions are counter productive.

Every competition generates popularity and raises interest. The publication of the ranking of the participants in a comparative evaluation campaign has a similar effect, but it brings mainly short term benefits. Only the descriptions the participants make of their systems give an idea of the methods that were used to achieve the result. A system that performs well, because it has been hyped-up, is of much less interest than a system that does not perform as well, but shows a better conception or uses a promising, but not yet mature technology.

Also when the compared systems perform the same function, the differences in environment are such that any form of strict competition is meaningless. The ARISE project showed this very clearly with four systems having the same function: automatically providing train schedule information by telephone. The comparison of the different requirements and the different implementations is far more significant and important for the future of the field.

The difficulty of a competition is best exemplified by the comparative evaluation of translation systems. First of all, the quality of a translation is subjective. Secondly, it is not clear how to compare a cheap translator that gives a wrong grammatical output with an expensive one that is producing correct output. One can only validate the fit for their intended use through explicit criteria.

2.2.2 Different Types of Evaluation.

Different types of evaluation. Evaluation on a large scale is needed, but which kind? Looking at the whole development life-cycle of a technology, a few stages exist, each requiring the use of a particular type of evaluation. The ELSE consortium has identified the following five types (the first four are related to one stage, the fifth to all stages):

User-oriented evaluation is used in all five types when consideration of the end-user perception and behavior are included in the evaluation, e.g. the speed of speech, the acceptance of a mode of interaction.

In the following chapters, we will only address Technology and Usage Evaluation. In the remainder of this chapter, we shall study the relation between the Technology and Usage Evaluation on the one hand and the impact and program evaluation on the other hand.

 

2.2.3 Relationship between Basic Research Evaluation and Technology Evaluation.

Basic research evaluation can best be performed for new concepts that are expected to replace older ones. It tries to determine whether a concept is viable and if it provides a significant improvement for existing methods. When possible basic research evaluation can use previous results of Technology Evaluation to validate the fact that an improvement is brought under consideration by the novelty. Basic research evaluation can then measure:

It does not matter if the corpus and the metric are relatively old as the purpose of the basic research evaluation is to indicate a direction in an objective way.

2.2.4 Relationship between Technology Evaluation and Usage Evaluation.

Both the Technology and the Usage Evaluation use the results of a control task to perform the evaluation. Their main difference lies in the presence or absence of end-user considerations in this task: both try to establish how a technology performs the control task, but Usage Evaluation is also concerned with the usefulness of the task for the end-user and the related usability aspects for the systems under assessment.

The task used in Technology Evaluation is simplified and a number of its features are abstracted from the intended deployment environment. The key problem here is to abstract enough to get rid of the noise introduced in the measures by the specificity of the deployment environment while still remaining faithful to the real issues at stake for the progress of the underlying technology.

. Usage Evaluation must take into consideration the attributes of the system that are essential for usability but not necessarily related to the technology itself: some measures then become irrelevant for the technology itself but are more related to ergonomic or even market related issues.

Significant problems remain after the Technology Evaluation. These problems are not technological in their core. Because natural language is close to the human psyche, the behavior of the users and their reaction to the technology have a significant influence on the performance in actual field conditions.

In order to be reliable, Usage Evaluation must be exercised in real life situations with different environments, applications, languages and cultures.

The more one looks towards field aspects, the more the number of parameters to take into account increases, and the parameters themselves become more and more context specific. As we move from technology-oriented consideration towards field consideration, the complexity and size of the search space defined by the possible interactions of the input parameters increases drastically. One of the key results of the Usage Evaluation will be a reduction of this complexity with an understanding of which parameters have the largest impact on the users for what concerns usability.

The performance of a technology (as measured by the Technology Evaluation) tends to evolve through thresholds: an improvement of a technology inside an interval between two thresholds is not perceptible in Usage Evaluation. This interaction between technology and Usage Evaluation determines through these thresholds decision points for the industrial deployments of the technology: every new threshold that is reached defines a new class of applications that may successfully be deployed (marketability is another issue).

Comparative Technology Evaluation ignores the interface of the participating systems. But Usage Evaluation takes these interfaces into consideration. For Usage Evaluation, the measured performances and appreciation do not provide a clear distinction between the influences of the packaging of the system and of its core functionality. On this score, Technology Evaluation can be said to provide a glass box look on aspects that Usage Evaluation tends to handle in a black box oriented manner.

Ideally, Technology Evaluation should be able to predict some results of Usage Evaluation, because it takes place earlier in the lifecycle of a technology. But a large number of experiments are needed in earlier Usage Evaluations to show when these predictions are correct.

Technology Evaluation and Usage Evaluation are complementary. Both are needed as they each provide one part of the technical answer for assessing progress and for selecting a technology for a given application.

In the beginning of the lifecycle of a technology, one first expects to perform only Technology Evaluation, then Technology Evaluation and Usage Evaluation together, and when the technology has matured, only Usage Evaluation until it is replaced by validation once a standard has been established.

2.2.5 Relationship between Technology with Usage Evaluation and Impact Evaluation.

The relationship between technology and impact evaluation is difficult to appreciate. These types of evaluation occur at distant points in the development lifecycle.

However, Usage Evaluation is closer in time to the full deployment of the technology. By the involvement of the end-users, it can predict the possible impact of the technology on the end-user as consumer or citizen. The relation however remains difficult and dangerous to make. It is only after several years that the socio-economic consequences which follow the recognition of an emerging technology can be fully appreciated.

Very often the prediction proves to be wrong, e.g. the paper consumption decrease expected from adopting computer technology for office work. The caution with which the main holders of speech recognition technology approached the market once the technology had passed the first trial of Technology Evaluation, is characteristic. They knew the market was there, they knew that the technology had reached a sufficient level of performance (at least for a single speaker in office conditions), but they also knew that the wrong market approach would kill the golden egged goose.

Depending on where we are located in the development lifecycle of a given technology, it is more appropriate to talk in the early stages of Impact Prospective Analysis and later on of Impact Assessment.

2.2.6 Relationship between Technology with Usage Evaluation and Program Evaluation.

As program evaluation contains a sum of all the other types of evaluation, the relationship between technology and Usage Evaluations and program evaluation is a straightforward one. Technology is one term of the sum, field usage is another and socio-economic impact a third term. They contribute one part of the general picture, namely the progress achieved by a given technology during the program. The progress can be quantified by several aspects of the results produced by the technology and Usage Evaluation, e.g. the performance improvement, the increase in the number of participants, the higher diversity of their origins, the augmentation of the number of languages handled for a given control task, the number of applications and environments where systems are deployed.

Naturally the relationship between progress and program quality is not linear. In conclusion, we could say that technology and Usage Evaluations provide some useful indicators for program evaluation, but not all of them.

2.3 A Bit of History

Some key points of the evaluation history of Technology Evaluation are presented (Usage Evaluation have never been performed in a comparative way). This history helps understand the present state of affairs, both in the USA and in Europe. The ELSE project hopes to emulate the achieved results in the future.

2.3.1 The Evaluation of Speech in the USA.

In the 70’s and early 80's, there was no standard measure or protocol for assessing the quality of speech recognition systems. Most technology developers claimed a 99%+ recognition rate, and several very different approaches coexisted in the research community, none offering a visible advantage over the others.

When DARPA started its new campaign in 1984, the Evaluation Paradigm (the comparative quantitative black box approach) was chosen as backbone for the program. Three years were required to design and implement the first evaluations which were run in 1987. The first formal tests showed that the general quality of the recognition systems was well below what was actually claimed, and that knowledge-based methods were bested by statistically based ones.

This allowed to clear up the issues and investment began to flow towards the methods which were identified as the ones that could lead to usable systems. In the mid 90's, they finally arrived on the market with big success.

The series of evaluation campaigns have demonstrated a global increase of performance (from separate word recognition to continuous speech recognition) in parallel with a full deployment of the technology that has been taken up by industry and now occupies a whole market sector. To give an idea of the progress made since 1987, it suffices to note that Philips, a firm that actively participated in the DARPA campaigns, is now, ten years later, advertising the FREESPEECH speech recognition system for 39 US$ on the Internet, and one of its competitors, Dragon Systems (also involved in the DARPA evaluations), offers a similar product at 49.99 US$ (August ’98 data).

2.3.2 Tools as By-products of the Technology Evaluation Campaigns.

Whenever comparative evaluation takes place, there is a need for deploying an evaluation software toolkit for validating the input data, for computing the performance results and for displaying them. The undertaking of such construction is often a costly enterprise, particularly when done on an individual basis. Comparative evaluation represents a stong factor of incentive for making such toolkit available to a given community. For instance, the SMART (and its successors PRISE and ZPRISE) indexing tools produced during the TREC campaigns, have now become almost standard toolkits for information retrieval, while the SCLITE tool for speech transcription comparison is widely used. These by-products of evaluations are a very important contribution for industrialists, who do not wish to devote specific resources to develop evaluation tools.

2.3.3 Contrasts between USA and Europe.

For text related evaluation in the USA, the first MUC evaluation took place in 1987 and the TIPSTER program started in 1991. For speech processing, the first large-scale campaign (Continuous Speech Recognition and Large Vocabulary Continuous Recognition) also dates back to 1987. DARPA and NIST were the two funding agencies behind those campaigns. The American government provided an important budget that came along with evaluation objectives strongly inspired by military or political considerations.

The American campaigns have inspired similar efforts in Europe (e.g. SQALE). But the picture in Europe is less homogeneous for several reasons. The amount of resources which has been devoted to evaluation until now is much less and comes from many different sources:

Furthermore, the diversity of goals and infrastructures behind the different evaluation efforts in Europe is an extra factor adding to this heterogeneity.

It seems that the USA evaluation based programs had followed a top-down approach (government strongly influenced the campaigns, but provided abundant funding and a long lasting infrastructure). In Europe, the strategy has been more a bottom-up one, with various efforts for which ELSE could be a converging point, leading to a more ambitious deployment of the paradigm of evaluation in FP5 as described in chapter 3 of this document.

2.3.4 Subsidiary Control Tasks, Secondary Tasks, Hubs and Spokes.

As the campaigns progress, the control tasks evolve: subsidiary and secondary control tasks emerge. Refining a control task into optional subsidiary control tasks is something that generally develops over the years, as the issues associated with a control task become clearer. New problems, generally arising as satellite issues, are identified and come to the fore, sometimes replacing older ones as focal points of the evaluation.

For instance, in MUC-6 (1995) the sixth of the Message Understanding Conferences sponsored by DARPA, evaluation expanded in three new directions [LH98]:

The subsequent evaluation campaigns have continued to develop and refine their structure through subtasks called hub and spoke organizations.

2.3.5 Lessons from the History.

The most important lesson is that Technology Evaluation of the technology can bring significant results by identifying the most promising technology and by showing the rate with which it improves.

The impetus that evaluation gives to all the participants generates better systems, fosters agreements on metrics and measures, and insures the existence of substantial corpora. The tools generated as by-products of the campaigns provided significant support to the field.

In the past, the campaigns were successful when the technology had the potential to be improved. Some campaigns showed that the technology had reached a ceiling and that further evaluation would not show enough improvement.

It also became obvious that in order to be successful, a campaign needed the co-operation of the participants and agreements on all major points: protocols, metrics and measures as well as data.

Proposing an organization that refines the control task along various specific dimensions in a non-contractual manner for the participants, is a good way to broaden the scope of a campaign and to prepare the next evaluation campaign. It also contributes to refining the picture that is drawn of the current state of a given technology.

Evaluations were successful in the USA because a strong political and governmental support and involvement drove the process and financed a large part of it. It is not obvious whether the same political support can be obtained in Europe.

2.4 Criticism

The evaluation campaigns, specially the campaigns performed in the USA, have been criticized on two major points: the reduction of innovative ideas and the choice of the control task.

2.4.1 Innovative Ideas.

Some critics argue that the comparative evaluation can hinder the development of newer technologies because by necessity newer technologies will provide worse results before showing their value.

This can be solved by using evaluation campaigns of longer duration (two years seem to be a good compromise) and by funding long-term research projects with meeting points that are located far enough in the future to allow for the development of new ideas.

The ELSE project proposes to separately handle basic research with a specific evaluation mechanism, working with longer term objectives. In addition, Technology evaluation results and by-products will also support the development and evaluation of new ideas.

Other critics argue that comparative evaluation may also break off innovative ideas because the focus is put on a single approach and only short term considerations will be taken into account by the researchers. This possible problem will tend to occur when a strong competitive element is introduced in the comparative evaluation. The project ELSE however, attempts to reduce the competitive element of the comparative evaluation.

Finally, we think it is not comparative evaluation that kills new ideas, but the use one makes of comparative evaluation. The same could be said of any other scientific tool. To convince oneself of this, it suffices to look back on the dampening effect that the first publication in 1969 of Marvin L. Minsky and Seymour A. Papert’s book entitled "Perceptrons" [MP90] had on the field of neural network computing (this excellent book addressed the limitation of one type of connectionist model which people unfortunately generalized to all connectionist models). The problem lay more in the way we conduct science than in evaluation.

2.4.2 The Control Task.

A last criticism is related to the choice of the control task used for comparison. It may not necessarily be related to the key function that is needed to determine the practical value of the technologies. Using the evaluation paradigm is like using a very powerful lamp to help finding an object in a dark room, but focusing the light on a place where the object is not. A counter argument would be that not using the evaluation paradigm is like trying to find this object without a lamp. Moreover, the Usage Evaluation will tell in which corner of the room the lamp should be directed.

3. Proposal

3.1 The Objectives of Evaluation.

With this proposal, the ELSE project attempts to build upon the result already obtained in the past by comparative evaluation and to incorporate new demands from the fields while reaching for a wider audience. The underlying factors that could inspire a large scale evaluation effort in Europe are based mainly on scientific but most of all, on economic grounds.

3.1.1 The Objectives of Evaluation for the Developers.

The comparative element provides a particular psychological incentive to the participants to deliver the best results possible.

After each evaluation, a workshop will be held in which the participants will explain their analysis of the task and the techniques with which they have solved it. This knowledge will be shared. The performance of the system is not the major output of the evaluation exercise. More importantly, the common metric used and the knowledge gained during the evaluation will be shared by the participants and by the funding agencies during workshops.

It may happen that better results are obtained by some participants because they have used data of better quality. Therefore, evaluation will help identify better data, not only better techniques. It also contributes to assess the impact that the quality of data has on system performance.

The objective evaluation will advantageously complements paper publications by weighting scientific ideas using real data common to all the participants. The results reported in papers have sometimes been obtained on specific data, or with specific measures, that are hard to generalize and do not always meet the common evaluation requirements of the metric.

The developers will also benefit indirectly from evaluation because complete evaluation toolkits and by-product data become available after the completion of an evaluation campaign. Institutions that have not participated in a campaign can evaluate their own technology in relation to the state of the art by using the resource of that completed campaign.

Another important by-product of the evaluation campaigns is the broad agreement about metrics and measures.

3.1.2 The Objectives of Evaluation for the Community at Large.

The evaluations allow the funding agencies to determine if their investment has led to significant progress. They will help identify areas where the technology needs further improvement.

The commercial deployers and the end-users will be able to understand where the technology can help them and provide new solutions to the problems they face. The full evaluation program will also provide indications of the applicability of the technology to practical solutions and show the importance of the technology for the society at large.

Evaluation is a way to identify promising technology and to show its value to industry, thus speeding up the time required for a concept to become a mass-market product.

3.1.3 Language Engineering's Current Need for Data.

Language engineering needs resources to progress. Right now it is obvious that, for languages other than English, there is a lack of:

  1. Part-Of-Speech tagged corpora and treebanks;
  2. ontologies;
  3. lexicons;
  4. corpora tagged with word senses (taken from a reference dictionary);
  5. large corpora of speech transcriptions (aligned with voice data).

Evaluation provides a partial solution to the problem through the production of standardized, annotated and validated linguistic resources at a low cost, from the data processed by the participants during an evaluation campaign. Evaluation contributes to the definition of the associated standards with a practical viewpoint. An essential asset, because without standards the resources would not be usable.

Of course, evaluation can directly contribute to the support of evaluation itself. Once the best technology has been identified in a given domain, it can be harnessed to produce annotated data for future evaluation campaigns.

3.1.4 The Contribution of Usage Evaluation.

The ELSE project proposes to complement Technology Evaluation with Usage Evaluation in order to take also into account more market oriented considerations. The specific goals that the ELSE consortium think Usage Evaluation should achieve are listed below.

Usage Evaluation will clearly show the value of a technology for the user and will allow to measure progress in this direction over several campaigns.

Usage Evaluation will provide clear directions to the choice of criteria for Technology Evaluation. Indeed, this choice is empirical and should be determined by the usage of the technology and not by the technology itself. Usage Evaluation will clarify the relation between technology and usage by answering a number of questions:

The Usage Evaluation that the ELSE project proposes will break new grounds, because of its innovative aspect. It will answer questions like:

Because it is closer to the usage of the technology, Usage Evaluation will support its commercial deployment and show which are the key elements that influence the successful deployments.

3.2 Structure of a Campaign.

Within one campaign, the same control task is performed by all participants. Their results are the basis for the comparisons.

To define these control tasks, ELSE proposes to use the abstract architecture of a generic application that covers all the aspects that language technology needs to address nowadays.

The generic application we will use as a reference frame for defining specific evaluations, will be a cross-language intelligent information extraction system. Here information extraction is meant in a broad sense, encompassing both the classical meanings of Information Extraction (IE), i.e. template filling from documents, and Information Retrieval (IR), i.e. document selection. Such system would have multi-modal input and output and would be able to intelligently adapt its behavior to a particular query. We will use the architecture of this generic application for communicating and explaining the relationship between the various evaluation tasks that we intend to propose, each evaluation task corresponding to an abstract functionality or module of the architecture. The various components of the architecture will be developed along the following 3 activities which corresponds to different segments of the loop that language information would follow when user would interact with the generic application (from the user to the application, then back to the user):

1. Information Profiling (data analysis, e.g. input speech signal transcription). 2. Information Querying (dialog management issues and mapping results to query, e.g. document selection). 3. Information Presentation (output modality selection, language generation, e.g. speech synthesis).

Evaluation points can be selected at the input and output of individual modules of such architecture and also at any point along arbitrary chains of modules. Thus, new evaluation tasks can be defined by linking various modules of the abstract architecture in a braided fashion [KNRGRC95]. The global functionality achieved by the so selected evaluation chain will define the control task to be used for Technology Evaluation. The characteristics (usability requirement) of the Usage Evaluation that could be performed to complement the Technology Evaluation would be drawn from the characteristics of the environment existing at the end points of the chain of module. Of course, these environment parameters are quite different depending on whether one sees the chain of module as an embedded module in a larger system, or whether one sees it as a stand-alone application. In the latter case, the parameters are more numerous in order to cover additional ergonomic issues.

Our abstract architecture is very much like the new DARPA COMMUNICATOR evaluation paradigm [COM98], where a real information software (derived from the JUPITER system [JPSSJGTH98]. developed at MIT), will be distributed to all the participants for module development or improvement. However in our case the architecture is not a real one, but an abstract one and is used only as a reference framework for linking the various evaluation tasks.

3.2.1 The Control Task.

A control task is the function that the participating systems perform during evaluation with the conditions under which this function must be performed (e.g. for parser evaluation a control task could be bracketing of the constituents in texts of minimal size). The common evaluation protocol will use quantitative black box metrics deployed around the control task.

In addition, we put the following generic requirements on the definition of the control task:

3.2.2 Baseline and Metrics.

For each control task, a baseline performance level may be determined either by straightforward implementation of a basic algorithm or based on economical considerations (e.g. currently, for optical character recognition the economical threshold is 99.7% error free performance, below this level, it is cheaper to resort to keyboarding). Because it is representative of the state of the field and of the difficulty of the task, we think that a baseline approach should always be part of the results to provide a contrastive point of view over the systems' performance.

Sometimes human intervention in results quality assessment cannot be avoided. But whatever the assessment procedure, evaluation result production should be automatized as much as possible, in order do be easily reproducible (a guaranty for transparency of the protocole). Several different tester should be used, as they will not always agree, even if they apply the same evaluation criteria (inter-tester agreement statistics like the "kappa" one, are extremely useful for this). Note, that the less technical a control task is, the harder it is to provide for it a reliable metric as quality criteria and performance measures tend to be based on subjective value scales (e.g. text summarization, translation, speech synthesis). In general, Usage Evaluation require many more subjective parameters, and better quality results are obtained when the testers are taken among the intended end-user population. Production of evaluation results should be automated as much as possible, in order to be easily reproducible, a guarantee for transparency.

3.2.3 Basic Requirements for Evaluation Data.

Linguistic resources are needed to build and annotate the reference data set used to compare the participating systems. For some tasks, the resources may already exist, but would we perform competitive evaluation, they can only be considered as training material and not as reference material (it is very likely that the potential participants will already have had access to the material).

Concerning the annotations themselves, human intervention will necessarily be required. Otherwise it would mean, that the task could be properly carried out by an automation and that evaluation will then be unnecessary. This last remark puts the focus on one of the reasons behind the high cost of evaluation for language engineering. To build the reference data set, we need human intervention. These data are used for comparing data produced by automatic means. As computer capabilities keep on increasing in terms of speed and the amount of data handled, comparing several automatic approaches requires more and more data in order to exhibit significant differences. Despite the help that can be brought by dedicated software, the rate at which humans can produce data is almost constant because of inherent biological limitations (e.g. there is a maximum rate at which one can transcribe audio data).

A way to limit human intervention in building reference data sets consists of:

The size of the data sampled for evaluation is then defined by two contradictory requirements. It should be:

3.3 What?

3.3.1 Six Candidate Control Tasks for Technology Evaluation.

The main areas of language engineering that are current central preoccupations of researchers and developers, are [MLIM98]:

The RTD priorities for Human Language Technology in FP5 listed in [HLT98] are:

Considering the current state of the domain, annex 2 shows a list of 30 possible candidate control tasks, out of which we have pre-selected the following six control tasks. They could be used for the first Technology Evaluation campaigns. We would like to see this list validated by the actors of the domain in general and in particular by key representatives of the language engineering industrial sector.

 

The criteria for selecting these tasks were:

Finer selection criteria will have to be applied when implementing these control tasks. The criteria should at least be based on the potential number of participating systems and on the linguistic resources available when starting the evaluation campaign.

  1. Broadcast News Transcription;
  2. Cross-Lingual Information Retrieval / Extraction [GG97];
  3. Text To Speech Synthesis [RS97];
  4. Text Summarisation;
  5. Language Model Evaluation. (Word Prediction Task);
  6. All or a selection of the following techniques: Part-Of-Speech tagging, Lemmatisation, Analysis of Syntactic Functional Relations, Word Sense Disambiguation.

Comparing these three tasks with the researchers preoccupations and the priorities of FP5 give the following two tables. These points of interest are reasonably well addressed.

 

 

Multilinguality

Interactivity

Digital Content

Broadcast News

X

 

X

Cross-Lingual Information Retrieval / Extraction

X

X

X

Text To Speech Synthesis

 

X

X

Text Summarisation

X

 

X

Language Model Evaluation

   

X

Technique

X

 

X

 

 

Text

Speech

Image

Mono/Multilingual

Broadcast News

 

X

 

Mono

Cross-Lingual Information Retrieval / Extraction

X

   

Multi

Text To Speech Synthesis

 

X

 

Mono

Text Summarisation

X

   

Mono

Language Model Evaluation

X

X

 

Mono

Technique

X

   

Mono

 

This table shows the relation between the six control tasks and their multimedia and multilinguality aspects.

Naturally, the previous list contains very broadly scoped control tasks. According to the needs, the tasks could be refined into more specific subtasks, or implemented in conjunction with other correlated subsidiary control tasks.

3.3.2 Data Resources.

 

Control Task

By-product Data Resources

Broadcast News

Text transcription of speech signal (possibly time-aligned).

Cross-Lingual Information Retrieval / Extraction

Multilingual query/document pairs.

Text To Speech Synthesis

Speech signal for a text

Text Summarisation

Document and summary pairs.

Language Model Evaluation

Word predictions (e.g. probability tagging).

Technique

Text with Part-Of-Speech tags, Lemmas, Syntactic annotations and Word Sense tags.

 

The evaluation of these six tasks will produce data resources (see table above). Out of the 30 possible reusabilities of these resources between the six control tasks, 17 are actually possible. Each time, the data produced by one evaluation is interesting in the scope of another evaluation. If the reuse of data can take place between two control tasks, it is important to remember that such an operation entails scheduling constraints for the two evaluation campaigns. The second evaluation cannot start until the data produced by the first evaluation have been completely processed.

 

Producer/Consumer

BNT

CLIR

TTS

SUMZ

LM

TECH

Broadcast News

Reuse

Reuse

 

Reuse

Reuse

Reuse

Cross-Lingual Information Retrieval / Extraction

Reuse

       

Text To Speech Synthesis

   

Reuse

     

Text Summarisation

 

Reuse

Reuse

Reuse

Reuse

Reuse

Language Model Evaluation

 

Reuse

Reuse

Reuse

Reuse

Reuse

Technique

 

Reuse

Reuse

Reuse

Reuse

Reuse

 

It is obvious that the data of one campaign can be used to start the following one for the same control task (diagonal of the previous table) .

3.3.3 Complementary Usage Evaluation.

Related to four of the six control tasks for Technology Evaluation, meaningful stand-alone control task for Usage EvaluationEvaluation could be the following ones:

 

Control Tasks for Technology Evaluation

Control Tasks for Usage Evaluation

Broadcast News

Transcription of Virtual Meetings

Cross-Lingual Information Retrieval / Extraction

Multimodal tourist information

Text To Speech Synthesis

Text-to-speech for the blind

Text Summarisation

Text summarisation of financial newspapers

Language Model Evaluation

 

Technique

 

 

These tasks were chosen with the knowledge that applications or prototypes exist. A major difficulty will be to find the participants for these stand-alone control tasks for Usage Evaluation.

If these cannot be found, one should think about the value of these stand-alone tasks: it may not represent any real demand, but the possibility exists however that the developers of the systems do not wish to participate.

For the two remaining control task (Language Model Evaluation and Technique), specific Usage Evaluation criteria will have to found based on the sole of embedded module functionality (e.g. processing speed, language coverage etc.).

The differences between language, culture, environment and application will be parameters of the comparison process.

1. Transcription of Minutes of Virtual Meetings

We propose to use the following usage criteria as comparison points:

Ideally the test will be the transcriptions of real useful meetings with participants who want to achieve something during that meeting.

 

 

2. Multimedia Tourist Information

Multimedia tourist information systems can be compared with the following user-oriented criteria:

3. Text to Speech for the Blind

Text to speech systems for the blind can be compared with the following user-oriented criteria:

4. Summarization

Deployed summarization systems for a given domain or application can be compared with the following user-oriented criteria:

As in other cases it is expected to evaluate the performance of deployed systems with real users and real demands.

General points

In all cases multidimensional comparisons are necessary to cover the complexity of the tasks. Comparisons are proposed and not competition with a unique criterion.

To complement Technology Evaluation, other stand-alone control tasks can be thought of for the Usage-Oriented evaluation, the previous are given as indications of what could be done. Technology Evaluation

3.3.4 One Control Task: NODE ( News On Demand Evaluation).

Instead of these six control tasks, one task can be proposed that covers the whole spectrum of activities: news on demand. This task searches multimedia material for information that is relevant for a given query. The purpose of news on demand is to provide archived broadcast news material. A query is formulated and video excerpts from past material are extracted from archive databases.

News on demand encompasses the major research directions that were identified:

This control task will also show how the priorities of FP5 are worked out in the field:

Natural interactivity to handle the details of the query and the navigation in the space of the response;

Finally this control task is itself a stand-alone control task for Usage Evaluation and thus allows to perform Usage Evaluation the latter in a very natural way.

Usage Evaluation

3.4 How?

Within the objectives of the project ELSE, the imbedding of evaluation in the practical organization of the research projects of HLT had to be looked at. The project recognizes two modes of operations:

The proactive evaluation can be organized and prepared before the proposals of the first call of FP5 are accepted. The reactive evaluation campaigns can only be started when the accepted proposals are known and the clustering of projects are completed.

3.4.1 Clustering Considerations.

Project clustering was initiated in the course of FP4 in order to contribute to the objectives of the program, and to the achievement of the performance criteria laid down for the Telematics Application Program [JD95] (as FP5 is not very explicit on clustering, the data of FP4 is used here). The purposes of the program were:

In a broader perspective, project clustering was also meant to support the long-term objectives of the LE sector, which are mostly motivated by market considerations:

Possible factors which have been considered to organize the clustering of projects, are [LGLK98]:

Of these five factors, market opportunity was identified as the most appropriate basis for clustering. Grouping by technology had been termed to be "very attractive" in terms of project cross-fertilization and quality improvement. However, it was deemed impractical because of the large number of technology combinations, and not rewarding enough as concerns the target user community [JD95].

Evaluation can provide an important inter-project and inter-cluster link and an exchange medium that would contribute to the objectives of the program and the long-term objectives of the section. It will bring the projects of a cluster together in one common activity, which will force more inter-project communications.

Once a control task has been drafted after identifying a need for validation in a cluster, a careful selection of the features of the control task should be done in order to allow the largest number of systems to participate. As the purpose is comparison and not competition, the metrics should only be defined to measure what the application performs and should not be calibrated to provide fairness between projects or even clusters.

3.4.2 Multilingualism.

For a given evaluation campaign, the problem we face here is the following: How to compare the solutions proposed by N different systems for a given problem in M different languages. Performing an evaluation for all possible NxM combination between languages and systems is no realistic because such undertaking would imply the production of the data required for evaluation for all the M languages and to port each of the systems to the other M-1 languages.

The idea is to reduce the number of languages to process without losing the significance of the evaluation results.

The ELSE consortium has identified two means reduce the number of languages one need to consider to run an evaluation:

The generalization of the evaluation results to languages different from the ones which are used in the control task, could be helped if detailed comparative information about language specific features and their correspondence across languages was available (e.g. in the language lineage as French, Spanish, Italian, Portuguese and Romanian all derive from Latin).

The cross-language requirement scheme has the advantage of a more flexible architecture and avoids the problem of choosing a pivotal language. The drawback of this scheme is that it always requires an extra functionality akin to translation. The translation may not be part of the initial functionality under test, when the task is basically monolingual. It may also significantly increase the noise in the evaluation measurements. Furthermore, cross-lingual requirements cannot be successfully applied to tasks which are intrinsically monolingual like speech recognition, speech synthesis, or lemmatization.

The key issue is to find a way to distinguish methodological aspects (which are generic across languages) from the linguistic knowledge (which is specific to a particular language). The evaluation protocol could require that all the participating systems separate these kinds of information (e.g. using a clear-cut distinction between architecture, program and linguistic data). This would still not be sufficient to compare the different programs and is not always practical from an implementation viewpoint.

 

For Usage Evaluation, no solution exists for the Multilinguality other than a careful selection across different languages of the application characteristics, the end-users population types, the deployment environment specificity and the usability requirement analysis (particularly performance).

 

 

3.4.3 Phases of Evaluation.

An appropriate duration for the completion of an evaluation campaign seems to be two years. Less, the participants do not have the time to capitalize on the results of the previous campaign; more, their motivation might falter. When starting the first campaign of a new evaluation program, it would be wise to plan for a preliminary one year period devoted to advertisement, control task awareness build-up, community establishment, metrics preliminary definition and data selection.

Almost all the American evaluation campaigns have followed the organization described below, which some refer to as the paradigm of evaluation. This is also true for most of the previous European efforts based on a quantitative black-box evaluation protocol.

Ideally, the running of an evaluation campaign should comprise four phases:

Phase 1 - Training

Phase 2 - Dry run

Phase 3 - Tests

Phase 4 - Impact study

 

3.4.5 Results Computation.

Every evaluation campaign should produce multidimensional results in order to better appreciate all the possibilities of the technologies tested (see [MK96] for a discussion of the issue of measurement validity). It also means that the appropriate criteria have to be identified and defined first, e.g. information retrieval uses the measures precision and recall.

Multidimensionality is one of the essential requirements needed for comparing different systems from different application domains. The reality of natural languages and in particular their usage is too complex to be represented by one single measure.

Ideally, the reference formalism ought to be hierarchical in nature, because, partial results can always be computed through incremental filtering, going up the hierarchy towards more generic representations.

3.5 Resources.

3.5.1 Evaluation Data Lifecycle.

To support evaluation on a large scale, we need to develop sound logistics for data collection, construction, annotation, storage, distribution and reuse. There are multiple problems to solve, which are rather remote from the preoccupations of technology developers, e.g. format encoding, distribution copyrights, distribution media and infrastructure, validation, etc.

The number of potential data providers and consumers is very large and keeps on increasing. Therefore, it is impractical to develop ad hoc solutions for every campaign separately. In Europe, ELRA currently plays an important role [KC98] concerning the collection, production, validation and distribution of linguistic resources. Because of its current activities and assets, ELRA is a good candidate for playing an essential role in the envisioned evaluation infrastructure.

3.5.2 Budget Estimates for Technology Evaluation.

There should be a 100% funding policy by the EC, as the evaluation campaigns are infrastructural by nature. The average cost for each of the six candidate control tasks (task 1 and 2 require much more resources than tasks 5 and 6) is estimated at 600 KEURO, corresponding to a two-year campaign . The estimates of the averages per task are:

Activity per task

KEURO

Production of necessary language resources (up to 4 languages)

180

Organisation

90

Participants (estimated at 10 participants at 30 KEURO each)

300

Supervision

30

Total for one task

600

 

For the six tasks proposed, the total cost would therefore be 3.6 MEURO. These estimates were made while taking the American and three European evaluation campaigns into account.

Note that the effort devoted by DARPA to finance the Human Language Technology program is much larger than our proposal for an evaluation infrastructure in Europe. It is estimated that the funding reaches about 20 M$ per year. In average, five different tasks are conducted in parallel, both for spoken and written language processing, each costing about 4 M$ per year. But it should be stressed that DARPA fully finances the development of the systems for some participants, while in other cases DARPA restricts its financing to the organisation of the campaign.

3.5.3 Budget Estimates for Usage Evaluation.

For the first campaign:

 

Activity per task

EC participation

KEURO

Establish the framework

100%

30

Organisation

100%

90

Participants (estimated at 5 participants at 30 KEURO each)

50%

150

Supervision

100%

50

Total for one task

 

320

 

It is expected that there will be five participants in the first campaign and their costs will be 60 KEURO each. The commission will be asked to participate to these costs with a 50% contribution. The other costs will be covered by a 100% contribution.

For the following campaigns, the 30 KEURO needed to establish the framework are not necessary any more and the 50% contribution can be reduced (or even be negative if the participant pays to the campaign) depending on the success with industrial users.

Annex 1 Some European Examples of Comparative Quantitative Black Box Evaluation

A1.1 The ARCs (Actions de Recherche Concertées) of the Aupelf-Uref

The International Association of French Speaking Universities launched the Francil research network on language engineering coordinated by J. Mariani, in 1994. It began in parallel with the ARC program based on the evaluation paradigm for both spoken and written language for 7 different control tasks organized as follows [JM98]:

Written Language Resources and System Evaluation (ILEC)

Spoken Language Resources and Evaluation (ILOR)

The work for the first campaign started in July 1994 with the publication of a call for proposals which resulted in November 1994 in the selection of 50 proposals out of 89. An international advisory committee evaluates the program each year and includes experts of both ILEC and ILOR. Proposals were issued from 34 different laboratories.

The evaluation campaigns have a two year time span (1996-1997 and 1998-1999). Each control task has the same organizational structure, comprising an evaluator in charge of the management, a scientific committee whose members are the participants, one or more corpus providers and the participants. Except for ARC A4 (the number of replies to the call for proposals was not large enough to launch a complete program and resulted in a working group) all the ARCs were completed at the end of the first campaign, which was for some of them, an exploratory phase.

If we exclude A4, the total budget for the 6 ARCs was about 2 MEURO over 4 years, which averages roughly to 167 KEURO, per campaign, per control task (each with one evaluator and, on average, seven participants from three different countries) and for one language, French (A2 addressed French-English alignment). Note that only the evaluators and the corpus provider were funded. The participants only received a token subsidy to cover a part of the cost of adapting their system to the test conditions and the travel expenses they incurred.

All the campaigns used quantitative black box evaluation metrics except for ARC A3, for which qualitative assessment by domain experts was used (evaluation metrics for B2 are still being defined but it will very likely use a metric inspired by the PARADISE framework [WLKA97]).The results of the first campaign were presented and discussed at a series of workshops organised as satellite events of the Journées Scientifiques et Techniques du Réseau FRANCIL in April 1997.

 

Some essential information about the first campaign:

 

ARC

# Participants + # Evaluators

Countries

Approximate Corpus Size

Metrics

A1

8 + 1

CA, CH, FR, USA

330,000 documents,. indexed by 28 topics

Precision & Recall

A2

6 + 1

CA, CH, FR, UK

2.8M words aligned FR & Eng

Precision & Recall at various levels of granularity

A3

8 + 1

CA, FR

3,800 journal pages indexed + thesauri

Qualitative assessment by domain experts

B1

5 + 1

CA, FR

100 hours by 120 speakers + 40 M words text corpus + 64 K words phon. lexicon
4 Language Models (approx. tot. 170K words)

Word Error Rate (NIST/Sclite V3.0)

B2

5 + 1

CA, FR

30 hours of dialogue

under development

B3

9 + 1

B, CA, CH, FR,

2,100 sentences (27.3 K words)

Phoneme Error Rate (modified NIST/Sclite)

 

The first evaluation campaign resulted in:

A1.2 GRACE (Grammars and Resources for Analysers of Corpora and their Evaluation)

Started upon the initiative of Joseph Mariani from Limsi-CNRS and Robert Martin from INaLF-CNRS, GRACE was part of the French program CCIIL (Cognition, Intelligent Communication and Language Engineering), jointly promoted by the Engineering Sciences and Human Sciences departments of the CNRS.

Initially GRACE [ALMPR98] was intended to run over four years (1994-1997) in two phases the first dedicated to Part-of-Speech tagging for French text, and the second, which has since been abandoned, was intended to tackle syntactic analysis, also for French. The first year was devoted to bootstrapping the program by:

The call for tenders was published in November 1995. The training corpus was distributed globally to all the participants in January 1996, while the dry run corpus was distributed individually to each participant in an encrypted form during the fall of 1996. The results were discussed during a workshop restricted to the participants, a satellite event to the Journées Scientifiques et Techniques du Réseau FRANCIL, in April 1997 [ALMPR97]. The test corpus was distributed in the same manner as for the dry run, at the end of December 1997. The preliminary results of the tests were discussed with the participants in a workshop in May 1998. The final results were disclosed on the WEB during fall of 1998 as soon as they had been validated by the organizers (cross validation with two different processing chains based on different algorithms and developed at two different sites) and the participants.

At the beginning there were 18 participants from 5 different countries (CA, USA, D, CH, FR), from both public research and industry, and 3 evaluators (Martin Rajman, at first from ENST then EPFL, had joined the initial members of the coordinating committee who were from INaLF and Limsi). The 2 corpus providers were also the initial evaluators (Limsi and INaLF).
Out of the 21 initial participants, 17 only took part in the dry run and only 13 completed the tests.

The size of the training corpus was around 10 million words and it consisted of texts evenly distributed between literary works and newspaper articles. For the dry run, the participants tagged a corpus of roughly 450,000 words with a similar genre distribution and the performance measure was computed over 20,000 words to which a reference description had been manually assigned. For the tests, the participants had to mark a corpus of 650,000 words and the measure was taken over 40,000 words.

The real cost of GRACE is difficult to estimate because:

Nevertheless, it is possible to give the following assessment. Over the 4.75 years that the project lasted, the travel and consumable expenses can be roughly estimated at 100 KEURO. A minimal estimate of the evaluator’s work is of one person working full-time during the whole project. If we assume a yearly overall cost of 150 KEURO, we come up with a total in the order of 800 KEURO over 4.75 years. In GRACE only the evaluators were funded (the participants were only reimbursed their travel expenses) and the previous cost does not include any cost for the data as the corpus providers were the evaluators themselves.

We can estimate that a participant that followed the project from the beginning, contributed a minimum of 2 person/weeks. If we compute the cost over a two years period, we find a total of 335 KEURO for one control task, one language, 3 evaluators and 13 participants from 5 different countries. Note that this cost is double the estimated cost of one ARC campaign whose characteristic numbers are half of those of GRACE, but it would be hasty to infer a linear relationship between the cost of a program and the number of participants from such scarce data.

GRACE used the quantitative black box metrics: Decision and Precision, which were derived especially for GRACE from the metrics used in Information Retrieval (Precision and Recall). One of the lessons to draw from the GRACE experience, is that ideally, results should be cross-validated with two different processing chains, based on different algorithms (when this is possible) and developed at two different sites in order to ensure their accuracy and quality.

 

Project

# Participants + # Evaluators

Countries

Approximate Corpus Size

Metrics

GRACE

13 + 3

CA, USA, D, CH, FR

10M words + 60K words hand tagged + 350K word lexicon

Precision & Decision (an adaptation of I.R. Precision & Recall)

 

The results of GRACE are:

A1.3 SENSEVAL/ROMANSEVAL (Word Sense Disambiguators Evaluation)

SENSEVAL [AK98a] is a pilot evaluation campaign for Word Sense Disambiguating systems [IV98] working in English. It was coordinated by Adam Kilgarrif (who kindly provided the cost information below) and was run in collaboration and in parallel with the ROMANSEVAL evaluation campaign, the same exercise as SENSEVAL but applied to the French and Italian languages. ROMANSEVAL was coordinated by Jean Véronis (LPL-University of Aix-en-Provence), Frédérique Segond (XRCE-Grenoble) and Nicoletta Calzollari (CPR-Pisa).

The SENSEVAL exercise proposed two distinct tasks: one for those who need sense-tagged training data, and one for those who do not. For both, tagging was only performed on a few selected words, which were supposed to be tagged with the senses defined by HECTOR (both a dictionary and the associated corpus). HECTOR is the result of a collaboration between Oxford University Press and Digital. Initially, a third task had been proposed, for systems not requiring training data and which would be asked to tag all the words of the test material (using WordNet [CF98] senses). However, it was abandoned, partly because the resources to support it were lacking. Note that the texts to be tagged were excerpts and not full documents. There was no distribution of untagged corpus material of the same genre as that to be used for evaluation, but the evaluation material was taken from a similar spread of genres to that found in the British National Corpus.

SENSEVAL/ROMANSEVAL ran over 8 months, from December 1997, when the first expressions of interest were registered, to the final workshop in September 1998 in Herstmonceux (UK).

The dry run data samples were distributed in March 1998 to the participants, who had to return a formal agreement to participate, which included the license for research use of the HECTOR data (copyright Oxford University Press, which provided the data for free). The dry run data consisted of a sense-tagged mini-corpus of 40 word types: 20 nouns, 10 adjectives and 10 verbs, all the HECTOR instances (e.g. between 300 and 1000) of each type, as well as the HECTOR dictionary definitions for these 40 word types and another 200 for the porting of programs which take Machine Readable Dictionaries as input.

Test training data (word samples and lexical entries) were distributed in June 1998. The tests were done on 20 nouns, 20 adjectives and 20 verbs. At this stage, legitimate activities included developing, maybe semi-automatically, the algorithm-specific lexical representations for the target words, or manually identifying sense-mappings between HECTOR and another resource (e.g. WordNet [CF98]).

In early July, the participants were asked to freeze their software and the test data for all tasks were distributed. The taggings were returned during the first half of August. The results were made available to the participants at the end of August and disclosed at the final workshop in September 1998. Note that the participants were given up to mid-October 1998 to improve their score if they wished to do so, provided they did not modify their system. This opportunity gave them the chance to correct spurious errors (like the one due to format discrepancies for instance) or to optimize the learning of their system.

Initially, about 35 teams claimed interest in participating in SENSEVAL, and in the end, the results of the evaluation of 21 systems (including derived versions) were presented at the final workshop, along with the results obtained with different baseline approaches.

A very rough estimate of the cost of the pilot-SENSEVAL (which dealt only with one language, English) is the following:

 

 

KEURO

Coordinator gross salary (6 person-months)

23

Overheads on coordinator salary (approx. 45%)

11

English manual tagging: grant from UK EPSRC

16

English manual tagging: support in kind from Cambridge University Press

3

English lexicon and corpus, provided free by Oxford University Press

0

Results computation (paid in kind by paying travel and workshop attendance)

1

Student assistants (paid in kind by paying travel and workshop attendance)

2

Hardware and computing

0

Workshop: venue hire

5

Workshop: printing, photocopying, workshop subsidies

2

Total

61

 

Note that the biggest chunk is the coordinator’s salary (data, hardware and computing were provided for free ), and a lot depends on how well-paid, and how efficient, the coordinator is. The participants were not funded. According to the organizers, they were constrained in task definition by the availability of resources, particularly the dictionary. Cost estimates are not available for ROMANSEVAL, but the data were provided by ELRA at a very low cost and small off-the-shelf electronic dictionaries were used.

Both SENSEVAL and ROMANSEVAL used quantitative black box metrics. Concerning metrics, SENSEVAL marks an important milestone, since it was during the final workshop that the use of a cross entropy measure in conjunction with a penalty value (based on sense hierarchy distance or functional communicative distance between the correct and the proposed sense for a token) as proposed in [RY97] was recognized to be more relevant than the metrics generally used in the literature (Boolean Tag Error Rates), because of their higher discriminating power.

 

Project

# Participants + # Evaluators

Countries

Approximate Corpus Size

Metrics

SENSEVAL

21+1

FR, USA, IT, UK, CH, KO, MA, CA, SP, NL

60 lemmas in 8,448 contexts

Weighted Cross Entropy

ROMANSEVAL

7+1

FR, IT, CH

60 lemmas in 3,724 contexts

Precision/Recall per form/lemma pair

 

The results of SENSEVAL/ROMANSEVAL are:

Annex 2 - Thirty-one Candidate Control Tasks

Given the current state of the domain, here is a list of 31 possible candidate control tasks which could be easily specialized into subsidiary tasks.(T)= Text, (S)= Speech, (G)= Generic to both Text and Speech. Where an evaluation has already been performed, an example is provided with the name of the exercise, the sponsor, its nationality and the year.

1

G

Language Models

[ARC B1/Aupelf/FR/95]

2

G

Translation Memories (sub-sentence level matching and partial clause analysis)

.

3,4

S&T

Machine Translation

[DARPA/USA/92/93/94]

5,6

S&T

Multilingual data alignment

[ARC A2/Aupelf/FR/95]

7

T

Terminology Extraction

[ARC A3 /Aupelf/FR/95]

8,9

S&T

Document Extraction

[TREC/DARPA/USA/92-98]

10,11

S&T

Language Understanding

[MUC/DARPA/USA/87-97]

12

T

Text Generation (from information templates)

.

13

T

Summary Generation

[DARPA/USA/98]

14

T

Text Segmenting

.

15

S

Speech Segmenting

.

16

S

Speech Recognition

[DARPA/USA/84-98] and [ARC B1/Aupelf/FR/95]

17

S

Speech Synthesis

[ARC B3Aupelf/FR/95]

18,19

S&T

Topic detection & Tracking

[DARPA/USA/98]

20

T

Part-Of-Speech Tagging

[GRACE-CNRS/FR/94-98]

21

T

Parsing

[Parseval/USA/92] and [SPARKLE/EU/96]

22

T

Lemmatisers

[Morpholympics/Germany/94]

23

T

Word Sense Disambiguation

[SENSEVAL/98]

24

T

Predicate Argument Structure

.

25,26

S&T

Coreference Identification

[DARPA/USA/95+98]

27,28

S&T

Named Entity Extraction

[DARPA/USA/95+98]

29

S

Database Dialogue Querying

[EuroSpeech97/ELSNET/97] and [ARC B2/Aupelf/FR/95]

30

T

Hand Written Recognition

[NIST/USA/92]

33

S

Speaker Verification / Recognition

[NIST/USA/96/97/98]

31

S

Language Identification

 
The next table presents the list of previous tasks according to the three activities Profiling, Querying and Presentation.
 

Information Profiling

Information Querying

Information Presentation

Speech

Speech Recognition

Database Dialogue Querying

Speech Synthesis

Speech Segmenting

Coreference Resolution

Information Extraction

Named Entities Extraction

Topic Detection & Tracking

Multilingual Data Alignment

Language Identification

Speaker Verification

Machine Translation

Language Understanding

Generic

Language Models

Translation Memories

 

Text

Text Segmenting

Machine Translation

Text Generation

Part-Of-Speech tagging

Lemmatising

Predicate Argument Structure

Multilingual Data Alignment

Parsing

Named Entity Extraction

Word Sense Disambiguation

Information Extraction

Coreference Resolution

Topic Detection & Tracking

Language Understanding

Summary Generation

Hand Writing Recognition

   

 

Annex 3 - Practical Considerations for Implementation

A3.1 The Need for a Permanent Infrastructure

Implementing the comparative evaluation paradigm in EC programs is difficult, as they are based on a call for proposals mechanism, with limited duration projects and usually a share of the cost supported by the participants. There is a need for a permanent evaluation organization of European scope, in order to cover a time scale larger than the duration of a Framework Program (FP). An ideal solution for capitalizing on the know-how gained throughout the course of several programs would be to include in the plan of action this permanent European evaluation organization, which could be responsible for defining and updating the general policy for language Technology Evaluation, for the strategic issues, for the ethical aspects, as well as for the practical organization of the evaluation campaigns (measure methodology, results computation and publication, software development, evaluation label attribution, quality control, etc.). It could either be created from scratch or by extending the mission of an existing organization. On that score, useful insights on how to take into account practical requirements like profit or not-for-profit constraints can be drawn from a parallel with successful existing organizations like the pair ELRA/ELDA. Note that ELRA already offers the means for long term capitalization on the LR produced for training and testing the systems by openly distributing them after each evaluation campaign.

A3.2 Selection of Evaluators and Participants

For each control task and for each evaluation campaign, there is a need for:

In addition to the production of specific resources, the running of an evaluation campaign will require the recruiting of both evaluators and participants. The consortium thinks that membership for both classes should be as open as possible. Evaluators ought to be selected first since they should be involved in the organization of the evaluation campaign. The proposal of any potential evaluator ought to contain at least:

No restriction should be imposed on participants for participating in the dry run phase, but a selection based on the results of the dry run phase should be performed after it takes place in order to limit the number of participants in the tests to a reasonable number. This number should be fixed in advance according to the amount of resources available and advertised at the beginning of the evaluation campaign. This way of proceeding would also ensure that sufficient time is provided to solve the administrative problem that could be caused by non-EU participants, if they manage to pass the dry run test phase.

A3.3 Integrating Evaluation in the Call for Proposals

In order to include evaluation in the FP5 agenda, it is proposed to include this topic in the first call for proposals. Evaluation campaigns would have a 2-year duration, in order to allow for more progress and research work between two campaigns than in the DARPA ones. If evaluation is deployed on large scale during FP5, the consortium advises strongly that an installation period, of 6 months at the very least, should takes place at the beginning of the program. We expect a certain amount of delay in deploying the paradigm of evaluation because it will be the first time that it will be used on such scale and in the context of EU programs (3 years were necessary for DARPA to go from the drawing board to a real implementation for the speech recognition campaigns). This preliminary period must be planned with care to preserve the synchrony between the evaluation campaigns and the framework programs.

In a proactive scheme, the topics (related to both written and spoken language processing) should be selected beforehand and included in the call for proposals. Those topics should cover both complete systems and systems components, and should have links between them, thus allowing the progress obtained in one field to influence the development of another field. A straightforward way of implementing these links between topics is to have part of the evaluation data that is common to related topics, and therefore in the same language. They should be of interest for LE research, but also for LE industry. To that extent they should be proposed by a scientific committee and submitted to the appreciation of an industrial panel. A first selection of 6 topics could be:

  1. Broadcast news transcription;
  2. Cross lingual information retrieval;
  3. Text-to-speech synthesis;
  4. Text summarization;
  5. Language models;
  6. Morphosyntactic tagging, lemmatization, word sense disambiguation.
  7. As a fallback option, there exists the possibility of including an evaluation task in each candidate project. The evaluation task would constitute a sort of concertation activity where provision would be made for the needs of an evaluation campaign. The resources needed could be contracted out or produced by a subset of the concerned projects. The possible evaluation topics would be determined by the nature of research and technology projects running at a given time according to technological clusters , different from the existing project clusters, which are inspired by market considerations. In that case, management becomes more difficult because it is more distributed. It still requires a coordinating entity, which could be as a last resort a specific project. In this reactive scheme driven by the content of the accepted proposals, we may lose the benefits of capitalizing on the evaluation expertise over a long period of time.

    A3.4 Evaluation in a Multilingual Context

    A specific difficulty for using the evaluation paradigm in a European framework is the multilingual nature of Europe. The proposal is to require that each participant addresses at least two languages (their own and another European language), and that for any evaluation campaign there is at least one language common to all participants, and at least two participants for any language. It would be even better if all the evaluation campaigns would share a common language because the evaluation of different kinds of technologies, including complete systems and components, on the same data would then be possible. English is a strong candidate, since it is spoken and understood by a large number of people, it represents a large market and given possible co-operation activities between the EU and the US in the field of LE evaluation.

    The proposal is to select a list of languages (up to 4, possibly including English) in the first step, for which there are a large enough number of potential participants, as identified by the consortia in charge of evaluation. This is in agreement with the fact that the goal is to evaluate technology, not specific applications in a given language. In a second step (future FP’s), other languages could be addressed, both by domestic laboratories and by those who participated in the previous evaluation campaign, and gained enough know-how in developing systems for the evaluation task to be able to easily tune their system to a new language.

    A3.5 Proactive or Reactive Approach?

    Depending on whether a proactive or a reactive solution is sought, the difference in strategy reflects the disparity of requirements imposed by each type of solution. With the former option, a list of topics is defined in advance of their publication in a unique call for proposals (asking for both evaluators and participants). With the latter option, the evaluation topics are determined by the contents of the selected projects from a first call and a subsequent call is needed to select the evaluators.

    If the proactive solution is chosen, the call for proposals should ask either for consortia for each of the evaluation topics, or for larger consortia covering the full set of topics. The first solution is lighter to implement, but the second one allows for a better overall infrastructure more apt to co-ordinate the various evaluations of components and complete systems, but is harder to manage (70 participants or more). The consortia should include a set of organizers for managing the evaluation campaigns in one (their own) or several languages. Each proposal should consider the common language and up to 3 other languages. It should include the description of the way the consortium plans to organize the campaign, the Language Resources (LR) that will be used for training and testing the systems, their cost, and their providers (who will participate as subcontractors), the list of potential participants for each language (at least two), who will also participate as subcontractors. We strongly suggest that the permanent European evaluation organization mentioned before should be a partner in the final consortia in order to capitalize on the results of the different evaluation campaigns. Having the LR providers and the participants as subcontractors allows for more flexibility (in case of reduction in the number of participants down to two or if a change in the participant list occurs). Alternatively, if the cost is too high to support all the potential participants, the consortia could first select a set of participants based on the evaluation results obtained in a dry run. Second, the consortia would finance only the best systems for the final test, up to a certain number. Each participant would receive a fixed amount of resources corresponding to the estimated cost of the participation in the evaluation campaign (typically for adapting his system to the test conditions).

    If the reactive solution is chosen, the evaluation topics are determined by the content of the selected projects, which perforce address evaluation as a complementary issue. Subsequently the selected projects are grouped into technology clusters (either the technology used or developed in systems or components or the overall technological function implemented by the project result). Then, it is necessary to have a second call for proposals in order to select organizers for the evaluation campaign, as selecting an organizer beforehand or selecting an organizer among the projects already selected, run a very high risk of recruiting an organizer lacking the domain knowledge needed to appreciate the issues at stake, or of having an organizer with a biased opinion because of his involvement in his own project. Not mentioning the fact that organizing an evaluation is a time consuming activity which is poorly supported with the amount of resource generally allotted to project complementary issues.

    Although a parallel of some sort could be drawn, evaluation activities should not be mistakenly put on a same standing as concertation and dissemination activities. In particular, organizing an evaluation requires the ability to maintain high bandwidth communication with the participants on highly technical grounds, e.g. in order to finalize the evaluation metrics. While concertation activities can be successfully achieved with much lighter means judiciously distributed through time.

    A mixed solution between the purely proactive and purely reactive solutions is possible. Some evaluation topics could be selected beforehand and published in the first call for proposal, while others could be defined according to the projects selected after the first call. The ELSE consortium favors the proactive approach and a single consortium, constituted as an evaluation organizers network, with the support of the permanent European evaluation organization.

    Note that if the classical EU contract scheme is used to implement evaluation, and if the participants are funded, with the evaluator as only the evaluator (all the participants and the corpus providers are his subcontractors), then the usual limitation imposed on the amount of resources devoted to "third party assistance" should be modified or waived. The amount of resources could exceed the allowed value, just because of the number of participants or the cost of the linguistic data needed for evaluation.

    References

  8. [ALMPR97] Gilles Adda, Josette Lecomte, Joseph Mariani, Patrick Paroubek, Martin Rajman, "Les procédures de mesure automatique de l’action GRACE pour l’évaluation des assignateurs de Parties du Discours pour le Français", Actes des 1 ères Journées Scientifiques et Techniques du Réseau Francophone de l’Ingenierie de la Langue de l’Aupelf-Uref, Avignon, Avril 1997.
  9. [ALMPR98] Gilles Adda, Josette Lecomte, Joseph Mariani, P. Paroubek, M. Rajman, "The GRACE French Part-of-Speech Tagging Evaluation Task", in Proceedings of the First International Conference on Language Resources and Evaluation (LREC), Granada, May 1998.
  10. [CF98] Christiane Felbaum (Editor), "Wordnet – An Electronic Lexical Database", MIT Press, 1998.
  11. [COM98] URL: http://fofoca.mitre.org/index.html
  12. [DH98] Donna Harman, "The Text REtrieval Conference (TRECs) and the Cross- Language Track, in Proceedings of the First International Conference on Language Resources and Evaluation (LREC), Granada, May 1998.
  13. [GG97] Gregory Grefenstette (Editor), "Cross-Language Information Retrieval", Kluwer Academic Publishers, ISBN-0-7923-8122-X.
  14. [HLT97] European Commission, Human Language Technology, Living and Working Together in the Information Society", Discussion Document, Luxembourg, July 1997. URL: http://www2.echo.lu/langeng/ist/hlt/paper.html
  15. [HLT98] European Commission, Human Language Technology, "Proposal Concerning The IST Program 1998-2002 (excerpts)", COM (98) 305 Final, 13 May 1998.
    URL: http://www.linglink.lu/le/ist/ist/excerpts_ist_pgme.htm.
  16. [IV98]"Nancy Ide and Jean Véronis, "Introduction to the special issue on word sense disambiguation: the state of the art.", Computational Linguistics, 24(1), 1998.
  17. [AK98a] Adam Kilgarriff, "SENSEVAL: An Exercise in Evaluating Word Sense Disambiguation Programs", in Proceedings of the First International Conference on Language Resources and Evaluation (LREC), Granada, May 1998.
  18. [JD95] John Diver, "Principles and Practice of Language Engineering Project Clustering", Version 1.1 - Final, LINGLINK-LE1-1951, November 10th 1995.
  19. [JM98] Joseph Mariani, "The Aupelf-Uref Evaluation-Based Language Engineering Actions and Related Projects",
  20. [JP97] Jeremy Peckham, "Bringing Language Engineering to Market", Language Engineering Concertation and Project Review, Mondorf-les-Bains, March 1998.
  21. [JPSSJGTH98] Joseph Polifroni, Stephanie Seneff, James Glass, and Timothy Hazen, "Evaluation Methodology for a Telephone-Based Conversational System", First International Conference on Language Resources and Evaluation (LREC), Granada, May 1998.
  22. [KC98] Khalid Choukri, "The European Language Resource Association", in Proceedings of the First International Conference on Language Resources and Evaluation (LREC), Granada, May 1998.
  23. [KNRGRC95] Klaus Netter, Richard Crouch, Robert Gaizauskas, et al., "Interim Report of the Study Group on Assessment and Evaluation", April 1995.
  24. [KSJ95] Karen Sparck Jones, Julia R. Galliers, "Evaluating Natural Language Processing Systems", Springer-Verlag, 1995.
  25. [LH98] Lynette Hirshman, "Language Understanding Evaluations: Lessons Learned from MUC and ATIS", in Proceedings of the First International Conference on Language Resources and Evaluation (LREC), Granada, May 1998.
  26. [LGLK98] European Commission, "Thematic Clustering", Language Engineering Harnessing the Power of Language, URL: http://www.linglink.lu/le/concert/clusprin.html
  27. [MAT98] European Commission, URL: http://www2.echo.lu/langeng/projects/mate.
  28. [MP90]Marvin Minsky and Seymour Papert, "Perceptrons - Expanded Edition", MIT Press, 1990.
  29. [ML98] Mark Lieberman, Christopher Cieri, "The Creation, Distribution and Use of Linguistic Data: The Case of The Linguistic Data Consortium", in Proceedings of the First International Conference on Language Resources and Evaluation (LREC), Granada, May 1998.
  30. [MK96] Maghi King et al., "Evaluation of Natural Language Processing Systems - EAGLES Final Report", EAG-WEG-PR.2, October 1996, ISBN-87-90708-00-8.
  31. [MLIM98] Eduard Hovy, Nancy Ide, Robert Frederking, Joseph Mariani, Antonio Zampolli, Editors, "Multilingual Information Management – Current Levels and Future Abilities", A study commissioned by the US National Science Foundation and also delivered to the European Commission’s Language Engineering Office and the US Defence Advance Research Projects Agency, July 1998.
    URL: http://www.cs.cmu.edu/~ref/mlim/
  32. [RL98] Rose Lockwood, "Language Technology: Understanding the Market", Language Engineering Concertation and Project Review, Mondorf-les-Bains, March 1998.
  33. [RS97] Richard Sproat (Editor), "Multilingual Text-To-Speech Synthesis - The Bell Labs Approach", Kluwer Academic Publishers, ISBN 0-7923-8027-4.
  34. [RY97] Philip Resnik and David Yarowsky, "A perspective on word sense disambiguation methods and their evaluation", position paper presented at the ACL SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What, and How?, held in conjunction with ANLP-97 in Washington, D.C., USA, April 4-5, 1997.
  35. [SYLC98] Steve Young, Lin Chase, "Speech Recognition Evaluation, A Review of the US CSR and LVCSR Programs", Journal of Computer Speech and Language, 1998.
  36. [SYLL97] S. J. Young, M. Adda-Decker, X. Aubert, C. Dugast, J.-L. Gauvain, D.J. Kershaw, L. Lamel, D. Leeuwen, D. Pye, A.J. Robinson, H.J.M. Steeneken, an P.C. Woodland, "Multilingual Large Vocabulary Speech Recognition: The European SQALE Project, Computer Speech and Language, Vol. 11, 1997.
  37. [WGIMNSYZ98] Steven Wegman, Larry Gillick, Yoshiko Ito, Linda Manganaro, Miacheal Newman, Francesco Scattone, Jon Yamron, Puming Zhan, "Dragon System’s Automatic Transcription System for the New TDT Corpus", in Proceedings of the First International Conference on Language Resources and Evaluation (LREC), Granada, May 1998.
  38. [WLKA97] Marilyn Walker, Diane Litman, Candace Kamm, Alicia Abella, "PARADISE: A Framework for Evaluating Spoken Dialogue Agents", In Proceedings of the 35th Annual Meeting of the Association of Computational Linguistics, ACL 97, 1997.
  39. [WHFFM97] Marilyn Walker, Donald Hindle, Jeanne Fromer, Giuseppe Di Fabbrizio, Graig Mestel, "Evaluating Competing Agent Strategies For A Voice Email Agnet", in Proceedings of the 5th European Conference On Speech Communication And Technology, Rhodes, September 1997.