| Project ref. no.: | LE4-8340 |
| Project title: | Evaluation in Language an Speech Engineering. |
| Deliverable status: | Public |
| Contractual date of delivery: | April 30th 1999 |
| Actual date of delivery: | |
| Deliverable number: | D1.2 |
| Deliverable title: | "Follow-up Evaluation Proposal" |
| Type: | Report |
| Status: | Pre-Final version 1.1 |
| Number of pages: | 18 |
| WP contributing to the deliverable: | WP1 |
| WP / Task responsible: | Limsi - CNRS, Bâtiment 508 Université Paris XI Dépt. Communication Homme Machine, BP 133 - 91403 ORSAY Cedex |
| Author(s): | Editor: Patrick Paroubek. Contributors: Niels Ole Bernsen, Marc Blasband, Nicoletta Calzolari, Jean-Pierre Chanod, Khalid Choukri, Laila Dybkjær, Robert Gaizauskas, Steven Krauwer, Isabelle de Lamberterie, Joseph Mariani, Klaus Netter, Patrick Paroubek, Martin Rajman, Antonio Zampolli |
| EC Project Officer: | Giovanni Battista Varile |
| Keywords: | EVALUATION, QUANTITATIVE, BLACK BOX, NATURAL LANGUAGE PROCESSING, MULTILINGUALITY, CONTROL TASK, PROPOSAL, FP5. |
| Abstract: In the following, we will retrace the efforts made by the ELSE consortium to propose follow-up evaluation activities to the ELSE project. We will start with the initial list of 31 control task identified at the beginning of the project as deemed of worth of attention given the current level of development achieved by Language Technologies. Then we will consider the integration of evaluation in the call for proposal scheme used by the Commission. After which we will explain the reasons that made us favor as best candidates to start an evaluation program six control tasks out of the previous 31, and how these control tasks could be merged into a single one: NODE (News On Demand Evaluation).Finally, we will present the EvaLE proposal of which ELSE was a contributor and which has been submitted to the first call of FP5. |
2. Thirty-one Candidate Control Tasks.
3. Evaluation in a Multilingual Context.
4. Proactive or Reactive Approach? 5. Integrating Evaluation in the Call for Proposals.6. Six Candidate Control Tasks for Technology Evaluation
7. Complementary Usage Evaluation.
8. One Control Task: NODE (News On Demand Evaluation).
9. Contribution to the EvaLE proposal (first FP5 call).
Let us first recall that Technology Evaluation revolves around the concept of Control Task. which is the function that the participating systems perform during evaluation campaign under the conditions imposed by the evaluation protocol (e.g. for parser evaluation a control task could be bracketing of the constituents in texts of minimal size). The evaluation protocol will use quantitative black box metrics applyed to the output of the control task to measure the performance of the systems against a reference frame, made of data validated by humans. A more detailled discussion about the notion of control task is available in [ELSED1] where a more detailled presentation of the evaluation infrastructure blueprint and of the related issues is made. In the following, we will retrace the efforts made by the ELSE consortium to propose follow-up evaluation activities to the ELSE project. We will start with the initial list of control task identified at the beginning of the project as deemed of worth of attention given the current level of development achieved by Language Technologies. Then we will consider the integration of evaluation in the call for proposal scheme used by the Commission. After which we will explain the reasons that made us favor as best candidates to start an evaluation program six control tasks out of the previous 31, and how these control tasks could be merged into a single one: NODE (News On Demand Evaluation).Finally, we will present the EvaLE proposal of which ELSE was a contributor and which has been submitted to the first call of FP5. When deemed necessary for clarity of our talk, we will recall some key points about issues developed in [ELSED1].
Considering the current state of development of the domain, the consortium has identified, at the beginning of ELSE, a list of 31 possible candidate control tasks. If the need arise, each could be easily specialized into subsidiary control tasks. The following table presents these 31 control tasks, indicating for each one, whether it is more a Text processing oriented task (T), a Speech processing oriented task (S) or a generic task (G), used for both Text and Speech processing (e.g. statistical language modeling). When an evaluation has already been performed, an example is provided with the name of the exercise, the sponsor, its nationality and the year.
| 1 | G | Language Models | [ARC B1/Aupelf/FR/95] |
| 2 | G | Translation Memories (sub-sentence level matching and partial clause analysis) | . |
| 3,4 | S&T | Machine Translation | [DARPA/USA/92/93/94] |
| 5,6 | S&T | Multilingual data alignment | [ARC A2/Aupelf/FR/95] |
| 7 | T | Terminology Extraction | [ARC A3 /Aupelf/FR/95] |
| 8,9 | S&T | Document Extraction | [TREC/DARPA/USA/92-98] |
| 10,11 | S&T | Language Understanding | [MUC/DARPA/USA/87-97] |
| 12 | T | Text Generation (from information templates) | . |
| 13 | T | Summary Generation | [DARPA/USA/98] |
| 14 | T | Text Segmenting | . |
| 15 | S | Speech Segmenting | . |
| 16 | S | Speech Recognition | [DARPA/USA/84-98] and [ARC B1/Aupelf/FR/95] |
| 17 | S | Speech Synthesis | [ARC B3Aupelf/FR/95] |
| 18,19 | S&T | Topic detection & Tracking | [DARPA/USA/98] |
| 20 | T | Part-Of-Speech Tagging | [GRACE-CNRS/FR/94-98] |
| 21 | T | Parsing | [Parseval/USA/92] and [SPARKLE/EU/96] |
| 22 | T | Lemmatizers | [Morpholympics/Germany/94] |
| 23 | T | Word Sense Disambiguation | [SENSEVAL/98] |
| 24 | T | Predicate Argument Structure | . |
| 25,26 | S&T | Coreference Identification | [DARPA/USA/95+98] |
| 27,28 | S&T | Named Entity Extraction | [DARPA/USA/95+98] |
| 29 | S | Database Dialogue Querying | [EuroSpeech97/ELSNET/97] and [ARC B2/Aupelf/FR/95] |
| 30 | T | Hand Written Recognition | [NIST/USA/92] |
| 33 | S | Speaker Verification / Recognition | [NIST/USA/96/97/98] |
| 31 | S | Language Identification |
ELSE uses as reference frame, the generic abstract architecture of a cross-language intelligent information extraction system, which can access both local and distributed databases. In this context information extraction is meant in a broad sense, encompassing both the classical meanings of Information Extraction (IE), i.e. template filling from documents, and Information Retrieval (IR), i.e. document selection. Such system would have multi-modal input and output and would be able to intelligently adapt its behavior to a particular query, for instance by choosing between classical IE and IR functionality, or deciding whether to consult either a local database or the WEB to access information sources from various media. Our abstract application can be seen as a sort of super information browser/finder.
This generic architecture was helpful for determining the list of control tasks that we propose here. Each evaluation task corresponds to an abstract functionality or module of the architecture. The various components of the architecture can be developed along the following 3 dimensions:
Evaluation points can be selected at the input and output of individual modules of such architecture and also at any point along arbitrary module chains. Thus, new evaluation tasks can be defined by linking various modules of the abstract architecture in a braided fashion [KN95]. Our abstract architecture is very much like the new DARPA COMMUNICATOR evaluation paradigm [http://fofoca.mitre.org/index.html], where a real information software is distributed to identified partners for technology development (software derived from the JUPITER system [JP98], developed at MIT). However in our case the architecture is not a real one, but an abstract one and was used only as a reference framework for identifying and relating the various evaluation tasks with Multilinguality and Natural Interactivity issues. In the future, we hope that evaluation will be deployed on a large scale in Europe. We would like to have the complete functionality set of the abstract architecture addressed by evaluation activities.
| Information Profiling | Information Querying | Information Presentation | |
| Speech | Speech Recognition | Database Dialogue Querying | Speech Synthesis |
| Speech Segmenting | |||
| Coreference Resolution | Information Extraction | ||
| Named Entities Extraction | |||
| Topic Detection & Tracking | Multilingual Data Alignment | ||
| Language Identification | |||
| Speaker Verification | Machine Translation | ||
| Language Understanding | |||
| Generic | Language Models | Translation Memories | |
| Text | Text Segmenting | Machine Translation | Text Generation |
| Part-Of-Speech tagging | |||
| Lemmatizing | |||
| Predicate Argument Structure | Multilingual Data Alignment | ||
| Parsing | |||
| Named Entity Extraction | |||
| Word Sense Disambiguation | Information Extraction | ||
| Coreference Resolution | |||
| Topic Detection & Tracking | |||
| Language Understanding | Summary Generation | ||
| Hand Writing Recognition | |||
A specific difficulty for using the evaluation paradigm in a European framework is the multilingual nature of Europe. There are 11 official working languages in Europe. In Language Engineering, evaluating whether a technology is adequate for solving a particular task requires to separate language specific aspect from task specific aspects, or at least to have a fair assessment of the language specific issues. Considering the high cost of evaluation and for a given technology, running an evaluation campaign for every language spoken in Europe would be totally impractical.
Thus we must make either a drastic and arbitrary selection (by necessity not entirely based on scientif criteria) or find a way to generalize the results obtain for a given technology in one language to other languages. An alternative solution to reduce the number of languages addressed while retaining roughly the same language coverage for an evaluation campaign would be to set cross language functionality requirements for the control task (specifying different input and output languages, e.g. in Information Retrieval). But this solution does not apply to taks which are intrinsically monolingual like speech generation.
The solution that was finally taken in the follow-up proposal (see section ???) was to require that each participant addresses at least two languages (their own and another European language), and that for any evaluation campaign there is at least one language common to all participants, and at least two participants for any language. Such scheme was implemented in the SQALE [SYLL97] project. It enables some generalization of the evaluation results in the case where a system A obtains the best results for his language and has better results than a system B on the pivotal language, it is expected that the system A will have better results than B when addressing any other language in the language lineage (e.g. French, Spanish, Italian, Portuguese and Romanian all derive from Latin) of either the pivotal language or the language specific to system A.
An extra factor of homogeneity, would for all the evaluation campaigns to share a common language. (American) English is a strong candidate, since it is spoken and understood by a large number of people, it represents a large market and given possible co-operation activities between the EU and the US in the field of LE evaluation.
Depending on whether a proactive or a reactive solution is sought, the difference in strategy reflects the disparity of requirements imposed by each type of solution. With the former option, a list of topics is defined in advance of their publication in a unique call for proposals (asking for both evaluators and participants). With the latter option, the evaluation topics are determined by the contents of the selected projects from a first call and a subsequent call is needed to select the evaluators.
If the proactive solution is chosen, the call for proposals should ask either for consortia for each of the evaluation topics, or for larger consortia covering the full set of topics. The first solution is lighter to implement, but the second one allows for a better overall infrastructure more apt to co-ordinate the various evaluations of components and complete systems, but is harder to manage (70 participants or more). The consortia should include a set of organizers for managing the evaluation campaigns in one (their own) or several languages. Each proposal should consider the common language and up to 3 other languages. It should include the description of the way the consortium plans to organize the campaign, the Language Resources (LR) that will be used for training and testing the systems, their cost, and their providers (who will participate as subcontractors), the list of potential participants for each language (at least two), who will also participate as subcontractors. We strongly suggest that the permanent European evaluation organization mentioned before should be a partner in the final consortia in order to capitalize on the results of the different evaluation campaigns. Having the LR providers and the participants as subcontractors allows for more flexibility (in case of reduction in the number of participants down to two or if a change in the participant list occurs). Alternatively, if the cost is too high to support all the potential participants, the consortia could first select a set of participants based on the evaluation results obtained in a dry run. Second, the consortia would finance only the best systems for the final test, up to a certain number. Each participant would receive a fixed amount of resources corresponding to the estimated cost of the participation in the evaluation campaign (typically for adapting his system to the test conditions).
If the reactive solution is chosen, the evaluation topics are determined by the content of the selected projects, which perforce address evaluation as a complementary issue. Subsequently the selected projects could be grouped into clusters based on technological similarities (either the technology used or developed in systems or components or the overall technological function implemented by the project result). Then, it is necessary to have a second call for proposals in order to select organizers for the evaluation campaign, as selecting an organizer beforehand or selecting an organizer among the projects already selected, run a very high risk of recruiting an organizer lacking the domain knowledge needed to appreciate the issues at stake, or of having an organizer with a biased opinion because of his involvement in his own project. Not mentioning the fact that organizing an evaluation is a time consuming activity which is poorly supported with the amount of resource generally allotted to project complementary issues. And last but not least, according to EU regulation all the partners of a project need to be identified before signature of the contract, so it is impossible to have a project for organizing an evaluation campaign where only some of the partners are identified (e.g. the evaluation organizer but not the participants or vice versa).
Concerning the possibility to cluster projects for evaluation consideration, let us be reminded that project clustering was initiated in the course of FP4 in order to contribute to the objectives of the program, and to the achievement of the performance criteria laid down for the Telematics Application Program [JD95]. The purposes of the program were:
In a broader perspective, project clustering was also meant to support the long-term objectives of the LE sector, which are mostly motivated by market considerations:
Possible factors which have been considered for organizing project clusters, were [LGLK98]:
Of these five factors, market opportunity was identified as the most appropriate basis for clustering. Grouping by technology had been termed to be "very attractive" in terms of project cross-fertilization and quality improvement. However, it was deemed impractical because of the large number of technology combinations, and not rewarding enough as concerns the target user community [JD95].
Evaluation can provide an important inter-project and inter-cluster link and an exchange medium that would contribute to the objectives of the program and the long-term objectives of the section. It will bring the projects of a cluster together in one common activity, which will force more inter-project communications.
Once a control task has been drafted after identifying a need for validation in a cluster, a careful selection of the features of the control task should be done in order to allow the largest number of systems to participate.
Although a parallel of some sort could be drawn, evaluation activities should not be mistakenly put on a same standing as concertation and dissemination activities. In particular, organizing an evaluation requires the ability to maintain high bandwidth communication with the participants on highly technical grounds, e.g. in order to finalize the evaluation metrics. While concertation activities can be successfully achieved with much lighter means judiciously distributed over time.
A mixed solution between the purely proactive and purely reactive solutions is possible. Some evaluation topics could be selected beforehand and published in the first call for proposal, while others could be defined according to the projects selected after the first call. The ELSE consortium favors the proactive approach and a single consortium, constituted as a network or evaluation organizers, with the support of the permanent European evaluation organization.
Note that if the classical EU contract scheme is used to implement evaluation, and if the participants are funded, with the evaluator as only the evaluator (all the participants and the corpus providers are his subcontractors), then the usual limitation imposed on the amount of resources devoted to "third party assistance" should be modified or waived. The amount of resources could exceed the allowed value, just because of the number of participants or the cost of the linguistic data needed for evaluation.
In order to include evaluation in the FP5 agenda, it is proposed to include this topic in the first call for proposals. Evaluation campaigns would have a 2-year duration, in order to allow for more progress and research work between two campaigns than in the DARPA ones. We expect a certain amount of delay in deploying the paradigm of evaluation because it will be the first time that it will be used on such scale and in the context of EU programs (3 years were necessary for DARPA to go from the drawing board to a real implementation for the speech recognition campaigns).
In a proactive scheme, the topics (related to both written and spoken language processing) should be selected beforehand and included in the call for proposals. These topics should cover both complete systems and systems components, and should have links between them, thus allowing the progress obtained in one field to influence the development of another field. A straightforward way of implementing these links between topics is to have part of the evaluation data that is common to related topics, and therefore in the same language. They should be of interest for LE research, but also for LE industry. To that extent they should be proposed by a scientific committee and submitted to the appreciation of an industrial panel.
As a fallback option, there exists the possibility of including an evaluation task in each candidate project. The evaluation task would constitute a sort of concertation activity where provision would be made for the needs of an evaluation campaign. The resources needed could be contracted out or produced by a subset of the concerned projects. The possible evaluation topics would be determined by the nature of research and technology projects running at a given time, maybe according to project (possibly technology based) clusters, different from the existing project clusters, which are inspired by market considerations. In that case, management becomes more difficult because it is more distributed. It still requires a coordinating entity, which could be as a last resort a specific project. In this reactive scheme driven by the content of the accepted proposals, we may lose the benefits of capitalizing on the evaluation expertise over a long period of time, as there is no guaranty of continuity of the project content across framework programs.
The main areas of language engineering that are current central preoccupations of researchers and developers are [MLIM98]:
The RTD priorities for Human Language Technology in FP5 listed in [HLT98] are:
Out of the list of 31 possible candidate control tasks presented previously, we have pre-selected six.
The criteria for selecting these tasks were:
Finer selection criteria will have to be applied when implementing these control tasks. The criteria should at least be based on the potential number of participating systems and on the linguistic resources available when starting the evaluation campaign.
A comparison these tasks with the researchers preoccupations and the priorities of FP5 can be summarized with the following two tables.
| Multilinguality | Interactivity | Digital Content | |
| Broadcast News |
X |
X |
|
| Cross-Lingual Information Retrieval / Extraction |
X |
X |
X |
| Text To Speech Synthesis |
X |
X |
|
| Text Summarization |
X |
X |
|
| Language Model Evaluation |
X |
||
| Technique |
X |
X |
| Text | Speech | Image | Mono/Multilingual | |
| Broadcast News |
X |
Mono |
||
| Cross-Lingual Information Retrieval / Extraction |
X |
Multi |
||
| Text To Speech Synthesis |
X |
Mono |
||
| Text Summarization |
X |
Mono |
||
| Language Model Evaluation |
X |
X |
Mono |
|
| Technique |
X |
Mono |
This table shows the relation between the six control tasks and their multimedia and multilinguality aspects.
Naturally, the previous list contains very broadly scoped control tasks. According to the needs, the tasks could be refined into more specific subtasks, or implemented in conjunction with other correlated subsidiary control tasks.
The following table presents the different types of data needed to implement the six control tasks under consideration.
| Control Task | By-product Data Resources |
| Broadcast News | Text transcription of speech signal (possibly time-aligned). |
| Cross-Lingual Information Retrieval / Extraction | Multilingual query/document pairs. |
| Text To Speech Synthesis | Speech signal for a text |
| Text Summarization | Document and summary pairs. |
| Language Model Evaluation | Word predictions (e.g. probability tagging). |
| Technique | Text with Part-Of-Speech tagging, Lemmatization, Syntactic annotation and Word Sense tagging. |
The evaluation of these six tasks will produce data resources (see table above). Out of the 36 reuse possibility of these resources between the six control tasks, 22 are actually possible. Each time, the data produced by one evaluation is interesting in the scope of another evaluation. If the reuse of data can take place between two control tasks, it is important to remember that such reuse is subject to scheduling constraints when the tasks concerned are addressed in consecutive evaluation campaigns. The subsequent evaluation cannot start until the data produced by the preceding evaluation have been completely processed.
| Producer/Consumer | BNT | CLIR | TTS | SUMZ | LM | TECH |
| Broadcast News | Reuse | Reuse | Reuse | Reuse | Reuse | |
| Cross-Lingual Information Retrieval / Extraction | Reuse | |||||
| Text To Speech Synthesis | Reuse | |||||
| Text Summarization | Reuse | Reuse | Reuse | Reuse | Reuse | |
| Language Model Evaluation | Reuse | Reuse | Reuse | Reuse | Reuse | |
| Technique | Reuse | Reuse | Reuse | Reuse | Reuse |
It is obvious that the data of one campaign can be used to start the following one for the same control task. This is why the diagonal of the previous table is filled up. .
Technology Evaluation tries to assess the performance and appropriateness of a technology for solving a problem that is well defined, simplified and abstracted. Usage Evaluation tries to assess the usability of a technology for solving a real problem in the field. It involves the end-users in the environment intended for the deployment of the system under test. In relation with four of the six previous control tasks for Technology Evaluation, we are proposing below a list of meaningful stand-alone control tasks for Usage Evaluation whose results would complement the one produced by Technology Evaluation. Two Technology Evaluation control tasks were left aside because they were deemed unsuitable for Usage Evaluation. They are associated with in-core functionality in existing language processing systems and concern generic functionality that could be contained in any of the Usage Evaluation control tasks proposed below.
| Control Tasks for Technology Evaluation | Control Tasks for Usage Evaluation |
| Broadcast News | Transcription of Virtual Meetings |
| Cross-Lingual Information Retrieval / Extraction | Multimodal tourist information |
| Text To Speech Synthesis | Text-to-speech for the blind |
| Text Summarization | Text summarization of financial newspapers |
| Language Model Evaluation | |
| Technique |
These tasks were chosen with the knowledge that applications or prototypes exist. A major difficulty will be to find the participants for these stand-alone control tasks for Usage Evaluation.
If these cannot be found, one should think about the value of these stand-alone tasks: it may not represent any real demand, but the possibility exists however that the developers of the systems do not wish to participate.
For the two remaining control task (Language Model Evaluation and Technique), specific Usage Evaluation criteria will have to found based on the sole of embedded module functionality (e.g. processing speed, language coverage etc.).
The differences between language, culture, environment and application will be parameters of the comparison process.
1. Transcription of Minutes of Virtual Meetings
We propose to use the following usage criteria as comparison points:
Ideally the test will be the transcriptions of real useful meetings with participants who want to achieve something during that meeting.
2. Multimedia Tourist Information
Multimedia tourist information systems can be compared with the following user-oriented criteria:
3. Text to Speech for the Blind
Text to speech systems for the blind can be compared with the following user-oriented criteria:
4. Summarization
Deployed summarization systems for a given domain or application can be compared with the following user-oriented criteria:
Instead of these six control tasks for Technology Evaluation and the four control tasks for Usage Evaluation, it is reasonable to propose a synthesis control task, which covers the whole spectrum of activities. The one we think could fulfill this goal is: news on demand, where multimedia material is searched for information that is relevant to a given query. The purpose of news on demand is to provide search and retrieve facilities for archived broadcast news material. A query is formulated and audio/video excerpts from past material are extracted from an archive database.
News on demand encompasses the major research directions that were identified:
This control task will also show how the priorities of FP5 are addressed in the field:
Natural interactivity to handle the details of the query and the navigation in the space of the response;
Finally this control task is itself a stand-alone control task for Usage Evaluation and thus allows for the implementation of Usage Evaluation in a very straightforward and natural way.
ELSE work was used for building the Evale (Evaluation in Language Engineering) proposal in answer to the first call of the Fifth Framework program of the European Commission. Some of the ELSE consortium member provided help for writing the proposal. The EvaLE consortium counts one ELSE participant among its members besides ELDA, which was also a contributor to ELSE. EvaLE was submitted in the IST program, in action line 1.1.2.-3.4 (Human Language Technologies) as a Research and Technological Development Project. The Evale participants are:
The objective of the project is to design and validate evaluation packages for several Human Language Technologies. Initially, the technologies under consideration are the ones recommended by the ELSE project. These evaluation packages will be made available at specific project milestones for organizing larger evaluation campaigns, involving other participants either among the partners of the projects or project clusters of the FP5/HLT program, or from laboratories developing technologies targeted for evaluation. The evaluation packages will also be made commercially available upon request for government agencies or industries wishing to organize evaluation campaigns. Finally, they will be distributed for industrial or public research entities wishing to evaluate a technology (possibly the one they develop) and compare it to the state-of-the-art. An evaluation package comprises the following items:
The control task which has been selected as presenting interesting features both for Speech and Text analysis is "News on Demand" i.e. indexing, search and interactive browsing of news material, using raw broadcast data (either radio or TV). Using a very coarse grained description; this task includes sub-tasks such as:
To address Multilinguality, the project plans to handle a common pivotal language (American English) on which all partners will test their technologies besides 3 other languages (French, Italian and Dutch).
EPs will be validated by running full-fledged evaluation campaigns among the project participants, who will provide baseline reference results, while testing the technologies which are part of their recognized domain of expertise.
To initiate the deployment of evaluation on a larger scale at the European level, EvaLE plan to open its second phase of evaluation campaign to a limited number of participant from outside the project, on a basis of 1 per consortium member. This will complement dissemination activities in making the paradigm of evaluation known outside of the project and hopefully initiate a wider deployment of evaluation that could be supported by FP5 clustering activities. The objective of EvaLE is not only to produce Evaluation Packages, but also to initiate the spreading of the paradigm of evaluation in Language Engineering throughout Europe.
[ALMPR98] Gilles Adda, Josette Lecomte, Joseph Mariani, P. Paroubek, M. Rajman, "The GRACE French Part-of-Speech Tagging Evaluation Task", in Proceedings of the First International Conference on Language Resources and Evaluation (LREC), Granada, May 1998.
[AK98] Adam Kilgarriff, "SENSEVAL: An Exercise in Evaluating Word Sense Disambiguation Programs", in Proceedings of the First International Conference on Language Resources and Evaluation (LREC), Granada, May 1998.
[ATW96] Eric Atwell, "Comparative Evaluation of Grammatical Annotation Models", in Richard Sutcliffe, Heinz-Deltlev Koch, and Anne McElligott (eds), "Industrial Parsing of Technical Manuals", Amsterdam, Rodopi, 1996
[BLA94] E. Black, "A New Approach to Evaluating Broad-Coverage Parsers/Grammars of English", Proceedings of the International Conference on New Methods in Language Processing (NEMLAP'94), UMIST, Manchester, September 1994.
[DARPA99] Proceedings of the DARPA Broadcast News Workshop, February 28th-March 3rd 1999, Herdnon, Virginia, USA.
[ELSED1] Patrick Paroubek, Marc Blasband, Niels Ole Bernsen, Marc Blasband, Nicoletta Calzolari, Jean-Pierre Chanod, Khalid Choukri, Laila Dybkjær, Robert Gaizauskas, Steven Krauwer, Isabelle de Lamberterie, Joseph Mariani, Klaus Netter,Martin Rajman, Antonio Zampolli, "Blueprint for a General Infrastructure for Natural Language Processing Systems Evaluation Using Semi-Automatic Quantitative Black Box Approach in a Multilingual Environment", ELSE LE4-8340 Deliverable D1.1, June 1999.
[GG97] Gregory Grefenstette (Editor), "Cross-Language Information Retrieval", Kluwer Academic Publishers, ISBN-0-7923-8122-X.
[HLT98] European Commission, Human Language Technology, "Proposal Concerning The IST Programme 1998-2002 (excerpts)", COM (98) 305 Final, 13 May 1998. http://www.linglink.lu/le/ist/ist/excerpts_ist_pgme.htm
[HAUS94] R. Hauser, "The Coordinators' Final Report on the First Morpholympics", LDV-FORUM, vol. 11-1, June 1994, ISSN 0172-9926.
[JARD98] Michèle Jardino, Frédéric Bimbot, Stéphane Igounet, Kamel Smaili, Imhed Zitouni, Marc El-Bèze, "A first evaluation campaign for language models", in Proceedings of the First International Conference on Language Resources and Evaluation (LREC), Granada, May 1998.
[JD95] John Diver, "Principles and Practice of Language Engineering Project Clustering", Version 1.1 - Final, LINGLINK-LE1-1951, November 10th 1995.
[JP98] Joseph Polifoni et. al. "Evaluation Methodology for a Telephone-Based Conversational System", LREC'98, ELRA, Granada, May 1998.
[KN95] Klaus Netter, Richard Crouch, Robert Gaizauskas, et al., "Interim Report of the Study Group on Assessment and Evaluation", April 1995.
[MITR98] Inderjeed Mani, David House, Gary Klein,
Lynette Hirschman, Leo Obrsi, Therese Firmin, Michael Chizanowski, Beth
Sundheim, "The TIPSTER SUMMAC Text Summarization Evaluation",
Final Report, October 1998, MTR98W0000138, MITRE Corporation. Mac Lean,
Virginia, USA.
http://www.itl.nist.gov/div894/894.02/related_projects/tipster_summac/final_rpt.html
[MLIM98] Eduard Hovy, Nancy Ide, Robert Frederking,
Joseph Mariani, Antonio Zampolli, Editors, "Multilingual Information
Management – Current Levels and Future Abilities", A study commissioned
by the US National Science Foundation and also delivered to the European
Commission’s Lnaguage Engineering Office and the US Defense Advance Research
Projects Agency, July 1998.
http://www.cs.cmu.edu/~ref/mlim
[RS97] Richard Sproat (Editor), "Multilingual Text-To-Speech Synthesis - The Bell Labs Approach", Kluwer Academic Publishers, ISBN 0-7923-8027-4.
[SYLL97] S. J. Young, M. Adda-Decker, X. Aubert, C. Dugast, J.-L. Gauvain, D.J. Kershaw, L. Lamel, D. Leeuwen, D. Pye, A.J. Robinson, H.J.M. Steeneken, an P.C. Woodland, "Multilingal Large Vocabulary Speech Recognition: The European SQALE Project, Computer Speech and Language, Vol. 11, 1997.