|
Paris time:
|
POS Tagger Evaluation Protocol and Evaluation Workbench Demonstration.
As a sample illustration of how the evaluation paradigm can be applied, we provide here a demonstration of
the various steps composing a quantitative black-box evaluation procedure for Part-Of-Speech (POS) taggers
for 3 European languages: French, German and English. This procedure uses the evaluation workbench for POS taggers
that ELSE has developed (http://www.limsi.fr/TLP/ELSE/else-0.33/).
In the following we present:
- the time constraints,
- the detailed phases of the procedure,
- a demonstration of the POS tagger evaluation workbench,
- information about the resources needed and some potential participants.
As a sample illustration of how the evaluation paradigm can be applied, we provide here a demonstration of
the various steps composing a quantitative black-box evaluation procedure for Part-Of-Speech (POS) taggers
for 3 European languages: French, German and Italian. This procedure uses the evaluation workbench for POS taggers
that ELSE has developed (http://www.limsi.fr/TLP/ELSE/else-0.33/).
In the following we present:
1. Duration and Calendar
Since POS tagging is a language processing task well documented in the litterature, for which numerous tools exist
and which has already been the object of a formal evaluation campaign (see the GRACE
evaluation campaign that was organized by CNRS for POS taggers of French), it should be possible to perform this evaluation
campaign within 1 year time frame, a period that was sufficient for SENSEVAL/ROMANSEVAL to perform a prototype evaluation of word sense tagging systems for English, French and Italian.
A possible calendar over 1 year could be the following:
| Training | September to November. |
| Dry Run | December to March. |
| Tests (up to collecting results) | April to June. |
| Tests cont. (from result computation and validation to the closing workshop) | July to September. |
| Post Evaluation Phase | - |
2. Detailed Protocol Phases
The deployment of the evaluation procedure comprises the now well known detailled steps:
The training phase.
- Recruiting the organizers.
- Recruiting the data providers (corpora and lexicons).
- Recruiting the participants.
- Collecting the data (corpus and lexicons). The roles that lexicons and corpora play in a quantitative black box evaluation scheme are discussed repectively in ELSE deliverables D4 Lexicons for Evaluation and
D5 Corpora for Evaluation.
ELSE
- The organizers communicate the evaluation protocol to the participants
(including the evaluation software, the data format validation tools and the first draft for
gold standard definition, i.e. the definition of the reference tagging embodied in the coding manual intended
for the human annotators who will build the reference material).
- The organizers obtain the formal agreement from the participants on the conditions of their
participation as, defined by the evaluation protocol (for legal aspects see ELSE deliverable
D2 Evaluation in the Field of Linguistic Engineering,
a Legal Approach.).
- The organizers distribute the training data to the participants. The data need to be representative
of the one that will be used for the tests, and must cover a large enough set of intra-domain variations,
with a size of at least 10 millions words (untagged) for each language.
- The organizers collect the first feedback from the participants about the evaluation protocol
(metrics, tools and training data).
The dry run phase.
- The organizers have the reference material (a part of the dry run data selected
by the organizer and kept confidential) tagged by human annotators and validated.
- The organizers send the dry run data to the participants for tagging.
- The organizers collect the tagged dry run data and compute the results.
- Each participant is communicated his own results, which remain confidential.
Note that in principle, the dry run results have no formal status, to prevent
a participant from using them for advertisement while they have been obtained
with an intermdiary protocol version (not necessarily validated by all participants).
A good solution to synchronize the participants is to send the dry run data encrypted
and to give afterward the password simultaneously to all the participants who are then
allowed a limited time to process the data. Synchrony between the participants is
important to ensure fairness and transparency of the protocol.
- The ogranizers collect feedback from the participant and modify the evaluation
protocol accordingly, which is frozen at this stage.
- Each participant declares formally to the organizer his mode of participation (public or confidential)
to the final tests.
The Tests.
- The organizers have the reference material (a part of the dry run data selected
by the organizer and kept confidential) tagged by human annotators and validated.
- The organizers send the tests data to the participants for tagging.
- The organizers collect the tagged tests data and compute the results.
- Each participant is communicated his own results, which will remain confidential
until they have been validated by the participant).
- The ogranizers collect the validation of their results by the participants.
At this stage, results of the participants which have chosen the public participation mode
can be disclosed.
- The participants compare their method and exchange ideas during the workshop that close the campaign
and whose attendance is restricted to the participants.
Post Evaluation Phase.
- All the training and reference material that can be distributed is packaged along with all the
evaluation by-products data (e.g raw or transformed participant data) and made available for
distribution (e.g. through ELRA).
- An impact study is made to assess the benefits brought by the evaluation
campaign to the field: the increase in amount of annotated and validated
data, the identification of promising directions and new algorithms, new
products whose creation was a consequence of the evaluation campaign, new
actors and the progress made by the participants.
3. Demonstration of the POS tagger evaluation workbench
The processing of the data tagged by a participant comprises the following steps:
- Validation of the data format according to a syntax that was agreed between the participants and the organizers.
A tabular format, were data fields are organized in colums, is simpler to handle but requires specific parsers for
checking and for preserving consistency across manipulation. A better solution, which requires SGML aware tools, is to define a DTD
(Document Type Definition) for the data. Specific database management system formats are not advised as it is very
unlikely that all the participants would use the same system. A special attention must be given to special characters
representation, e.g. accented characters and robustness of the format (adding redundent information is two different forms
is useful for integrity checking). For a more detailled discussion on data formats please see ELSE deliverable D3 Data Formats for Evaluation.
- Validation of the mapping table between a participant tag set and the reference tagset. This table is provided by the participant. As the participant tagset is a priori confidential, the validation will concern only the part of the table containing the reference tags.
- Validation of the tags used in the tagged data. All the tags used in the data should be described in the mapping table.
- Aligning the data tagged by the participant and the reference data using the words defined by an atomic tokenization procedure. This step is required guard against potential modifications of the original text material (word expansions, contractions or normalization that could perturb the performance measure).
- Computing the results.
- Printing the results.
- Checking the statistical relevance of the results. Are the differences observed between the systems the result of chance or the consequence of real differences in the systems evaluated.
We will now illustrate the previous steps using the ELSE evaluation workbench for POS taggers with data sample in the following languages (please follow the appropriate link):
- French
- (soon available)German
- (soon available)Italian
4. Resources and potential participants
In the table below, we propose a preliminary list of potential participants to the evaluation campaign.
The taggers listed with a (*) mark, are freely available and downloadable from the URL indicated in the table.
For each system we indicate the language(s) it handles among the 3 ones selected for the demonstration.
Concerning the data needed for the evaluation, the training material can be otained from ELRA. For instance here is an excerpt of its catalog listing the corpora available (today, November 25th 1999) for French, Italian and German.
Ref. ELRA | Name | Type & No of entries |
Language | Date |
| W0004 | ECI/MCI European Corpus Initiative | Multilingual Corpus 98 million words | Major European languages + Turkish, Japanese, Russian, Chinese, Malay, etc. | 01/09/96 |
| W0005 | ECI-ELSNET Italian & German tagged sub-corpus | Economy 17,000 words Politics 14,000 words Culture 18,000 words Sports 9,000 words Local Events 8,500 words | Italian & German | 01/09/96 |
| W0006 | MLCC - Multi-lingual corpus | Het Financieele Dagblad (8.5 million words) The Financial Times (30 million words) Le Monde (10 million words) Handelsblatt (33 million words) Il sole 24 Ore (1.88 million words) Expansion (10 million words) | Dutch, English, French, German, Italian, Spanish | 01/09/96 |
| W0007 | MLCC - Office of Official Publications of the European Communities (Parliamentary Debates + OJ) | Parallel corpus of translated documents in the nine European official languages, divided into 2 sub-corpora: written questions and parliamentary debates | Multilingual | 01/09/96 |
| W0008 | MTP annotated German Corpus (500000 Words from FAZ/ Die Zeit) | 500,000 words
untagged: 2000 tagged: 8000 untagged: 3500 tagged: 12000 | German | 01/09/96 |
| W0011 | Tagged text in French (MEMODATA) Typographic tagging | 170 books | French | 23/01/97 |
| W0012 | Tagged text in French (MEMODATA) Morphologic tagging | 170 books | French | 23/01/97 |
| W0015 | Text corpus of "Le Monde" | Corpus from "Le Monde" newspaper. From 1 to 5 years of data are available. Each tape/year contains some 10 Mbytes of data per month (circa 120 Mbytes per year). | French | 15/09/97 |
| W0016 | Karl May Korpus (KMK) | Karl-May-Korpus is a German monolingual corpus, available in an SGML-tagged ASCII text format. It contains the works of the German author Karl May and consists of around 1.6 million words (divided into 9 sub-corpora of about 180,000 words each). | German | 28/11/97 |
| W0017 | MULTEXT JOC Corpus | This CD-ROM contains a part of the corpus developed in the MULTEXT project financed by the European Commission (LRE 62-050). This part contains raw, tagged and aligned data from the Written Questions and Answers of the Official Journal of the European Community. The corpus contains ca. 5 million words in English, French, German, Italian and Spanish (ca. 1 million words par language). About 800,000 words were grammatically tagged and manually checked for English, French, Italian and Spanish, i.e. roughly 200,000 words per language. The same subset for French, German, Italian and Spanish was aligned to English at the sentence level. | English, French, German, Italian, Spanish | 23/11/98 |
| W0018 |
ARCADE/ ROMANSEVAL corpus |
The corpus contains raw data from the JOC corpus developed in the MULTEXT project financed by the European Commission (LRE 62-050), composed of 1 million words in English and four romance languages: French, Italian, Spanish and Portuguese (Written Question and Answers from the Official Journal of the European Commission). The annotation concerns all the contexts of 60 different test words (20 nouns, 20 adjectives, 20 verbs), i.e. ca. 3700 contexts all together. It comprises: semantic tagging of all the occurrences of the test words in the JOC corpus for French and Italian; a,d word-level alignment of all the occurrences of the test words between French and English. |
English, French, Italian |
23/11/98 |
Lexicons (listing word forms and their associated tags) are not required for the campaign (the GRACE campaign did not use any),
but they help greatly in the tagging of the reference material and are very useful for providing a baseline performance metric
(tagging with only lexical lookup). Accordingly, the needed resource could be obtained from ELRA. As an illustration, here is
an excerpt of its catalog listing the monolingual lexicons available (today, November 25th 1999) for French, Italian and German.
Ref. ELRA | Name | Type & No of entries |
Language | Date |
| L0001 | DICO-MORPH_lemme. MEMODATA | Morpho-syntactic information 400,000 entries | French | 23/01/97 |
| L0006 | ILC Italian Morphological lexicon | Lexicon About 60,000 lemmas/lexical entries | Italian | 15/09/97 |
| L0010 | MULTEXT lexicons | This CD-ROM contains a set of lexicons developed in the MULTEXT project financed by the European Commission (LRE 62-050). The set contains the following languages: English, French, German, Italian and Spanish.
English 66,214 Word forms French 306,795 Word forms German 233,861 Word forms Italian 145,530 Word forms Spanish 510,710 Word forms | English, French, German, Italian, Spanish | 23/11/98 |
| L0013 | THAMUS. Generic Italian dictionary (Consorzio per la linguistica computazionale) |
Generic (canonical forms) 87,000
ii) Generic (inflected forms) 612,000
iii) Technical (canonical forms) 48,000
iv) Technical (inflected forms) 96,000 | Italian | 13/05/97 |
L0018 | German lexicon (CORA) | Lexicon 466,300 | German | 23/01/97 |
| L0020 | DST Dictionary (CORA)
1) String dictionary
2) Optional extra sets:
i) Part of speech (optional)
ii) Gender, number, conjugation (optional)
iii) Lemma (optional)
iv) Semantical information (optional)
v) Syntactical information (optional)
vi) Prep/adv. phrases (optional)
vii Compound nouns (optional)
3) The whole dictionary |
Generic Dictionary 550,000 inflected forms | French | 23/01/97 |
For the annotation of the reference data, we propose to use the guidelines developped by EAGLES
and refined in MULTEXT for German and Italian and for French to use the version of this guidelines
which was refined a step further in GRACE.
Note that the EAGLES/MULTEXT specifications for morpho-syntactic descriptions have been further
refined in the course of the PAROLE project and that the
latest improvements to the specifications added by this project or other posterior projects should
be considered before finalizing the morpho-syntactic description used for the reference formalism.
The reference corpus (for both the dry run and test phase) should be annoted at least by two different
annotators, and the inter-annotatorr agreement should be measured and validated using the Kappa statistics
as it was done in SENSEVAL/ROMANSEVAL.
Note that the building of the reference corpus by the annotator will be greatly helped if a coding manual, stating the basic rules and validation tests for assigning POS tags in context is used. Such document has for instance been realized for French during the
GRACE project (see ).
[ELSE home page]
[LIMSI home page]
[Last updated: Tue Dec 26th 2000]
|