L I M S I - C N R S

LIMSI Spoken Language Processing Group (TLP)

Laboratory
CNRS home page
LIMSI home page
LIMSI TLP Group


E L S E

MIP
DFKI
ILC
EPFL
XRCE
SHEF
LIMSI
CECOJI
ELRA
ELSNET


Paris time:



POS Tagger Evaluation Protocol and Evaluation Workbench Demonstration.

As a sample illustration of how the evaluation paradigm can be applied, we provide here a demonstration of the various steps composing a quantitative black-box evaluation procedure for Part-Of-Speech (POS) taggers for 3 European languages: French, German and English. This procedure uses the evaluation workbench for POS taggers that ELSE has developed (http://www.limsi.fr/TLP/ELSE/else-0.33/). In the following we present:

  1. the time constraints,
  2. the detailed phases of the procedure,
  3. a demonstration of the POS tagger evaluation workbench,
  4. information about the resources needed and some potential participants.

POS tagger evaluation demonstration

As a sample illustration of how the evaluation paradigm can be applied, we provide here a demonstration of the various steps composing a quantitative black-box evaluation procedure for Part-Of-Speech (POS) taggers for 3 European languages: French, German and Italian. This procedure uses the evaluation workbench for POS taggers that ELSE has developed (http://www.limsi.fr/TLP/ELSE/else-0.33/). In the following we present:

1. Duration and Calendar

Since POS tagging is a language processing task well documented in the litterature, for which numerous tools exist and which has already been the object of a formal evaluation campaign (see the GRACE evaluation campaign that was organized by CNRS for POS taggers of French), it should be possible to perform this evaluation campaign within 1 year time frame, a period that was sufficient for SENSEVAL/ROMANSEVAL to perform a prototype evaluation of word sense tagging systems for English, French and Italian.

A possible calendar over 1 year could be the following:

Training September to November.
Dry Run December to March.
Tests (up to collecting results) April to June.
Tests cont. (from result computation and validation to the closing workshop) July to September.
Post Evaluation Phase -

2. Detailed Protocol Phases

The deployment of the evaluation procedure comprises the now well known detailled steps:

    The training phase.
  1. Recruiting the organizers.
  2. Recruiting the data providers (corpora and lexicons).
  3. Recruiting the participants.
  4. Collecting the data (corpus and lexicons). The roles that lexicons and corpora play in a quantitative black box evaluation scheme are discussed repectively in ELSE deliverables D4 Lexicons for Evaluation and D5 Corpora for Evaluation. ELSE
  5. The organizers communicate the evaluation protocol to the participants (including the evaluation software, the data format validation tools and the first draft for gold standard definition, i.e. the definition of the reference tagging embodied in the coding manual intended for the human annotators who will build the reference material).
  6. The organizers obtain the formal agreement from the participants on the conditions of their participation as, defined by the evaluation protocol (for legal aspects see ELSE deliverable D2 Evaluation in the Field of Linguistic Engineering, a Legal Approach.).
  7. The organizers distribute the training data to the participants. The data need to be representative of the one that will be used for the tests, and must cover a large enough set of intra-domain variations, with a size of at least 10 millions words (untagged) for each language.
  8. The organizers collect the first feedback from the participants about the evaluation protocol (metrics, tools and training data).
    The dry run phase.
  9. The organizers have the reference material (a part of the dry run data selected by the organizer and kept confidential) tagged by human annotators and validated.
  10. The organizers send the dry run data to the participants for tagging.
  11. The organizers collect the tagged dry run data and compute the results.
  12. Each participant is communicated his own results, which remain confidential. Note that in principle, the dry run results have no formal status, to prevent a participant from using them for advertisement while they have been obtained with an intermdiary protocol version (not necessarily validated by all participants).
    A good solution to synchronize the participants is to send the dry run data encrypted and to give afterward the password simultaneously to all the participants who are then allowed a limited time to process the data. Synchrony between the participants is important to ensure fairness and transparency of the protocol.
  13. The ogranizers collect feedback from the participant and modify the evaluation protocol accordingly, which is frozen at this stage.
  14. Each participant declares formally to the organizer his mode of participation (public or confidential) to the final tests.
    The Tests.
  15. The organizers have the reference material (a part of the dry run data selected by the organizer and kept confidential) tagged by human annotators and validated.
  16. The organizers send the tests data to the participants for tagging.
  17. The organizers collect the tagged tests data and compute the results.
  18. Each participant is communicated his own results, which will remain confidential until they have been validated by the participant).
  19. The ogranizers collect the validation of their results by the participants. At this stage, results of the participants which have chosen the public participation mode can be disclosed.
  20. The participants compare their method and exchange ideas during the workshop that close the campaign and whose attendance is restricted to the participants.
    Post Evaluation Phase.
  21. All the training and reference material that can be distributed is packaged along with all the evaluation by-products data (e.g raw or transformed participant data) and made available for distribution (e.g. through ELRA).
  22. An impact study is made to assess the benefits brought by the evaluation campaign to the field: the increase in amount of annotated and validated data, the identification of promising directions and new algorithms, new products whose creation was a consequence of the evaluation campaign, new actors and the progress made by the participants.

3. Demonstration of the POS tagger evaluation workbench

The processing of the data tagged by a participant comprises the following steps:
  1. Validation of the data format according to a syntax that was agreed between the participants and the organizers. A tabular format, were data fields are organized in colums, is simpler to handle but requires specific parsers for checking and for preserving consistency across manipulation. A better solution, which requires SGML aware tools, is to define a DTD (Document Type Definition) for the data. Specific database management system formats are not advised as it is very unlikely that all the participants would use the same system. A special attention must be given to special characters representation, e.g. accented characters and robustness of the format (adding redundent information is two different forms is useful for integrity checking). For a more detailled discussion on data formats please see ELSE deliverable D3 Data Formats for Evaluation.
  2. Validation of the mapping table between a participant tag set and the reference tagset. This table is provided by the participant. As the participant tagset is a priori confidential, the validation will concern only the part of the table containing the reference tags.
  3. Validation of the tags used in the tagged data. All the tags used in the data should be described in the mapping table.
  4. Aligning the data tagged by the participant and the reference data using the words defined by an atomic tokenization procedure. This step is required guard against potential modifications of the original text material (word expansions, contractions or normalization that could perturb the performance measure).
  5. Computing the results.
  6. Printing the results.
  7. Checking the statistical relevance of the results. Are the differences observed between the systems the result of chance or the consequence of real differences in the systems evaluated.

We will now illustrate the previous steps using the ELSE evaluation workbench for POS taggers with data sample in the following languages (please follow the appropriate link):

  • French
  • (soon available)German
  • (soon available)Italian

4. Resources and potential participants

In the table below, we propose a preliminary list of potential participants to the evaluation campaign. The taggers listed with a (*) mark, are freely available and downloadable from the URL indicated in the table. For each system we indicate the language(s) it handles among the 3 ones selected for the demonstration.
système Institution Langue
  AT&T Bell Laboratories Fr
  GREYC
URA 1526 CNRS

U. de Caen
Fr
SYLEX INGENIA S.A. Fr
CRISTAL CRISTAL-GRESEC
U. Stendhal (Grenoble)
Fr
CNET1 CNET (Lannion) Fr
XPOST XRCE (Grenoble) Fr, It, Ge
FipsTag LATL U. Genève Fr

ECSta

LIA U.Avignon &
LPL CNRS &
U. de Provence
(Aix en P.)
Fr
  SYNAPSE S.A. Fr

SPIRIT

TGID Fr

CORDIAL

SYNAPSE
Développement
Fr

PILAF

CLIPS-TRILAN
IMAG (Grenoble)
Fr

Tree Tagger (*)

ILR & IMS
U. Stuttgard
Fr, It, Ge

TAL

IBM Fr

SAM-2

LEXIQUEST Fr

Reac

IRO
U. Montréal
Fr

winbrill

InaLF-CNRS Fr
  LIMSI Fr

MULTEXT tagger (*)
& TATOO (*)

ISSCO
ftp://issco-ftp.unige.ch
/pub/multext/ &
/staff/robert/tatoo
Fr, Ge

Brill (*)

http://www.cs.jhu.edu/~brill/code.html En

TnT

U. des Saarlandes
(Saabrücken)
http://www.coli.uni-sb.de/~thorsten/tnt/
Ge
  CL U. Zürich
http://www.ifi.unizh.ch/CL/tagger/
Ge

Concerning the data needed for the evaluation, the training material can be otained from ELRA. For instance here is an excerpt of its catalog listing the corpora available (today, November 25th 1999) for French, Italian and German.

Ref.
ELRA
NameType &
No of entries
LanguageDate
W0004 ECI/MCI European Corpus Initiative Multilingual Corpus
98 million words
Major European languages
+ Turkish, Japanese, Russian, Chinese, Malay, etc.
01/09/96
W0005 ECI-ELSNET Italian & German tagged sub-corpus Economy 17,000 words
Politics 14,000 words
Culture 18,000 words
Sports 9,000 words
Local Events 8,500 words
Italian & German 01/09/96
W0006MLCC - Multi-lingual corpus Het Financieele Dagblad (8.5 million words)
The Financial Times (30 million words)
Le Monde (10 million words)
Handelsblatt (33 million words)
Il sole 24 Ore (1.88 million words)
Expansion (10 million words)
Dutch, English, French, German, Italian, Spanish 01/09/96
W0007MLCC - Office of Official Publications of the European Communities (Parliamentary Debates + OJ) Parallel corpus of translated documents in the nine European official languages, divided into 2 sub-corpora: written questions and parliamentary debatesMultilingual01/09/96
W0008MTP annotated German Corpus
(500000 Words from FAZ/ Die Zeit)
500,000 words
untagged: 2000
tagged: 8000
untagged: 3500
tagged: 12000
German 01/09/96
W0011Tagged text in French (MEMODATA)
Typographic tagging
170 books French 23/01/97
W0012Tagged text in French (MEMODATA)
Morphologic tagging
170 books French 23/01/97
W0015Text corpus of
"Le Monde"
Corpus from "Le Monde" newspaper. From 1 to 5 years of data are available. Each tape/year contains some 10 Mbytes of data per month (circa 120 Mbytes per year).French15/09/97
W0016Karl May Korpus (KMK)Karl-May-Korpus is a German monolingual corpus, available in an SGML-tagged ASCII text format. It contains the works of the German author Karl May and consists of around 1.6 million words (divided into 9 sub-corpora of about 180,000 words each).German28/11/97
W0017MULTEXT JOC CorpusThis CD-ROM contains a part of the corpus developed in the MULTEXT project financed by the European Commission (LRE 62-050). This part contains raw, tagged and aligned data from the Written Questions and Answers of the Official Journal of the European Community. The corpus contains ca. 5 million words in English, French, German, Italian and Spanish (ca. 1 million words par language). About 800,000 words were grammatically tagged and manually checked for English, French, Italian and Spanish, i.e. roughly 200,000 words per language. The same subset for French, German, Italian and Spanish was aligned to English at the sentence level.English, French, German, Italian, Spanish23/11/98
W0018 ARCADE/
ROMANSEVAL
corpus
The corpus contains raw data from the JOC corpus developed in the MULTEXT project financed by the European Commission (LRE 62-050), composed of 1 million words in English and four romance languages: French, Italian, Spanish and Portuguese (Written Question and Answers from the Official Journal of the European Commission). The annotation concerns all the contexts of 60 different test words (20 nouns, 20 adjectives, 20 verbs), i.e. ca. 3700 contexts all together. It comprises: semantic tagging of all the occurrences of the test words in the JOC corpus for French and Italian; a,d word-level alignment of all the occurrences of the test words between French and English. English, French, Italian 23/11/98

Lexicons (listing word forms and their associated tags) are not required for the campaign (the GRACE campaign did not use any), but they help greatly in the tagging of the reference material and are very useful for providing a baseline performance metric (tagging with only lexical lookup). Accordingly, the needed resource could be obtained from ELRA. As an illustration, here is an excerpt of its catalog listing the monolingual lexicons available (today, November 25th 1999) for French, Italian and German.

Ref.
ELRA
NameType &
No of entries
LanguageDate
L0001DICO-MORPH_lemme. MEMODATA Morpho-syntactic information
400,000 entries
French 23/01/97
L0006ILC Italian Morphological lexiconLexicon
About 60,000 lemmas/lexical entries
Italian15/09/97
L0010MULTEXT lexicons This CD-ROM contains a set of lexicons developed in the MULTEXT project financed by the European Commission (LRE 62-050). The set contains the following languages: English, French, German, Italian and Spanish.
English 66,214 Word forms
French 306,795 Word forms
German 233,861 Word forms
Italian 145,530 Word forms
Spanish 510,710 Word forms
English, French, German, Italian, Spanish23/11/98
L0013THAMUS.
Generic Italian dictionary
(Consorzio per la linguistica computazionale)
Generic (canonical forms) 87,000
ii) Generic (inflected forms) 612,000
iii) Technical (canonical forms) 48,000
iv) Technical (inflected forms) 96,000
Italian13/05/97
L0018German lexicon (CORA)Lexicon
466,300
German23/01/97
L0020DST Dictionary (CORA)
1) String dictionary
2) Optional extra sets:
i) Part of speech (optional)
ii) Gender, number, conjugation (optional)
iii) Lemma (optional)
iv) Semantical information (optional)
v) Syntactical information (optional)
vi) Prep/adv. phrases (optional)
vii Compound nouns (optional)
3) The whole dictionary
Generic Dictionary
550,000 inflected forms
French23/01/97

For the annotation of the reference data, we propose to use the guidelines developped by EAGLES and refined in MULTEXT for German and Italian and for French to use the version of this guidelines which was refined a step further in GRACE.

LanguageMorpoho-syntactic description specification
Frenchhttp://www.limsi.fr/TLP/grace/www/testevaltags.html
Germanhttp://www.lpl.univ-aix.fr/projects/multext/LEX/LEX.LangSpec.de.html
Italianhttp://www.lpl.univ-aix.fr/projects/multext/LEX/LEX.LangSpec.it.html

Note that the EAGLES/MULTEXT specifications for morpho-syntactic descriptions have been further refined in the course of the PAROLE project and that the latest improvements to the specifications added by this project or other posterior projects should be considered before finalizing the morpho-syntactic description used for the reference formalism.

The reference corpus (for both the dry run and test phase) should be annoted at least by two different annotators, and the inter-annotatorr agreement should be measured and validated using the Kappa statistics as it was done in SENSEVAL/ROMANSEVAL.

SENSEVAL/ROMANSEVAL short presentation
SENSEVAL http://www.itri.brighton.ac.uk/events/senseval/
ROMANSEVAL http://www.lpl.univ-aix.fr/projects/romanseval/

Note that the building of the reference corpus by the annotator will be greatly helped if a coding manual, stating the basic rules and validation tests for assigning POS tags in context is used. Such document has for instance been realized for French during the GRACE project (see ).


[ELSE home page] [LIMSI home page] [Last updated: Tue Dec 26th 2000]