Consolidating and Harmonising Treebank Annotation

Report on the CLARA Thematic Course on Consolidating and Harmonising Treebank Annotation, held in Prague on December 13–17, 2010.

Universität Tübingen led the scientific organization, in cooperation with Charles University Prague who were the local host.

The course offered 5 ECTS credits and was open to external participants.

The program consisted of the following components:

  • Erhard Hinrichs and Kathrin Beck: Tübingen Treebank Resources
    The Tübingen suite of treebanks includes semi-automatically and fully automatically annotated corpora for spoken and written language. TüBa-D/S, TüBa-E/S and TüBa-J/S are treebanks of spoken German, English, and Japanese, which were annotated semi-automatically at the level of POS-tagging, syntax, and grammatical functions. The most elaborated Tübingen treebank is the TüBa-D/Z based on newspaper articles of the German newspaper taz. With a size of 55 000 sentences (in release 6, released end of 2010), TüBa-D/Z is one of the biggest and richest semi-automatically treebanks created for German. It has been continuously extended, corrected and enriched by additional annotation layers since 2001. Annotation includes morphology, POS tags and syntax (inculding grammatical functions), named entities, and coreference. Further annotation layers of discourse connectors and classification of named entities are currently in process. The course provided an overview of all treebanks mentioned above, focusing on the annotation principles, the annotation process, and the information contained in each annotation layer.
  • Aravind Joshi: Discourse Relations: Going Beyond the Domain of Sentences
    How does one go beyond the domain of sentences? An introduction was given to Discourse Relations (DR), which are signaled by the Discourse Connectives (DC), which can be thought of as higher level predicates taking abstract objects (such as events, situations, propositions) as their arguments. Then an overview was given of the Penn Discourse Treebank (PDTB) of about 1 million words, annotated with DC (explicit and implicit) and their arguments, the senses of the DRs, attributions of the arguments, among some other pieces of information. A discussion followed on the dependency relations at the level of discourse relations and their comparison to the syntactic dependencies. It was shown that the set of DR’s does not appear to be a closed class, yet not completely an open class either. These are expressions which behave as DC’s.and can be thought of as Alternate Lexicalizations (AltLex) of discourse relations (DR). These are annotated in PDTB. Some applications of PDTB were given.
  • Sandra Kübler: Querying Treebanks
    Treebanks are useful for finding specific syntactic phenomena, which are difficult to detect in raw text. One example would be finding ditransitive verbs and their objects. But they are only helpful if the annotation contains the information we are looking for and if there is a way of finding out how specific phenomena are annotated in the treebank. In this course, we looked at which phenomena we can find in treebanks and how to find them. Several search tools were demonstrated: TigerSearch and Steven Bird’s Treebank Search. The query languages used in the two tools were discussed with respect to their strengths and limitations as well as the phenomena that we can find in different corpora. This course was a practical one and students were encouraged to bring their laptops to the course. (There is also an update version of Tigersearch available here.)
  • Paul Meurer, Victoria Rosén and Koenraad de Smedt: Tools for Automatically Analyzed Corpora
    This course started with a short motivation for parsing corpora, based on construction principles and usage cases. This was followed by a presentation of the context of the TREPIL and XPAR projects and the INESS infrastructure. The LFG Parsebanker, a comprehensive toolkit for interactive incremental construction of a treebank as a parsed corpus, using the XLE parsing tool, was demonstrated. This web-based toolkit offers an environment for batch and interactive parsing, versioning, inspection of structures, discriminant-based disambiguation and a structural search facility. The tool is suited for any language with an XLE-based LFG grammar. Also parallel treebanks can be constructed, and a dependency treebank mode is under development.
  • Detmar Meurers: Detecting Errors in Corpus Annotation
    Large corpora that are annotated with various types of linguistic annotation are central for computational linguistics and arguably also to theoretical linguistics. They play a crucial role as training and testing data for a wide range of natural language processing algorithms, and they provide access to natural examples relevant for creating and testing linguistic theories. At the same time, the “gold standard” annotations used for these purposes contain a significant number of errors, which have been shown to negatively affect both kinds of uses. As a step towards addressing this situation, we can deploy automatic methods for detecting errors in annotated corpora that is generally applicable to corpora with a wide range of annotation schemes. An introduction was given to the approach developed by Detmar Meurers in collaboration with Markus Dickinson and Adriane Boyd, based on the idea that data recurring within a comparable context should be annotated the same way in all occurrences. Variation in the annotation within similar contexts thus is likely to be erroneous. The applicability of this variation n-gram method was illustrated by the fact that it can detect errors with high precision for a range of annotation types, including positional (part-of-speech), tree-based syntactic, discontinuous syntactic, and dependency annotation.
  • Martha Palmer: From Propositions to Event Descriptions
    The PropBank annotated data has contributed significantly to the improvement in our ability to detect semantic roles. However, detecting semantic roles in individual predicate argument structures it just the first step towards realizing actual event descriptions, including co-references with previous mentions of the same event. VerbNet provided a key resource for the recent ACL paper on recovering implicit arguments, and further steps in this direction and that of building richer event descriptions will be discussed. The latest enhancements to VerbNet, which include greater coverage, a simplification and regularization of syntactic frame descriptions and thematic roles, and plans for generalizing semantic predicates, were presented. The talk also described SemLink, an effort to map between complementary lexical resources: WordNet, FrameNet, VerbNet and PropBank. The goal is to develop a broad-coverage, unified English resource that has the fine-granularity and rich semantics of WordNet and FrameNet, that is a platform for syntactically based semantic generalizations derived from VerbNet, and that provides PropBank-like broad coverage training data for supervised Machine Learning techniques. SemLink should provide a necessary foundation for building richer event descriptions.
  • Jan Hajič, Eva Hajičová, Silvie Cinková, Martin Popel, Jan Štěpánek and Zdeněk Žabokrtský: Prague Dependency Treebank Tutorial: Annotation and Technology
    The tutorial introduced the Prague Dependency Treebank project, which aims at a complex manual annotation of a substantial amount of naturally occurring sentences in continuous Czech texts. The Prague Dependency Treebank has three layers of annotation: morphological, analytical (describing surface syntax in a dependency fashion) and tectogrammatical, which combines syntax and sentence semantics into a language meaning representation, keeping the dependency structure as the core of the annotation structure but adding basic coreferential links, topic/focus annotation, and a detailed semantic labeling of every sentence unit. The Prague Czech-English Dependency Treebank was introduced as well. In addition to the data, the treebank and data processing tools were discussed.
    This tutorial was intended for students, researchers, and practitioners in natural language processing who want to see how many of the broadly annotated data and the annotation and data processing tools have been built in the Prague treebanking projects. The fact that the annotations and tools can be used in a general way was a strong motivation for all attendees.
  • Barbora Hladká and Jiří Mírovský: Play the Language. An Alternative Way of Annotation
    A collection of high quality data is resource-demanding regardless of the area of research and type of the data. This course presented the Internet games and applications, whose purpose is to enrich text data with various types of annotation. In addition, the game competition was organized.


All presentations are available in a single zip file here.

Monday, Dec 13 (Mamaison Hotel Riverside Prague)
9:20-  9:30 Erhard Hinrichs and Jan Hajič: Opening
9:30-11:00 Erhard Hinrichs and Kathrin Beck: Tübingen Treebank Resources (Slides 1Slides 2)
11:30-13:00 Paul Meurer, Victoria Rosén and Koenraad de Smedt: Tools for Automatically Analyzed Corpora  (Slides 1Slides 2)
14:30-16:00 Paul Meurer, Victoria Rosén and Koenraad de Smedt
16:30-18:00 Detmar Meurers: Detecting Errors in Corpus Annotation  (Slides 1Slides 2Slides 3)
Tuesday, Dec 14 (Mamaison Hotel Riverside Prague)
9:30-11:00 Erhard Hinrichs and Kathrin Beck
11:30-13:00 Sandra Kübler: Querying Treebanks (Slides)
14:30-16:00 Detmar Meurers
16:30-18:00 Paul Meurer, Victoria Rosén and Koenraad de Smedt
Wednesday, Dec 15 (Mamaison Hotel Riverside Prague)
9:30-11:00 Sandra Kübler
11:30-12:30 Barbora Hladká and Jiří­ Mí­rovský: Play the Language. An Alternative Way of Annotation  (Slides)
14:00-15:30 Prague Dependency Treebank Tutorial: Technology
Jan Štěpánek: Tred Editor and PML-TQ Query Engine and Query Language (Slides)
16:00-17:00 Prague Dependency Treebank Tutorial: Technology
Zdeněk Žabokrtský and Martin Popel: Introduction to TectoMT (Slides)
since 19:30 workshop dinner  (Koní­rna restaurant)
Thursday, Dec 16 (Refectory, School of Computer Science)
9:30-12:50 Prague Dependency Treebank Tutorial: Data
9:30-10:50   Jan Hajič: Introduction; Three Layers of Annotation: Morphology, Surface And Deep Syntax  (Slides)
11:20-11:50   Eva Hajičová: Topic-Focus Articulation  (Slides)
11:50-12:20   Zdeněk Žabokrtský: Grammatemes; Coreference (Slides)
12:20-12:50   Silvie Cinková: Prague Czech-English Dependency Treebank (Slides)
12:50-13:00 A Competition before Christmas 2010: A Medal Ceremony
14:30-16:00 Martha Palmer: From Propositions to Event Descriptions (Slides)
16:30-18:00 Aravind Joshi: Discourse corpus (Slides 1Slides 2Slides 3)

