Recent advances in technology and widespread research efforts have expanded both the size of corpora and the extent of their annotations, e.g. in the area of deep syntactic annotation (treebanks), semantic and pragmatic annotation, and multilingual (parallel) corpora, while also various speech and multimodal corpora are becoming available. From corpora as basic resources, other resources are being derived, e.g. lexicons, frequency lists, word nets, term banks, etc. Although a large number of language resources have been produced to date, many scientific and organizational challenges remain, including the following:
- Theories and modeling approaches have not yet been applied on a wide range of languages, while some languages or language types (e.g. morphologically rich languages) may present special challenges.
- Parsers and other tools tend to be language specific (English in particular) and many tools for creating modules, resources and applications impose restrictions in their further use by SMEs and researchers.
- The gap between academic models and the needs of industrial actors who aim at real life applications remains to be bridged.
- The standardization and compatibility of language resources are still inadequate, despite the existence of metadata and integration initiatives like IMDI and DAM-LR, coding and annotation practices like XML and the TEI guidelines and semantic interoperability initiatives like ISO TC37/SC4 and LIRICS.
- There is a lack of appropriate documentation for many resources, and moreover there is no good overview of available resources for all the European languages.
- Since some resources are developed for specific purposes, there is a challenge to convert them so they can be reused for other purposes.
- There is a multitude of different conditions and restrictions for access and use in R&D.
- The long term preservation of language resources needs to be secured.
- Efficiency issues in accessing language resources in very large repositories must be addressed.
CLARA has the following scientific and technological objectives in the area of language infrastructures:
- further work on standardization of coding and annotation practices and promotion of standards
- development of registries and documentation systems for language resources
- transfer and integration of single-purpose resources to interoperable, reusable and extendable forms
- development of transnational legal and organizational frameworks for simplified access
- conceptual and technical models for efficient access and preservation of language resources
CLARA furthermore aims at achieving the following objectives:
- Transfer and extension of modeling approaches and tools to different languages and language types, through synergies between partners with complementary approaches or one partner having tools and the other data.
- New insights in commonalities and differences between languages through data-based comparative and cross-lingual studies, through synergies between partners working on different languages.
- Development and testing of systems in real life settings, through synergies between academic and industrial partners.
- Encourage researchers and SMEs to produce parsers and language processing modules which are mutually compatible for all relevant languages.
The training provided by CLARA is thematically linked to the goals and methods of the CLARIN project, an EC-funded infrastructure project on the ESFRI roadmap.