Joint Research Project
PECO2824
"Language Processing Technologies for Slavic Languages"
(LaTeSlav)

Summary of the Joint Research Project

Advanced technologies for information processing systems require as one of their prerequisites computer processing of natural language. One of the main application areas for language technology is document preparation. High-level word processing technology has been developed mainly for English and partly for other Western languages and Japanese, but is almost completely missing for Slavic languages. The objective of this project is to transfer state-of-the-art natural language technology to two of them: Czech (representing Western Slavic) and Bulgarian (Southern Slavic). The research in the project will be almost purely application driven: the practical outcome of the project will be prototypes of grammar checkers for Bulgarian and Czech.

To meet the main goal of the project, the relevant differences between the investigated Slavic languages and the languages for which the processing technologies mentioned already exist (mainly English) will be explored where necessary - first of all as to the differences in the interplay between syntax and word order, where the two groups of languages typologically differ.

The academic partner in Germany will transfer descriptive formalisms and processing methods for free word order languages to the academic partners in Bulgaria and the Czech republic. These formalisms and methods that have already been applied to small grammars of Bulgarian and Czech on an experimental level will then be combined in a joint effort with the linguistic theories developed by the academic partners in Prague and Sofia. This includes an application-driven investigation of the role of free word order in syntax and semantics and of other problems specific to Slavic languages.

The outcome of these efforts will be pilot implementations of grammar checkers of Bulgarian and Czech which will be passed over to the industrial partners in Prague and Sofia. The industrial partners will use these pilot implementations together with existing low-level language technology (dictionaries with morphology and spelling checkers) for the development of prototypes of grammar checkers for Bulgarian and Czech.

Project Sites and Their Respective Tasks

Here we summarize the main tasks of each project sites, in particular those which will be coordinated by them. It must be taken into account however that several tasks will involve close collaboration among the partners. For a detailed description of the work packages and the connections among them, please, refer to Section B.6.

University of Saarland (Hans Uszkoreit)

Coordination of the project;
Application-driven research in the subfields of syntactic structure and word order of Bulgarian and Czech (together with their comparison to English and German, if needed for the transfer of language processing technologies);
Application-driven research concerning descriptive devices and parsing methods for partially free word order;
Pilot implementation of grammar checkers of Bulgarian and Czech

Charles University (Eva Hajicová)

Application-driven research in the areas of word order and syntactic structure of Czech (together with comparison to English, if needed for the transfer of language processing technologies);
Practical study of most common syntactic errors and their sources in Czech;
Data-driven error analysis and error recovery strategies;
Application of the Prague School theories and analyses to the tasks of the project;
Programming support of the parsers.

Bulgarian Academy of Sciences (Iordan Penchev)

Application-driven research in the areas of word order and syntactic structure of Bulgarian (together with comparison to English and German, if needed for the transfer of language processing technologies);
Practical study of most common syntactic errors and their sources in Bulgarian;
Data-driven error analysis and error recovery strategies;
Application of previously developed theories and analyses of Bulgarian to the tasks of the project.

Autonomous University of Barcelona (Sergio Balari-Ravera)

Comparison of existing grammar checkers;
Design of error recovery strategies.

Bulgarian Business Systems, Ltd. (Nikolai Savov)

Development of a prototype of a commercial version of a grammar checker for Bulgarian and its linking to the Bulgarian version of a word processor to be selected according to the market requirements.

Macron Praha, Ltd. (Pavel Novák)

Development of a prototype of a commercial version of a grammar checker for Czech and its linking to the Czech version of a word processor to be selected according to the market requirements.

Motivation for the Application Domain

Grammar checkers constitute an excellent application domain for high-level language technology. This observation is based on five following reasons:

A useful application does not require a degree of reliability that cannot be guaranteed at the current state of the art. There is no program today that can derive a complete and correct syntactic analysis for every sentence it may be confronted with. As long as a grammar checker is robust in the sense that it does not break down if a full analysis cannot be obtained, it can be useful -- in particular if it is able to detect potential errors in partially analyzed sentences.
A grammar checker does not require linguistic competence in semantics and world knowledge in order to be useful. Many other prospective linguistic applications such as text understanding or abstracting systems have not yet made it into useful products because they would need better semantic capabilities and processing of world knowledge before they could satisfy user's needs. The limitations of current language technology in the areas of semantics and world knowledge turns out to be a true bottle neck for many application domains. Clearly, grammar checker technology will also benefit immensely from progress in computational semantics, but even the detection of errors on purely syntactic grounds is a very useful functionality.
Since the usefulness of a grammar checker grows as the number of detected error types increases and the number of "false alarms'' decreases, this product type can be improved gradually from version to version.
The application does not require language generation capabilities. Parsing technology is much better developed than generation.
Researchers do not need to convince industry of the feasibility and market chances of the application since there are already some well-selling products.

The last reason is particularly important for the special conditions in the new or es ablishing free market economies of Central and Eastern Europe where most software companies are young enterprises that need fast returns. Word processing technology is the software application domain with the most immediate growth potential.

Topics of the Joint Research Project

The goal of the project is to provide for a state-of-the-art high-level language technology for Slavic languages. In the context of this project, high-level language technology shall refer to technology that exploits syntactic and semantic linguistic knowledge as opposed to low-level technology that is based on the spelling system and dictionaries with morphology.

As feasible applications, grammar checkers for Bulgarian and Czech have been selected. The applications will provide the focus of the proposed research. They will also serve as an evaluation measure for the success of the project. They are finally means for strengthening the connections between the academic research partners and the software industry.

Building a grammar checker as an industrial product would go far beyond the scope of a research project of the size proposed. It would also violate the precompetitive nature of the program. The goal is therefore to provide prototypes that can be used by the commercial partners as a starting point for product development.

Thus, the research in the project will be restricted to tasks that immediately contribute to the prototypes of grammar checkers, though in the long run, high-level language technology includes the adjustment of methods of syntactic and semantic computer processing (parsing as well as generation) to the target languages as a necessary prerequisite for development of industrial systems including also:

Grammar-based word-processing (grammar-based editor, able of, e.g., replacement of all occurrences of a given noun or nominal phrase in all its morphological forms (cases) with another nominal phrase in the appropriate form, including the contingent necessary changes in grammatical agreement);
Means for style checking (e.g., occurrence of excessively long phrases, lexical repetitions) and testing text coherence (e.g., fluency of theme/rheme articulation between any two following sentences) of documents written in a natural language;
Natural language interfaces to different types of software systems;
Natural language based information retrieval;
Automatic abstracting;
Possibly other aids for application areas such as computer-driven or computer-assisted machine translation, computer-assisted language teaching and learning, etc.

For the immediate goals, the project relies on the fact that the necessary low-level word processing technologies (usage of national alphabets, morphological analysis and synthesis, spelling checkers, automatic dictionaries) have already been independently developed for the languages in question by the academic and commercial partners, who, on the other hand, express their firm interest in developing the industrial word processing software systems currently available in the directions mentioned above.

Research Areas

research in computer formalisms sui able for processing free word order languages, with particular respect to Slavic
research in computer processing of ill-formed natural language input
linguistic research of relevant phenomena of Czech
linguistic research of relevant phenomena of Bulgarian

Implementation Results Previewed

prototype of a grammar checker for Czech.
prototype of a grammar checker for Bulgarian.

Technical and Finanacial Issues

This project is being funded within the PECO framework of the Commission of European Communities, with an overall contribution of the Commission amounting to 429.999,99 ECU. The supervision of the project on the side of the Commission has been assigned to DG XIII in its Brussels headquarters. The responsible project officer is Ms. Josephine Reimann-Pijls, the reviewers are Prof Anna Sagvall-Hein and Prof Gerard Kempen.

Important Addresses

Sergio Balari-Ravera
Departament de Filologia Catalana
Facultat de Lletres, Edifici B
Universitat Autònoma de Barcelona
Campus de Bellaterra
08193 Bellaterra (Barcelona)
ilftg@cc.uab.es

Eva Hajicová
Institute of Formal and Applied Linguistics,
Faculty of Mathematics and Physics, Charles University,
Malostranske nam. 25,
CZ-118 00 Praha 1 - Mala Strana
hajicova@ufal.mff.cuni.cz

Gerard Kempen
University of Leiden
Cognitive Psychology
Pieter de la Court Building
Postbus 9555
NL-2300 Leiden
kempen@rulfsw.leidenuniv.nl

Pavel A.C. Novak
Macron Ltd.
Nad Petruskou 1
CZ-120 00 Praha 2 - Vinohrady
AC@macron.cz

Iordan Penchev
Institute of Bulgarian Language
Bulgarian Academy of Sciences
acad. G. Bonchev St. bl. 25A
BG1113 - Sofia
Bulgaria
jpen@bgearn.bitnet

Josephine Reimann-Pijls
EC Brussels
BU31 2/58
200, Rue de la Loi
B-1049 Bruxelles
jpi@dg13.cec.be

Anna Sagvall-Hein
Dept. of Linguistics
University Uppsala
BGX 513
S-75120 Uppsala
UDUAS@mvs.udac.uu.se

Nikolai Savov
Bulgarian Business Systems Ltd.
Dragan Tsankov Blvd. 36
BG - 1057 Sofia
BULG.GM@AppleLink.Apple.COM

Hans Uszkoreit
Computational Linguistics
University of Saarland
P.O. Box 15 11 50
D-66041 Saarbrücken
Germany
uszkoreit@coli.uni-sb.de

Joint Research Project PECO2824 "Language Processing Technologies for Slavic Languages" (LaTeSlav)