Uplug - a modular corpus tool for parallel corpora

Jörg Tiedemann

The Uplug-system was developed at Uppsala University within the PLUG project. The project's aim is to develop, evaluate, and apply several approaches to translation data generation from bilingual text [PLUG98a,PLUG98b].

A common project corpus was established and aligned sentencewise. It includes text resources of three different genres: technical documentation, political/administrative text, and literary text. Each part of the corpus has Swedish either as its source language or as its target language [PLUG98c].

The Uplug-system represents Uppsala's contribution to the common word alignment system which is currently under development.

The purpose with this software is to provide a modular platform for the integration of text processing tools. Special attention is given to the development of a general, many-task system that supports further extensions. The current version of the Uplug-system, however, is intended for processing bilingual texts from the project's corpus. It includes three components: An extensible I/O library which provides a transparent interface to work with textual data, a tool for combining data processing modules into sequentially executable systems, and a graphical user interface based on Perl/Tk for work with Uplug-like modules.

All of these components are still under development. A prototype is currently used at Uppsala University for different kinds of data generation from parallel texts and for the examination and evaluation of the results that are produced. The main task is the extraction of translation equivalents in terms of word and phrase alignments from sentence aligned bi-texts. A project specific XML format was developed which is supported by the system's I/O library. The extraction system includes a set of modules and sub-systems. Hereby, the extraction process can be executed in several steps and intermediate results may be re-used. Each module produces data which can be stored in appropriate formats. Furthermore, data can be linked to a local database management system via a transparent database interface which is included in the I/O component of the Uplug system. All the data can be examined easily and the system provides functionality for the conversion to other data formats which are included in the I/O library. The extraction system is an example of a complex text processing application which can be easily integrated into the Uplug system. Experimental results can be produced efficiently and new parameters and modules can be added by simple modifications.

References

PLUG98a
Lars Ahrenberg, Magnus Merkel, Katarina Mühlenbock, Daniel Ridings, Anna Sågvall Hein, and Jörg Tiedemann, Parallell corpora in Linköping, Uppsala and Göteborg (PLUG). Project Application, Uppsala University, 1998
PLUG98b
Lars Ahrenberg, Magnus Merkel, Katarina Mühlenbock, Daniel Ridings, Anna Sågvall Hein, and Jörg Tiedemann, Parallell corpora in Linköping, Uppsala and Göteborg(PLUG). Project plan 980401 - 991231. Internal report, Uppsala University, 1998
PLUG98c
Jörg Tiedemann, Parallel corpora in Linköping, Uppsala and Göteborg (PLUG). Work package 1. Internal report, Uppsala University, October 1998