 | | The Uplug home page
Uplug is a collection of tools for linguistic corpus processing, word
alignment and term extraction from parallel corpora. It includes two main
components:
- Corpus
Manager - Monolingual and bilingual corpora can be
added to your personal repository. The corpus manager includes tools
for updating the repository and inspecting corpus data in your collection.
- Task
Manager - The task manager allows to run
applications on registered corpora. Several tools are integrated which
can be used to process monolingual and bilingual corpora. Jobs are
queued on the local system and results will be send by mail and added
to the personal data collection.
Several tools have been integrated in Uplug. Pre-processing tools
include a sentence splitter, tokenizer and external part-of-speech
tagger and shallow parsers.
The following external tools are used:
The TreeTagger
for English, French, Italian, and German,
the TnT
tagger for English, German and Swedish,
The Grok system for
English (tagging and chunking), and
the morphological analyzer
ChaSen for Japanese.
Translated documents can be sentence
aligned using the length-based approach by
Gale&Church.
Words and phrases can be aligned using the
clue
alignment approach and the toolbox for statistical machine translation
GIZA++.
Publications
- Tiedemann, J. 2003,
- Recycling Translations - Extraction of Lexical Data from Parallel
Corpora and their Application in Natural Language Processing,
Doctoral Thesis, Studia Linguistica Upsaliensia 1, ISSN 1652-1366,
ISBN 91-554-5815-7 [pdf,
2.1MB] [html]
[errata, pdf]
- Tiedemann,J. 2003,
- Combining Clues for Word Alignment.
In Proceedings of the 10th Conference
of the European Chapter of the ACL (EACL03)
Budapest, Hungary, April 12-17, 2003
[pdf, 90 kB]
[ps, 93 kB]
- Ahrenberg, Lars, Merkel, Magnus, SÄgvall Hein, A., Tiedemann, J.,
2000.
- Evaluation of Word Alignment Systems. In Proceedings of LREC 2000,
Athens/Greece.
[pdf, 406kB]
[ps, 757kB]
[gzipped ps, 236kB]
| |