Next: Acknowledgments
  Contents
  Index
ACTA UNIVERSITATIS UPSALIENSIS
Studia Linguistica Upsaliensia
1
Recycling Translations
Extraction of Lexical Data from Parallel Corpora and their Application in Natural Language Processing
BY
JÖRG TIEDEMANN
ACTA UNIVERSITATIS UPSALIENSIS
UPPSALA 2003
Abstract:
Dissertation at Uppsala University to be publicly examined in
the lecture hall IX, University Hall, Friday, December 12, 2003 at
10:15
for the Degree of Doctor of Philosophy. The examination will
be conducted in English
Abstract
Tiedemann, J. 2003.
Recycling Translations. Extraction of Lexical Data from
Parallel Corpora and their Application in Natural Language Processing. Acta Universitatis
Upsaliensis. Studia Linguistica Upsaliensia 1. 130 pp.
Uppsala. ISBN 91-554-5815-7
The focus of this thesis is on re-using translations in
natural language
processing. It involves the collection of documents and their translations
in an appropriate format, the automatic extraction of translation data, and
the application of the extracted data to different tasks in natural
language processing.
Five parallel corpora containing more than 35 million words in 60 languages
have been collected within co-operative projects. All corpora are sentence
aligned and parts of them have been analyzed automatically and annotated
with linguistic markup.
Lexical data are extracted from the corpora by means of word alignment. Two
automatic word alignment systems have been developed, the Uppsala Word
Aligner (UWA) and the Clue Aligner. UWA implements an iterative
''knowledge-poor'' word alignment approach using association measures and
alignment heuristics. The Clue Aligner provides an innovative framework for
the combination of statistical and linguistic resources in aligning single
words and multi-word units. Both aligners have been applied to several
corpora. Detailed evaluations of the alignment results have been carried
out for three of them using fine-grained evaluation techniques.
A corpus processing toolbox, Uplug, has been developed. It includes the
implementation of UWA and is freely available for research purposes. A
new version, Uplug II, includes the Clue Aligner. It can be used via
an experimental web interface (UplugWeb).
Lexical data extracted by the word aligners have been applied to different
tasks in computational lexicography and machine translation. The use of
word alignment in monolingual lexicography has been investigated in two
studies. In a third study, the feasibility of using the extracted data in
interactive machine translation has been demonstrated. Finally, extracted
lexical data have been used for enhancing the lexical components of two
machine translation systems.
Keywords: word alignment, parallel
corpora, translation corpora, computational lexicography, machine
translation, computational linguistics
Jörg Tiedemann, Department of Linguistics. Uppsala University.
Villavägen 4, Box 527, 751 20 Uppsala
© Jörg Tiedemann
2003
ISBN 91-554-5815-7
ISSN 1652-1366
urn:nbn:se:uu:diva-3791 (http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-3791)
Next: Acknowledgments
  Contents
  Index
Jörg Tiedemann
2004-01-03