Word alignment is the basic task in extracting lexical data from parallel corpora. In this chapter, we will concentrate on techniques for automatic word alignment that have been investigated, developed and implemented in the thesis.
The task of word alignment was introduced in section 0.6. We will emphasize the combination of multiple resources for achieving alignment accuracy. In the first section, word alignment techniques are described. It includes an overview of alignment resources and association measures, and a presentation of two word alignment systems that have been developed in the thesis (the Uppsala Word Aligner (UWA) and the Clue Aligner). The second section of this chapter describes metrics for word alignment evaluation including a proposal of a new refined measure for the evaluation of partially correct alignment results. Finally, the last section, before summarizing the chapter, presents experimental results that have been achieved with the word alignment techniques as described in the sections before.
In this section we will discuss in detail, word alignment techniques used in our investigations. The first part summarizes resources and measures. The second part presents the principles of the Uppsala Word Aligner (UWA) and its ``greedy'' alignment approach. The third part contains a detailed description of the clue alignment approach including definitions of the terminology and explanatory examples.
The main resource in word alignment is the bitext itself. Correspondences between words and phrases in the two languages can be derived directly from the corpus using statistics of word distributions and measures of similarity. Another group of resources for word alignment are external knowledge sources which can be divided into machine-readable data collections (such as bilingual dictionaries) and expert knowledge (such as association heuristics between word positions or part-of-speech tags).
Statistical approaches to word alignment are usually built upon the co-occurrence of words and their translations in bitext segments. Statistical techniques as introduced in section 0.6.2 capture the co-occurrence property by estimating alignment parameters that describe translation relations according to a translation model. The estimated parameters maximize the likelihood of the data given the model, i.e. recurrent word pairs with similar contextual features obtain high probabilities when maximizing the likelihood function. More details about statistical estimation approaches can be found in chapter 0.3. Association approaches to word alignment use measures of correspondence for the purpose of identifying translation relations. These association measures are often derived from statistics using co-occurrence frequencies as their main parameter. Section 0.6.1 gives some background to such measures that have been applied in alignment tasks. In our approaches, we apply three co-occurrence measures, the Dice coefficient, t-scores, and point-wise mutual information. More information about these measures can be found in section 0.6.1.
String similarity measures are used to find cognates in bitexts. Two common measures have been presented in section 0.6.1, the Dice coefficient and the longest common sub-sequence ratio (LCSR). In our alignment approaches, we apply the LCSR score.
LCSR measures string similarity based on common characters. However, the spelling of cognates often differs systematically between languages even if they are close to each other phonologically. Loan words that enter a language are usually altered in order to match language specific structures and rules. In many languages certain modifications are consistently applied when borrowing words from other languages. For example, the consonant 'c' in English usually corresponds to the letter 'k' in Swedish (and German) and vice versa when representing the phoneme /k/. Other examples of systematic spelling changes are the French word ``chauffeur'', which became ``chafför'' in Swedish, and the English word ``mail'', which is often spelled ``mejl'' in Swedish. Sometimes language specific characters are simply replaced by similar characters for being consistent with the alphabet of the borrowing language, for instance, ``smorgasbord'' in English, which is borrowed from Swedish (``smörgåsbord'').
A cognate recognizer should be aware of systematic spelling differences in order to identify a large number of possible cognates. Weighted string matching algorithms are discussed in [Tie99a] which can capture similarities even between non-identical characters. The same algorithm as for standard LCSR is used together with a character mapping function. This function defines weights for matching pairs of characters, even non-identical ones. In [Tie99a], three approaches the automatic construction of weighted string similarity measures are proposed. Weighted character mapping functions are learned automatically from a collection of cognate candidates. In the first approach pairs of vowels and pairs of consonants at similar positions are mapped to each other and co-occurrence scores are calculated for each character pair using the Dice coefficient. These scores represent the weights in the character mapping function. Similarly, in the second approach, pairs of vowel sequences and pairs of consonant sequences are mapped to each other and Dice scores are calculated in order to fill the character mapping function. Note that the unit to be matched is now a sequence of characters, which may consist of more than one character. This is also the case for the third approach, which yields the best result in the presented experiments. It is based on the co-occurrence of non-matching parts at similar positions in previously extracted cognate candidates. The following example illustrates non-matching parts from a Swedish-English cognate candidate:
| Swedish | k | r | i | t | i | sk | a | |
| English | c | r | i | t | i | c | a | l |
Non-matching parts in this example are the pairs ('k'
'c'), ('sk'
'c'), and (''
'l').
In [Tie99a], it has been demonstrated
experimentally for Swedish and English, that a weighted string
similarity measure with a mapping function that has been learned
using the third approach
improves recall and even precision for cognate extraction tasks.
In section 0.6.1, external resources for word alignment have been discussed. Several resources have been applied in our word alignment approaches. Machine-readable bilingual dictionaries (MRBD) and alignment heuristics such as position weights have been used in the Uppsala Word Aligner which is presented in the following section. Pre-defined relations between morphosyntactic features are applied in the Clue Aligner which is presented in section 0.16.3.
UWA implements a ``greedy'' word alignment approach based on association measures and alignment heuristics. The basic idea in this approach is to combine different resources in an iterative alignment procedure. The principles of this approach are inspired by Martin Kay's quote on machine translation [Kay97, p. 13]:
The keynote will be modesty. At each stage, we will do only what we know we can do reliably. Little steps for little feet!
Word alignment is a non-trivial task and also walks on ``little feet'', which leads to the principles of ``baby-steps'' for word alignment as formulated in [Tie99c, p. 218]:
Using these principles and the modular design of the Uplug system, the Uppsala Word Aligner (UWA) was designed as sketched in figure 11.
The aligner starts with a pre-processing step (principle 1), which includes the recognition and extraction of MWUs for both languages in the bitext. Collocations are identified using co-occurrence statistics (point-wise mutual information) and language-specific classified stop word lists for phrase boundary detection. In the second step, alignment candidates by means of type links are collected from any appropriate source (principle 2). Candidates are ranked according to their reliability (principle 3), i.e. their association score. Alignment candidates include cognates (found by means LCSR scores), associated pairs (found by means of Dice scores, point-wise mutual information or t-scores), single-word bitext segments5.1, pairs of rare words (which occur significantly less frequently than all other words in the bitext segment), and machine-readable bilingual dictionaries (MRBD). ``Risky'' candidates should be avoided (principle 4); several heuristics help to keep the collection of link candidates as clean as possible: Link distance thresholds (assuming that corresponding words do not occur arbitrarily far apart from each other), string length thresholds (short words are unreliable for string similarity measures), constraints on length difference ratios (assuming that related words do not exceed a certain length difference), and frequency thresholds (sparse data produce unreliable statistics). Word alignment includes several steps (principle 7) in the form of linking cascades. The process is implemented as a greedy, best-first (principle 4) search in the manner of competitive linking. Competitive linking implies that aligned words cannot be aligned again which allows the removal of linked words (principle 5). Unlike Melamed's approach [Mel96a], we do not restrict the greedy search to one-to-one alignments but include also (possibly overlapping) multi-word-units (MWUs) in the word alignment competition.
The word alignment cascades are not the end of the process. Type links are extracted from the aligned bitext and return to the alignment process as an additional resource (principle 6). Due to the competitive linking process the size of the corpus is reduced by the number of aligned words. The reduced corpus enters another alignment iteration starting, again, with a ``careful'' preparation of the first step in the process, i.e. the collocation extraction. The MWU recognizer finds adjacent collocations only. However, having aligned words removed from the bitext it is possible to find long-distance collocations, assuming that interrupting items have been removed completely from the corpus in the previous iteration. This motivates the repeated use of the MWU recognizer. Furthermore, the contents of the corpus has been changed significantly with respect to word occurrence and co-occurrence of word pairs. Statistically significant associations between words (and new MWUs) may arise in the reduced bitext, which could not been identified previously in the corpus. Finally, the reduced bitext may contain additional single-word bitext segments that can be directly aligned. The iterative process can be continued as long as new links are found (principle 8).
Word alignment using a combination of association clues has been introduced in [Tie03]. It is inspired by previous work (as described in section 0.16.2) but applies refined techniques for the combination of different resources. The main idea about the clue alignment approach is to incorporate several knowledge sources into the word alignment process in a systematic way. Empirical data shall be complemented with linguistic information, statistical parameters shall be combined with alignment heuristics.
Most word alignment models apply knowledge-poor approaches, i.e. alignment based on plain texts disregarding all additional information that can be gathered for the languages involved. Simple tools such as stemming functions and language-specific stop word lists are often added to improve the performance of simple association approaches. Recently, initial studies on the impact of linguistic information, such as part-of-speech, on word alignment have been carried out as described in chapter 0.3. The clue alignment approach represents a flexible framework for such a combination of language-specific, heuristic, and statistical resources in word alignment.
For many languages, reliable tools are available for enriching corpus data with linguistic information. Examples of such tools are part-of-speech taggers, lemmatizers, named-entity recognizers, and (shallow) syntactic parsers. Other kinds of language-specific information may be found in dictionaries including idiomatic expressions, multi-word terms, word meanings, word translations etc.
It may be assumed that such data and tools have a positive impact on the identification of translation relations between lexical units in bitexts. For an illustration of this intuition, see for instance the simple example in figure 12 where translation correspondences are expressed between automatically tagged chunks rather than between words.
Linguistic information and contextual features may be regarded as clues to relations between words. In particular, these clues may indicate translational correspondences between words in bitexts. Contextual clues are frequently used by lexicographers for the identification of word senses. Similarly, the clue alignment approach tries to utilize information about the context for the identification of translation equivalents. Many types of features can be explored: Spelling similarities may indicate cognates. Information about word classes may support linking decisions. Word position can be another key to alignment preferences. Morphosyntactic features and information about syntactic functions may help to find structural similarities and translational correspondences between lexical items within these structures. Some of these alignment clues may be very specific (e.g. default translations taken from a dictionary) and others may be very general (e.g. pairs of chunk labels such as 'NP-NP' in figure 12). Depending on their impact, clues may be of different importance for the identification of translational correspondence.
The following definitions are used in the clue alignment model:
The definition of word alignment clues is very general and allows a large variety of clue types. Clues are (as in real life) not always helpful. Bad clues can be misleading. This has to be considered when choosing clue resources and designing clue patterns. The impact of certain clues in combination with others is another issue, which becomes very important when setting weights for individual clue patterns and resources. The optimal solution for the compilation of alignment clues would be to find a way of automatic approval of possible clues for the task of word alignment including an automatic setting of optimal weights for ``good'' clues. This could be achieved using representative test data and an optimization procedure. However, this is very time consuming due to the large number of possible options and parameters. An intermediate solution is to adjust the clue settings experimentally according to the alignment results using intuitively reasonable settings and clue patterns.
The distributional property of clues as defined above enables the combination of clues on a word-to-word level according to the clue combination rule. In this way, a value for the total clue can be calculated for each pair of words for a given bitext segment using all available clues, even for overlapping items. One-to-one relations between words of a bitext segment can be represented in a two-dimensional space, which we will call the clue matrix as it is filled with clue values. The following example is constructed from a bitext segment taken from a translation of Saul Bellow's ``To Jerusalem and back: a personal account'' [Bel76] to Swedish [Bel77] (henceforth the Bellow corpus):
Then hand luggage is opened. Sedan öppnas handbagaget.
Let us assume clues derived from co-occurrence measures (Dice) and string similarity measures (LCSR) in table 2.
Using the clues from table 2 we can fill each cell of the following clue matrix (figure 13) with corresponding values (without weighting scores).
The example in figure 13 illustrates how overlapping clues may strengthen the reliability of word links. Most of the clues in the example do not strongly suggest links on their own (especially clues below 0.5). However, in combination they create a clearer picture about the alignment that should be preferred.
In the following section, word alignment strategies using association clues collected in clue matrices are presented and discussed.
A clue matrix summarizes information from various sources that can be used for the identification of translation relations. However, there is no obvious way to utilize this information for word alignment as we explicitly include multi-word units in our approach. The clue matrix in figure 14 has been obtained for a bitext segment from the Bellow corpus using a set of weighted declarative and estimated clues.
There are many ways of ``clustering'' words together and there is no obvious maximization procedure for finding the alignment optimum. The alignment procedure depends very much on the definition of an optimal alignment. The best alignment for our example would probably be the set of the following links:
A typical procedure for automatic word alignment is to start with
one-to-one word links. Links
that have a common source or target language words are called
overlapping links.
Sets of overlapping
links, which do not
overlap with any other link outside the set, are called link
clusters.
Aligning words one by one often
produces overlaps and in
this way implicitly creates aligned multi-word-units as part of link clusters.
A general word-to-word alignment
for a given bitext segment with
source language words (
) and
target language
words (
) can be formally
described as a set of links
This general definition allows varying numbers of
links (
) within possible alignments
. There is no
obvious function
that can be maximized for finding the optimal alignment. The notion of
an alignment optimum can be defined in several ways.
One such word-to-word alignment approach is to assume a
directional word alignment model
similar to the models in statistical
machine translation (see section 0.6.2). The directional alignment
model assumes that there is at most one link for each source language
word. Using alignment clues, this can be expressed as the following
optimization problem:
where
is a set of links
with
and
is the ``total clue value''
for the linked items
and
. In
other words, word alignment is the search for the best
link for each source language word. Directional models do not allow
multiple links from one item to several target items. However, target
items can be linked to multiple source language words as several
source language words can be aligned to the same target language word.
The direction of alignment can easily be
reversed, which leads to the inverse directional
alignment:
with links
and
. In the inverse directional alignment,
source language words can be linked to multiple words but not the
other way around.
The following figure illustrates directional alignment models applied to the
example in figure 14:
Directional link sets can be combined in several ways. The
union of
link sets (
) usually causes many
overlaps and, hence, very large link
clusters. On the other hand, an intersection of link sets
(
) removes all overlaps and
leaves only highly confident one-to-one word links behind.
Using the same example from above we obtain the following alignments:
The union and the intersection of links do not produce
satisfactory results.
Another
possibility is a refined combination of link sets
(
)
as suggested by Och
and Ney [ON00b]. In this approach, the intersection of links
is iteratively extended by additional links
which pass
one of the following two constraints:
Applying this approach to the example, we get:
Another alignment approach is the competitive linking
approach
proposed by Melamed [Mel96a]. In this approach, one assumes
that there are only one-to-one word links. The alignment is done
in a ``best-first'' search manner where links with the highest
association scores are aligned first, and the aligned items are then
immediately
removed from the search space. This process is repeated until no more
links can be found. In this way, the optimal alignment (
) for
non-overlapping one-to-one
links is found. The number of possible links in an alignment is
reduced to
. Using competitive linking with our example we yield:
Another iterative alignment approach is
proposed in [Tie03]. In this approach,
the link
with the highest
score in the clue matrix
is added to the set
of link clusters if it fulfills certain constraints. The top score is
removed from the matrix (i.e. set to zero) and the link search is
repeated until no more links can be found. This is basically a
constrained best-first search.
Several constraints are possible. In [Tie03] an
adjacency check is suggested, i.e. overlapping links are accepted only
if they are adjacent to other
links in one and only one existing link cluster. Non-overlapping links
are always accepted (i.e. a non-overlapping links creates a new link
cluster). Other possible constraints are clue value
thresholds, thresholds for clue score differences between adjacent
links, or syntactic constraints (e.g. that link clusters may not cross
phrase boundaries). Using a best-first search strategy with the adjacency
constraint we obtain the following alignment:
None of the alignment approaches described above produces the manual reference alignment in our example using the given clue matrix. However, simple iterative procedures come very close to the reference and produce acceptable alignments even for multi-word units, which is promising for an automatic clue alignment system. Directional alignment models depend very much on the relation between the source and the target language. One direction usually works better than the other, e.g. an alignment from English to Swedish is better than Swedish to English because in English terms and concepts are often split into several words whereas Swedish tends to contain many compositional compounds. Symmetric approaches to word alignment are certainly more appropriate for general alignment systems than directional ones.
Using association measures for estimating alignment clues lead to
another complication, the proper transformation of association scores
into association probabilities. The definition of estimated clues
using arbitrary association measures assumes a linear correlation
between the scores of the measure and the probability of the
corresponding association. For example, the degree of similarity
of two strings by means of LCSR assumed to be linearly
correlated with the probability of the two strings being cognates. The
correlation factor is expressed in the value of
where
is used to normalize
the association score (i.e. to transform the score into a value
between 0 and 1)
and
is used to weight the score according to the
correlation between the measure and the corresponding probability.
The
normalization factor can be found by setting
,
for instance
(for identical
strings
and
). However, for
measures like point-wise mutual information, such a maximum is not
defined for arbitrary arguments
and
. In the case of point-wise
mutual information it is possible to make
dependent on
specific items
and
, as the maximum of point-wise mutual
information is given with
and similarly
[MS99]. Normalizing point-wise mutual information
can then be done by setting
.
Other measures like t-scores may even produce negative values that
can be interpreted as statistical evidence for items to avoid each other
[CGHH91]. A simple way of handling negative scores is
to disregard them. However, the value of general t-scores can also be
arbitrarily large.
The maximum of a t-score given certain items
and
also depends on the estimation of the standard error, which involves
the size of the corpus as one parameter.
Normalization in
these cases is very inefficient because it depends on several
parameters and has to be re-calculated for individual word pairs.
An approximate solution for normalization is to use a
large value
for highly correlated items
and
.
This approximation is sufficient as long as the score
does not exceed
.
Another important issue is the adjustment of an appropriate weighting scheme. Weights have to resemble the importance of specific clues. However, an optimization of weights has not been carried out in the present study.
In [Tie03], word alignment experiments are presented using two clues, which have been derived from association scores: cognate clues using LCSR scores and co-occurrence clues using the Dice coefficient. LCSR and Dice produce values between 0 and 1, thus, simplifying the normalization. The clue weights have been set uniformly to 0.5. One of the advantages of the clue alignment approach is that clues may even refer to overlapping multi-word units. This makes it possible to compute association scores for various words and word groups from the corpus. Two approaches to the identification of MWU candidates have been tested in [Tie03]: Pre-defined chunks using shallow parsers and arbitrary bi- and tri-grams5.3.
As mentioned earlier, clues can be learned from aligned data.
Learning
clues is based on the assumption that an association clue
between
lexical items is
linearly correlated with the likelihood of
and
being
features of aligned source and target language items. In this way,
word alignment relations are represented as generalized clues
from aligned
training data using
clue patterns
that define the features that
are assumed to carry the
relational information.
One way to
estimate the correlation between features of aligned items is to use a
conditional probability
, which is the likelihood of
the features
of target language items given the features
of aligned source language items.
This likelihood can be estimated using
the frequency
of co-occurring features in aligned
training data, the
feature frequencies
and
in the same data, and
a maximum likelihood estimation:
| (5.1) |
The correlation factor is expressed as the
weight
. Features can be any kind of information attached to
the lexical items or to any contextual data.
A conditional probability is ``directional'', i.e. estimations of
may be very different to estimations of
especially for features with very different
occurrence frequencies. One possibility of a symmetric estimation of
the association between features is to use the Dice
coefficient for
combining both conditional probabilities.
| (5.2) |
The Dice coefficient is in fact the harmonic mean of the two
conditional probabilities
and
as
shown in equation 2.
Unfortunately, word aligned training data is not available and therefore, conditional feature probabilities cannot be estimated directly. However, other clues such as association measure clues may be used to create word alignments as described earlier. The idea now is to use automatically aligned data as training data for the estimation of feature clues. In this way, alignment clues can be found in a boot-strapping procedure, starting with a basic alignment and learning new clues from (noisy) alignments that have been produced using previous clues. Such self-learning techniques are known to increase noise. However, experimental results show that the alignment gains from dynamically learned clues, as is described in [Tie03] and section 0.18. The main reason for the success of this method is that previously unused information and contextual dependencies can be integrated in the alignment process. For example, relations between part-of-speech tags can be found, which can be used to generalize the translation relation between words that belong to certain word classes. Translational dependencies between word positions can be learned and syntactic relations can be identified. Many features are possible for the estimation of dynamic clues. Four dynamic clue patterns have been defined in [Tie03]:
Many more examples than the ones above could be shown. There is a large variety of possible feature sets and many possible combinations among them. Some other examples of dynamic clue patterns are listed in table 3.
|
|
Clue patterns do not have to be symmetric like the ones described here. A clue pattern can also refer to features of different complexity for source and target languages. However, it is important to bear in mind that the design of clue patterns is important for the reliability of clues that have been learned using these patterns. Learning clues assumes that associations between translated words can be generalized as relations between certain features. As mentioned earlier, clues can be misleading, which is certainly true for dynamic clues that may ``over-generalize'' relations, i.e. they may express relations that do not correlate with associations we are looking for. Another crucial assumption is the independence assumption between clues, which is needed for the combination of clues. Overlapping associations indicators violate this assumption, which may influence the approach in a negative way.
Word alignment clues may be derived from any source that ``promises'' to find associations between words and phrases. Alignment approaches used in statistical machine translation are such a source producing, among other things, bilingual lexical translation probabilities. SMT uses refined translation models for estimating lexical probabilities that are directly applicable as a resource in the clue alignment system. SMT alignment models include various kinds of dependencies, which makes them a valuable resource, evident in the experimental results (section 0.18.2). There are several results of SMT alignments, beside the lexical probabilities, that can be applied in clue alignment. SMT itself produces a word aligned bitext suitable for training. Type links can be extracted directly from the aligned corpus and may be applied as lexical resources in clue alignment. Distortion parameters can be used as an additional clue. Even fertility parameters and mappings to the ``empty word'' could be added to improve the clue alignment approach. However, the main resource that has been used in our experiments is the set of lexical translation probabilities, which can be directly included in clue alignment experiments.
Declarative clues are pre-defined and originate from language sources such as bilingual dictionaries or contain expert knowledge such as pre-defined relations between certain feature pairs. Declarative clues have not been used in the experiments presented in [Tie03]. Further experiments (see section 0.18.3) show that simple declarative clues improve alignment results. An example of a simple declarative clue for aligning English and Swedish texts is the part-of-speech clue in figure 155.5.
Similarly, declarative clues can be defined between chunk labels, relative word positions, common function words etc. The following declarative clues have been defined for the experiments in section 0.18:
The clues discussed above are all used as ``positive''
indicators for associations between lexical items. It has already been
mentioned before that certain clues may be misleading
depending on the reliability of the source from which they have been
derived. Some clues may indicate exactly the opposite of positive
clues, i.e. high clue
values may signal a ``negative'' relation between items such as
negative t-scores indicate words which ``avoid'' each other in a
corpus. Such negative clues (
) could be
utilized in alignment in order to prevent incorrect links.
One way would be to use the complement of a negative clue (
)
in the existing framework.
However, this does not correspond to the actual function of a negative
example as it does not reduce already existing association scores but
only improves scores for items that do not match the example. A proper
integration of negative clues would be a change in the
clue alignment model in order to allow the reduction of association
scores according to the values of negative clues. One could, for
example, define the total clue as the joint probability of positive
clues and the complement of negative clues. Using the independence
assumption for alignment clues the total clue could be computed as the
following product:
. However, in this way, negative
clues would have a very strong influence on the indicated
associations as they affect all scores. Consequently, negative clues
have to be chosen very
carefully in order to lead the alignment program to the correct
decisions. Strong negative clues could be, for example, pre-defined links of
words that should absolutely not be aligned. Examples of such
links would be pairs of non-related function words, which tend to obtain rather
large association scores because they co-occur frequently. Other
examples would be word pairs that are very similar but are in fact
``false friends''. The impact of negative clues has not yet been
tested and a proper way of integrating them in the system has not yet
been found. It would be interesting to investigate this
further in the future.
In section 0.6.4, two types of measures for word
alignment evaluation have been presented, the
measures for
translation spotting and the
measures for exhaustive
word-to-word alignments.
Another measure has been proposed by the author of this thesis in
[AMST00].
The measures in this approach (the
measures) are tailored
toward gold standards which include complex
MWU links of sampled words from the bitext. In the following I will
review this approach, revise the evaluation measures, and propose
refined metrics for word alignment evaluation (the
measures).
For the computation of the PWA measures, the following partiality value Q is calculated for partially correct link proposals5.6 for each reference link of the gold standard.
The set of
includes all source language
words of all proposed links if at least one of them is partially
correct with
respect to the reference link
from the gold standard. Similarly,
refers to all the proposed target language
words.
and
refer to
the sets of source and target language words in link
of the gold
standard. Using the partiality value
, we can defined the
recall and precision metrics as follows:
The value of
refers to the number of links
for which
there is at least one link proposal by the system, i.e. the number
of links
for which there is a link proposal
with
or
. The value of
is the size of the
gold standard (
).
The measures above take the correct portions of partially correct links
into account. A consequence of this definition is that partially correct links
will be scored as completely correct
if they cover a MWU link completely and nothing else than
words of the MWU link. For example, if the reference link represents
the linked MWUs ``United Nations'' and the Swedish translation
``Förenta nationerna'' the following sets of links will be considered to be
completely correct with respect to the reference:
(United
Förenta, Nations
nationerna),
(United Nations
Förenta nationerna),
but also (United
nationerna, Nations
Förenta).
The first two link sets are perfectly acceptable whereas the third set is
unlikely to be accepted in isolation of the MWU alignment.
This problem is a consequence of the fact that the gold standard does not give
any information about how the internal parts of a MWU link should be
covered. However, cases like the constructed one above are
usually very rare and acceptable with respect to the underlying
MWU link.
Another problem with the partiality measure
is that its
value is the same for both precision and recall. Hence, precision and
recall are always identical in cases where at least one partially
correct link has been proposed for each reference link,
i.e.
.
In order to capture the difference between precision and recall I propose the following refinements of the PWA measures for precision and recall using MWU links5.7:
These measures will be called the MWU measures for word
alignment evaluation in the following.
Let us review the example from table 1 on page
in order to illustrate the differences between the evaluation
metrics which have been introduced so far.
Table 4
contains different
evaluation scores given the links from the sample gold
standard5.8 (column 1 and 2) and two link proposals by an
imaginary alignment system (column 3).
The differences between precision scores are quite
large.
The example illustrates the dependence of the evaluation on particular
metrics, which have been tailored towards specific alignment purposes
and styles of gold standards.
Note that the precision measure
is only possible for
completely aligned gold
standard documents. Otherwise it is impossible to judge how many
alignments the system proposes for existing gold standard links. Compare this
to a word sampling method for creating gold standards. Using only
samples of words makes it impossible to say if a proposed link is
to be counted as incorrect with respect to the sampled link or just as
another link that is not to be considered in the evaluation.
The difference between the PWA measures and the MWU measures can be
mainly seen in precision. In our example, the PWA measure ``punishes''
the proposed links because of their partiality in both, precision and
recall. However, they are not necessarily wrong but
incomplete. The MWU measures capture this fact by a clear difference
between precision and recall.
In our experiments, we applied word sampling gold standards
with complex MWUs. Hence, the MWU measures are a natural choice for
evaluation of the alignment performance.
A large number of experiments were carried out mainly using the corpus data from the PLUG project. Results have been evaluated using the gold standards created in the same project [MA99]. An extensive summary of alignment results, using the greedy word aligner described in section 0.16.2, can be found in [AMST99]5.9. Results from initial experiments with the clue alignment approach are presented in [Tie03].
In this section I will present alignment experiments which were carried out recently on three sub-corpora from the PLUG corpus, the novel ``To Jerusalem and back: A personal account'' by Saul Bellow in English and Swedish (Bellow), the Scania 1995 corpus in Swedish and English (Scania95en), and the Scania 1995 corpus in Swedish and German (Scania95de). These three sub-corpora were chosen for comparing alignment results for different text types (literary and technical text) and for different language pairs (Swedish-English, English-Swedish and Swedish-German). We will concentrate on the clue alignment approach as it allows many interesting combinations of techniques and resources.
First, I will describe the setup of the presented experiments. This includes a brief overview of the gold standards which have been used for evaluation. Secondly, I will present experiments using basic alignment clues. Thirdly, I will discuss the effect of declarative and dynamic clues on alignment performances. Then, I will compare different search strategies and their impact on alignment quality. Finally, I will discuss the overall alignment performance in comparison with previously achieved results.
The three corpora that have been chosen are fairly small and therefore well suited for extensive experiments using a large number of settings. Some characteristics are presented in table 5.
The clue alignment approach was chosen as the basic methodology. A large number of settings have been explored and results using these settings are presented in the following sections.
The first series of experiments comprises word alignment attempts using so called ``basic clues'' referring to alignment clues that have been derived directly from the bitext using empirical techniques such as association measures and statistical alignment. Two types of basic clues will be distinguished: Base1 clues refers to word associations that have been found using simple association measures based on co-occurrence and string similarity. Base2 clues refers to basic clues, including the translation probabilities that are produced by statistical alignment.
The second series of experiments comprises alignment attempts, in which declarative clues have been added to basic clues from the section before. Three types of declarative clues have been applied as described in section 0.16.3. Different combinations of basic and declarative clues are investigated on the example of aligning one of the three test corpora, the Bellow corpus. Declarative clues have been added to both types of basic clues, base1 and base2.
The third series of experiments comprises word alignment attempts using dynamic clues as described in table 3 in section 0.16.3. Dynamic clues are learned from previously aligned corpora. Aligned training data are produced by the clue aligner using basic and declarative clues. The impact of the quality of training data on dynamic clues is investigated as well as the performance of the aligner when adding learned clues to basic and declarative clues. Experiments have been carried out for all three test corpora. Results are presented and discussed.
In the fourth series of alignment experiments the effect of different search strategies on the alignment performance is investigated. Seven strategies have been introduced in section 0.16.3. Their impact on aligning the Bellow corpus is discussed on the example of three different clue alignment settings.
Finally, the overall performance of the clue aligner is presented for each of the three test corpora in comparison with results which have been yielded by the Uppsala Word Aligner (UWA) and the statistical alignment tool GIZA++5.10. Six types of alignment settings are used to summarize the achieved results: Word alignment using base1 clues, base2 clues, base1 and base2 clues in combination with declarative clues, and, finally, a combination of the latter two with dynamic clues. For each type the best result in terms of F-values has been chosen.
Within the PLUG project, gold standards were created for a majority of sub-corpora of the project corpus. They were produced using a word sampling method and the PLUG Link Annotator (PLA) [MAA02]. 500 randomly sampled words were chosen for texts from three different genres (literary, technical, political documents) and two different language pairs (Swedish-English and Swedish-German). Gold standards include ``fuzzy'' links, ``null'' links, and complex MWU links. They were produced by several annotators using PLA and detailed guidelines [Mer99a]. The average agreement between annotators was measured for the creation of some of the gold standards and yielded about 92% [AMST99]. Figure 16 shows some examples of links in the gold standard of the English-Swedish Bellow corpus5.11.
The same gold standards have been used throughout the experiments. However, the clue aligner and UWA use different formats. Several links have been lost in the automatic conversion from the old format to the clue aligner format due to tokenization differences. However, the majority of links could be converted without difficulties. Table 5 summarizes the sub-corpora and their corresponding gold standards that have been used to evaluate the experiments presented here.
Basic clues are directly derived from parallel corpora without word aligned training material or external knowledge sources. The following basic clue types have been used in the experiments:
I have used a very simple method for normalizing and
weighting scores.
and
scores have been weighted by a factor of 0.05.
and
clues have been weighted by 0.005 and 0.01,
respectively5.12.
The
clues are considered to be more reliable than, for instance,
clues because they are
estimated with consideration of contextual dependencies. Therefore, a
slightly higher weight of 0.1 is used for the
clue. The same
applies to the
clue.
Table 6 illustrates several alignment experiments
using different sets of basic clues for the three bitexts.
A large difference can be observed between the alignment results using different sets of basic clues. Generally, a simple string matching clue is clearly the worst indicator for word alignment even for closely related languages such as Swedish, German and English. GIZA, on the other hand, is the best single clue of our basic clues. Alignments using these probabilities as alignment clues yield the highest precision among all settings for all three bitexts.
Combinations of co-occurrence clues such as the Dice clues, point-wise mutual information clues and t-score clues do not improve the overall performance when measured in terms of F-values. Such combinations generally yield higher recall at the expense of precision. However, for all three corpora, the loss in precision is so high that the overall performance drops significantly for the combination of the three measures in comparison with the single best measure among them (Dice). This can be explained by the fact that closely related clues such as co-occurrence clues violate the independence assumption made when combining them. However, a combination of the probabilistic GIZA clue and the Dice co-occurrence clue produces a slight improvement for the Bellow corpus. The loss in precision is compensated by a clear increase in recall. This is not the case for the technical Scania corpora. Here, the precision decreases considerably without noticeable improvements in recall. The recall rate goes even down significantly in case of the Swedish-English Scania bitext.
Surprisingly, the LCSR clue based on string similarity does not add much to the aligner. On the contrary, performance drops in most of the cases when adding string similarity clues. This was unexpected especially in the case of technical Swedish-German text. A small improvement can be observed when adding LCSR scores to Dice based alignment clues when aligning the Scania bitexts. However, the improvement is very little and does not justify the expensive task of computing string matching clues.
The overall picture of basic clues and their influence on alignment results is very consistent even though the bitexts represent two different genres and include two different language pairs. Therefore, we may conclude that probabilistic clues yield reliable results independent of the text genre and possibly even independent of the language pair under consideration. However, these claims are too strong to be drawn from our data as they include only three related languages and two types of text. The experiments prove that probabilistic clues can be successfully applied to a variety of languages and text types. Furthermore, they imply that string matching techniques do not work well for word alignment at least according to the present investigations. However, adjustments in the weighting scheme may change this behavior.
Declarative clues are pre-defined relations as described in section 0.16.3, i.e. they are static and independent of the particular bitext. Three declarative clue sets for English and Swedish bitexts have been tested in the present study; part-of-speech clues in form of pairs of part-of-speech tags (pos), a second set of part-of-speech clues based on pairs of tag sequences that match multi-word units (pos-mwu), and a set of chunk label pairs (chunk). The pos-mwu clues are shown in figure 15. The other two clue sets are listed in appendix .4. Declarative clues can be combined with any other set of clues. In this study, the three sets have been combined with some of the basic clues which have been presented in the previous section. Naturally, declarative clues apply only to bitexts that include language pairs matching the defined clue features and which include necessary data such as part-of-speech tags and chunk labels. We concentrate our investigations on the Bellow corpus. The English part of the corpus has been tagged and parsed using the open-source library for natural language processing Grok [Bal02]. The Swedish portion has been tagged using the TnT tagger [Bra00,Meg02] and parsed using Beáta Megyesi`s context-free grammar parser for Swedish [Meg02]. All markup has been done automatically and no corrections have been made. Hence, errors are to be expected in the markup. The declarative clues have been weighted uniformly. Figure 17 illustrates the results produced when combining declarative clues with different sets of basic clues.
|
The effect of declarative clues on the alignment performance depends very much on the set of basic clues to which they are added. Surprisingly, combinations with low performances in their basic settings (such as the combinations of three co-occurrence measures and string similarity clues (dice+I+tscore+lcsr)) outperform the basic alignment with the best performance (dice) when combined with declarative clues such as the pos clues. This indicates the importance of an appropriate weighting scheme, which has not been optimized in our experiments. The clue patterns seem to be in a better ``balance'' for the complex clue combinations in comparison with less complex ones. The declarative clues are very strong generalizations of relations between words and phrases and their influence on alignment is probably too strong in cases of less complex alignment clue sets. Similar behavior can be observed when combining declarative clues with base2 clues as shown in figure 18.
|
It is also interesting to observe that the performance decreases in most cases where all available clues are combined together. The performance drop is due to lower precision values that are not compensated by their corresponding recall scores. Generally, the pos clues seem to add the most valuable link indications to the word alignment procedure. The performance increase is mainly due to a great improvement in terms of recall. This is the case for both configurations, base1 plus declarative clues and base2 plus declarative clues.
A large variety of dynamic clues can be learned from previous alignments (see section 0.16.3, table 3). Furthermore, they can be combined in many ways. In table 7, alignment results for several settings of dynamic clues are presented.
| ||||||||||||||||||||||||||||||||||||||||
The quality of the training data is decisive for the performance of dynamic clues, i.e. the use of different basic clues for producing aligned material affects the set of dynamic clues that can be learned. The scores in table 7 illustrate the influence of training data on the performance of the word aligner using dynamic clues. Two different basic alignments have been used in order to learn several types of dynamic clues from the Bellow corpus, i.e. alignments using the Dice coefficient (Dice) and alignments using the GIZA clue. As shown in table 6 the Dice clues perform worse than GIZA clues. The difference amounts to more than 11% in terms of F-values (about 66% for Dice clues and more than 77% for the GIZA clues). In other words, both alignments represent very different training sets. As expected, dynamic clues learned from these two link sets reflect the difference in that way, that clues learned from the GIZA alignment generally perform better than clues learned from Dice alignments. However, the performance differences are relatively small when considering the large difference between the two training sets. This shows that valuable alignment clues can be learned even from small and noisy alignments as the ones produced by the the Dice clues. Precision seems to be the main factor for inferring generalized dynamic clues whereas recall is a secondary aspect.
Alignment clues that involve features such as parts-of-speech and phrase types (p3, c3, c3p, eacl, pp, p3p) behave very similar for both experimental settings because they generalize relations between words and phrases on a morphological and syntactic level rather than on the lexical level. These generalizations seem to be quite similar for both training sets if one considers the resulting word alignments. The experiments in table 7 also show that certain generalizations are better suited for word alignment than other ones. The combination of part-of-speech tags and relative positions (pp and p3p) seem to be the best clue patterns on a morphosyntactic level for our test corpus. Other morphosyntactic clues perform much worse on their own (p3, c3, c3p) indicating that they do not capture much useful information for word alignment. However, a combination of very simple morphosyntactic clues as in the eacl setting performs much better than similar clues on their own.
On the other hand, learned clues including lexical features (lex, lexp) perform very differently. For them, the amount of training data (i.e. recall of the base alignment) seems to be more decisive for their alignment quality than for the others. These lexical clues perform significantly better when learned from the giza-alignment than from the dice-alignment.
Finally, a combination of lexical and morphosyntactic clues (lex+p3p+c3p) clearly out-performs all the other dynamic clue settings.
Table 8 summarizes alignment results for different bitexts and clue sets. Basic clues (as indicated in the first column) were used to create aligned training data for learning dynamic clues listed in the remaining columns. Each column contains the performances of the word aligner when adding the corresponding dynamic clues to the set of basic clues. The top scores are indicated in bold style for each row in the table.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
In general, dynamic clues that do not perform well on their own do not perform well in combination with basic clues either. Another general observation is that basic clues with low alignment qualities are ``easier'' to improve than clues with high alignment qualities. Most of the dynamic clues that have been added to base1 clues (dice, dice+lcsr+static) boost the alignment performance for all three test corpora. On the other hand, only some dynamic clues help to improve alignments based on base2 clues (giza, giza+lcsr+static). However, certain improvements can be observed for all three test corpora when adding dynamic clues even to base2 clues. The only exception is the giza clue alignment of the Swedish-German Scania95 corpus, which did not improve in combination with any dynamic clue. In most cases, dynamic clues that perform well on their own produce the largest improvements when combined with basic and declarative clues. In several cases, the difference between the best score and the second best score is very low and a general conclusion about the quality of specific dynamic clues in comparison with others cannot be drawn from this result. In the present experiments the possibilities of incorporating learned clues for word alignment have been investigated. The results prove that it is possible to improve word alignment results using such clues. An optimized weighting scheme is needed to get the best out of each learned clue when combined with others.
In section 0.16.3 (pp
) several
alignment search strategies have been discussed. The clue aligner
implements different strategies in order to test their impact on the
alignment performance. In figure 19, the
alignment results for three clue settings and the
Bellow corpus are presented for
the seven search strategies described earlier.
|
The figure illustrates the relation between precision and recall when using different algorithms. As expected, the intersection of directional alignment strategies yields the highest precision at the expense of recall, which is generally lower than for the other approaches. Contrary to the intersection, the union of directional links produces alignments with the highest recall values but lower precision than all other search algorithms. Directional alignment strategies generally yield lower F-values than other refined symmetric alignment strategies. The differences between the two alignment directions are surprisingly inconsistent. Competitive linking is somewhat in between the intersection approach and the two symmetric approaches, ``best-first'' and ``refined''. This could also be expected as competitive linking only allows non-overlapping one-to-one word links. The refined bi-directional alignment approach and the constrained best-first approach are almost identical in our examples with a more or less balanced relation between precision and recall.
According to figure 19, alignment strategies can be chosen to suit particular needs. Concluding from our experiments, restrictive methods like the intersection approach or competitive linking should be chosen if results with high precision are required (which is mostly found among one-to-one word links). This is, for example, the case in automatic extraction of bilingual lexicons where noise should be avoided as much as possible. Other strategies should be chosen for applications, which require a comprehensive coverage as, for example, machine translation.
In the previous sections, the clue alignment approach has been explored using a variety of different settings and resources. One of the main advantages of this approach is its modularity, which makes it possible to integrate many different alignment resources. In this section we will look at the clue alignment results in comparison with other alignment approaches, i.e. the greedy word aligner implemented in the Uppsala Word Aligner (UWA) and the statistical alignment approach implemented in the GIZA++ toolbox. We will look at the three test corpora which have been used throughout our investigations separately.
We will use the following name conventions to refer to different settings: The best result for a clue alignment using basic clues only, except GIZA clues, are denoted base1. The best result of an alignment with basic clues including the GIZA clues are called base2. Alignments including declarative clues contain the string ``+stat'' in their labels and alignments including dynamic clues include the string ``+dyn''. Identical names do not necessarily refer to the same setting for each of the three corpora. There are usually several settings that fit into the given category. Here, we will always use the best of our results for the given category and each specific corpus, which is not necessarily taken from results presented in previous sections. Note that all clues are weighted as described in the previous sections. An optimization of weights for the combination of clues has not been carried out. Alignment results using GIZA++ are denoted with GIZA++ and refer to standard settings of the GIZA++ system (IBM model 1 + HMM + IBM model 3 + IBM model 4, five iterations for each model). GIZA++ inverse refers to alignments in the opposite translation direction. The best result using the Uppsala Word Aligner is denoted UWA.
Figure 20 illustrates clue alignment results for different settings together with alignment performances of UWA and GIZA++ for the English-Swedish Bellow corpus.
The clue aligner clearly outperforms the simple UWA alignment approach even for basic setups. The main improvement can be found in recall, which is much higher than for the restrictive UWA approach. UWA yields high scores in terms of precision (about 86%) but low recall scores (only 52% in our example) resulting in an F-value of about 65% as indicated in figure 20. The precision of the clue aligner is below 81% for all settings but with corresponding recall values way above 52%. The clue alignment results are generally below the GIZA++ alignments as long as GIZA clues are not included in the alignment procedure. However, using the GIZA++ dictionaries implies a major improvement of the aligner performance and, combined with declarative and dynamic clues, it goes beyond the GIZA++ alignment result. The top result with an F-value of over 83.9% is clearly an improvement compared to the other approaches including GIZA++.
Figure 21 illustrates alignment results for the Swedish-English Scania 1995 corpus.
GIZA++ performs very well on the Swedish-English Scania corpus. Both, precision and recall are high above 80% (86.15% in precision and 85.85% in recall). Swedish and English are two very closely related languages and technical manuals such as the documents in the Scania corpus are very consistent in their terminology. Statistical alignment approaches profit from consistency as it can be seen in the excellent alignment that is produced by GIZA++. Certainly, the alignment direction suits the statistical aligner. GIZA++ links Swedish words to each English word in the corpus, which makes it possible to align several English words to the same Swedish word. The GIZA++ alignment in the other direction does not reach the same performance (only about 80.21%). This is consistent with the results of the alignments of the Bellow corpus. The highest performance there is also obtained when aligning Swedish words to their English counterparts. However, the performance difference between the two GIZA++ alignments is even larger for the Scania corpus than for the Bellow corpus. Similarly to the alignments of the Bellow corpus, GIZA++ outperforms the clue alignment approach when using base1 clues only. Using the GIZA dictionary (base2 clues) improves the clue alignment significantly. However, it does not reach the performance of the statistical alignment even though it touches its result (85.26% for a combination of base2, static and dynamic clues). This may be explained by the importance of distortion and fertility parameters estimated by GIZA++ but not used in the clue aligner. Scania documents contain many short sentences and sentence fragments with a large number of compound terms. Many alignment problems can be solved with the help of positional dependencies and relations between compositional compounds (in Swedish) and non-compositional compounds (in English).
Finally, in figure 22 alignment results for the Swedish-German Scania 1995 corpus are illustrated.
Similarly to the other two bitexts, the clue alignment performance increases when adding clues to the process and outperforms the Uppsala Word Aligner. Basic alignments using association measures (base1 clues) are worse than the alignments produced by GIZA++. However, adding the GIZA clue to the aligner increases the performance beyond the result of the statistical alignment. An interesting fact is that the base2 clue alignment performs better than both GIZA++ alignments. The difference between the two directional alignments performed by GIZA++ is minimal. However, symmetric alignment approaches, such as the clue alignment strategy used here, outperform directional strategies. The relation between Swedish and German is different from the relation between Swedish and English. Compounding is common in both languages (Swedish and German) and, therefore, the impact of the alignment direction is not very great. The quality of the statistical alignment is much lower than for the Swedish-English bitext, probably due to syntactic and morphological differences between Swedish and German, which are larger than between Swedish and English. It would be interesting to investigate these differences in detail by means of alignment examples.
This chapter includes one of the main contributions of the thesis. Word alignment techniques have been discussed in detail and their applications to different data collections have been investigated. Two word alignment systems have been presented, the Uppsala Word Aligner (UWA), with its iterative alignment strategy, and the matrix-based Clue Aligner. Both aligners have a modular design and integrate several alignment resources and techniques. UWA (section 0.16.2) implements a knowledge-poor approach using association measures based on co-occurrence and string similarity. It combines empirical data with alignment heuristics in an iterative linking procedure. Multi-word units can be handled in two ways: They can be identified either by means of n-gram statistics prior to the word linking step or dynamically within the linking procedure. The Clue Aligner (section 0.16.3) is a modular word alignment tool that supports the probabilistic combination of various resources. It is based on so-called alignment clues, which indicate translational relations between words and phrases according to their associated features. These features may be any set of linguistic annotations, word position, contextual information, and the words and phrases themselves. Alignment clues can be derived from association statistics (as in UWA), from statistical alignment models, and from aligned training data (dynamic clues). They can also be pre-defined (declarative clues). Dynamic clues can be learned even from noisy training data such as automatically word-aligned bitexts.
Both systems mentioned above have been evaluated using manually produced reference alignments (so-called gold standards). Fine-grained metrics for precision and recall have been defined for this purpose (section 0.17). UWA and the Clue Aligner have been applied to three different bitexts from two different genres (technical and literary texts). They include two language pairs, Swedish-English (in both translation directions) and Swedish-German (only in that direction). Different settings of the Clue Aligner have been tested and results have been compared to UWA alignments and to statistical alignments using the external toolkit GIZA++ (section 0.18). The experiments have demonstrated that the Clue Aligner is superior to UWA for all three test corpora. It has also been shown that the clue alignment approach can be used to improve statistical alignments produced by, for instance, GIZA++. The performance of the Clue Aligner depends on the quality of the alignment clues and on the weighting scheme used for the combination of clues. The optimization of the clue aligner settings and weights remains to be investigated.