next up previous contents index
Next: Applications of extracted bilingual Up: thesis_html Previous: Development of tools   Contents   Index

Subsections


Word alignment strategies and experiments

Word alignment is the basic task in extracting lexical data from parallel corpora. In this chapter, we will concentrate on techniques for automatic word alignment that have been investigated, developed and implemented in the thesis.

The task of word alignment was introduced in section 0.6. We will emphasize the combination of multiple resources for achieving alignment accuracy. In the first section, word alignment techniques are described. It includes an overview of alignment resources and association measures, and a presentation of two word alignment systems that have been developed in the thesis (the Uppsala Word Aligner (UWA) and the Clue Aligner). The second section of this chapter describes metrics for word alignment evaluation including a proposal of a new refined measure for the evaluation of partially correct alignment results. Finally, the last section, before summarizing the chapter, presents experimental results that have been achieved with the word alignment techniques as described in the sections before.

Alignment strategies

In this section we will discuss in detail, word alignment techniques used in our investigations. The first part summarizes resources and measures. The second part presents the principles of the Uppsala Word Aligner (UWA) and its ``greedy'' alignment approach. The third part contains a detailed description of the clue alignment approach including definitions of the terminology and explanatory examples.


Alignment resources

The main resource in word alignment is the bitext itself. Correspondences between words and phrases in the two languages can be derived directly from the corpus using statistics of word distributions and measures of similarity. Another group of resources for word alignment are external knowledge sources which can be divided into machine-readable data collections (such as bilingual dictionaries) and expert knowledge (such as association heuristics between word positions or part-of-speech tags).

Statistical resources

Statistical approaches to word alignment are usually built upon the co-occurrence of words and their translations in bitext segments. Statistical techniques as introduced in section 0.6.2 capture the co-occurrence property by estimating alignment parameters that describe translation relations according to a translation model. The estimated parameters maximize the likelihood of the data given the model, i.e. recurrent word pairs with similar contextual features obtain high probabilities when maximizing the likelihood function. More details about statistical estimation approaches can be found in chapter 0.3. Association approaches to word alignment use measures of correspondence for the purpose of identifying translation relations. These association measures are often derived from statistics using co-occurrence frequencies as their main parameter. Section 0.6.1 gives some background to such measures that have been applied in alignment tasks. In our approaches, we apply three co-occurrence measures, the Dice coefficient, t-scores, and point-wise mutual information. More information about these measures can be found in section 0.6.1.


String matching techniques

String similarity measures are used to find cognates in bitexts. Two common measures have been presented in section 0.6.1, the Dice coefficient and the longest common sub-sequence ratio (LCSR). In our alignment approaches, we apply the LCSR score.

LCSR measures string similarity based on common characters. However, the spelling of cognates often differs systematically between languages even if they are close to each other phonologically. Loan words that enter a language are usually altered in order to match language specific structures and rules. In many languages certain modifications are consistently applied when borrowing words from other languages. For example, the consonant 'c' in English usually corresponds to the letter 'k' in Swedish (and German) and vice versa when representing the phoneme /k/. Other examples of systematic spelling changes are the French word ``chauffeur'', which became ``chafför'' in Swedish, and the English word ``mail'', which is often spelled ``mejl'' in Swedish. Sometimes language specific characters are simply replaced by similar characters for being consistent with the alphabet of the borrowing language, for instance, ``smorgasbord'' in English, which is borrowed from Swedish (``smörgåsbord'').

A cognate recognizer should be aware of systematic spelling differences in order to identify a large number of possible cognates. Weighted string matching algorithms are discussed in [Tie99a] which can capture similarities even between non-identical characters. The same algorithm as for standard LCSR is used together with a character mapping function. This function defines weights for matching pairs of characters, even non-identical ones. In [Tie99a], three approaches the automatic construction of weighted string similarity measures are proposed. Weighted character mapping functions are learned automatically from a collection of cognate candidates. In the first approach pairs of vowels and pairs of consonants at similar positions are mapped to each other and co-occurrence scores are calculated for each character pair using the Dice coefficient. These scores represent the weights in the character mapping function. Similarly, in the second approach, pairs of vowel sequences and pairs of consonant sequences are mapped to each other and Dice scores are calculated in order to fill the character mapping function. Note that the unit to be matched is now a sequence of characters, which may consist of more than one character. This is also the case for the third approach, which yields the best result in the presented experiments. It is based on the co-occurrence of non-matching parts at similar positions in previously extracted cognate candidates. The following example illustrates non-matching parts from a Swedish-English cognate candidate:



Swedish k r i t i sk a
English c r i t i c a l


Non-matching parts in this example are the pairs ('k' $\rightarrow$ 'c'), ('sk' $\rightarrow$ 'c'), and ('' $\rightarrow$ 'l'). In [Tie99a], it has been demonstrated experimentally for Swedish and English, that a weighted string similarity measure with a mapping function that has been learned using the third approach improves recall and even precision for cognate extraction tasks.

Other resources

In section 0.6.1, external resources for word alignment have been discussed. Several resources have been applied in our word alignment approaches. Machine-readable bilingual dictionaries (MRBD) and alignment heuristics such as position weights have been used in the Uppsala Word Aligner which is presented in the following section. Pre-defined relations between morphosyntactic features are applied in the Clue Aligner which is presented in section 0.16.3.


Greedy word alignment - step by step

UWA implements a ``greedy'' word alignment approach based on association measures and alignment heuristics. The basic idea in this approach is to combine different resources in an iterative alignment procedure. The principles of this approach are inspired by Martin Kay's quote on machine translation [Kay97, p. 13]:

The keynote will be modesty. At each stage, we will do only what we know we can do reliably. Little steps for little feet!

Word alignment is a non-trivial task and also walks on ``little feet'', which leads to the principles of ``baby-steps'' for word alignment as formulated in [Tie99c, p. 218]:

  1. Prepare carefully before taking the first step.
  2. Use all available tools that can help.
  3. Check alternatives before taking the next step.
  4. Take safe steps first; try risky steps later.
  5. Remove everything that is in the way.
  6. Improve walking by learning from previous steps.
  7. Reach the goal with many steps rather than one big one.
  8. Continue trying until you cannot get any closer.

Using these principles and the modular design of the Uplug system, the Uppsala Word Aligner (UWA) was designed as sketched in figure 11.

Figure 11: The Uppsala Word Aligner.

The aligner starts with a pre-processing step (principle 1), which includes the recognition and extraction of MWUs for both languages in the bitext. Collocations are identified using co-occurrence statistics (point-wise mutual information) and language-specific classified stop word lists for phrase boundary detection. In the second step, alignment candidates by means of type links are collected from any appropriate source (principle 2). Candidates are ranked according to their reliability (principle 3), i.e. their association score. Alignment candidates include cognates (found by means LCSR scores), associated pairs (found by means of Dice scores, point-wise mutual information or t-scores), single-word bitext segments5.1, pairs of rare words (which occur significantly less frequently than all other words in the bitext segment), and machine-readable bilingual dictionaries (MRBD). ``Risky'' candidates should be avoided (principle 4); several heuristics help to keep the collection of link candidates as clean as possible: Link distance thresholds (assuming that corresponding words do not occur arbitrarily far apart from each other), string length thresholds (short words are unreliable for string similarity measures), constraints on length difference ratios (assuming that related words do not exceed a certain length difference), and frequency thresholds (sparse data produce unreliable statistics). Word alignment includes several steps (principle 7) in the form of linking cascades. The process is implemented as a greedy, best-first (principle 4) search in the manner of competitive linking. Competitive linking implies that aligned words cannot be aligned again which allows the removal of linked words (principle 5). Unlike Melamed's approach [Mel96a], we do not restrict the greedy search to one-to-one alignments but include also (possibly overlapping) multi-word-units (MWUs) in the word alignment competition.

The word alignment cascades are not the end of the process. Type links are extracted from the aligned bitext and return to the alignment process as an additional resource (principle 6). Due to the competitive linking process the size of the corpus is reduced by the number of aligned words. The reduced corpus enters another alignment iteration starting, again, with a ``careful'' preparation of the first step in the process, i.e. the collocation extraction. The MWU recognizer finds adjacent collocations only. However, having aligned words removed from the bitext it is possible to find long-distance collocations, assuming that interrupting items have been removed completely from the corpus in the previous iteration. This motivates the repeated use of the MWU recognizer. Furthermore, the contents of the corpus has been changed significantly with respect to word occurrence and co-occurrence of word pairs. Statistically significant associations between words (and new MWUs) may arise in the reduced bitext, which could not been identified previously in the corpus. Finally, the reduced bitext may contain additional single-word bitext segments that can be directly aligned. The iterative process can be continued as long as new links are found (principle 8).


Combining clues for word alignment

Word alignment using a combination of association clues has been introduced in [Tie03]. It is inspired by previous work (as described in section 0.16.2) but applies refined techniques for the combination of different resources. The main idea about the clue alignment approach is to incorporate several knowledge sources into the word alignment process in a systematic way. Empirical data shall be complemented with linguistic information, statistical parameters shall be combined with alignment heuristics.

Most word alignment models apply knowledge-poor approaches, i.e. alignment based on plain texts disregarding all additional information that can be gathered for the languages involved. Simple tools such as stemming functions and language-specific stop word lists are often added to improve the performance of simple association approaches. Recently, initial studies on the impact of linguistic information, such as part-of-speech, on word alignment have been carried out as described in chapter 0.3. The clue alignment approach represents a flexible framework for such a combination of language-specific, heuristic, and statistical resources in word alignment.


Word alignment clues

For many languages, reliable tools are available for enriching corpus data with linguistic information. Examples of such tools are part-of-speech taggers, lemmatizers, named-entity recognizers, and (shallow) syntactic parsers. Other kinds of language-specific information may be found in dictionaries including idiomatic expressions, multi-word terms, word meanings, word translations etc.

It may be assumed that such data and tools have a positive impact on the identification of translation relations between lexical units in bitexts. For an illustration of this intuition, see for instance the simple example in figure 12 where translation correspondences are expressed between automatically tagged chunks rather than between words.

Figure 12: Linguistically enriched bitext segments.

Linguistic information and contextual features may be regarded as clues to relations between words. In particular, these clues may indicate translational correspondences between words in bitexts. Contextual clues are frequently used by lexicographers for the identification of word senses. Similarly, the clue alignment approach tries to utilize information about the context for the identification of translation equivalents. Many types of features can be explored: Spelling similarities may indicate cognates. Information about word classes may support linking decisions. Word position can be another key to alignment preferences. Morphosyntactic features and information about syntactic functions may help to find structural similarities and translational correspondences between lexical items within these structures. Some of these alignment clues may be very specific (e.g. default translations taken from a dictionary) and others may be very general (e.g. pairs of chunk labels such as 'NP-NP' in figure 12). Depending on their impact, clues may be of different importance for the identification of translational correspondence.

The following definitions are used in the clue alignment model:

Word alignment clue:
A word alignment clue $C_{i}(s,t)$ is a probability
$P(a_{i}\vert s,t)$ which indicates an association $a_{i}$ between two lexical items $s$ and $t$ in parallel texts. The association is used as a binary function, i.e. the probability distribution includes only the two values $P(a_{i}\vert s,t)$ (representing the probability of $s$ and $t$ being associated) and $P(\neg a_{i}\vert s,t)=1-P(a_{i}\vert s,t)$ (representing the probability of $s$ and $t$ not being associated).
Lexical item:
A lexical item $x$ is a set of words5.2with associated features $f_{x}$ attached to it. Features may include any information attached to $x$ or to the context of $x$ (word position may also be a feature).
Declarative clue:
Clues which have been pre-defined are declarative clues, i.e. declarative clues are independent of the training data. Typical examples of declarative clues are pre-defined associations between word pairs from related word classes (represented by their part-of-speech tag) or translations derived from machine-readable dictionaries.
Estimated clue:
Clues which have been derived from training data are called estimated clues. They can be derived from association measures:
$C_{i}(s,t)=w_{i}A_{i}(s,t)$ with $w_{i}$ being a factor for weighting and normalizing the association score and $A_{i}(s,t)$ being a measure such as the Dice coefficient $A_{Dice}(s,t)=w_{Dice}Dice(s,t)$ or the longest common sub-sequence ratio $A_{LCSR}(s,t)=w_{LCSR}LCSR(s,t)$. Other estimated clues can be learned from aligned training data: $C_{j}(s,t)=w_{j}A_{j}(f_{s},f_{t})$ where $f_{s}$ and $f_{t}$ are features of source and target language items, and $w_{j}$ is a weight.
Clue resources :
A source of alignment clues or an association metric for clue estimation is called a clue resource.
Clue patterns :
Patterns that match lexical or contextual features are called clue patterns. They are used for learning clues from aligned training data.
Total clue :
Clues can be combined and the combination of all available clues is called the total clue: $C_{all}(s,t)=P(a_{all}\vert s,t)=P(a_{1}\cup a_{2}\cup .. \cup a_{x}\vert s,t)$ Clues are not mutually exclusive. Several association types can be found together, e.g., an association based on co-occurrence can be found together with an association based on string similarity. The total clue combines these associations in order to strengthen the evidence of certain translation relations indicated by different clues. The union of two clues is defined as follows: $P(a_{1}\cup a_{2}\vert s,t)=P(a_{1}\vert s,t)+P(a_{2}\vert s,t)-P(a_{1}\cap
a_{2}\vert s,t)$. For simplicity, we assume that clues are independent of each other:
$P(a_{1}\cap a_{2}\vert s,t)=P(a_{1}\vert s,t)P(a_{2}\vert s,t)$.
Clue value distribution:
A clue indicates associations between all its member token pairs. The distribution of clue values to member token pairs can formally be expressed as $C_{i,s^{N},t^{M}}(s_{n},t_{m})=C_{i}(s^{N},t^{M})$ for the source item $s^{N}=s_{1}s_{2}..s_{N}$ ($n\in\{1..N\}$) and the target item $t^{M}=t_{1}t_{2}..t_{M}$ ($m\in\{1..M\}$).

The definition of word alignment clues is very general and allows a large variety of clue types. Clues are (as in real life) not always helpful. Bad clues can be misleading. This has to be considered when choosing clue resources and designing clue patterns. The impact of certain clues in combination with others is another issue, which becomes very important when setting weights for individual clue patterns and resources. The optimal solution for the compilation of alignment clues would be to find a way of automatic approval of possible clues for the task of word alignment including an automatic setting of optimal weights for ``good'' clues. This could be achieved using representative test data and an optimization procedure. However, this is very time consuming due to the large number of possible options and parameters. An intermediate solution is to adjust the clue settings experimentally according to the alignment results using intuitively reasonable settings and clue patterns.


The clue matrix

The distributional property of clues as defined above enables the combination of clues on a word-to-word level according to the clue combination rule. In this way, a value for the total clue can be calculated for each pair of words for a given bitext segment using all available clues, even for overlapping items. One-to-one relations between words of a bitext segment can be represented in a two-dimensional space, which we will call the clue matrix as it is filled with clue values. The following example is constructed from a bitext segment taken from a translation of Saul Bellow's ``To Jerusalem and back: a personal account'' [Bel76] to Swedish [Bel77] (henceforth the Bellow corpus):

Then hand luggage is opened.
Sedan öppnas handbagaget.

Let us assume clues derived from co-occurrence measures (Dice) and string similarity measures (LCSR) in table 2.


Table 2: Word alignment clues.
co-occurrence (Dice) string similarity (LCSR)
then sedan 0.38 hand luggage handbagaget 0.67
is opened öppnas 0.30 opened öppnas 0.33
is opened sedan öppnas 0.30 then sedan 0.40
opened öppnas 0.65 hand sedan 0.40
luggage handbagage 0.45 hand handbagaget 0.36


Using the clues from table 2 we can fill each cell of the following clue matrix (figure 13) with corresponding values (without weighting scores).

Figure 13: A clue matrix (all values in %).
\begin{figure}\begin{center}
\footnotesize\begin{tabular}{\vert l\vert ccc\vert}...
...ened & 30 & 83 & 0 \\
\hline
\end{tabular}\normalsize \end{center} \end{figure}

The example in figure 13 illustrates how overlapping clues may strengthen the reliability of word links. Most of the clues in the example do not strongly suggest links on their own (especially clues below 0.5). However, in combination they create a clearer picture about the alignment that should be preferred.

In the following section, word alignment strategies using association clues collected in clue matrices are presented and discussed.


Clue alignment

A clue matrix summarizes information from various sources that can be used for the identification of translation relations. However, there is no obvious way to utilize this information for word alignment as we explicitly include multi-word units in our approach. The clue matrix in figure 14 has been obtained for a bitext segment from the Bellow corpus using a set of weighted declarative and estimated clues.

Figure 14: Another clue matrix (all values in %).
\begin{figure}\begin{center}
\footnotesize\begin{tabular}{\vert l\vert ccccc\ver...
... 2 & 1 & 4 & 12 & 6\\
\hline
\end{tabular}\normalsize \end{center} \end{figure}

There are many ways of ``clustering'' words together and there is no obvious maximization procedure for finding the alignment optimum. The alignment procedure depends very much on the definition of an optimal alignment. The best alignment for our example would probably be the set of the following links:


\begin{displaymath}
\footnotesize\textrm{links}=\left\{
\begin{array}{ll}
\textr...
... & \textrm{särskilt mycket} \\
\end{array}\right\}
\normalsize\end{displaymath}



A typical procedure for automatic word alignment is to start with one-to-one word links. Links that have a common source or target language words are called overlapping links. Sets of overlapping links, which do not overlap with any other link outside the set, are called link clusters. Aligning words one by one often produces overlaps and in this way implicitly creates aligned multi-word-units as part of link clusters. A general word-to-word alignment $L$ for a given bitext segment with $N$ source language words ( $s_{1}s_{2}...s_{N}$) and $M$ target language words ( $t_{1}t_{2}...t_{M}$) can be formally described as a set of links $L_{x}$


\begin{displaymath}
L=\{L_{1},L_{2},...,L_{x}\}\textrm{ with }
L_{x}=\left[s_{x_{1}},t_{x_{2}}\right],
x_{1}\in\{1..N\}, x_{2}\in\{1..M\}
\end{displaymath}

This general definition allows varying numbers of links ( $0\leq x\leq N*M$) within possible alignments $L$. There is no obvious function that can be maximized for finding the optimal alignment. The notion of an alignment optimum can be defined in several ways.

One such word-to-word alignment approach is to assume a directional word alignment model similar to the models in statistical machine translation (see section 0.6.2). The directional alignment model assumes that there is at most one link for each source language word. Using alignment clues, this can be expressed as the following optimization problem: $\hat{L}^{D}=argmax_{L^{D}}\prod_{n=1}^{N}C_{all}(L_{n}^{D})$ where $L^{D}=\{L_{1}^{D},L_{2}^{D},..,L_{N}^{D}\}$ is a set of links $L_{n}^{D}=\left[s_{n},t_{a_{n}^{D}}\right]$ with $a_{n}^{D}\in \{1..M\}$ and $C_{all}(L_{n}^{D})$ is the ``total clue value'' for the linked items $s_{n}$ and $t_{a_{n}^{D}}$. In other words, word alignment is the search for the best link for each source language word. Directional models do not allow multiple links from one item to several target items. However, target items can be linked to multiple source language words as several source language words can be aligned to the same target language word. The direction of alignment can easily be reversed, which leads to the inverse directional alignment: $\hat{L}^{I}=argmax_{L^{I}}\prod_{m=1}^{M}C_{all}(L_{m}^{I})$ with links $L_{m}^{I}=\left[s_{a_{m}^{I}},t_{m}\right]$ and $a_{m}^{I}\in
\{1..N\}$. In the inverse directional alignment, source language words can be linked to multiple words but not the other way around. The following figure illustrates directional alignment models applied to the example in figure 14:




\begin{displaymath}
\footnotesize\begin{array}{ll}
\hat{L}^{D}=\left\{
\begin{a...
...trm{tålamod} \\
\end{array}\right\} \\
\end{array}\normalsize\end{displaymath}



Directional link sets can be combined in several ways. The union of link sets ( $\hat{L}^{\cup}=\hat{L}^{D}\cup\hat{L}^{I}$) usually causes many overlaps and, hence, very large link clusters. On the other hand, an intersection of link sets ( $\hat{L}^{\cap}=\hat{L}^{D}\cap\hat{L}^{I}$) removes all overlaps and leaves only highly confident one-to-one word links behind. Using the same example from above we obtain the following alignments:




\begin{displaymath}
\footnotesize\begin{array}{ll}
\hat{L}^{\cup}=\left\{
\begin...
...link clusters}^{\cap}=\hat{L}^{\cap} \\
\end{array}\normalsize\end{displaymath}



The union and the intersection of links do not produce satisfactory results. Another possibility is a refined combination of link sets
( $\hat{L}^{R}=\{\hat{L}^{D}\cap\hat{L}^{I}\}\cup\{L_{1}^{R},...,L_{r}^{R}\}$) as suggested by Och and Ney [ON00b]. In this approach, the intersection of links is iteratively extended by additional links $L_{r}^{R}$ which pass one of the following two constraints:

Applying this approach to the example, we get:




\begin{displaymath}
\footnotesize\hat{L}^{R}=\left\{
\begin{array}{ll}
\textrm{n...
...patient} & \textrm{tålamod} \\
\end{array}\right\}
\normalsize\end{displaymath}



Another alignment approach is the competitive linking approach proposed by Melamed [Mel96a]. In this approach, one assumes that there are only one-to-one word links. The alignment is done in a ``best-first'' search manner where links with the highest association scores are aligned first, and the aligned items are then immediately removed from the search space. This process is repeated until no more links can be found. In this way, the optimal alignment ($\hat{L}^{C}$) for non-overlapping one-to-one links is found. The number of possible links in an alignment is reduced to $min(N,M)$. Using competitive linking with our example we yield:




\begin{displaymath}
\footnotesize\hat{L}^{C}=\left\{
\begin{array}{ll}
\textrm{n...
...ay}\right\},
\textrm{link clusters}^{C}=\hat{L}^{C}
\normalsize\end{displaymath}



Another iterative alignment approach is proposed in [Tie03]. In this approach, the link $L_{x}^{B}=\left[s_{x_{1}},t_{x_{2}}\right]$ with the highest score in the clue matrix
$\hat{C}_{all}(s_{x_{1}},t_{x_{2}})=max_{s_{i},t_{j}}(C_{all}(s_{i},t_{j}))$ is added to the set of link clusters if it fulfills certain constraints. The top score is removed from the matrix (i.e. set to zero) and the link search is repeated until no more links can be found. This is basically a constrained best-first search. Several constraints are possible. In [Tie03] an adjacency check is suggested, i.e. overlapping links are accepted only if they are adjacent to other links in one and only one existing link cluster. Non-overlapping links are always accepted (i.e. a non-overlapping links creates a new link cluster). Other possible constraints are clue value thresholds, thresholds for clue score differences between adjacent links, or syntactic constraints (e.g. that link clusters may not cross phrase boundaries). Using a best-first search strategy with the adjacency constraint we obtain the following alignment:




\begin{displaymath}
\footnotesize\hat{L}^{B}=\left\{
\begin{array}{ll}
\textrm{n...
...rm{särskilt mycket tålamod} \\
\end{array}\right\}
\normalsize\end{displaymath}



None of the alignment approaches described above produces the manual reference alignment in our example using the given clue matrix. However, simple iterative procedures come very close to the reference and produce acceptable alignments even for multi-word units, which is promising for an automatic clue alignment system. Directional alignment models depend very much on the relation between the source and the target language. One direction usually works better than the other, e.g. an alignment from English to Swedish is better than Swedish to English because in English terms and concepts are often split into several words whereas Swedish tends to contain many compositional compounds. Symmetric approaches to word alignment are certainly more appropriate for general alignment systems than directional ones.


Association measures as clues

Using association measures for estimating alignment clues lead to another complication, the proper transformation of association scores into association probabilities. The definition of estimated clues using arbitrary association measures assumes a linear correlation between the scores of the measure and the probability of the corresponding association. For example, the degree of similarity of two strings by means of LCSR assumed to be linearly correlated with the probability of the two strings being cognates. The correlation factor is expressed in the value of $w_{i}=\lambda_{i}/Z_{i}$ where $Z_{i}$ is used to normalize the association score (i.e. to transform the score into a value between 0 and 1) and $\lambda_{i}$ is used to weight the score according to the correlation between the measure and the corresponding probability.

The normalization factor can be found by setting $Z_{i}=max_{x,y}\left(A_{i}(x,y)\right)$, for instance $Z_{LCSR}=max_{x,y}\left(LCSR(x,y)\right)=1$ (for identical strings $x$ and $y$). However, for measures like point-wise mutual information, such a maximum is not defined for arbitrary arguments $x$ and $y$. In the case of point-wise mutual information it is possible to make $Z_{I}$ dependent on specific items $s$ and $t$, as the maximum of point-wise mutual information is given with $max_{x}(I(x,t))=max_{x}\left(log_{2}\frac{P(x,t)}{P(x)P(t)}\right)=
log_{2}\frac{P(x)}{P(x)P(t)}=-log_{2}P(t)$ and similarly
$max_{y}(I(s,y))=-log_{2}P(s)$ [MS99]. Normalizing point-wise mutual information can then be done by setting $Z_{I,s,t}=
max\left(max_{y}\left(I(s,y)\right),max_{x}\left(I(x,t)\right)\right)=
max(-log_{2}P(s),-log_{2}P(t))$. Other measures like t-scores may even produce negative values that can be interpreted as statistical evidence for items to avoid each other [CGHH91]. A simple way of handling negative scores is to disregard them. However, the value of general t-scores can also be arbitrarily large. The maximum of a t-score given certain items $s$ and $t$ also depends on the estimation of the standard error, which involves the size of the corpus as one parameter. Normalization in these cases is very inefficient because it depends on several parameters and has to be re-calculated for individual word pairs. An approximate solution for normalization is to use a large value $\tilde Z_{i}\gg A_{i}(s,t)$ for highly correlated items $s$ and $t$. This approximation is sufficient as long as the score does not exceed $\tilde Z_{i}$.

Another important issue is the adjustment of an appropriate weighting scheme. Weights have to resemble the importance of specific clues. However, an optimization of weights has not been carried out in the present study.

In [Tie03], word alignment experiments are presented using two clues, which have been derived from association scores: cognate clues using LCSR scores and co-occurrence clues using the Dice coefficient. LCSR and Dice produce values between 0 and 1, thus, simplifying the normalization. The clue weights have been set uniformly to 0.5. One of the advantages of the clue alignment approach is that clues may even refer to overlapping multi-word units. This makes it possible to compute association scores for various words and word groups from the corpus. Two approaches to the identification of MWU candidates have been tested in [Tie03]: Pre-defined chunks using shallow parsers and arbitrary bi- and tri-grams5.3.


Learning clues

As mentioned earlier, clues can be learned from aligned data. Learning clues is based on the assumption that an association clue $C_{j}$ between lexical items is linearly correlated with the likelihood of $f_{t}$ and $f_{s}$ being features of aligned source and target language items. In this way, word alignment relations are represented as generalized clues from aligned training data using clue patterns that define the features that are assumed to carry the relational information.

One way to estimate the correlation between features of aligned items is to use a conditional probability $P(f_{t}\vert f_{s})$, which is the likelihood of the features $f_{t}$ of target language items given the features $f_{s}$ of aligned source language items. This likelihood can be estimated using the frequency $freq(f_{s},f_{t})$ of co-occurring features in aligned training data, the feature frequencies $freq(f_{s})$ and $freq(f_{t})$ in the same data, and a maximum likelihood estimation:


\begin{displaymath}
C_{j}(s,t)=
w_{j}A_{j}(f_{s},f_{t})=
w_{j}P(f_{t}\vert f_{s})\approx
w_{j}\frac{freq(f_{s},f_{t})}{freq(f_{t})}
\end{displaymath} (5.1)

The correlation factor is expressed as the weight $w_{j}$. Features can be any kind of information attached to the lexical items or to any contextual data. A conditional probability is ``directional'', i.e. estimations of $P(f_{t}\vert f_{s})$ may be very different to estimations of $P(f_{s}\vert f_{t})$ especially for features with very different occurrence frequencies. One possibility of a symmetric estimation of the association between features is to use the Dice coefficient for combining both conditional probabilities.


\begin{displaymath}
C_{j}(s,t)=
w_{j}A_{j}(f_{s},f_{t})=
w_{j}\frac{2*P(f_{s}...
...prox
w_{j}\frac{2*freq(f_{s},f_{t})}{freq(f_{s})+freq(f_{t})}
\end{displaymath} (5.2)

The Dice coefficient is in fact the harmonic mean of the two conditional probabilities $P(f_{s}\vert f_{t})$ and $P(f_{t}\vert f_{s})$ as shown in equation 2.

Unfortunately, word aligned training data is not available and therefore, conditional feature probabilities cannot be estimated directly. However, other clues such as association measure clues may be used to create word alignments as described earlier. The idea now is to use automatically aligned data as training data for the estimation of feature clues. In this way, alignment clues can be found in a boot-strapping procedure, starting with a basic alignment and learning new clues from (noisy) alignments that have been produced using previous clues. Such self-learning techniques are known to increase noise. However, experimental results show that the alignment gains from dynamically learned clues, as is described in [Tie03] and section 0.18. The main reason for the success of this method is that previously unused information and contextual dependencies can be integrated in the alignment process. For example, relations between part-of-speech tags can be found, which can be used to generalize the translation relation between words that belong to certain word classes. Translational dependencies between word positions can be learned and syntactic relations can be identified. Many features are possible for the estimation of dynamic clues. Four dynamic clue patterns have been defined in [Tie03]:

POS :
Part-of-speech (POS) tags are used as features. This clue pattern assumes relations between POS tags of corresponding lexical items.
POS coarse :
Some POS tags (e.g. SUC-tags5.4 for Swedish) include detailed morphosyntactic information. POS coarse assumes associations between reduced tag-sets.
phrase :
Chunk labels (representing common phrase types) can be used as a feature. Associations are assumed between certain chunk types that may help to find translation relations between lexical items belonging to these chunks.
position :
The feature in this clue pattern is the word position of a translated item relative to the word position of the original item in the source language segment.

Many more examples than the ones above could be shown. There is a large variety of possible feature sets and many possible combinations among them. Some other examples of dynamic clue patterns are listed in table 3.


Table 3: Dynamic clue patterns.

name pattern
(source and target)
features
lex #text word itself
lexpos #text
pos
word + its part-of-speech tag
postrigram left:pos
pos
right:pos
part-of-speech tag trigram (previous word, current word, next word)
posposition pos
relative position
part-of-speech tag + word position relative to the position of the aligned word
chunktrigram chunk:left:type
chunk:type
chunk:right:type
chunk label trigram (this chunk, previous chunk, next chunk)
chunktrigrampos chunk:left:type
chunk:type
chunk:right:type
pos
chunk label trigram (this chunk, previous chunk, next chunk) + part-of-speech tag


Clue patterns do not have to be symmetric like the ones described here. A clue pattern can also refer to features of different complexity for source and target languages. However, it is important to bear in mind that the design of clue patterns is important for the reliability of clues that have been learned using these patterns. Learning clues assumes that associations between translated words can be generalized as relations between certain features. As mentioned earlier, clues can be misleading, which is certainly true for dynamic clues that may ``over-generalize'' relations, i.e. they may express relations that do not correlate with associations we are looking for. Another crucial assumption is the independence assumption between clues, which is needed for the combination of clues. Overlapping associations indicators violate this assumption, which may influence the approach in a negative way.

Statistical alignment models and clue alignment

Word alignment clues may be derived from any source that ``promises'' to find associations between words and phrases. Alignment approaches used in statistical machine translation are such a source producing, among other things, bilingual lexical translation probabilities. SMT uses refined translation models for estimating lexical probabilities that are directly applicable as a resource in the clue alignment system. SMT alignment models include various kinds of dependencies, which makes them a valuable resource, evident in the experimental results (section 0.18.2). There are several results of SMT alignments, beside the lexical probabilities, that can be applied in clue alignment. SMT itself produces a word aligned bitext suitable for training. Type links can be extracted directly from the aligned corpus and may be applied as lexical resources in clue alignment. Distortion parameters can be used as an additional clue. Even fertility parameters and mappings to the ``empty word'' could be added to improve the clue alignment approach. However, the main resource that has been used in our experiments is the set of lexical translation probabilities, which can be directly included in clue alignment experiments.


Declarative clues

Declarative clues are pre-defined and originate from language sources such as bilingual dictionaries or contain expert knowledge such as pre-defined relations between certain feature pairs. Declarative clues have not been used in the experiments presented in [Tie03]. Further experiments (see section 0.18.3) show that simple declarative clues improve alignment results. An example of a simple declarative clue for aligning English and Swedish texts is the part-of-speech clue in figure 155.5.

Figure 15: Declarative part-of-speech clues.
\begin{figure}\footnotesize\begin{verbatim}...

Similarly, declarative clues can be defined between chunk labels, relative word positions, common function words etc. The following declarative clues have been defined for the experiments in section 0.18:

POS :
Pre-defined relations between part-of-speech labels. An example for English-Swedish can be seen in appendix .4.
POS-MWU :
Relations between part-of-speech label sequences as shown in figure 15
Chunk :
Pre-defined relations between chunk-labels. An English-Swedish example is included in appendix .4.


Negative clues

The clues discussed above are all used as ``positive'' indicators for associations between lexical items. It has already been mentioned before that certain clues may be misleading depending on the reliability of the source from which they have been derived. Some clues may indicate exactly the opposite of positive clues, i.e. high clue values may signal a ``negative'' relation between items such as negative t-scores indicate words which ``avoid'' each other in a corpus. Such negative clues ($C^{-}$) could be utilized in alignment in order to prevent incorrect links. One way would be to use the complement of a negative clue ($1-C^{-}$) in the existing framework. However, this does not correspond to the actual function of a negative example as it does not reduce already existing association scores but only improves scores for items that do not match the example. A proper integration of negative clues would be a change in the clue alignment model in order to allow the reduction of association scores according to the values of negative clues. One could, for example, define the total clue as the joint probability of positive clues and the complement of negative clues. Using the independence assumption for alignment clues the total clue could be computed as the following product: $C_{all}=C^{+}*(1-C^{-})$. However, in this way, negative clues would have a very strong influence on the indicated associations as they affect all scores. Consequently, negative clues have to be chosen very carefully in order to lead the alignment program to the correct decisions. Strong negative clues could be, for example, pre-defined links of words that should absolutely not be aligned. Examples of such links would be pairs of non-related function words, which tend to obtain rather large association scores because they co-occur frequently. Other examples would be word pairs that are very similar but are in fact ``false friends''. The impact of negative clues has not yet been tested and a proper way of integrating them in the system has not yet been found. It would be interesting to investigate this further in the future.


Evaluation metrics

In section 0.6.4, two types of measures for word alignment evaluation have been presented, the $ARCADE$ measures for translation spotting and the $SPLIT$ measures for exhaustive word-to-word alignments. Another measure has been proposed by the author of this thesis in [AMST00]. The measures in this approach (the $PWA$ measures) are tailored toward gold standards which include complex MWU links of sampled words from the bitext. In the following I will review this approach, revise the evaluation measures, and propose refined metrics for word alignment evaluation (the $MWU$ measures).


The PWA measures

For the computation of the PWA measures, the following partiality value Q is calculated for partially correct link proposals5.6 for each reference link of the gold standard.


\begin{displaymath}
Q_{x}=\frac{\vert aligned_{src}^{x}\cap correct_{src}^{x}\ve...
...ax(\vert aligned_{trg}^{x}\vert,\vert correct_{trg}^{x}\vert)}
\end{displaymath}

The set of $aligned_{src}^{x}$ includes all source language words of all proposed links if at least one of them is partially correct with respect to the reference link $x$ from the gold standard. Similarly, $aligned_{trg}^{x}$ refers to all the proposed target language words. $correct_{src}^{x}$ and $correct_{trg}^{x}$ refer to the sets of source and target language words in link $x$ of the gold standard. Using the partiality value $Q$, we can defined the recall and precision metrics as follows:


\begin{displaymath}
R_{pwa}=\frac{\sum_{x=1}^{X}Q_{x}}{\vert correct\vert},
P_{pwa}=\frac{\sum_{x=1}^{X}Q_{x}}{\vert aligned\vert}
\end{displaymath}

The value of $\vert aligned\vert$ refers to the number of links $x$ for which there is at least one link proposal by the system, i.e. the number of links $x$ for which there is a link proposal $n$ with $\vert aligned_{src}^{n}\cap correct_{src}^{x}\vert>0$ or $\vert aligned_{trg}^{n}\cap correct_{trg}^{x}\vert>0$. The value of $\vert correct\vert$ is the size of the gold standard ($X$). The measures above take the correct portions of partially correct links into account. A consequence of this definition is that partially correct links will be scored as completely correct if they cover a MWU link completely and nothing else than words of the MWU link. For example, if the reference link represents the linked MWUs ``United Nations'' and the Swedish translation ``Förenta nationerna'' the following sets of links will be considered to be completely correct with respect to the reference:
(United $\rightarrow$ Förenta, Nations $\rightarrow$ nationerna),
(United Nations $\rightarrow$ Förenta nationerna),
but also (United $\rightarrow$ nationerna, Nations $\rightarrow$ Förenta).
The first two link sets are perfectly acceptable whereas the third set is unlikely to be accepted in isolation of the MWU alignment. This problem is a consequence of the fact that the gold standard does not give any information about how the internal parts of a MWU link should be covered. However, cases like the constructed one above are usually very rare and acceptable with respect to the underlying MWU link.

Another problem with the partiality measure $Q$ is that its value is the same for both precision and recall. Hence, precision and recall are always identical in cases where at least one partially correct link has been proposed for each reference link, i.e. $\vert aligned\vert=\vert correct\vert$.


The MWU measures

In order to capture the difference between precision and recall I propose the following refinements of the PWA measures for precision and recall using MWU links5.7:


\begin{displaymath}
Q^{precision}_{x}=\frac{\vert aligned_{src}^{x}\cap correct_...
...t}
{\vert aligned_{src}^{x}\vert+\vert aligned_{trg}^{x}\vert}
\end{displaymath}


\begin{displaymath}
Q^{recall}_{x}=\frac{\vert aligned_{src}^{x}\cap correct_{sr...
...t}
{\vert correct_{src}^{x}\vert+\vert correct_{trg}^{x}\vert}
\end{displaymath}


\begin{displaymath}
R_{mwu}=\frac{\sum_{x=1}^{X}Q^{recall}_{x}}{\vert correct\ve...
...wu}=\frac{\sum_{x=1}^{X}Q^{precision}_{x}}{\vert aligned\vert}
\end{displaymath}

These measures will be called the MWU measures for word alignment evaluation in the following. Let us review the example from table 1 on page [*] in order to illustrate the differences between the evaluation metrics which have been introduced so far.


Table 4: Word alignment evaluation metrics.
Gold standards alignment
complex alignment MWU splitting proposed links
no one $\rightarrow$ ingen is $\rightarrow$ visar (S) patient $\rightarrow$ tålamod
is patient $\rightarrow$ visar tålamod is $\rightarrow$ tålamod (P) very $\rightarrow$ mycket
very $\rightarrow$ mycket patient $\rightarrow$ visar (P)
patient $\rightarrow$ tålamod (S)
no $\rightarrow$ ingen (S)
one $\rightarrow$ ingen (S)
very $\rightarrow$ mycket (S)
precision recall F
$P_{split}=2/2=1.00$ $R_{split}=2/5=0.40$ $F_{split}\approx0.57$
$AER\approx0.43$
$P_{arcade}=\frac{0+1+1}{3}\approx0.67$ $R_{arcade}=\frac{0+1/2+1}{3}=0.50$ $F_{arcade}\approx0.57$
$P_{pwa}=\frac{0+2/4+2/2}{2}=0.75$ $R_{pwa}=\frac{0+2/4+2/2}{3}=0.50$ $F_{pwa}=0.60$
$P_{mwu}=\frac{0+2/2+2/2}{2}=1.00$ $R_{mwu}=\frac{0+2/4+2/2}{3}=0.50$ $F_{mwu}=0.67$


Table 4 contains different evaluation scores given the links from the sample gold standard5.8 (column 1 and 2) and two link proposals by an imaginary alignment system (column 3). The differences between precision scores are quite large. The example illustrates the dependence of the evaluation on particular metrics, which have been tailored towards specific alignment purposes and styles of gold standards. Note that the precision measure $P_{split}$ is only possible for completely aligned gold standard documents. Otherwise it is impossible to judge how many alignments the system proposes for existing gold standard links. Compare this to a word sampling method for creating gold standards. Using only samples of words makes it impossible to say if a proposed link is to be counted as incorrect with respect to the sampled link or just as another link that is not to be considered in the evaluation. The difference between the PWA measures and the MWU measures can be mainly seen in precision. In our example, the PWA measure ``punishes'' the proposed links because of their partiality in both, precision and recall. However, they are not necessarily wrong but incomplete. The MWU measures capture this fact by a clear difference between precision and recall. In our experiments, we applied word sampling gold standards with complex MWUs. Hence, the MWU measures are a natural choice for evaluation of the alignment performance.


Experiments and results

A large number of experiments were carried out mainly using the corpus data from the PLUG project. Results have been evaluated using the gold standards created in the same project [MA99]. An extensive summary of alignment results, using the greedy word aligner described in section 0.16.2, can be found in [AMST99]5.9. Results from initial experiments with the clue alignment approach are presented in [Tie03].

In this section I will present alignment experiments which were carried out recently on three sub-corpora from the PLUG corpus, the novel ``To Jerusalem and back: A personal account'' by Saul Bellow in English and Swedish (Bellow), the Scania 1995 corpus in Swedish and English (Scania95en), and the Scania 1995 corpus in Swedish and German (Scania95de). These three sub-corpora were chosen for comparing alignment results for different text types (literary and technical text) and for different language pairs (Swedish-English, English-Swedish and Swedish-German). We will concentrate on the clue alignment approach as it allows many interesting combinations of techniques and resources.

First, I will describe the setup of the presented experiments. This includes a brief overview of the gold standards which have been used for evaluation. Secondly, I will present experiments using basic alignment clues. Thirdly, I will discuss the effect of declarative and dynamic clues on alignment performances. Then, I will compare different search strategies and their impact on alignment quality. Finally, I will discuss the overall alignment performance in comparison with previously achieved results.

Experimental setup

The three corpora that have been chosen are fairly small and therefore well suited for extensive experiments using a large number of settings. Some characteristics are presented in table 5.

The clue alignment approach was chosen as the basic methodology. A large number of settings have been explored and results using these settings are presented in the following sections.

The first series of experiments comprises word alignment attempts using so called ``basic clues'' referring to alignment clues that have been derived directly from the bitext using empirical techniques such as association measures and statistical alignment. Two types of basic clues will be distinguished: Base1 clues refers to word associations that have been found using simple association measures based on co-occurrence and string similarity. Base2 clues refers to basic clues, including the translation probabilities that are produced by statistical alignment.

The second series of experiments comprises alignment attempts, in which declarative clues have been added to basic clues from the section before. Three types of declarative clues have been applied as described in section 0.16.3. Different combinations of basic and declarative clues are investigated on the example of aligning one of the three test corpora, the Bellow corpus. Declarative clues have been added to both types of basic clues, base1 and base2.

The third series of experiments comprises word alignment attempts using dynamic clues as described in table 3 in section 0.16.3. Dynamic clues are learned from previously aligned corpora. Aligned training data are produced by the clue aligner using basic and declarative clues. The impact of the quality of training data on dynamic clues is investigated as well as the performance of the aligner when adding learned clues to basic and declarative clues. Experiments have been carried out for all three test corpora. Results are presented and discussed.

In the fourth series of alignment experiments the effect of different search strategies on the alignment performance is investigated. Seven strategies have been introduced in section 0.16.3. Their impact on aligning the Bellow corpus is discussed on the example of three different clue alignment settings.

Finally, the overall performance of the clue aligner is presented for each of the three test corpora in comparison with results which have been yielded by the Uppsala Word Aligner (UWA) and the statistical alignment tool GIZA++5.10. Six types of alignment settings are used to summarize the achieved results: Word alignment using base1 clues, base2 clues, base1 and base2 clues in combination with declarative clues, and, finally, a combination of the latter two with dynamic clues. For each type the best result in terms of F-values has been chosen.


The gold standards

Within the PLUG project, gold standards were created for a majority of sub-corpora of the project corpus. They were produced using a word sampling method and the PLUG Link Annotator (PLA) [MAA02]. 500 randomly sampled words were chosen for texts from three different genres (literary, technical, political documents) and two different language pairs (Swedish-English and Swedish-German). Gold standards include ``fuzzy'' links, ``null'' links, and complex MWU links. They were produced by several annotators using PLA and detailed guidelines [Mer99a]. The average agreement between annotators was measured for the creation of some of the gold standards and yielded about 92% [AMST99]. Figure 16 shows some examples of links in the gold standard of the English-Swedish Bellow corpus5.11.

Figure 16: Links from a gold standard.
\begin{figure}\footnotesize\begin{tabular}{rp{10cm}}
regular link: & call on $\r...
...ung man, och han ser ut att trivas
gott.\\
\end{tabular}\normalsize\end{figure}

The same gold standards have been used throughout the experiments. However, the clue aligner and UWA use different formats. Several links have been lost in the automatic conversion from the old format to the clue aligner format due to tokenization differences. However, the majority of links could be converted without difficulties. Table 5 summarizes the sub-corpora and their corresponding gold standards that have been used to evaluate the experiments presented here.


Table 5: Test corpora and gold standards.
corpus gold standard
name languages type size (words) size converted
Bellow English-Swedish literary 138,000 500 468
Scania95en Swedish-English technical 385,000 500 483
Scania95de Swedish-German technical 337,000 500 468



Basic clues

Basic clues are directly derived from parallel corpora without word aligned training material or external knowledge sources. The following basic clue types have been used in the experiments:

$Dice$:
The Dice coefficient for co-occurring words and multi-word units
(MWUs). The threshold for Dice scores was set to 0.2 and word pairs that co-occur only once were discarded.
$I$:
Point-wise mutual information. A threshold of 4 was used and word pairs with a co-occurrence frequency of 1 were removed.
$t-score$:
The t-test as association measure. The score threshold was set to 1 and the co-occurrence frequency threshold was set to 2 as for Dice and point-wise mutual information scores.
$LCSR$:
String similarity scores using the longest common sub-sequence ratio. The threshold was set to 0.4. Weighted string similarity measures have not been used.
$GIZA$:
Translation probabilities derived from GIZA++. The system was applied using its standard settings (training scheme: model 1 + HMM + model 3 + model 4).
$GIZA^{-1}$:
GIZA++ implements directional statistical alignment models.
$GIZA^{-1}$ refers to the translation probabilities produced when aligning a bitext in the reverse direction.

I have used a very simple method for normalizing and weighting scores. $Dice$ and $LCSR$ scores have been weighted by a factor of 0.05. $I$ and $t-score$ clues have been weighted by 0.005 and 0.01, respectively5.12. The $GIZA$ clues are considered to be more reliable than, for instance, $Dice$ clues because they are estimated with consideration of contextual dependencies. Therefore, a slightly higher weight of 0.1 is used for the $GIZA$ clue. The same applies to the $GIZA^{-1}$ clue. Table 6 illustrates several alignment experiments using different sets of basic clues for the three bitexts.


Table 6: Basic word alignment clues.
clues Bellow Scania95en Scania95de
base1 clues P R F P R F P R F
lcsr 38.0 29.7 33.3 46.4 29.1 35.8 60.4 32.6 42.4
I 64.0 40.2 49.4 52.9 63.4 57.7 53.8 55.5 54.7
t-score 52.3 58.4 55.2 49.9 64.6 56.3 54.4 65.1 59.3
dice 73.7 60.1 66.2 66.8 61.8 64.2 68.7 62.1 65.2
dice+I+t-score 58.3 62.3 60.2 53.1 65.9 58.8 59.4 66.7 62.8
dice+I+t-score+lcsr 59.3 68.0 63.3 55.1 68.5 61.1 62.3 70.6 66.2
base2 clues P R F P R F P R F
giza 78.0 76.5 77.3 81.9 82.0 82.0 78.9 79.0 79.0
giza+lcsr 73.7 78.1 75.8 74.7 79.4 77.0 75.8 77.6 76.7
giza+dice 77.8 79.3 78.5 71.3 75.1 73.1 75.5 79.2 77.3
giza+dice+lcsr 75.7 81.0 78.3 72.4 81.0 76.5 74.9 79.7 77.2
$giza^{-1}$ 79.6 80.3 79.9 81.7 76.4 79.0 79.8 79.7 79.7


A large difference can be observed between the alignment results using different sets of basic clues. Generally, a simple string matching clue is clearly the worst indicator for word alignment even for closely related languages such as Swedish, German and English. GIZA, on the other hand, is the best single clue of our basic clues. Alignments using these probabilities as alignment clues yield the highest precision among all settings for all three bitexts.

Combinations of co-occurrence clues such as the Dice clues, point-wise mutual information clues and t-score clues do not improve the overall performance when measured in terms of F-values. Such combinations generally yield higher recall at the expense of precision. However, for all three corpora, the loss in precision is so high that the overall performance drops significantly for the combination of the three measures in comparison with the single best measure among them (Dice). This can be explained by the fact that closely related clues such as co-occurrence clues violate the independence assumption made when combining them. However, a combination of the probabilistic GIZA clue and the Dice co-occurrence clue produces a slight improvement for the Bellow corpus. The loss in precision is compensated by a clear increase in recall. This is not the case for the technical Scania corpora. Here, the precision decreases considerably without noticeable improvements in recall. The recall rate goes even down significantly in case of the Swedish-English Scania bitext.

Surprisingly, the LCSR clue based on string similarity does not add much to the aligner. On the contrary, performance drops in most of the cases when adding string similarity clues. This was unexpected especially in the case of technical Swedish-German text. A small improvement can be observed when adding LCSR scores to Dice based alignment clues when aligning the Scania bitexts. However, the improvement is very little and does not justify the expensive task of computing string matching clues.

The overall picture of basic clues and their influence on alignment results is very consistent even though the bitexts represent two different genres and include two different language pairs. Therefore, we may conclude that probabilistic clues yield reliable results independent of the text genre and possibly even independent of the language pair under consideration. However, these claims are too strong to be drawn from our data as they include only three related languages and two types of text. The experiments prove that probabilistic clues can be successfully applied to a variety of languages and text types. Furthermore, they imply that string matching techniques do not work well for word alignment at least according to the present investigations. However, adjustments in the weighting scheme may change this behavior.


Declarative clues

Declarative clues are pre-defined relations as described in section 0.16.3, i.e. they are static and independent of the particular bitext. Three declarative clue sets for English and Swedish bitexts have been tested in the present study; part-of-speech clues in form of pairs of part-of-speech tags (pos), a second set of part-of-speech clues based on pairs of tag sequences that match multi-word units (pos-mwu), and a set of chunk label pairs (chunk). The pos-mwu clues are shown in figure 15. The other two clue sets are listed in appendix .4. Declarative clues can be combined with any other set of clues. In this study, the three sets have been combined with some of the basic clues which have been presented in the previous section. Naturally, declarative clues apply only to bitexts that include language pairs matching the defined clue features and which include necessary data such as part-of-speech tags and chunk labels. We concentrate our investigations on the Bellow corpus. The English part of the corpus has been tagged and parsed using the open-source library for natural language processing Grok [Bal02]. The Swedish portion has been tagged using the TnT tagger [Bra00,Meg02] and parsed using Beáta Megyesi`s context-free grammar parser for Swedish [Meg02]. All markup has been done automatically and no corrections have been made. Hence, errors are to be expected in the markup. The declarative clues have been weighted uniformly. Figure 17 illustrates the results produced when combining declarative clues with different sets of basic clues.

Figure 17: Adding declarative clues to base1 clues: pairs of part-of-speech tags (pos), pairs of part-of-speech sequences for multi-word units (pos-mwu), pairs of chunk labels (chunk).

The effect of declarative clues on the alignment performance depends very much on the set of basic clues to which they are added. Surprisingly, combinations with low performances in their basic settings (such as the combinations of three co-occurrence measures and string similarity clues (dice+I+tscore+lcsr)) outperform the basic alignment with the best performance (dice) when combined with declarative clues such as the pos clues. This indicates the importance of an appropriate weighting scheme, which has not been optimized in our experiments. The clue patterns seem to be in a better ``balance'' for the complex clue combinations in comparison with less complex ones. The declarative clues are very strong generalizations of relations between words and phrases and their influence on alignment is probably too strong in cases of less complex alignment clue sets. Similar behavior can be observed when combining declarative clues with base2 clues as shown in figure 18.

Figure 18: Adding declarative clues to base2 clues: pairs of part-of-speech tags (pos), pairs of part-of-speech sequences for multi-word units (pos-mwu), pairs of chunk labels (chunk).

It is also interesting to observe that the performance decreases in most cases where all available clues are combined together. The performance drop is due to lower precision values that are not compensated by their corresponding recall scores. Generally, the pos clues seem to add the most valuable link indications to the word alignment procedure. The performance increase is mainly due to a great improvement in terms of recall. This is the case for both configurations, base1 plus declarative clues and base2 plus declarative clues.


Dynamic clues

A large variety of dynamic clues can be learned from previous alignments (see section 0.16.3, table 3). Furthermore, they can be combined in many ways. In table 7, alignment results for several settings of dynamic clues are presented.


Table 7: Dynamic word alignment clues learned from two different basic clues, Dice and GIZA. Dynamic clues: POS trigram (p3), chunk trigram (c3), chunk trigram+POS (c3p), POS+POS coarse+chunk+position (eacl), wordform+POS (lexp), wordform (lex), POS+position (pp), POS trigram+position (p3p), and combinations of them (lex+p3p+c3p).
Bellow - dynamic clues (F-values)
basic clues p3 c3 c3p eacl lexp lex pp p3p lex+p3p+c3p
dice (66.2) 46.3 45.6 52.2 59.7 63.2 66.8 64.5 61.9 74.7
giza (77.3) 48.3 45.4 58.1 60.2 69.7 73.1 65.9 66.5 80.3


The quality of the training data is decisive for the performance of dynamic clues, i.e. the use of different basic clues for producing aligned material affects the set of dynamic clues that can be learned. The scores in table 7 illustrate the influence of training data on the performance of the word aligner using dynamic clues. Two different basic alignments have been used in order to learn several types of dynamic clues from the Bellow corpus, i.e. alignments using the Dice coefficient (Dice) and alignments using the GIZA clue. As shown in table 6 the Dice clues perform worse than GIZA clues. The difference amounts to more than 11% in terms of F-values (about 66% for Dice clues and more than 77% for the GIZA clues). In other words, both alignments represent very different training sets. As expected, dynamic clues learned from these two link sets reflect the difference in that way, that clues learned from the GIZA alignment generally perform better than clues learned from Dice alignments. However, the performance differences are relatively small when considering the large difference between the two training sets. This shows that valuable alignment clues can be learned even from small and noisy alignments as the ones produced by the the Dice clues. Precision seems to be the main factor for inferring generalized dynamic clues whereas recall is a secondary aspect.

Alignment clues that involve features such as parts-of-speech and phrase types (p3, c3, c3p, eacl, pp, p3p) behave very similar for both experimental settings because they generalize relations between words and phrases on a morphological and syntactic level rather than on the lexical level. These generalizations seem to be quite similar for both training sets if one considers the resulting word alignments. The experiments in table 7 also show that certain generalizations are better suited for word alignment than other ones. The combination of part-of-speech tags and relative positions (pp and p3p) seem to be the best clue patterns on a morphosyntactic level for our test corpus. Other morphosyntactic clues perform much worse on their own (p3, c3, c3p) indicating that they do not capture much useful information for word alignment. However, a combination of very simple morphosyntactic clues as in the eacl setting performs much better than similar clues on their own.

On the other hand, learned clues including lexical features (lex, lexp) perform very differently. For them, the amount of training data (i.e. recall of the base alignment) seems to be more decisive for their alignment quality than for the others. These lexical clues perform significantly better when learned from the giza-alignment than from the dice-alignment.

Finally, a combination of lexical and morphosyntactic clues (lex+p3p+c3p) clearly out-performs all the other dynamic clue settings.


Adding dynamic clues

Table 8 summarizes alignment results for different bitexts and clue sets. Basic clues (as indicated in the first column) were used to create aligned training data for learning dynamic clues listed in the remaining columns. Each column contains the performances of the word aligner when adding the corresponding dynamic clues to the set of basic clues. The top scores are indicated in bold style for each row in the table.


Table 8: Adding dynamic clues to different basic clues (Dice, LCSR, GIZA, all declarative clues (static)). Dynamic clues as in table 7.
Bellow - adding dynamic clues (F-values)
basic clues p3 c3 c3p eacl lexp lex pp p3p lex+p3p+c3p
dice (66.2) 67.3 68.8 68.5 72.7 66.7 67.9 74.1 71.9 74.2
dice+lcsr+static (67.6) 68.1 70.1 70.3 72.9 70.5 71.6 74.5 72.7 77.1
giza (77.3) 75.7 76.8 78.3 79.8 78.4 78.4 80.7 81.2 81.1
giza+lcsr+static (79.1) 76.4 78.3 79.1 79.3 78.7 78.1 82.2 81.9 81.3
Scania95en - adding dynamic clues (F-values)
basic clues p3 c3 c3p eacl lexp lex pp p3p lex+p3p+c3p
dice (64.2) 67.3 69.3 69.3 70.5 66.1 65.5 70.8 70.7 72.7
dice+lcsr+static (67.3) 68.1 69.5 68.9 68.5 68.9 70.0 71.6 71.6 72.6
giza (82.0) 77.8 79.8 80.3 76.7 82.5 82.9 81.4 82.3 81.3
giza+lcsr+static (79.3) 77.7 78.1 79.4 76.6 80.1 80.8 80.6 80.8 83.0
Scania95de - adding dynamic clues (F-values)
basic clues p3 c3 c3p eacl lexp lex pp p3p lex+p3p+c3p
dice (65.2) 67.8 61.1 68.5 73.1 67.2 67.0 68.9 68.8 70.3
dice+lcrs+static (72.8) 72.4 69.8 72.9 72.8 74.2 73.5 73.2 73.8 73.9
giza (79.0) 76.4 70.9 75.4 76.8 78.5 78.0 75.5 78.4 76.5
giza+lcsr+static (78.6) 77.7 76.3 76.7 76.5 79.1 78.9 77.4 79.3 78.2


In general, dynamic clues that do not perform well on their own do not perform well in combination with basic clues either. Another general observation is that basic clues with low alignment qualities are ``easier'' to improve than clues with high alignment qualities. Most of the dynamic clues that have been added to base1 clues (dice, dice+lcsr+static) boost the alignment performance for all three test corpora. On the other hand, only some dynamic clues help to improve alignments based on base2 clues (giza, giza+lcsr+static). However, certain improvements can be observed for all three test corpora when adding dynamic clues even to base2 clues. The only exception is the giza clue alignment of the Swedish-German Scania95 corpus, which did not improve in combination with any dynamic clue. In most cases, dynamic clues that perform well on their own produce the largest improvements when combined with basic and declarative clues. In several cases, the difference between the best score and the second best score is very low and a general conclusion about the quality of specific dynamic clues in comparison with others cannot be drawn from this result. In the present experiments the possibilities of incorporating learned clues for word alignment have been investigated. The results prove that it is possible to improve word alignment results using such clues. An optimized weighting scheme is needed to get the best out of each learned clue when combined with others.


Different search strategies

In section 0.16.3 (pp [*]) several alignment search strategies have been discussed. The clue aligner implements different strategies in order to test their impact on the alignment performance. In figure 19, the alignment results for three clue settings and the Bellow corpus are presented for the seven search strategies described earlier.

Figure 19: Different alignment search strategies. Corpus: Bellow, Swedish-English. Clue types: Giza++ lexicon (giza), Dice, POS + position (pp). Alignment strategies: directional ($L^{D}$), inverse directional ($L^{I}$), bi-directional union ($L^{\cup}$), bi-directional intersection ($L^{\cap}$), bi-directional refined ($L^{R}$), best-first ($L^{B}$) and competitive linking ($L^{C}$).

The figure illustrates the relation between precision and recall when using different algorithms. As expected, the intersection of directional alignment strategies yields the highest precision at the expense of recall, which is generally lower than for the other approaches. Contrary to the intersection, the union of directional links produces alignments with the highest recall values but lower precision than all other search algorithms. Directional alignment strategies generally yield lower F-values than other refined symmetric alignment strategies. The differences between the two alignment directions are surprisingly inconsistent. Competitive linking is somewhat in between the intersection approach and the two symmetric approaches, ``best-first'' and ``refined''. This could also be expected as competitive linking only allows non-overlapping one-to-one word links. The refined bi-directional alignment approach and the constrained best-first approach are almost identical in our examples with a more or less balanced relation between precision and recall.

According to figure 19, alignment strategies can be chosen to suit particular needs. Concluding from our experiments, restrictive methods like the intersection approach or competitive linking should be chosen if results with high precision are required (which is mostly found among one-to-one word links). This is, for example, the case in automatic extraction of bilingual lexicons where noise should be avoided as much as possible. Other strategies should be chosen for applications, which require a comprehensive coverage as, for example, machine translation.


UWA, clue alignment and SMT

In the previous sections, the clue alignment approach has been explored using a variety of different settings and resources. One of the main advantages of this approach is its modularity, which makes it possible to integrate many different alignment resources. In this section we will look at the clue alignment results in comparison with other alignment approaches, i.e. the greedy word aligner implemented in the Uppsala Word Aligner (UWA) and the statistical alignment approach implemented in the GIZA++ toolbox. We will look at the three test corpora which have been used throughout our investigations separately.

We will use the following name conventions to refer to different settings: The best result for a clue alignment using basic clues only, except GIZA clues, are denoted base1. The best result of an alignment with basic clues including the GIZA clues are called base2. Alignments including declarative clues contain the string ``+stat'' in their labels and alignments including dynamic clues include the string ``+dyn''. Identical names do not necessarily refer to the same setting for each of the three corpora. There are usually several settings that fit into the given category. Here, we will always use the best of our results for the given category and each specific corpus, which is not necessarily taken from results presented in previous sections. Note that all clues are weighted as described in the previous sections. An optimization of weights for the combination of clues has not been carried out. Alignment results using GIZA++ are denoted with GIZA++ and refer to standard settings of the GIZA++ system (IBM model 1 + HMM + IBM model 3 + IBM model 4, five iterations for each model). GIZA++ inverse refers to alignments in the opposite translation direction. The best result using the Uppsala Word Aligner is denoted UWA.

Figure 20 illustrates clue alignment results for different settings together with alignment performances of UWA and GIZA++ for the English-Swedish Bellow corpus.

Figure 20: Comparison of alignment settings for the Bellow corpus.

The clue aligner clearly outperforms the simple UWA alignment approach even for basic setups. The main improvement can be found in recall, which is much higher than for the restrictive UWA approach. UWA yields high scores in terms of precision (about 86%) but low recall scores (only 52% in our example) resulting in an F-value of about 65% as indicated in figure 20. The precision of the clue aligner is below 81% for all settings but with corresponding recall values way above 52%. The clue alignment results are generally below the GIZA++ alignments as long as GIZA clues are not included in the alignment procedure. However, using the GIZA++ dictionaries implies a major improvement of the aligner performance and, combined with declarative and dynamic clues, it goes beyond the GIZA++ alignment result. The top result with an F-value of over 83.9% is clearly an improvement compared to the other approaches including GIZA++.

Figure 21 illustrates alignment results for the Swedish-English Scania 1995 corpus.

Figure 21: Comparison of alignment settings for the Swedish-English Scania 1995 corpus.

GIZA++ performs very well on the Swedish-English Scania corpus. Both, precision and recall are high above 80% (86.15% in precision and 85.85% in recall). Swedish and English are two very closely related languages and technical manuals such as the documents in the Scania corpus are very consistent in their terminology. Statistical alignment approaches profit from consistency as it can be seen in the excellent alignment that is produced by GIZA++. Certainly, the alignment direction suits the statistical aligner. GIZA++ links Swedish words to each English word in the corpus, which makes it possible to align several English words to the same Swedish word. The GIZA++ alignment in the other direction does not reach the same performance (only about 80.21%). This is consistent with the results of the alignments of the Bellow corpus. The highest performance there is also obtained when aligning Swedish words to their English counterparts. However, the performance difference between the two GIZA++ alignments is even larger for the Scania corpus than for the Bellow corpus. Similarly to the alignments of the Bellow corpus, GIZA++ outperforms the clue alignment approach when using base1 clues only. Using the GIZA dictionary (base2 clues) improves the clue alignment significantly. However, it does not reach the performance of the statistical alignment even though it touches its result (85.26% for a combination of base2, static and dynamic clues). This may be explained by the importance of distortion and fertility parameters estimated by GIZA++ but not used in the clue aligner. Scania documents contain many short sentences and sentence fragments with a large number of compound terms. Many alignment problems can be solved with the help of positional dependencies and relations between compositional compounds (in Swedish) and non-compositional compounds (in English).

Finally, in figure 22 alignment results for the Swedish-German Scania 1995 corpus are illustrated.

Figure 22: Comparison of alignment settings for the Swedish-German Scania 1995 corpus.

Similarly to the other two bitexts, the clue alignment performance increases when adding clues to the process and outperforms the Uppsala Word Aligner. Basic alignments using association measures (base1 clues) are worse than the alignments produced by GIZA++. However, adding the GIZA clue to the aligner increases the performance beyond the result of the statistical alignment. An interesting fact is that the base2 clue alignment performs better than both GIZA++ alignments. The difference between the two directional alignments performed by GIZA++ is minimal. However, symmetric alignment approaches, such as the clue alignment strategy used here, outperform directional strategies. The relation between Swedish and German is different from the relation between Swedish and English. Compounding is common in both languages (Swedish and German) and, therefore, the impact of the alignment direction is not very great. The quality of the statistical alignment is much lower than for the Swedish-English bitext, probably due to syntactic and morphological differences between Swedish and German, which are larger than between Swedish and English. It would be interesting to investigate these differences in detail by means of alignment examples.

Summary and discussion

This chapter includes one of the main contributions of the thesis. Word alignment techniques have been discussed in detail and their applications to different data collections have been investigated. Two word alignment systems have been presented, the Uppsala Word Aligner (UWA), with its iterative alignment strategy, and the matrix-based Clue Aligner. Both aligners have a modular design and integrate several alignment resources and techniques. UWA (section 0.16.2) implements a knowledge-poor approach using association measures based on co-occurrence and string similarity. It combines empirical data with alignment heuristics in an iterative linking procedure. Multi-word units can be handled in two ways: They can be identified either by means of n-gram statistics prior to the word linking step or dynamically within the linking procedure. The Clue Aligner (section 0.16.3) is a modular word alignment tool that supports the probabilistic combination of various resources. It is based on so-called alignment clues, which indicate translational relations between words and phrases according to their associated features. These features may be any set of linguistic annotations, word position, contextual information, and the words and phrases themselves. Alignment clues can be derived from association statistics (as in UWA), from statistical alignment models, and from aligned training data (dynamic clues). They can also be pre-defined (declarative clues). Dynamic clues can be learned even from noisy training data such as automatically word-aligned bitexts.

Both systems mentioned above have been evaluated using manually produced reference alignments (so-called gold standards). Fine-grained metrics for precision and recall have been defined for this purpose (section 0.17). UWA and the Clue Aligner have been applied to three different bitexts from two different genres (technical and literary texts). They include two language pairs, Swedish-English (in both translation directions) and Swedish-German (only in that direction). Different settings of the Clue Aligner have been tested and results have been compared to UWA alignments and to statistical alignments using the external toolkit GIZA++ (section 0.18). The experiments have demonstrated that the Clue Aligner is superior to UWA for all three test corpora. It has also been shown that the clue alignment approach can be used to improve statistical alignments produced by, for instance, GIZA++. The performance of the Clue Aligner depends on the quality of the alignment clues and on the weighting scheme used for the combination of clues. The optimization of the clue aligner settings and weights remains to be investigated.


next up previous contents index
Next: Applications of extracted bilingual Up: thesis_html Previous: Development of tools   Contents   Index
Jörg Tiedemann 2004-01-03