next up previous contents
Next: The Scania Project Up: Automatical Lexicon Extraction from Previous: Methods of Bilingual Lexicon

Subsections

Lexicon Architecture

Norms, Guidelines and Recommendations

An important question is how to store lexical information. The format should be standardized and applicable for different languages. The possibilities of describing information should be extensive and the structure should be extendable if necessary. The storage of word phrases and multiple links to target language entries should be possible.

There are several definitions for lexicon formats. At the Institutio di Linguistica Computazionale in Pisa, an Expert Advisory Group on Language Engineering Standards (EAGLES) was founded to collect and evaluate existing definitions and to develop standards for very-large-scale language resources and corresponding processing tools. Several working groups consider different fields of research, and among them is a group for computational lexica. The resulting guidelines contain recommendations for language engineering and are available from [EAG96].

In the following sections some lexicon architectures are described briefly.

MULTILEX

The MULTILEX lexicon architecture was designed as a multilingual model. It was therefore developed as a flexible model (to adapt it to special needs in specific languages) with a language-independent architecture. It is described as an SGML Document Type Definition (DTD) to make the architecture portable.

The header of the lexicon contains the language name and optional language-specific information and rule specifications. Language-specific information can be gender definitions in different languages for instance. Rule specifications describe instructions to modify stem forms in order to build special inflections like plural forms.

The body of the lexicon contains the actual entries. These entries are stored in two nodes.

The first node called Graphic-Phonologic-Morphological Unit (GPMU) describes the orthographic (standard form), phonetic, and morphological characteristics of a specific language token. An extension for complex word structures has been created to make it possible to store word phrases and complex words. The orthographic description represents the key word. That means that for every inflection with a different spelling, a new GPMU will be created.

The second node is called the Lexical Unit (LU), which contains syntactic, semantic, and bilingual transfer information. There are two different approaches to semantic descriptions. One approach is to use explanations from existing dictionaries regimented logically, and the other approach uses metalinguistic constructs based on a set of semantic features [MM93]. Syntactic information is related to single meanings. Specific syntactic structures are described in syntactic blocks. Every block refers to one semantic meaning and may contain attributes with multiple values and specific attributes, e.g. to cover the possibility of changing the word order [MM93]. The bilingual transfer information structure contains a pointer to an identifier or a list of identifiers (LU's or GPMU's), a name of the established relation, information about reversibility, and the name of the reverse link if it exists.

The connection between GPMU's and LU's is realized by links, that means several LU's can be linked to one GPMU, and one LU can be linked to several GPMU's.

For further information see [MM93].

GENELEX

GENELEX (GENEric LEXicon) is an abstraction from a French monolingual dictionary model. This model originated in separate developments of monolingual dictionaries for several European languages. In the next step, the GENELEX initiative will find general convergences to define the multilingual dimension [MM93]. The conceptual model is expressed by an entity-relationship formalism and a SGML Document Type Definition (DTD).

The GENELEX methodology was developed to cover the maximum amount of non-redundent linguistic information and to be portable. Derivable information is not stored [MM93].

The architecture is divided into three layers.

The first layer is the morphological layer (called morphological unit - UM) and covers orthographical and phonological information. It contains written and phonetic forms, grammatical categories (part of speech), morphological features (gender, number, person...), information about inflectional behavior, derivations (basis, suffix ...), abbreviations (acronyms, initials ...), usage values (rare, colloquial ...), and information about etymology.

The second layer is the syntactic layer (called syntactic unit - USYN). In this layer several information about the syntactical usage is stored (positions, transformational possibilities, optionality, compound characteristics ...).

The third layer is the semantic layer (called semantic unit - USEM). In this layer semantic features are described in the form of a set of values or by means of cross-references (semantic derivation, synonyms, collocation preferences).

To connect these layers, relations between units are used. A unit from the morphological layer can have relations to zero or more units from the syntactical layer, a unit from the syntactic layer can have relations to zero or more units from the semantic layer, a unit from the syntactic layer has to be related to exactly one morphological unit, and a semantic unit can be related to one or more syntactic units which have to be related to the same morphological unit.

For further information see [MM93].

COMLEX

COMLEX syntax is a monolingual English dictionary which was developed at New York University in New York City in the framework of the Proteus Project. The lexical entries are organized as feature structures in a Lisp notation. This notation can be transformed into other forms like Prolog or SGML-marked text. The lists consist of a type symbol followed by zero or more keyword-value pairs. Keywords start with a colon, and their values can be atoms, strings, a list of strings, feature-values, or a list of feature-value lists. Keywords are defined for identifying the orthography, inflectional forms, features, subcatagorizations, and other information.

A second version of COMLEX syntax was completed in August 1995 with an improved quality and coverage. There are also Lisp utilities available for using COMLEX.

For more detailed information see [GMM94].

MULTRA Lexicon Architecture

MULTRA represents a multilingual computer system for translating and writing and was developed at the Department of Linguistics at Uppsala University [ASH93]. It contains a language analyser (UCP), a transfer unit, and a language generator.

Lexicon information in MULTRA is represented by feature structures [Bes93]. These structures are characterized by a list of attribute-value pairs. The attributes represent the features and their values can be atoms or feature structures. It is also possible that different features share a value. With this formalism, lexicon entries as well as transfer rules, can be described using different attribute-value pairs.

A transfer rule in MULTRA contains four parts, a label, a source record, a corresponding target record, and a transfer record [Bes93]. The label is a unique name for the rule. The source (respectively target) record contains attribute-value equations which describe the source (respectively target) feature structure. In the transfer record, relations between substructures of the source and the target feature structure are described.

With this formalism, lexical relations, as well as syntactic transfer rules, can be described. All rules are reversible. The figures 3.1 and 3.2 show two examples taken from [Bes93].


  
Figure 3.1: Example for a MULTRA Transfer Rule
\begin{figure}\hrule
\begin{tabbing}
Label\=\hspace{3,5cm}\=\\
\>G\uml arna-lik...
...orm of {\em
'like'} plus the infinitive verb complement.
\par\hrule
\end{figure}


  
Figure 3.2: Example for a MULTRA Lexical Rule
\begin{figure}\hrule
\begin{tabbing}
Label\=\hspace{2cm}\=\\
\>Is\\
Source\\
...
...ce between the
Swedish word 'is' and the English word 'ice'.
\hrule
\end{figure}


next up previous contents
Next: The Scania Project Up: Automatical Lexicon Extraction from Previous: Methods of Bilingual Lexicon
Jörg Tiedemann
2000-09-07