Extracting multilingual lexicons from parallel corpora
Dan Tufiş and Ana-Maria
Barbu
RACAI-Romanian Academy Center
for Artificial Intelligence
13, "13 Septembrie",
RO-74311, Bucharest, 5, Romania
{tufis, abarbu}@racai.ro
Key
words:
word-alignment, parallel corpora, translation lexicons
1
Introduction
Extracting bilingual dictionaries from corpora can be seen as a very
fine-grained alignment process, where the aligned units are not paragraphs or
sentences but words and phrases. Most approaches to this problem rely on
statistical means to build translation lexicons from bilingual texts, roughly
falling into two categories: the hypotheses testing approach and the
estimating approach. There are pros and cons for each type of approach, some of
them discussed in (Hiemstra 1997).
Our method hardly fits into one of these two categories, but is closer
in spirit to hypotheses testing approach, by first generating a list of
translation equivalence candidates and then extracting, in several steps, the
most probable translation-equivalence pairs. The candidate list is constructed
from the aligned translation units (TU).
2
A baseline and the iterative algorithm
The input data for the extraction algorithm is a sentence-aligned
parallel text (bi-text) with each word in both parts of the bi-text being
lemmatized and tagged. The algorithm is sensitive to tagging, lemmatization and
alignment errors, but they will affect only the recall not the precision. We showed that using a tiered-tagging
approach with combined language models (Tufiş, 2000) and Tufis et al, 2000) one
could construct, from clean training corpora, high accuracy taggers (close to
99%). For many languages lemmatization, following an accurate POS tagging, is
also very accurate and sentence aligners with almost perfect job (at least for
some types of texts) are not rare. Therefore, the prerequisites of our
algorithm are not very difficult to meet.
Based on the alignment, the first step is to compute a list of
translation equivalence candidates
(TEC). This list (TECL) contains several sub-lists, one for each POS
considered in the extraction procedure. The TECs are generated by the Cartesian
product of the sets of tokens (of the same POS) in the two parts of each TU.
Finally, the pairs generated from all TUs are uniquely sorted and attached with
the number of occurrences of the respective association. The core of the
baseline algorithm is a selection procedure, which prunes the TECs considered
to have occurred just by chance. This hypothesis may be tested by different
statistical tests (we used various statistical measures such as chi-square,
dice, mutual information, log-likelihood and left-Fisher). One could use also a
minimal number of occurrences for <TS TT> (usually,
as in our case, 3).
Our iterative
algorithm is also very simple but significantly faster than the baseline. The
algorithm gets as input the aligned parallel corpus and the maximum number of
iterations. At each iteration step, the pairs that pass the selection (see below) will be removed from TECL
so that this list is shortened after each step and eventually may be emptied. From TECL, for each
POS is constructed a contingency table (TBLk) as showed in Table 1. The rows of
the table are indexed by the distinct source lemmas and the columns are indexed
by the distinct target lemmas (of the same POS).
|
|
TT1 |
… |
TTn |
|
|
TS1 |
n11 |
… |
n1n |
n1* |
|
… |
… |
… |
… |
… |
|
TSm |
nm1 |
… |
nmn |
nm* |
|
|
n*1 |
… |
n*n |
n** |
Table 1: TBLk - Contingency table with counts for
TECs at step K
Each cell (i,j) contains the number of occurrences of the <TSi,
TTj> pair. The marginal counts are ni*=
, n*j=
and n**=
.
The equation (EQ1):
expresses the
selection condition and represents the key idea of the extraction algorithm. In
order to select, at step k, the pair <TSi, TTj> as a
translation equivalence pair (TEP), the number of associations TSi has
with TTj must be higher than (or at least equal to) with any other TTp
(p¹j) and vice-versa.
If TSi is translated in more than one way, the rest of translations
will be found in subsequent steps (if frequent enough).
3
Experiments
We conducted experiments on one of the few publicly available
multilingual aligned corpora, namely the "1984" multilingual corpus
(Dimitrova et all 1998) containing 6 translations of the English original. This
corpus was developed within the Multext-East project, published on a CD-ROM
(Erjavec et all 1998) and recently improved within the CONCEDE project (to be
soon released on the CONCEDE's homepage: www.itri.brighton.ac.uk/projects/concede/).
Each monolingual part of the corpus (Bulgarian, Czech, Estonian, Hungarian,
Romanian and Slovene) was tokenised, lemmatised, tagged and sentence aligned to
the English hub.
|
Language |
Bulgarian |
Czech |
English |
Estonian |
Hungarian |
Romanian |
Slovene |
|
No. of distinct words |
15093 |
17659 |
9192 |
16811 |
19250 |
14023 |
16402 |
|
No. of lemmas |
8225 |
8677 |
6871 |
8403 |
9729 |
6626 |
7157 |
|
No. of >2-occ lemmas* |
3350 |
3329 |
2916 |
2876 |
3294 |
3052 |
3189 |
Table 2:The lemmatized monolingual "1984"
overview
* the number of lemmas does not include interjections, particles,
residuals
The number of lemmas in each monolingual part of the multilingual corpus
as well as the number of lemmas that occurred more than twice are shown in
Table 2. For validation purposes we set the step limit of the algorithm to 4.
The evaluation was fully done for Estonian, Hungarian and Romanian and
partially for Slovene (the first step was fully evaluated while from the rest were
evaluated randomly selected pairs). The algorithm provides as output a list of
TEPs and up to 10 examples of translation units where the respective pair
appeared. The figures in Table 3 show the results and the evaluation. The
precision was computed as usual but the recall (Rec*) was computed as the
number of correct TEPs divided by the number of lemmas in the source language
with more than 3 occurrences.
As one can see from the figures in Table 4, the precision is higher than
98% for Romanian and Slovene, almost 97% for Hungarian and more than 96% for
Estonian. The recall (our defined Rec*) ranges from 50.92% (Slovene) to 63.90%
(Estonian). We run the extractor for the Ro-En bi-text without imposing a step
limit. The program stopped after 25 steps with a number of 2765 extracted
pairs, out of which 113 were wrong. The precision decreased to 95,91%, but the
recall significantly improved: 86,89%.
|
Language |
Bg-En Prec/Rec* |
Cz-En Prec/Rec* |
Et-En Prec/Rec* |
Hu-En rec/Rec* |
Ro-En rec/Rec* |
Sl-En Prec/Rec* |
|
Step 1 |
1336 NA/NA |
1399 NA/NA |
1216 99.50/42.07 |
1299 98.61/38.88 |
1394 99.71/42.74 |
1177 99.91/36.87 |
|
Step 2 |
1741 NA/NA |
1886 NA/NA |
1617 97.89/55.04 |
1737 97.63/51.48 |
1867 99.30/52.23 |
1489 99.52/46.47 |
|
Step 3 |
1896 NA/NA |
2085 NA/NA |
1807 96.63/60.84 |
1863 96.99/54.85 |
2067 99.03/54.84 |
1589 99.06/49.63 |
|
Step 4 |
1986 NA/NA |
2188 NA/NA |
1911 96.18/63.90 |
1935 96.89/56.92 |
2182 98.57/56.36 |
1646 98.66/50.92 |
Table 3: The results after 4 iteration steps and
partial evaluation
In an initial version of this algorithm we used a chi-square test (as in
the baseline algorithm) to check the selected TEPs. Since the selection
condition (EQ1) is very powerful, the vast majority of the selected TEPs passed
the chi-square test while many pairs that used to pass the chi-square threshold
did not pass the condition (EQ1) and therefore we eliminated the time consuming
statistical tests.
The extraction program is written in Perl and runs under practically any platform (Perl implementations exists not only for UNIX/LINUX but also for Windows and MACOS). The Table 4 shows the running time (Pentium III/600Mhz with 96 MB RAM) for each bi-text in the "1984" corpus.
|
Language |
Bg-En |
Cz-En |
Et-En |
Hu-En |
Ro-En |
Si-En |
|
|
4 steps |
25 steps |
||||||
|
Time (sec) |
181 |
148 |
139 |
220 |
183 |
415 |
157 |
Table 4: Extraction time (in seconds) for each of the
bilingual lexicons
The 6 bilingual lexicons as well as a 7-languages lexicon are available
at www.racai.ro/~tufis.
A quite similar approach to ours (also Perl-based) is presented in
(Ahrenberg et all 1998) and (Ahrenberg et all 2000). The languages considered
by their experiments are English and Swedish. For a novel of about half length
of Orwell's "1984" their algorithm needed 55 minutes on a Ultrasparc1
with 320 MB RAM and the best results reported are 96.7% precision and 54.6%
recall. For a computer manual containing about 45% more token than our corpus,
their algorithm needed 4,5 hours with the best results being 85,6% precision
and 67,1% recall. Unlike us, they don’t rely on tagging and
lemmatisation, although they use a “morphology” module that
achieves stemming and grouping of inflectional variants. This strategy makes
possible to link low-frequency source expressions belonging to the same suffix
paradigm. The obvious advantage of not using POS categorisation is that their
approach would be able to identify TEPs where the POS of the source token is
different from the POS of the target token. The disadvantage is that search
space is extremely large. An explanation of the much better response time in
our case, besides not using statistical tests (they use t-test), is that the
search space in our case is probably several orders of magnitude smaller.
4 Conclusions and further work
We presented a simple but very effective algorithm for extracting
bilingual lexicons, based on a word to word model of translation equivalence.
In case a language specific tokeniser is able to recognise and "pack"
the compounds in a pre-processing phase, the 1:1 model will obviously discover
multi-words translation equivalencies. The MULTEXT tokenizer for instance
allows for recognition of generic multi-word expressions (dates, literally
expressed numbers) or specific ones based on external resources containing
lists of compounds, proper names and idiomatic expressions. If the compounds
cannot be dealt with in the tokenizing pre-processing phase one may consider
either extending the bilingual lexicon extractor's model to an explicit N:M
paradigm or consider using monolingual tools for recognising the compounds in
both languages. We are currently considering both options. For the first one we
started the implementation of a new language independent tokeniser, including
statistical collocation recognition (we use S. Banerjee's and T. Pedersen's BSP - Bigram
Statistical Package). For the second option, we are carrying out some
preliminary experiments with a version of the program presented in this paper.
Conceptually, the modified version of the program can be seen as receiving the
same text as source and target input file with all the sentence alignments
being 1:1.
Ahrenberg L, Andersson M,
Merkel M, 1998. A Simple Hybrid Aligner
for Generating Lexical Correspondences in Parallel Texts. In Proceedings of
COLING'98, Montreal, pp 29-35.
Ahrenberg L, Andersson M,
Merkel M, 2000. A knowledge-lite approach
to word alignment, in Véronis J (ed), Parallel
Text Processing. Text, Speech and
Language Technology Series, Kluwer Academic Publishers, pp 97-116.
Dimitrova L, Erjavec T, Ide N, Kaalep H, Petkevic
V, Tufiş D, 1998. Multext-East: Parallel and Comparable
Corpora and Lexicons for Six Central and East European Languages in Proceedings
of COLING’98, Montreal, pp
315-319.
Erjavec T, Lawson A, Romary L, 1998. East Meet West: A Compendium of Multilingual
Resources. TELRI-MULTEXT EAST CD-ROM,
1998, ISBN: 3-922641-46-6.
Hiemstra
D, 1997. Deriving a bilingual lexicon for cross language information retrieval.
In Proceedings of Gronics 21-26
Tufiş
D, 2000. Using a Large Set of Eagles-compliant Morpho-Syntactic Descriptors as
a Tagset for Probabilistic Tagging. In Proceedings of LREC2000, Athens, pp.1105-1112.
Tufiş
D, Dienes P, Oravecz C,
Váradi T, 2000. Principled Hidden Tagset Design for Tiered Tagging of
Hungarian. In Proceedings of LREC’2000, Athens, 1421-1426.
Word
count
(excluding tables and references): 1499