In workpackage 1 a bilingually sentence aligned corpus has to be compiled. The corpus shall include the following language pairs:
The tables below describe some characteristics of the current version of the PLUG corpus.
PLUG corpus files
-----------------
all files are encoded in XML using the plugXML.dtd
complete corpus: 2188079 words
sv<->en: 1169165 words
sv<->de: 525278 words
sv<->it: 493636 words
filename origin
------------------------------------------------------------------
ensvtacc.xml manual texts for MS Access
ensvtxl.xml manual texts for MS Excel
sventscan.xml collection of truck maintainance manuals from Scania
ensvtscan.xml collection of truck maintainance manuals from Scania
svdetscan.xml collection of truck maintainance manuals from Scania
svittscan.xml collection of truck maintainance manuals from Scania
ensvfbell.xml 'Viking P. for the Jewish Publication Society
of America' by Saul Bellow
ensvfgord.xml 'A Guest of Honour' by Nadine Gordimer
svitfbio.xml 'En biodlares död' by Lars Gustafsson
svitfkak.xml 'En kakelsättares eftermiddag' by Lars Gustafsson
svenpeu.xml collection of EU texts (taken from the PEDANT corpus)
svdepeu.xml collection of EU texts (taken from the PEDANT corpus)
svitpfut.xml 'Future noise policy - European Commission Green Paper'
EU text (taken from the PEDANT corpus)
svenprf.xml declarations from the Swedish government
svdeprf.xml declarations from the Swedish government
political texts: 410408 words
size in words in bytes languages
------------------------------------------------------------------
svenpeu.xml 186111 1584273 sv->en
svdepeu.xml 180312 1657251 sv->de
svenprf.xml 8011 86180 sv->en
svdeprf.xml 7778 90486 sv->de
svitpfut.xml 28196 255892 sv->it
technical texts: 1353740 words
size in words in bytes languages
------------------------------------------------------------------
ensvtacc.xml 163173 1606582 en->sv
ensvtxl.xml 124961 1330559 en->sv
sventscan.xml 187830 3370698 sv->en
ensvtscan.xml 197459 3508625 en->sv
svdetscan.xml 337188 6052332 sv->de
svittscan.xml 343129 7139947 sv->it
fiction: 423931 words
size in words in bytes languages
------------------------------------------------------------------
ensvfbell.xml 132066 1169471 en->sv
ensvgord.xml 169554 1423741 en->sv
svitfbio.xml 55882 501451 sv->it
svitfkak.xml 66429 629647 sv->it
(all word counts were computed using the following command:
cat {files} |sed 's/<[^>]*>//g' |wc -w )