|
|
Corpora |
| Corpora and other resources installed on STP
| |
| Swedish:
English:
other languages:
Parallel corpora:
other multilingual corpora: | |
| [New Answer in "Corpora"] | |
|
|
Press65 |
| /corpora/Press65
Korpusen består av artiklar insamlade från morgontidningar år 1965
| |
| [Append to This Answer] | |
|
|
Scarrie |
| /corpora/scarrie
corpora and text collections from the Scarrie project
| |
UNT9596 corpus (23,810,171 tokens) build by IMS Corpus Work Bench from file unt9596.seg cqp -r /corpora/scarrie/cwb/unt SVD9596 corpus (47,433,729 tokens) build by IMS Corpus Work Bench from file svd9596.sg cqp -r /corpora/scarrie/cwb/svd | |
| [Append to This Answer] | |
|
|
SUC |
| /corpora/SUC
Stockholm-Umeå corpus
| |
971018 erik.tjong@ling.uu.se The directory /corpora/SUC contains the corpus SUC 1.0 (1997). The corpus came on a cd. Obtained from Daniel Ridings, August 1997. access via CWB: cqp -r /local/ling/cwb/registry [no corpus]> SUC; SUC> | |
| [Append to This Answer] | |
|
|
BNC |
| /corpora/BNC
BNC ttf-formatted version for the word predictor Prophet used in the Fasty project.
| |
| [Append to This Answer] | |
|
|
ICAME - Collection of English Language Corpora |
| /corpora/ICAME
| |
Brown-katalogen tillagd 2003-10-24 (ljo@stp.ling.uu.se)
Där finns versioner med en mening per rad (.byb).
Coordinators: Knut Hofland, Norwegian Computing Centre for the
Humanities, Bergen
Stig Johansson, University of Oslo
Bergen, December 1991
Publisher/
Distributor: Norwegian Computing Centre for the Humanities,
P.O. Box 53
N-5027 Bergen
Norway
Telephone: +47 5 212954
Telefax: +47 5 322656
Electronic mail: icame@navf-edb-h.uib.nojoerg | |
Introduction1. ICAMEThis disc contains corpora distributed through the International Computer Archive of Modern English (ICAME). ICAME is an organization of linguists and information scientists working with English machine-readable texts. The aim of the organisation is to collect and distribute information on English language material available for computer processing and on linguistic research completed or in progress on the material, to compile an archive of English text corpora in machine-readable form, and to make material available to research institutions. The archive mentioned in the name resides at the Norwegian Computing Centre for the Humanities (NCCH) in Bergen, Norway. This acts as a distribution centre for computerized English- language corpora and corpus-related software. ICAME publishes the ICAME Journal, which appears once a year, with articles and information on English computer corpora. There is also an electronic information service: NAVFSERV@NORA.NAVF-EDB-H.UIB.NO (send a message with HELP in the Subject line)2. The corporaEach corpus was produced by a different research team, as explained below.2.1. The Brown CorpusThe Brown Corpus was compiled in the early 1960s at Brown University, USA, under the direction of W. Nelson Francis and Henry Kucera. It contains 500 text samples of some 2,000 words representing 15 categories of American English texts printed in 1961. A list of the texts in the Brown Corpus is included on the disc. A full description of the corpus is given in Francis & Kucera (1979). The Brown Corpus is available in a number of versions, with and without word-class tagging. There are no tagged versions on this disc.2.2. The LOB CorpusThe Lancaster-Oslo/Bergen (LOB) Corpus was compiled in the 1970s under the direction of Geoffrey Leech, University of Lancaster, and Stig Johansson, University of Oslo. It is a British English counterpart of the Brown Corpus and contains 500 text samples selected from texts printed in Great Britain in 1961. A list of the texts in the LOB Corpus is included on the disc. A full description of the corpus is given in Johansson et al. (1978). The LOB Corpus exists in a number of versions, with and without word-class tagging. The tagging was done by researchers at Lancaster, Oslo, and Bergen. The principal members of the research teams were:Lancaster: Geoffrey Leech, Roger Garside, Eric Atwell, Ian Marshall Oslo/Bergen: Stig Johansson, Knut Hofland, Mette-Cathrine Jahr A list of the tags used in the corpus is includen on the disc. A full description of the tagged corpus is given in Johansson et al. (1986). This disc contains both tagged and untagged versions of the LOB Corpus. 2.3. The Kolhapur CorpusThe Kolhapur Corpus is an Indian English counterpart of the Brown and LOB corpora, compiled under the direction of S. V. Shastri, Shivaji University, Kolhapur. It contains 500 text samples selected from English texts printed in India in 1978. The Kolhapur Corpus contains the same text categories as the British and American counterparts, but the weighting and the internal structure of some of the text categories are somewhat different, due to inherent differences in the Indian situation. A list of the texts in the Kolhapur Corpus is included on the disc. A full description of the corpus is given in Shastri et al. (1986).2.4. The London-Lund CorpusThe London-Lund Corpus contains 100 spoken English texts of some 5,000 words collected and transcribed at the Survey of English Usage, University College London, under the direction of Randolph Quirk, and computerized at the University of Lund, under the direction of Jan Svartvik (13 of the texts were computerized at University College London, under the direction of Sidney Greenbaum). The principal members of the research teams were:London: Sidney Greenbaum, Andrew Rosta, Akiva Quinn Lund: Bengt Altenberg, Mats Eeg-Olofsson, Lennart Maansby, Bengt Orestro"m, Jan Svartvik, Cecilia Thavenius The texts in the corpus are transcribed orthographically, with detailed prosodic marking. They represent a range of text categories, such as spontaneous conversation, spontaneous commentary, spontaneous and prepared oration, etc. For a full description of the corpus, see Svartvik (1990, 1992). Note that, when the London-Lund Corpus was first made available, it contained 87 texts. The 13 new texts which have since been added are included in all the versions of the corpus found on this disc. 2.5. The Helsinki Corpus of English Texts: Diachronic PartThis corpus was compiled at the University of Helsinki, under the direction of Matti Rissanen. Other members of the research team were:Old English: Leena Kahlas-Tarkka, Matti Kilpio", Ilkka Mo"nkko"nen, Aune O"sterman Middle English: Inkeri Blomstedt, Juha Hannula, Mailis Ja"rvio", Leena Koskinen, Saara Nevanlinna, Tesma Outakoski, Pa"ivi Pahta, Kirsti Peitsara, Irma Taavitsainen Early Modern English: Merja Kyto", Anneli Meurman-Solin, Terttu Nevalainen, Helena Raumolin- Brunberg, Ritva Tiusanen The project secretary was Merja Kyto" and the research assistants, who keyed in and proofread texts were: Kirsi Heikkonen, Jussi Klemola, Asta Kuusinen, Tuula Lehtonen, Tom Lo"fstro"m, Arja Nurmi, Minna Palander, Tiina Selki, Pa"ivi O"hman. The corpus consists of a selection of texts covering the Old, Middle, and Early Modern English periods, totalling 1.5 million words. Information on source texts and coding conventions is given in Kyto" (1991); the principles of compilation and a number of pilot studies will appear in Kyto", Palander and Rissanen (eds.), forthcoming. 3. VersionsThe computer editing for the collection was done by Knut Hofland, Norwegian Computing Centre for the Humanities. The following versions of the corpora are included on the disc:Brown Corpus: Bergen text version I and II, for MS-DOS, Macintosh and Unix. A modified Bergen version II indexed by WordCruncher 4.4 and TACT for MS-DOS and Free Text Browser for Macintosh. LOB Corpus: Tagged and untagged original text versions, for MS-DOS, Macintosh and Unix. A tagged horizontal version indexed by WordCruncher 4.4 and TACT for MS-DOS and Free Text Browser for Macintosh. Kolhapur Corpus: Text version for MS-DOS, Macintosh and Unix. A version indexed by WordCruncher 4.4 for MS-DOS. London-Lund Corpus: Original text version for MS-DOS, Macintosh and Unix. An edited version indexed by WordCruncher 4.4 and TACT for MS-DOS and Free Text Browser for Macintosh. Helsinki Corpus: Text version for MS-DOS, Macintosh and Unix. 1-file, 3-file and 11-file versions indexed by WordCruncher 4.4 and TACT for MS-DOS. The WordCruncher versions of the corpora were prepared by: Brown Corpus: Randall Jones, Brigham Young University LOB Corpus: Knut Hofland, Norwegian Computing Centre for the Humanities Kolhapur Corpus: Gerhard Leitner, Free University of Berlin and Knut Hofland, Norwegian Computing Centre for the Humanities London-Lund Corpus: Knut Hofland, Norwegian Computing Centre for the Humanities Helsinki Corpus: Merja Kyto", University of Helsinki 4. Structure of the discThe disc contain the following subdirectories:
MS-DOS
GEN_INFO
TXT
BROWN1
BROWN2
HELSINKI
LOBUNTAG
LOBTAGH (horizontal version)
LOBTAGV (vertical version)
LONDLUND
KOLHAPUR
WC44
BROWN
LOBTAG
LONDLUND
KOLHAPUR
HELSINKI
1-FILE
3-FILE
11-FILE
TACT12
MAC
GEN_INFO
TXT
BROWN1
BROWN2
LONDLUND
LOBUNTAG
LOBTAGH
LOBTAGV
KOLHAPUR
HELSINKI
FREETEXT
HQX
UNIX
GEN_INFO
BROWN1
BROWN2
LONDLUND
LOBUNTAG
LOBTAGH
LOBTAGV
KOLHAPUR
HELSINKI
HUM (programs)
BROWSER (programs)
5. Conditions of useThe following conditions govern the use of corpus material distributed through ICAME:
References
joerg | |
| [Append to This Answer] | |
|
|
LPC: Lancaster Parsed Corpus |
/corpora/LPC/ Installed only for educational purposes. Hong Liang Qiao /corpora/LPC/prolog/ This directory contains the lpc tag format converted to prolog for the Korpus Lingvistik project of Anna Ekegren and Tomas Englund lpc --- the tags of the lpc sentences converted to prolog convert --- conversion program 960514-960607 Erik TKS <erik.tjong@ling.uu.se>joerg | |
| [Append to This Answer] | |
|
|
SUSANNE |
| /corpora/Susanne
| |
| The SUSANNE Corpus
A by-product of the work of creating the SUSANNE annotation scheme was the production of a corpus of English annotated in accordance with the scheme. The SUSANNE Corpus contains annotations of a 130,000-word cross-section of written American English (it is based on a subset of the million-word Brown Corpus). The SUSANNE Corpus is freely available without formalities for use by researchers anywhere (and has been heavily used since the first release was published in 1992). Many gratifying comments have been received from users about the detail and reliability of the annotated Corpus. | |
| [Append to This Answer] | |
|
|
NEGRA |
| /corpora/NEGRA
| |
This directory contains version 2.0 of the NEGRA corpus.
The contents of the files are:
negra-corpus.tar.gz All of the following files are combined in
this gzip:ed tar file. This is the only file
that you need to download. In case of problems,
please retrieve the single files or
contact us.
negra-corpus.export The complete corpus data in export format
(see exformat3.ps).
negra-corpus.cfg The corpus converted to context-free structures
(i.e., all crossing branches are transformed to
traces)
negra-corpus.penn The corpus converted to the Penn Treebank format,
(a bracketed structure, crossing branches are
converted to traces).
All left and right round brackets () in the text
are replaced by *LRB* and *RRB*.
negra-corpus.sent The raw sentence data.
negra-corpus.tt Words and parts-of-speech from the corpus
(one token and tag per line).
negra-corpus.lex A list of words from the corpus, together with
frequencies and part-of-speech information.
exformat3.ps A description of the export format.
stts.asc A description of the part-of-speech tags
edges.html A description of the edge labels
nodes.html A description of the node labels
Please refer to
http://www.coli.uni-sb.de/sfb378/negra-corpus/
for further information on the corpus or contact
Geert-Jan M. Kruijff
Universitaet des Saarlandes
Computerlinguistik
P.O.Box 151150
D-66041 Saarbruecken
Germany
gj@coli.uni-sb.de
| |
| [Append to This Answer] | |
|
|
Franska |
| /corpora/Franska
971014 Files in /corpora/Franska There are 2 versions of the files: a MSDOS version (original) in the directory Dos and a Unix version in the directory Unix. | |
| [Append to This Answer] | |
|
|
EUROPARL |
| /corpora/EUROPARL
| |
Europarl Release v2 -- Dec 4, 2003
==================================
This is a parallel corpus that was extracted from the
European Parliament web site by Philipp Koehn (USC/ISI).
It is faily big, 25-30 million words per language pair,
and its main intended use is to aid statistical machine
translation research.
More information can be found at
http://www.isi.edu/~koehn/europarl/
The main difference in this release vs. the first release
in 2002 is that it is larger and it comes with a sentence
aligner that allows the creation of parallel corpora
between any two of the 11 languages. | |
Source ------ http://www3.europarl.eu.int/omk/omnsapir.so/calendar?APP=CRE&LANGUE=EN Copyright in the Europarl service (c) European Communities Except where otherwise indicated, reproduction is authorised, provided that the source is acknowledged. | |
| [Append to This Answer] | |
|
|
KOMA |
| /corpora/KOMA
Parallel corpora from the KOMA project | |
20040109 joerg@stp.ling.uu.se
KOMA corpus 1.0
2003-06-11
/corpora/KOMA
doc/ bitext documents (LWA style & KOMA XML)
dtd/ liu DTDs
tools/ some scriptsjoerg | |
| [Append to This Answer] | |
|
|
MATS |
| /corpora/MATS
trilingual parallel corpus from the MATS project | |
| [Append to This Answer] | |
|
|
OPUS |
| /corpora/OPUS
OPUS - an open source parallel corpus | |
OPUS v0.2 OpenOffice ....... OpenOffice documentation PHP .............. PHP manuals KDE .............. KDE system messages KDEdoc ........... KDE documentation src .............. original documents reg .............. cwb registry data ............. cwb data scripts .......... some scripts | |
| [Append to This Answer] | |
|
|
PLUG |
| /corpora/PLUG
Parallel corpora from the PLUG project | |
subdirectories:
XML - latest corpus files encoded in PLUG XML
--------------------------------------------------------------------------
all files are encoded in XML using the plugXML.dtd
complete corpus: 1910296 words
sv<->en: 1112159 words
sv<->de: 455008 words
sv<->it: 343129 words
technical texts: 1353740 words
size in words in bytes languages
------------------------------------------------------------------
ensvtacc.xml 163173 1606582 en->sv
ensvtxl.xml 124961 1330559 en->sv
svdetscan.xml 337188 6052332 sv->de
sventscan.xml 187830 3370698 sv->en
ensvtscan.xml 197459 3508625 en->sv
svittscan.xml 343129 7139976 sv->it
fiction: 301620 words
size in words in bytes languages
------------------------------------------------------------------
ensvfbell.xml 132066 1169471 en->sv
ensvgord.xml 169554 1423741 en->sv
political texts: 254963 words
size in words in bytes languages
------------------------------------------------------------------
svdepeu.xml 110042 973813 sv->de
svenpeu.xml 129105 1074560 sv->en
svenprf.xml 8011 86180 sv->en
svdeprf.xml 7778 90486 sv->de | |
| [Append to This Answer] | |
|
|
Regeringsförklaringen |
| /corpora/RF
This corpus contains texts from the Swedish government among which the declarations issued by the Swedish prime-minister when a new cabinet starts (regeringsförklaringen). These declarations are issued in five languages: Swedish, English, French, German and Spanish (since 1996). The current size of the corpus is 11,000 words. | |
| [Append to This Answer] | |
|
|
Scania |
| /corpora/Scania
Parallel corpora from the Scania project | |
|
restricted access!
| |
/corpora/Scania/Scania1995/
-----------------------------------------------------------------
Language files words bytes
-----------------------------------------------------------------
Swedish 80 172259 7792597
Dutch 80 216424 8072128
English 80 220827 7886082
Finnish 80 148348 7833990
French 80 244239 8156457
German 80 186293 8004331
Italian 80 228631 8127121
Spanish 80 250730 8090916
-----------------------------------------------------------------
total 640 1667751 63963622
----------------------------------------------------------------- | |
/corpora/Scania/Scania1998/ Swedish tokens bytes --------------------------------------------------- ScaniaOM.txt 88,925 652,013 ScaniaSD.txt 258,964 1,938,396 ScaniaSH.txt 455,946 3,224,298 ScaniaSHa.txt 615,026 4,411,102 ScaniaSP.txt 22,636 155,161 ScaniaTI.txt 95,880 627,067 ScaniaTL.txt 5,352 197,978 --------------------------------------------------- total number of tokens 1,542,729 11,206,015 total number of types 46,770 =================================================== English tokens bytes --------------------------------------------------- ScaniaOM.txt 117,397 748,241 ScaniaSD.txt 339,613 2,186,504 ScaniaSH.txt 579,931 3,508,928 ScaniaSP.txt 28,065 177,150 ScaniaTI.txt 109,318 665,032 ScaniaTL.txt 9,188 197,982 --------------------------------------------------- total number of tokens 1,183,512 7,483,837 | |
/corpora/Scania/Scania1994 /corpora/Scania/Scania1996 | |
| [Append to This Answer] | |
|
|
Invandrartidningen |
| /corpora/Invandrartidningen
| |
| [Append to This Answer] |
| Previous: |
|
| Next: |
|
| ||||||||