Institutionen för lingvistik
(Category) (Category) STP FAQ :
Corpora
Corpora and other resources installed on STP
Swedish:
(Answer) Press65
(Answer) Scarrie
(Answer) SUC

English:
(Answer) BNC
(Answer) ICAME - Collection of English Language Corpora
(Answer) LPC: Lancaster Parsed Corpus
(Answer) SUSANNE

other languages:
(Answer) NEGRA (German)
(Answer) Franska (French)

Parallel corpora:
(Answer) EUROPARL
(Answer) KOMA
(Answer) MATS
(Answer) OPUS
(Answer) PLUG
(Answer) Regeringsförklaringen
(Answer) Scania

other multilingual corpora:
(Answer) Invandrartidningen

[New Answer in "Corpora"]
(Answer) (Category) STP FAQ : (Category) Corpora :
Press65
/corpora/Press65

Korpusen består av artiklar insamlade från morgontidningar år 1965

[Append to This Answer]
(Answer) (Category) STP FAQ : (Category) Corpora :
Scarrie
/corpora/scarrie

corpora and text collections from the Scarrie project

UNT9596 corpus (23,810,171 tokens) build by
IMS Corpus Work Bench from file unt9596.seg

cqp -r /corpora/scarrie/cwb/unt


SVD9596 corpus (47,433,729 tokens) build by
IMS Corpus Work Bench from file svd9596.sg

cqp -r /corpora/scarrie/cwb/svd
[Append to This Answer]
(Answer) (Category) STP FAQ : (Category) Corpora :
SUC
/corpora/SUC

Stockholm-Umeå corpus

971018 erik.tjong@ling.uu.se
 
The directory /corpora/SUC contains the corpus SUC 1.0 (1997).
The corpus came on a cd. Obtained from Daniel Ridings, August 1997.


access via CWB:

cqp -r /local/ling/cwb/registry
[no corpus]> SUC;
SUC>
[Append to This Answer]
(Answer) (Category) STP FAQ : (Category) Corpora :
BNC
/corpora/BNC

BNC ttf-formatted version for the word predictor Prophet used in the Fasty project.

[Append to This Answer]
(Answer) (Category) STP FAQ : (Category) Corpora :
ICAME - Collection of English Language Corpora
/corpora/ICAME
Brown-katalogen tillagd 2003-10-24 (ljo@stp.ling.uu.se)
Där finns versioner med en mening per rad (.byb).

Coordinators: Knut Hofland, Norwegian Computing Centre for the
              Humanities, Bergen
              Stig Johansson, University of Oslo

Bergen, December 1991

Publisher/
Distributor: Norwegian Computing Centre for the Humanities,
             P.O. Box 53
             N-5027 Bergen
             Norway

             Telephone: +47 5 212954
             Telefax:   +47 5 322656
             Electronic mail: icame@navf-edb-h.uib.no
joerg

Introduction

1. ICAME

This disc contains corpora distributed through the International Computer Archive of Modern English (ICAME). ICAME is an organization of linguists and information scientists working with English machine-readable texts. The aim of the organisation is to collect and distribute information on English language material available for computer processing and on linguistic research completed or in progress on the material, to compile an archive of English text corpora in machine-readable form, and to make material available to research institutions. The archive mentioned in the name resides at the Norwegian Computing Centre for the Humanities (NCCH) in Bergen, Norway. This acts as a distribution centre for computerized English- language corpora and corpus-related software. ICAME publishes the ICAME Journal, which appears once a year, with articles and information on English computer corpora. There is also an electronic information service: NAVFSERV@NORA.NAVF-EDB-H.UIB.NO (send a message with HELP in the Subject line)

2. The corpora

Each corpus was produced by a different research team, as explained below.

2.1. The Brown Corpus

The Brown Corpus was compiled in the early 1960s at Brown University, USA, under the direction of W. Nelson Francis and Henry Kucera. It contains 500 text samples of some 2,000 words representing 15 categories of American English texts printed in 1961. A list of the texts in the Brown Corpus is included on the disc. A full description of the corpus is given in Francis & Kucera (1979). The Brown Corpus is available in a number of versions, with and without word-class tagging. There are no tagged versions on this disc.

2.2. The LOB Corpus

The Lancaster-Oslo/Bergen (LOB) Corpus was compiled in the 1970s under the direction of Geoffrey Leech, University of Lancaster, and Stig Johansson, University of Oslo. It is a British English counterpart of the Brown Corpus and contains 500 text samples selected from texts printed in Great Britain in 1961. A list of the texts in the LOB Corpus is included on the disc. A full description of the corpus is given in Johansson et al. (1978). The LOB Corpus exists in a number of versions, with and without word-class tagging. The tagging was done by researchers at Lancaster, Oslo, and Bergen. The principal members of the research teams were:

Lancaster: Geoffrey Leech, Roger Garside, Eric Atwell, Ian Marshall Oslo/Bergen: Stig Johansson, Knut Hofland, Mette-Cathrine Jahr

A list of the tags used in the corpus is includen on the disc. A full description of the tagged corpus is given in Johansson et al. (1986). This disc contains both tagged and untagged versions of the LOB Corpus.

2.3. The Kolhapur Corpus

The Kolhapur Corpus is an Indian English counterpart of the Brown and LOB corpora, compiled under the direction of S. V. Shastri, Shivaji University, Kolhapur. It contains 500 text samples selected from English texts printed in India in 1978. The Kolhapur Corpus contains the same text categories as the British and American counterparts, but the weighting and the internal structure of some of the text categories are somewhat different, due to inherent differences in the Indian situation. A list of the texts in the Kolhapur Corpus is included on the disc. A full description of the corpus is given in Shastri et al. (1986).

2.4. The London-Lund Corpus

The London-Lund Corpus contains 100 spoken English texts of some 5,000 words collected and transcribed at the Survey of English Usage, University College London, under the direction of Randolph Quirk, and computerized at the University of Lund, under the direction of Jan Svartvik (13 of the texts were computerized at University College London, under the direction of Sidney Greenbaum). The principal members of the research teams were:

London: Sidney Greenbaum, Andrew Rosta, Akiva Quinn Lund: Bengt Altenberg, Mats Eeg-Olofsson, Lennart Maansby, Bengt Orestro"m, Jan Svartvik, Cecilia Thavenius

The texts in the corpus are transcribed orthographically, with detailed prosodic marking. They represent a range of text categories, such as spontaneous conversation, spontaneous commentary, spontaneous and prepared oration, etc. For a full description of the corpus, see Svartvik (1990, 1992). Note that, when the London-Lund Corpus was first made available, it contained 87 texts. The 13 new texts which have since been added are included in all the versions of the corpus found on this disc.

2.5. The Helsinki Corpus of English Texts: Diachronic Part

This corpus was compiled at the University of Helsinki, under the direction of Matti Rissanen. Other members of the research team were:

Old English: Leena Kahlas-Tarkka, Matti Kilpio", Ilkka Mo"nkko"nen, Aune O"sterman

Middle English: Inkeri Blomstedt, Juha Hannula, Mailis Ja"rvio", Leena Koskinen, Saara Nevanlinna, Tesma Outakoski, Pa"ivi Pahta, Kirsti Peitsara, Irma Taavitsainen

Early Modern English: Merja Kyto", Anneli Meurman-Solin, Terttu Nevalainen, Helena Raumolin- Brunberg, Ritva Tiusanen

The project secretary was Merja Kyto" and the research assistants, who keyed in and proofread texts were:

Kirsi Heikkonen, Jussi Klemola, Asta Kuusinen, Tuula Lehtonen, Tom Lo"fstro"m, Arja Nurmi, Minna Palander, Tiina Selki, Pa"ivi O"hman.

The corpus consists of a selection of texts covering the Old, Middle, and Early Modern English periods, totalling 1.5 million words. Information on source texts and coding conventions is given in Kyto" (1991); the principles of compilation and a number of pilot studies will appear in Kyto", Palander and Rissanen (eds.), forthcoming.

3. Versions

The computer editing for the collection was done by Knut Hofland, Norwegian Computing Centre for the Humanities. The following versions of the corpora are included on the disc:

Brown Corpus:

Bergen text version I and II, for MS-DOS, Macintosh and Unix. A modified Bergen version II indexed by WordCruncher 4.4 and TACT for MS-DOS and Free Text Browser for Macintosh.

LOB Corpus:

Tagged and untagged original text versions, for MS-DOS, Macintosh and Unix. A tagged horizontal version indexed by WordCruncher 4.4 and TACT for MS-DOS and Free Text Browser for Macintosh.

Kolhapur Corpus:

Text version for MS-DOS, Macintosh and Unix. A version indexed by WordCruncher 4.4 for MS-DOS.

London-Lund Corpus:

Original text version for MS-DOS, Macintosh and Unix. An edited version indexed by WordCruncher 4.4 and TACT for MS-DOS and Free Text Browser for Macintosh.

Helsinki Corpus:

Text version for MS-DOS, Macintosh and Unix. 1-file, 3-file and 11-file versions indexed by WordCruncher 4.4 and TACT for MS-DOS.

The WordCruncher versions of the corpora were prepared by:

Brown Corpus: Randall Jones, Brigham Young University LOB Corpus: Knut Hofland, Norwegian Computing Centre for the Humanities Kolhapur Corpus: Gerhard Leitner, Free University of Berlin and Knut Hofland, Norwegian Computing Centre for the Humanities London-Lund Corpus: Knut Hofland, Norwegian Computing Centre for the Humanities Helsinki Corpus: Merja Kyto", University of Helsinki

4. Structure of the disc

The disc contain the following subdirectories:

   MS-DOS
      GEN_INFO
      TXT
           BROWN1
           BROWN2
           HELSINKI
           LOBUNTAG
           LOBTAGH (horizontal version)
           LOBTAGV (vertical version)
           LONDLUND
           KOLHAPUR
      WC44
           BROWN
           LOBTAG
           LONDLUND
           KOLHAPUR
           HELSINKI
               1-FILE
               3-FILE
               11-FILE
      TACT12
   MAC
      GEN_INFO
      TXT
           BROWN1
           BROWN2
           LONDLUND
           LOBUNTAG
           LOBTAGH
           LOBTAGV
           KOLHAPUR
           HELSINKI
      FREETEXT
           HQX
   UNIX
       GEN_INFO
       BROWN1
       BROWN2
       LONDLUND
       LOBUNTAG
       LOBTAGH
       LOBTAGV
       KOLHAPUR
       HELSINKI
       HUM (programs)
       BROWSER (programs)

5. Conditions of use

The following conditions govern the use of corpus material distributed through ICAME:

a. No copies of corpora, or parts of corpora, are to be distributed outside the institution under any circumstances, without the written permission of ICAME.
b. The corpora, or parts thereof, are to be used for bona fide research of a non-profit nature. Holders of copies of corpora may not reproduce any texts, or parts of texts, for any purpose other than scholarly research without getting the written permission of the individual copyright holders, as listed in the manual for the corpus in question. (For material where there is no known copyright holder, the person(s) who originally prepared the material in computerized form will be regarded as the copyright holder(s).)
c. Commercial publishers and other non-academic organizations wishing to make use of part or all of a corpus or a print-out thereof must obtain permission from all the individual copyright holders involved.
d. Publications making use of the material should include a reference to the relevant corpus (or corpora), giving the name of the corpus and the distributor.

References

  • Francis, W. Nelson and Henry Kucera. 1979. Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English, for Use with Digital Computers. Revised and amplified ed. Providence, Rhode Island: Department of Linguistics, Brown University.
  • Johansson, Stig, Geoffrey Leech, and Helen Goodluck. 1978. Manual of Information to Accompany the Lancaster-Oslo/Bergen Corpus of British English, for Use with Digital Computers. Oslo: Department of English, University of Oslo.
  • Johansson, Stig, Eric Atwell, Roger Garside, and Geoffrey Leech. 1986. The Tagged LOB Corpus: Users' Manual. Bergen: Norwegian Computing Centre for the Humanities.
  • Kyto", Merja (comp.). 1991. Manual to the Diachronic Part of the Helsinki Corpus of English Texts: Coding Conventions and Lists of Source Texts. Helsinki: Department of English, University of Helsinki. [Distributed by the Norwegian Computing Centre for the Humanities, Bergen]
  • Kyto", Merja, Minna Palander and Matti Rissanen (eds.). Forthcoming. The Helsinki Corpus of English Texts: Introduction and Pilot Studies on the Diachronic Part.
  • Shastri, S.V., C.T. Patilkulkarni, and Geeta S. Shastri. 1986. Manual of Information to Accompany the Kolhapur Corpus of Indian English, for Use with Digital Computers. Kolhapur: Department of English, Shivaji University.
  • Svartvik, Jan (ed.). 1990. The London-Lund Corpus of Spoken English: Description and Research. Lund Studies in English 82. Lund: Lund University Press.
  • Svartvik, Jan (ed.). 1992. The London-Lund Corpus of Spoken English: Users' Manual. Lund: Department of English, Lund University. [Distributed by the Norwegian Computing Centre for the Humanities]

joerg
[Append to This Answer]
(Answer) (Category) STP FAQ : (Category) Corpora :
LPC: Lancaster Parsed Corpus
/corpora/LPC/
 
Installed only for educational purposes.
  
Hong Liang Qiao


/corpora/LPC/prolog/
 
This directory contains the lpc tag format converted to prolog for
the Korpus Lingvistik project of Anna Ekegren and Tomas Englund
 
lpc     --- the tags of the lpc sentences converted to prolog
convert --- conversion program
 
960514-960607 Erik TKS <erik.tjong@ling.uu.se>
joerg
[Append to This Answer]
(Answer) (Category) STP FAQ : (Category) Corpora :
SUSANNE
/corpora/Susanne
The SUSANNE Corpus

   A by-product of the work of creating the SUSANNE annotation scheme was
   the production of a corpus of English annotated in accordance with the
   scheme. The SUSANNE Corpus contains annotations of a 130,000-word
   cross-section of written American English (it is based on a subset of
   the million-word Brown Corpus). The SUSANNE Corpus is freely available
   without formalities for use by researchers anywhere (and has been
   heavily used since the first release was published in 1992). Many
   gratifying comments have been received from users about the detail and
   reliability of the annotated Corpus.

[Append to This Answer]
(Answer) (Category) STP FAQ : (Category) Corpora :
NEGRA
/corpora/NEGRA
This directory contains version 2.0 of the NEGRA corpus.
The contents of the files are:
                                                                                
negra-corpus.tar.gz     All of the following files are combined in
                        this gzip:ed tar file. This is the only file
                        that you need to download. In case of problems,
                        please retrieve the single files or
                        contact us.
                                                                                
negra-corpus.export     The complete corpus data in export format
                        (see exformat3.ps).
negra-corpus.cfg        The corpus converted to context-free structures
                        (i.e., all crossing branches are transformed to
                        traces)
negra-corpus.penn       The corpus converted to the Penn Treebank format,
                        (a bracketed structure, crossing branches are
                        converted to traces).
                        All left and right round brackets () in the text
                        are replaced by *LRB* and *RRB*.
negra-corpus.sent       The raw sentence data.
negra-corpus.tt         Words and parts-of-speech from the corpus
                        (one token and tag per line).
negra-corpus.lex        A list of words from the corpus, together with
                        frequencies and part-of-speech information.
exformat3.ps            A description of the export format.
stts.asc                A description of the part-of-speech tags
edges.html              A description of the edge labels
nodes.html              A description of the node labels
                                                                                
Please refer to
        http://www.coli.uni-sb.de/sfb378/negra-corpus/
for further information on the corpus or contact
        Geert-Jan M. Kruijff
        Universitaet des Saarlandes
        Computerlinguistik
        P.O.Box 151150
        D-66041 Saarbruecken
        Germany
        gj@coli.uni-sb.de
[Append to This Answer]
(Answer) (Category) STP FAQ : (Category) Corpora :
Franska
/corpora/Franska

971014 Files in /corpora/Franska
These are the files of the French newspaper corpus obtained from Lars Larsson (Romanska Institutionen). The corpus is to be used by Stina Nylander.

 
There are 2 versions of the files: a MSDOS version (original) in the directory Dos and a Unix version in the directory Unix.
[Append to This Answer]
(Answer) (Category) STP FAQ : (Category) Corpora :
EUROPARL
/corpora/EUROPARL
Europarl Release v2 -- Dec 4, 2003
==================================
 
This is a parallel corpus that was extracted from the
European Parliament web site by Philipp Koehn (USC/ISI).
It is faily big, 25-30 million words per language pair,
and its main intended use is to aid statistical machine
translation research.
 
More information can be found at
        http://www.isi.edu/~koehn/europarl/
 
The main difference in this release vs. the first release
in 2002 is that it is larger and it comes with a sentence
aligner that allows the creation of parallel corpora
between any two of the 11 languages.
 
Source
------
http://www3.europarl.eu.int/omk/omnsapir.so/calendar?APP=CRE&LANGUE=EN
                                                                                
Copyright in the Europarl service
(c) European Communities
Except where otherwise indicated, reproduction is authorised,
provided that the source is acknowledged.
[Append to This Answer]
(Answer) (Category) STP FAQ : (Category) Corpora :
KOMA
/corpora/KOMA

Parallel corpora from the KOMA project
http://www.ida.liu.se/~nlplab/koma/

                                20040109 joerg@stp.ling.uu.se
KOMA corpus 1.0
2003-06-11
 
/corpora/KOMA
 
doc/            bitext documents (LWA style & KOMA XML)
dtd/            liu DTDs
tools/          some scripts
joerg
[Append to This Answer]
(Answer) (Category) STP FAQ : (Category) Corpora :
MATS
/corpora/MATS

trilingual parallel corpus from the MATS project
http://stp.ling.uu.se/mats

[Append to This Answer]
(Answer) (Category) STP FAQ : (Category) Corpora :
OPUS
/corpora/OPUS

OPUS - an open source parallel corpus
http://logos.uio.no/opus/

OPUS v0.2
 
OpenOffice ....... OpenOffice documentation
PHP .............. PHP manuals
KDE .............. KDE system messages
KDEdoc ........... KDE documentation
 
src .............. original documents
 
reg .............. cwb registry
data ............. cwb data
scripts .......... some scripts
[Append to This Answer]
(Answer) (Category) STP FAQ : (Category) Corpora :
PLUG
/corpora/PLUG

Parallel corpora from the PLUG project
http://stp.ling.uu.se/plug

subdirectories:

XML     - latest corpus files encoded in PLUG XML

--------------------------------------------------------------------------

all files are encoded in XML using the plugXML.dtd

complete corpus:        1910296 words
sv<->en:                1112159 words
sv<->de:                 455008 words
sv<->it:                 343129 words



technical texts:        1353740 words

                size    in words        in bytes        languages
------------------------------------------------------------------
ensvtacc.xml            163173          1606582         en->sv
ensvtxl.xml             124961          1330559         en->sv
svdetscan.xml           337188          6052332         sv->de
sventscan.xml           187830          3370698         sv->en
ensvtscan.xml           197459          3508625         en->sv
svittscan.xml           343129          7139976         sv->it


fiction:                301620 words

                size    in words        in bytes        languages
------------------------------------------------------------------
ensvfbell.xml           132066          1169471         en->sv
ensvgord.xml            169554          1423741         en->sv


political texts:        254963 words

                size    in words        in bytes        languages
------------------------------------------------------------------
svdepeu.xml             110042           973813         sv->de
svenpeu.xml             129105          1074560         sv->en
svenprf.xml               8011            86180         sv->en
svdeprf.xml               7778            90486         sv->de
[Append to This Answer]
(Answer) (Category) STP FAQ : (Category) Corpora :
Regeringsförklaringen
/corpora/RF

This corpus contains texts from the Swedish government among which the declarations issued by the Swedish prime-minister when a new cabinet starts (regeringsförklaringen). These declarations are issued in five languages: Swedish, English, French, German and Spanish (since 1996). The current size of the corpus is 11,000 words.
http://stp.ling.uu.se/~corpora/rf/

[Append to This Answer]
(Answer) (Category) STP FAQ : (Category) Corpora :
Scania
/corpora/Scania

Parallel corpora from the Scania project
http://stp.ling.uu.se/scania/

restricted access!


/corpora/Scania/Scania1995/

-----------------------------------------------------------------
           Language    files       words     bytes
-----------------------------------------------------------------
           Swedish       80       172259   7792597
           Dutch         80       216424   8072128
           English       80       220827   7886082
           Finnish       80       148348   7833990
           French        80       244239   8156457
           German        80       186293   8004331
           Italian       80       228631   8127121
           Spanish       80       250730   8090916
-----------------------------------------------------------------
           total        640      1667751  63963622
-----------------------------------------------------------------
/corpora/Scania/Scania1998/

 Swedish                     tokens        bytes
---------------------------------------------------
 ScaniaOM.txt                88,925      652,013
 ScaniaSD.txt               258,964    1,938,396
 ScaniaSH.txt               455,946    3,224,298
 ScaniaSHa.txt              615,026    4,411,102
 ScaniaSP.txt                22,636      155,161
 ScaniaTI.txt                95,880      627,067
 ScaniaTL.txt                 5,352      197,978
---------------------------------------------------
 total number of tokens   1,542,729   11,206,015
 total number of types       46,770
===================================================

 English                     tokens        bytes
---------------------------------------------------
 ScaniaOM.txt               117,397       748,241
 ScaniaSD.txt               339,613     2,186,504
 ScaniaSH.txt               579,931     3,508,928
 ScaniaSP.txt                28,065       177,150
 ScaniaTI.txt               109,318       665,032
 ScaniaTL.txt                 9,188       197,982
---------------------------------------------------
 total number of tokens   1,183,512     7,483,837
/corpora/Scania/Scania1994
/corpora/Scania/Scania1996
[Append to This Answer]
(Answer) (Category) STP FAQ : (Category) Corpora :
Invandrartidningen
/corpora/Invandrartidningen
[Append to This Answer]
Previous: (Category) Forskningsprojekt
Next: (Category) Corpus tools
This document is: http://stp.ling.uu.se/cgi-bin/joerg/faq/stp?file=6
[Search] [Appearance] [Show Top Category Only]
This is a Faq-O-Matic 2.717.
This FAQ administered by ...