International Corpus of English

The International Corpus of English (ICE) is a set of corpora representing varieties of English from around the world. Over twenty countries or groups of countries where English is the first language or an official second language are included.

History

The project began in 1990 with the primary aim of collecting material for comparative studies of English worldwide. Twenty-three research teams around the world are preparing electronic corpora of their own national or regional variety of English. Each ICE corpus consists of one million words of spoken and written English produced after 1989. For most participating countries, the ICE project is stimulating the first systematic investigation of the national variety. To ensure compatibility among the component corpora, each team is following a common corpus design, as well as a common scheme for grammatical annotation.

Description

Each corpus contains one million words in 500 texts of 2000 words, following the sampling methodology used for the Brown Corpus. Unlike Brown or the Lancaster-Oslo-Bergen (LOB) Corpus (or indeed mega-corpora such as the British National Corpus), however, the majority of texts are derived from spoken data.

ICE corpora contain 60% (600,000 words) of orthographically transcribed spoken English. The father of the project, Sidney Greenbaum, insisted on the primacy of the spoken word, following Randolph Quirk and Jan Svartvik's collaboration on the original London-Lund Corpus (LLC). This emphasis on word-for-word transcription marks out ICE from many other corpora, including those containing, e.g. parliamentary or legal paraphrases.

The British Component of ICE, ICE-GB, is fully parsed with a detailed Quirk et al.^[1] phrase structure grammar, and the analyses have been thoroughly checked and completed. This analysis includes a part-of-speech tagging and parsing of the entire corpus. The treebank can be thoroughly searched and explored with the ICE Corpus Utility Program or ICECUP software. More information is in the handbook.^[2]

To ensure compatibility between the individual corpora in ICE, each team is following a common corpus design, as well as a common scheme for grammatical annotation.^[3]

Participants

The current list of participant countries are (*= available):

Australia
Cameroon
Canada*
East Africa (Kenya, Malawi, Tanzania)*
Fiji
Ghana
Great Britain* (parsed)
Hong Kong*
India*
Ireland*
Jamaica*
Kenya
Malta
Malaysia
New Zealand*
Nigeria
Pakistan
Philippines*
Sierra Leone
Singapore*
South Africa
Sri Lanka
Trinidad and Tobago
USA

References

↑ Quirk, Randolph, Greenbaum, Sidney, Leech, Geoffrey and Svartvik, Jan (1985). A Comprehensive Grammar of the English Language London: Longman
↑ Nelson, Gerald, Wallis, Sean, and Aarts, Bas (2002). Exploring Natural Language. Working with the British Component of the International Corpus of English Amsterdam: John Benjamins
↑ The International Corpus of English website

External links

The International Corpus of English website

Corpus linguistics

Text corpora, English	American National Corpus Bank of English Bergen Corpus of London Teenage Language British National Corpus Brown Corpus Buckeye Corpus Cambridge English Corpus Corpus of Contemporary American English Enron Corpus International Corpus of English Lancaster-Oslo-Bergen Corpus Oxford English Corpus PropBank Spoken English Corpus TIMIT VerbNet Wellington Corpus of Spoken New Zealand English

Text corpora, non-English	Bijankhan Corpus CHILDES Croatian Language Corpus Croatian National Corpus Europarl corpus German Reference Corpus Hamshahri Corpus National Corpus of Polish Neo-Assyrian Text Corpus Project Quranic Arabic Corpus Russian National Corpus Scottish Corpus of Texts and Speech Slovenian National Corpus TalkBank Tatoeba Tehran Monolingual Corpus Tekstaro de Esperanto Thesaurus Linguae Graecae

Organizations	BNC consortium COBUILD

This article is issued from Wikipedia - version of the 10/9/2015. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.