Croatian Language Corpus

The Croatian Language Corpus (Croatian: Hrvatski jezični korpus, HJK) is a corpus of Croatian compiled at the Institute of Croatian Language and Linguistics (IHJJ).

Background

The CLC was initially funded as a sub-project of the research program Riznica (Croatian Language Repository) by the Ministry of Science, Education, and Sports of the Republic of Croatia (MZOŠ) (project no. 0212010) from May 2005. In a second development phase, since 2007, the further extension and development of the CLC was embedded within the research program The Croatian Language Repository (CLR) that was granted by the MZOŠ (cf. Ćavar and Brozović Rončević, 2012^[1]). Being a research program (PI Dunja Brozović Rončević) with numerous subsumed independent research projects that make use of the CLC, the corpus is mainly developed as a by-product of those research projects within the CLR. Currently Dunja Brozović Rončević and Damir Ćavar are in charge of the corpus development.

Goals

One of the main goals of the CLC project is to create a publicly available Croatian corpus that is annotated on multiple levels, i.e. lemmatized, morphologically segmented and morpho-syntactically annotated, phonemically transcribed and syllabified, and syntactically parsed. While the current version of the corpus provides resources from the Croatian language standard, several corpora from different development phases of Croatian are created as well, including the digitizations of manuscripts and Croatian dictionaries.

Format and Availability

From the outset, the collected and digitized texts in the CLC were annotated using the Text Encoding Initiative (TEI) P5 XML standard. Currently approx. 90 mil. tokens are available in the TEI P5 XML format. The corpus can be accessed online via the Philologic^[2] interface (see The ARTFL Project,^[3] Department of Romance Languages and Literatures, The University of Chicago). It is virtualized into various sub-corpora, and individual or specific definitions of sub-corpora can be provided on demand.

Content

The CLC is assembled from selected text of Croatian, covering various functional domains and genres. It includes literature and other written sources from the period of the beginning of the final shaping of the standardization of the Croatian language, i.e. from the second half of the 19th century on.

The CLC consists of:

fundamental Croatian literature (e.g. novels, short stories, drama, poetry)
non-fiction
scientific publications from various domains and University textbooks
school books
translated literature from outstanding Croatian translators
online journals and newspapers
books from the pre-standardization period of Croatian that are adapted to nowadays standard Croatian

Cooperation

The realization of the CLC was made possible in cooperation with:

Školska knjiga d.d.
Croatian Academy of Sciences and Arts (HAZU)
Stoljeća hrvatske književnosti, Matica hrvatska

References

External links

Croatian Language Corpus (CLC) website and Philologic interface
(Croatian) Croatian National Corpus, another Croatian corpus by the Institute of Linguistics of the Faculty of Humanities and Social Sciences, University of Zagreb

Corpus linguistics

Text corpora, English	American National Corpus Bank of English Bergen Corpus of London Teenage Language British National Corpus Brown Corpus Buckeye Corpus Cambridge English Corpus Corpus of Contemporary American English Enron Corpus International Corpus of English Lancaster-Oslo-Bergen Corpus Oxford English Corpus PropBank Spoken English Corpus TIMIT VerbNet Wellington Corpus of Spoken New Zealand English

Text corpora, non-English	Bijankhan Corpus CHILDES Croatian Language Corpus Croatian National Corpus Europarl corpus German Reference Corpus Hamshahri Corpus National Corpus of Polish Neo-Assyrian Text Corpus Project Quranic Arabic Corpus Russian National Corpus Scottish Corpus of Texts and Speech Slovenian National Corpus TalkBank Tatoeba Tehran Monolingual Corpus Tekstaro de Esperanto Thesaurus Linguae Graecae

Organizations	BNC consortium COBUILD

Croatian language

Features	Alphabet Grammar

Dialects	Shtokavian Chakavian Kajkavian Burgenland Croatian Molise Croatian

Names	Patronymic names List of exonyms Months

History and literature	Literature 1967 Declaration

Promotion and purism	Croatian National Corpus Days of the Croatian Language Council for Standard Croatian Language Norm Institute of Croatian Language and Linguistics Croatian Encyclopedia Linguistic purism Studies

Related topics	Croatian Sign Language

This article is issued from Wikipedia - version of the 10/9/2015. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.