The IJS-ELAN Slovene-English Parallel Corpus

Erjavec, Tomaž

doi:10.1075/ijcl.7.1.01erj

Article published In: International Journal of Corpus Linguistics
Vol. 7:1 (2002) ► pp.1–20

Get fulltext from our e-platform

Download PDF

The IJS-ELAN Slovene-English Parallel Corpus

Tomaž Erjavec | Department of Intelligent Systems, Jozef Stefan Institute, Ljubljana, Slovenia

Published online: 18 October 2002

https://doi.org/10.1075/ijcl.7.1.01erj

The paper presents an annotated parallel Slovene-English corpus developed in the scope of the EU ELAN project. The IJS-ELAN corpus was compiled to be a widely distributable dataset for language engineering and for translation and terminology studies. The corpus contains 1 million words from fifteen recent terminology-rich texts. The corpus is sentence aligned and word-tagged with context disambiguated morphosyntactic descriptions and lemmas. These descriptions model simple feature structures, the structure of which is shared between Slovene and English. The corpus is encoded according to the Guidelines for Text Encoding and Interchange and is freely available on the Web for downloading. Additionally, access to IJS-ELAN is available via a powerful Web concordancer.

Keywords: parallel corpus, corpus encoding, tagging, concordancing

Cited by (6)

Cited by six other publications

Order by:

Rai, Pooja & Sanjay Chatterji

2023. Annotation Projection-based Dependency Parser Development for Nepali. ACM Transactions on Asian and Low-Resource Language Information Processing 22:2 ► pp. 1 ff.

Mizushima, Kota, Atusi Maeda & Yoshinori Yamaguchi

2010. Proceedings of the 9th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering, ► pp. 29 ff.

Žganec-Gros, Jerneja & Stanislav Gruden

2008. MSD Recombination for Statistical Machine Translation into Highly-Inflected Languages. In Text, Speech and Dialogue [Lecture Notes in Computer Science, 5246], ► pp. 235 ff.

Dias, Gaël & Špela Vintar

2005. Unsupervised Learning of Multiword Units from Part-of-Speech Tagged Corpora: Does Quantity Mean Quality?. In Progress in Artificial Intelligence [Lecture Notes in Computer Science, 3808], ► pp. 669 ff.

Žganec-Gros, Jerneja, France Mihelič, Tomaž Erjavec & Špela Vintar

2005. The VoiceTRAN Speech-to-Speech Communicator. In Text, Speech and Dialogue [Lecture Notes in Computer Science, 3658], ► pp. 379 ff.

ERJAVEC, TOMAŽ & SASČO DŽEROSKI

2004. MACHINE LEARNING OF MORPHOSYNTACTIC STRUCTURE: LEMMATIZING UNKNOWN SLOVENE WORDS. Applied Artificial Intelligence 18:1 ► pp. 17 ff.

This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.