Towards better language representation in Natural Language Processing: A multilingual dataset for text-level Grammatical Error Correction

This paper introduces MultiGEC, a dataset for multilingual Grammatical Error Correction (GEC) in twelve European languages: Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian. MultiGEC distinguishes itself from previous GEC datasets in that it covers several underrepresented languages, which we argue should be included in resources used to train models for Natural Language Processing tasks which, as GEC itself, have implications for Learner Corpus Research and Second Language Acquisition. Aside from multilingualism, the novelty of the MultiGEC dataset is that it consists of full texts — typically learner essays — rather than individual sentences, making it possible to train systems that take a broader context into account. The dataset was built for MultiGEC-2025, the first shared task in multilingual text-level GEC, but it remains accessible after its competitive phase, serving as a resource to train new error correction systems and perform cross-lingual GEC studies.

Keywords: learner corpora, grammatical error correction, multilingual corpora, Matthew effect, MultiGEC shared task

Article outline

1.Introduction
2.Data
- 2.1Czech
- 2.2English
- 2.3Estonian
- 2.4German
- 2.5Greek
- 2.6Icelandic
- 2.7Italian
- 2.8Latvian
- 2.9Russian
- 2.10Slovene
- 2.11Swedish
- 2.12Ukrainian
3.Conclusions and future outlook
Open data badge and data availability statement
Notes
References

References (38)

References

Alsufieva, A., Kisselev, O., & Freels, S. (2012). Results 2012: Using flagship data to develop a Russian learner corpus of academic writing. Russian Language Journal, 621, 79–105.

Arhar Holdt, Š., Gantar, P., Bon, M., Gapsa, M., Lavrič, P., & Klemen, M. (2023). Dataset for evaluation of Slovene spell- and grammar-checking tools Šolar-Eval 1.0. (Slovenian language resource repository CLARIN.SI). [URL]

Arhar Holdt, Š., & Kosem, I. (2024). Šolar, the developmental corpus of Slovene. Language Resources and Evaluation, 1–27.

Arnardóttir, Þ., Xu, X., Guðmundsdóttir, D., Stefánsdóttir, L., & Ingason, A. (2021). Creating an Error Corpus: Annotation and Applicability. In Proceedings of CLARIN 2021 Annual Conference (pp. 59–63).

Bol, T., de Vaan, M., & van de Rijt, A. (2018). The Matthew effect in science funding. Proceedings of the National Academy of Sciences, 115(19), 4887–4890.

Boyd, A. (2018). Using Wikipedia edits in low resource grammatical error correction. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text (pp. 79–84). Association for Computational Linguistics.

Boyd, A., Hana, J., Nicolas, L., Meurers, D., Wisniewski, K., Abel, A., Schöne, K., Štindlová, B., & Vettori, C. (2014). The MERLIN corpus: Learner language and the CEFR. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14) (pp. 1281–1288). European Language Resources Association (ELRA).

Council of Europe. (2020). Common European Framework of Reference for Languages: Learning, teaching, assessment. Companion volume with new descriptors. Council of Europe Publishing.

Darg̀is, R., Auziņa, I., Kaija, I., Levāne-Petrova, K., & Pokratniece, K. (2022). LaVA–Latvian Language Learner corpus. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (pp. 727–731).

Darg̀is, R., Auziņa, I., Levāne-Petrova, K., & Kaija, I. (2020). Quality focused approach to a learner corpus development. In Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 392–396).

Davis, C., Caines, A., Andersen, Ø., Taslimipoor, S., Yannakoudakis, H., Yuan, Z., Bryant, C., Rei, M. & Buttery, P. (2024). Prompting open-source and commercial language models for grammatical error correction of English learner text. In Findings of the association for computational linguistics: ACL 2024 (pp. 11952–11967). Association for Computational Linguistics.

Ducel, F., Fort, K., Lejeune, G., & Lepage, Y. (2022). Do we name the languages we study? the #BenderRule in LREC and ACL articles. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk & S. Piperidis (Eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference (pp. 564–573). European Language Resources Association.

Gantar, P., Bon, M., Gapsa, M., & Arhar Holdt, Š. (2023). Šolar-Eval: Evalvacijska množica za strojno popravljanje jezikovnih napak v slovenskih besedilih. Jezik in Slovstvo, 68(4), 89–108.

Glišić, I., & Ingason, A. K. (2022). The Nature of Icelandic as a second language: An insight from the Learner Error Corpus for Icelandic. In Proceedings of the CLARIN Annual Conference (p. 23–33).

Godfroid, A., & Andringa, S. (2023). Uncovering sampling biases, advancing inclusivity, and rethinking theoretical accounts in Second Language Acquisition: Introduction to the special issue SLA for all? Language Learning, 73(4), 981–1002.

Hammarstedt, M., Schumacher, A., Borin, L., & Forsberg, M. (2022). Sparv 5 user manual (Tech. Rep.). Språkbanken Text.

Ingason, A. K., Stefánsdóttir, L. B., Arnardóttir, Þ., & Xu, X. (2021). Icelandic Error Corpus (IceEC) Version 1.1. (CLARIN-IS).

Ingason, A. K., Stefánsdóttir, L. B., Arnardóttir, Þ., Xu, X., Glišić, I., & Guðmundsdóttir, D. (2022). The Icelandic L2 Error Corpus (IceL2EC) 1.3 (22.10). (CLARIN-IS).

Masciolini, A., Caines, A., De Clercq, O., Kruijsbergen, J., Kurfalı, M., Muñoz Sánchez, R., Volodina, E., Östling, R. (2025a). The MultiGEC-2025 shared task on multilingual grammatical error correction at NLP4CALL. In R. Muñoz Sánchez, D. Alfter, J. Kallas, & E. Volodina (Eds.), Proceedings of the 14th workshop on Natural Language Processing for Computer Assisted Language Learning. Tallin, Estonia: University of Tartu. [URL]

Masciolini, A., Caines, A., De Clercq, O., Kruijsbergen, J., Kurfalı, M., Muñoz Sánchez, R., … Zesch, T. (2025b). An overview of grammatical error correction for the twelve MultiGEC-2025 languages. GU-ISS Forskningsrapporter från Institutionen för svenska språket. Institution for Swedish, Multilingualism, Language Technology; University of Gothenburg. [URL]

Merton, R. K. (1968). The Matthew effect in science: The reward and communication systems of science are considered. Science, 159(3810), 56–63.

Náplava, J., Straka, M., Straková, J., & Rosen, A. (2022). Czech grammar error correction with a large and diverse corpus. Transactions of the Association for Computational Linguistics, 101, 452–467.

Nicholls, D., Caines, A., & Buttery, P. (2024). The Write & Improve Corpus 2024: Error-annotated and CEFR-labelled essays by learners of English. Cambridge University Press Assessment.

Palma Gomez, F., & Rozovskaya, A. (2024). Multi-reference benchmarks for Russian grammatical error correction. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (volume 1: Long papers) (pp. 1253–1270). Association for Computational Linguistics.

Perc, M. (2014). The Matthew effect in empirical data. Journal of The Royal Society Interface, 11(98), 20140378.

Rosen, A., Hana, J., Hladká, B., Jelínek, T., Škodová, S., & Štindlová, B. (2020). Compiling and annotating a learner corpus for a morphologically rich language — CzeSL, a corpus of non-native Czech. Karolinum, Charles University Press.

Rozovskaya, A., & Roth, D. (2019). Grammar error correction in morphologically rich languages: The case of Russian. Transactions of the Association for Computational Linguistics, 71, 1–17.

Rudebeck, L., & Sundberg, G. (2021). SweLL correction annotation guidelines. (Tech. Rep.). GU-ISS Research report series, Department of Swedish, University of Gothenburg.

Sakaguchi, K., Napoles, C., Post, M., & Tetreault, J. (2016). Reassessing the goals of grammatical error correction: Fluency instead of grammaticality. Transactions of the Association for Computational Linguistics, 41, 169–182.

Šebesta, K., Bedřichová, Z., Šormová, K., Straňák, P., & Peterek, N. (2014). ROMi 1.0. (LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University).

Šebesta, K., Goláňová, H., Letafková, J., & Jelínková, B. (2016). AKCES 1. (LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University).

Søgaard, A. (2022). Should we ban English NLP for a year? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (pp. 5254–5260). Association for Computational Linguistics.

Syvokon, O., Nahorna, O., Kuchmiichuk, P., & Osidach, N. (2023). UA-GEC: Grammatical error correction and fluency corpus for the Ukrainian Language. In Proceedings of the second Ukrainian Natural Language Processing workshop (UNLP) (pp. 96–102). Association for Computational Linguistics.

Syvokon, O., & Romanyshyn, M. (2023). The UNLP 2023 Shared Task on Grammatical Error Correction for Ukrainian. In Proceedings of the second Ukrainian Natural Language Processing workshop (UNLP) (pp. 132–137). Association for Computational Linguistics.

Tantos, A., Amvrazis, N., & Drakonaki, E. (2023). Greek Learner Corpus II (GLCII): Design and development of an online corpus for L2 Greek. Journal of Applied Linguistics, 361, 125–150.

Volodina, E., Granstedt, L., Matsson, A., Megyesi, B., Pilán, I., Prentice, J., … & Wirén, M. (2019). The SweLL language learner corpus: From design to annotation. Northern European Journal of Language Technology (NEJLT), 61, 67–104.

(2022). SweLL-gold. Språkbanken Text. Distributed via SBX/CLARIN.

Wisniewski, K., Schöne, K., Nicolas, L., Vettori, C., Boyd, A., Meurers, D., … Hana, J. (2013). MERLIN: An online trilingual learner corpus empirically grounding the European Reference Levels in authentic learner data. In International Conference, ICT for Language Learning, 6th edition.