Review article published In: International Journal of Learner Corpus Research
Vol. 11:2 (2025) ► pp.309–335
Corpus report
Towards better language representation in Natural Language Processing
A multilingual dataset for text-level Grammatical Error Correction
Available under the Creative Commons Attribution (CC BY) 4.0 license.
For any use beyond this license, please contact the publisher at rights@benjamins.nl.
Open Access publication of this article was funded through a Transformative Agreement with University of Gothenburg.
Published online: 1 April 2025
https://doi.org/10.1075/ijlcr.24033.mas
https://doi.org/10.1075/ijlcr.24033.mas
Abstract
This paper introduces MultiGEC, a dataset for multilingual Grammatical Error Correction (GEC) in twelve European
languages: Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian. MultiGEC
distinguishes itself from previous GEC datasets in that it covers several underrepresented languages, which we argue should be included in
resources used to train models for Natural Language Processing tasks which, as GEC itself, have implications for Learner Corpus Research and
Second Language Acquisition. Aside from multilingualism, the novelty of the MultiGEC dataset is that it consists of full texts — typically
learner essays — rather than individual sentences, making it possible to train systems that take a broader context into account. The dataset
was built for MultiGEC-2025, the first shared task in multilingual text-level GEC, but it remains accessible after its competitive phase,
serving as a resource to train new error correction systems and perform cross-lingual GEC studies.
Article outline
- 1.Introduction
- 2.Data
- 2.1Czech
- 2.2English
- 2.3Estonian
- 2.4German
- 2.5Greek
- 2.6Icelandic
- 2.7Italian
- 2.8Latvian
- 2.9Russian
- 2.10Slovene
- 2.11Swedish
- 2.12Ukrainian
- 3.Conclusions and future outlook
- Open data badge and data availability statement
- Notes
References
References (38)
Alsufieva, A., Kisselev, O., & Freels, S. (2012). Results 2012: Using flagship data to develop a Russian learner corpus of academic writing. Russian Language Journal, 621, 79–105.
Arhar Holdt, Š., Gantar, P., Bon, M., Gapsa, M., Lavrič, P., & Klemen, M. (2023). Dataset for evaluation of Slovene spell- and grammar-checking tools Šolar-Eval 1.0. (Slovenian language resource repository CLARIN.SI). [URL]
Arhar Holdt, Š., & Kosem, I. (2024). Šolar, the developmental corpus of Slovene. Language Resources and Evaluation, 1–27.
Arnardóttir, Þ., Xu, X., Guðmundsdóttir, D., Stefánsdóttir, L., & Ingason, A. (2021). Creating an Error Corpus: Annotation and Applicability. In Proceedings of CLARIN 2021 Annual Conference (pp. 59–63).
Bol, T., de Vaan, M., & van de Rijt, A. (2018). The Matthew effect in science funding. Proceedings of the National Academy of Sciences, 115(19), 4887–4890.
Boyd, A. (2018). Using Wikipedia edits in low resource grammatical error correction. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text (pp. 79–84). Association for Computational Linguistics.
Boyd, A., Hana, J., Nicolas, L., Meurers, D., Wisniewski, K., Abel, A., Schöne, K., Štindlová, B., & Vettori, C. (2014). The MERLIN corpus: Learner language and the CEFR. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14) (pp. 1281–1288). European Language Resources Association (ELRA).
Council of Europe. (2020). Common European Framework of Reference for Languages: Learning, teaching, assessment. Companion volume with new descriptors. Council of Europe Publishing.
Darg̀is, R., Auziņa, I., Kaija, I., Levāne-Petrova, K., & Pokratniece, K. (2022). LaVA–Latvian Language Learner corpus. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (pp. 727–731).
Darg̀is, R., Auziņa, I., Levāne-Petrova, K., & Kaija, I. (2020). Quality focused approach to a learner corpus development. In Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 392–396).
Davis, C., Caines, A., Andersen, Ø., Taslimipoor, S., Yannakoudakis, H., Yuan, Z., Bryant, C., Rei, M. & Buttery, P. (2024). Prompting open-source and commercial language models for grammatical error correction of English learner text. In Findings of the association for computational linguistics: ACL 2024 (pp. 11952–11967). Association for Computational Linguistics.
Ducel, F., Fort, K., Lejeune, G., & Lepage, Y. (2022). Do we name the languages we study? the #BenderRule in LREC and ACL articles. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk & S. Piperidis (Eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference (pp. 564–573). European Language Resources Association.
Gantar, P., Bon, M., Gapsa, M., & Arhar Holdt, Š. (2023). Šolar-Eval: Evalvacijska množica za strojno popravljanje jezikovnih napak v slovenskih besedilih. Jezik in Slovstvo, 68(4), 89–108.
Glišić, I., & Ingason, A. K. (2022). The Nature of Icelandic as a second language: An insight from the Learner Error Corpus for Icelandic. In Proceedings of the CLARIN Annual Conference (p. 23–33).
Godfroid, A., & Andringa, S. (2023). Uncovering sampling biases, advancing inclusivity, and rethinking theoretical accounts in Second Language Acquisition: Introduction to the special issue SLA for all? Language Learning, 73(4), 981–1002.
Hammarstedt, M., Schumacher, A., Borin, L., & Forsberg, M. (2022). Sparv 5 user manual (Tech. Rep.). Språkbanken Text.
Ingason, A. K., Stefánsdóttir, L. B., Arnardóttir, Þ., & Xu, X. (2021). Icelandic Error Corpus (IceEC) Version 1.1. (CLARIN-IS).
Ingason, A. K., Stefánsdóttir, L. B., Arnardóttir, Þ., Xu, X., Glišić, I., & Guðmundsdóttir, D. (2022). The Icelandic L2 Error Corpus (IceL2EC) 1.3 (22.10). (CLARIN-IS).
Masciolini, A., Caines, A., De Clercq, O., Kruijsbergen, J., Kurfalı, M., Muñoz Sánchez, R., Volodina, E., Östling, R. (2025a). The MultiGEC-2025 shared task on multilingual grammatical error correction at NLP4CALL. In R. Muñoz Sánchez, D. Alfter, J. Kallas, & E. Volodina (Eds.), Proceedings of the 14th workshop on Natural Language Processing for Computer Assisted Language Learning. Tallin, Estonia: University of Tartu. [URL]
Masciolini, A., Caines, A., De Clercq, O., Kruijsbergen, J., Kurfalı, M., Muñoz Sánchez, R., … Zesch, T. (2025b). An overview of grammatical error correction for the twelve MultiGEC-2025 languages. GU-ISS Forskningsrapporter från Institutionen för svenska språket. Institution for Swedish, Multilingualism, Language Technology; University of Gothenburg. [URL]
Merton, R. K. (1968). The Matthew effect in science: The reward and communication systems of science are considered. Science, 159(3810), 56–63.
Náplava, J., Straka, M., Straková, J., & Rosen, A. (2022). Czech grammar error correction with a large and diverse corpus. Transactions of the Association for Computational Linguistics, 101, 452–467.
Nicholls, D., Caines, A., & Buttery, P. (2024). The Write & Improve Corpus 2024: Error-annotated and CEFR-labelled essays by learners of English. Cambridge University Press Assessment.
Palma Gomez, F., & Rozovskaya, A. (2024). Multi-reference benchmarks for Russian grammatical error correction. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (volume 1: Long papers) (pp. 1253–1270). Association for Computational Linguistics.
Perc, M. (2014). The Matthew effect in empirical data. Journal of The Royal Society Interface, 11(98), 20140378.
Rosen, A., Hana, J., Hladká, B., Jelínek, T., Škodová, S., & Štindlová, B. (2020). Compiling and annotating a learner corpus for a morphologically rich language — CzeSL, a corpus of non-native Czech. Karolinum, Charles University Press.
Rozovskaya, A., & Roth, D. (2019). Grammar error correction in morphologically rich languages: The case of Russian. Transactions of the Association for Computational Linguistics, 71, 1–17.
Rudebeck, L., & Sundberg, G. (2021). SweLL correction annotation guidelines. (Tech. Rep.). GU-ISS Research report series, Department of Swedish, University of Gothenburg.
Sakaguchi, K., Napoles, C., Post, M., & Tetreault, J. (2016). Reassessing the goals of grammatical error correction: Fluency instead of grammaticality. Transactions of the Association for Computational Linguistics, 41, 169–182.
Šebesta, K., Bedřichová, Z., Šormová, K., Straňák, P., & Peterek, N. (2014). ROMi 1.0. (LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University).
Šebesta, K., Goláňová, H., Letafková, J., & Jelínková, B. (2016). AKCES 1. (LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University).
Søgaard, A. (2022). Should we ban English NLP for a year? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (pp. 5254–5260). Association for Computational Linguistics.
Syvokon, O., Nahorna, O., Kuchmiichuk, P., & Osidach, N. (2023). UA-GEC: Grammatical error correction and fluency corpus for the Ukrainian Language. In Proceedings of the second Ukrainian Natural Language Processing workshop (UNLP) (pp. 96–102). Association for Computational Linguistics.
Syvokon, O., & Romanyshyn, M. (2023). The UNLP 2023 Shared Task on Grammatical Error Correction for Ukrainian. In Proceedings of the second Ukrainian Natural Language Processing workshop (UNLP) (pp. 132–137). Association for Computational Linguistics.
Tantos, A., Amvrazis, N., & Drakonaki, E. (2023). Greek Learner Corpus II (GLCII): Design and development of an online corpus for L2 Greek. Journal of Applied Linguistics, 361, 125–150.
