Using machine learning to automate data annotation in corpus linguistics: A case study with MacBERTh

Fonteyn, Lauren; Manjavacas, Enrique; De Regt, Jaleesa

doi:10.1075/ijcl.22088.fon

Article published In: International Journal of Corpus Linguistics
Vol. 30:3 (2025) ► pp.296–315

Get fulltext from our e-platform

Download EPUB

Using machine learning to automate data annotation in corpus linguistics

A case study with MacBERTh

Lauren Fonteyn | KNAW Meertens Institute

Enrique Manjavacas | Leiden University

Jaleesa De Regt | Leiden University

Available under the Creative Commons Attribution (CC BY) 4.0 license.

For any use beyond this license, please contact the publisher at rights@benjamins.nl.

Open Access publication of this article was funded through a Transformative Agreement with Meertens Institute.

Published online: 19 September 2025

https://doi.org/10.1075/ijcl.22088.fon

Abstract

A wealth of linguistic data has been annotated by corpus linguists, and this extant annotated data can be used to automatically replicate and apply the linguist’s annotation scheme by means of machine learning models. This paper accompanies the release of documented code notebooks, which allow corpus linguists to use manually categorized examples or ‘training data’ as input for a predictive language model. By means of a case study of Early Modern English -ing forms, we describe how the predictive language model MacBERTh can be used to accurately replicate the manual data classification scheme employed in previous corpus linguistic studies. Additionally, we discuss how manual error analysis and post-correction may help improve the model’s output. By openly releasing the data and code used in this paper, we hope to stimulate the use of machine learning models such as MacBERTh in corpus linguistics.

Keywords: machine learning, morphosyntax, gerund, participle, historical corpora

Article outline

1.Introduction
2.Approach
- 2.1Picking the right model for the job: BERT vs. MacBERTh (and other variants)
- 2.2Case study: -ing forms in Early Modern English
- 2.3Data
- 2.4Automating data classification
3.Results
4.New data, error analysis and post-correction
5.Conclusion
Notes
References

References (28)

References

Brandsen, A., Verberne, S., Lambers, K., & Wansleeben, M. (2022). Can BERT dig it? Named entity recognition for information retrieval in the Archaeology domain. Journal on Computing and Cultural Heritage, 15(3), Article 51.

Davies, M. (2010). The Corpus of Historical American English (COHA). Available online at [URL]

De Smet, H., & Vancayzeele, E. (2015). Like a rolling stone: The changing use of English premodifying present participles. English Language and Linguistics, 19(1), 131–156.

De Smet, H., Flach, S., Tyrkkö, J., & Diller, H.-J. (2015). The Corpus of Late Modern English (CLMET) (version 3.1: Improved tokenization and linguistic annotation). KU Leuven, FU Berlin, U Tampere, RU Bochum. [URL]

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Association for Computational Linguistics. [URL]

Fanego, T. (2004). On reanalysis and actualization in syntactic change: The rise and development of English verbal gerund. Diachronica, 21(1), 5–55.

Fonteyn, L. (2019). Categoriality in language change: The case of the English gerund. Oxford University Press.

Fonteyn, L., & Hartmann, S. (2016). Usage-based perspectives on diachronic morphology: A mixed-methods approach towards English ing-nominals. Linguistics Vanguard, 2(1), 20160057.

Fonteyn, L., & Petré, P. (2022). On the probability and direction of morphosyntactic lifespan change. Language Variation and Change, 34(1), 79–105.

Fonteyn, L., & Van de Pol, N. (2016). Divide and conquer: The formation and functional dynamics of the Modern English ing-clause network. English Language and Linguistics, 20(2), 185–219.

Hosseini, K., Beelen, K., Colavizza, G., & Coll Ardanuy, M. (2021). Neural language models for nineteenth-century English. Journal of Open Humanities Data, 71, 22.

Hundt, M., Röthlisberger, M., Schneider, G., & Zehentner, E. (2019). (Semi-)automatic retrieval of data from historical corpora: Chances and challenges. [Conference presentation]. 52nd Annual Meeting of the Societas Linguistica Europaea (SLE). Leipzig, Germany. [URL]

James, G. M., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. Springer.

Jurafsky, D., & Martin, J. H. (2025). Speech and language processing: An introduction to speech recognition, computational linguistics, and speech recognition with language models. Third edition. Online manuscript released January 12, 2025. [URL]

Killie, K., & Swan, T. (2009). The grammaticalization and subjectification of adverbial -ing clauses (converb clauses) in English. English Language and Linguistics, 13(3), 337–363.

Kortmann, B. (1991). Free adjuncts and absolutes in English: Problems of control and interpretation. Routledge.

Kroch, A., Santorini, B., & Delfs, L. (2004). The Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME) (First edition, release 3). Department of Linguistics, University of Pennsylvania. [URL]

Kroch, A., Santorini, B., & Diertani, A. (2016). The Penn Parsed Corpus of Modern British English (PPCMBE2) (Second edition, release 1). Department of Linguistics, University of Pennsylvania. [URL]

Lass, R. (1992). Phonology and morphology. In N. Blake (Ed.), The Cambridge history of the English language, vol. II: 1066–1476 (pp. 23–155). Cambridge University Press.

Leech, G., Hundt, M., Mair, C., & Smith, N. (2009). Change in contemporary English: A grammatical study. Cambridge University Press.

Manjavacas, E., & Fonteyn, L. (2021). MacBERTh: Development and evaluation of a historically pre-trained language model for English (1450–1950). Proceedings of the Workshop on Natural Language Processing for Digital Humanities (NLP4DH) (pp. 23–36). Association for Computational Linguistics. [URL]

(2022). Adapting vs. pre-training language models for historical languages. Journal of Data Mining & Digital Humanities, 91521.

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press.

Manning, C. D. (2011). Part-of-Speech tagging from 97% to 100%: Is it time for some linguistics?. In A. F. Gelbukh (Ed.) Computational linguistics and intelligent text processing. CICLing 2011. Lecture notes in computer science, vol. 66081. (pp. 171–189). Springer. [URL].

Petré, P., Anthonissen, L., Budts, S., Manjavacas, E., Silva, E.-L., Standing, W., & Strik, A. O. (2019). Early Modern Multiloquent Authors (EMMA): Designing a large-scale corpus of individuals’ languages. ICAME Journal, 431, 83–122.

Rastas, I., Ryan, Y., Tiihonen, I., Qaraei, M., Repo, L., Babbar, R., Mäkelä, E., Tolonen, M. & Ginter, F. (2022). Explainable publication year prediction of eighteenth century texts with the BERT model. Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change (pp. 68–77). Association for Computational Linguistics. [URL].

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.) Advances in neural information processing systems: 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. Neural Information Processing Systems Foundation, Inc. [URL]

Zhang, J., Ryan, Y. C., Rastas, I., Ginter, F., Tolonen, M., & Babbar, R. (2022). Detecting sequential genre change in eighteenth-century texts. In F. Karsdorp, A. Lassche, & K. Nielbo (Eds.), Proceedings of the Computational Humanities Research Conference 2022. CEUR Workshop Proceedings 3290 (pp. 243–255). Computational Humanities Research Conference, Antwerp, Belgium. [URL]