Automatic dialect classification of the Southern Dutch dialects

Sung, Ho Wang Matthew

doi:10.1075/nb.00043.sun

Article published In: Linguistics in the Netherlands 2025
Edited by Kristel Doreleijers, Remco Knooihuizen and Eva van Lier
[Nota Bene 2:2] 2025
► pp. 448–473

Get fulltext from our e-platform

Download EPUB

Automatic dialect classification of the Southern Dutch dialects

Ho Wang Matthew Sung | Leiden University

Available under the Creative Commons Attribution (CC BY) 4.0 license.

For any use beyond this license, please contact the publisher at rights@benjamins.nl.

Open Access publication of this article was funded through a Transformative Agreement with Leiden University.

Published online: 31 October 2025

https://doi.org/10.1075/nb.00043.sun

Abstract

Since the 1980s, computational methods have been introduced to dialectology (known as dialectometry, cf. , ). Many of these methods were designed for data from dialect surveys or linguistic atlases, typically elicited items uttered in isolation. Scholars have turned to corpus-based approaches to seek dialect patterns from more naturalistic speech, which can tell us more about the context and magnitude of the variants used ().

Transcriptions of spontaneous speech pose challenges for traditional approaches to automatic dialect classification: it is impossible to go through all the transcriptions manually; these are not systematic word lists; and we should not only extract the frequency of some known features, as we might overlook features that are not yet discovered.

This paper employs topic modelling to automatically detect dialect groups in the southern Dutch dialects. This method is data-driven and can overcome the issues mentioned above. The result shows that southern Dutch dialects can be divided into 2 to 4 major groups, coinciding with the traditional classification ().

Keywords: dialectometry, natural language processing, topic modelling, southern Dutch dialects, Gesproken Corpus van de zuidelijk-Nederlandse Dialecten, GCND, dialect classification, Flemish, Brabantic, Limburgish

Article outline

1.Introduction
2.Traditional dialect classification
- 2.1Dialect classification with characteristic features and isoglosses
- 2.2Characteristics and problems with the isogloss approach
3.Quantitative approaches to dialect classification
- 3.1An aggregate perspective to dialect variation
- 3.2Atlas-based methods in dialectometry
- 3.3Corpus-based approaches in dialectometry
4.Challenges and research questions
- 4.1Challenges to corpus-based dialectometry
- 4.2Research questions
5.Data and methodology
- 5.1Data
- 5.2Methodology
  - 5.2.1Pre-processing of the data
  - 5.2.2Topic modelling
  - 5.2.3Evaluation criteria and model comparison
6.Results
- 6.1Model comparison
- 6.2Dialect classification
  - 6.2.1Two-group division
  - 6.2.2Three-group division
  - 6.2.3Four-group division
7.Discussion and conclusion
Supplementary material
Acknowledgements
Notes
References

References (38)

References

Anderwald, Lieselotte, & Benedikt Szmrecsanyi. 2009. Corpus linguistics and dialectology. In Anke Lüdeling & Merja Kytö (eds.), Corpus linguistics: An international handbook, vol. 2, 1126–1139.

Barbiers, Sjef, Hans Bennis, Gunther De Vogelaer, Magda Devos & Margreet van der Ham. 2005. Syntactische atlas van de Nederlandse dialecten, vol. 1. Amsterdam: Amsterdam University Press.

Barbiers, Sjef, Johan van der Auwera, Hans Bennis, Eefje Boef, Gunther De Vogelaer & Margreet van der Ham. 2008. Syntactische atlas van de Nederlandse dialecten, vol. 2. Amsterdam: Amsterdam University Press.

Blei, David M. 2012. Probabilistic topic models. Communications of the ACM 55(4), 77–84.

Blei, David M., Andrew Y. Ng & Michael I. Jordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research 31. 993–1022.

Borg, Ingwer & Patrick J. F. Groenen. 2005. Modern multidimensional scaling: theory and applications. New York: Springer New York.

Breitbarth, Anne, Melissa Farasyn, Anne-Sophie Ghyselen & Jacques Van Keymeulen. 2018. Het Gesproken Corpus van de zuidelijk-Nederlandse Dialecten. Handelingen Koninklijke Zuid-Nederlandse Maatschappij voor Taal- en Letterkunde en Geschiedenis 721.

Breitbarth, Anne, Melissa Farasyn, Anne-Sophie Ghyselen, Lien Hellebaut, Frederic Lamsens, Katrien Depuydt, Jesse de Does, Jan Niestadt & Koen Mertens. 2024. Gesproken Corpus van de zuidelijk-Nederlandse Dialecten. 1st release October 2024. Available at the Dutch Language Institute: [URL]

Chambers, Jack K. & Peter Trudgill. 1998. Dialectology (2nd edition). Cambridge: Cambridge University Press.

Ghyselen, Anne-Sophie, Jacques Van Keymeulen, Melissa Farasyn, Lien Hellebaut & Anne Breitbarth. 2020. Het transcriptieprotocol van het Gesproken Corpus van de Nederlandse Dialecten (GCND). Bulletin de la commission royal de toponymie & dialectology 921. 83–115.

Goebl, Hans. 1984. Dialektometrische studien: Anhand italoromanischer, raetoromanischer und galloromanischer sprachmaterialien aus AIS und ALF. (Beihefte zur Zeitschrift für romanische Philologie 191–193). Niemeyer, Tübingen.

. 2018. Dialectometry. In Charles Boberg, John Nerbonne & Dominic Watt (eds.), The handbook of dialectology, 123–142. New Jersey: Wiley-Blackwell.

Grootaers, Ludovic & Gesinus Kloeke. 1926. Handleiding bij het Noord- en ZuidNnederlandsch dialectonderzoek: Met een kaart. ’s-Gravenhage: Martinus Nijhoff.

Grootendorst, Maarten. 2022. BERTopic: neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794.

Heeringa, Wilbert. 2004. Measuring dialect pronunciation using Levenshtein distance. (PhD thesis, University of Groningen).

Kuparinen, Olli & Yves Scherrer. 2024. Corpus-based dialectometry with topic models. Journal of Linguistic Geography 12(1). 1–12.

Lameli, Alfred & Schönberg, Andreas. 2023. A measure for linguistic coherence in spatial language variation. In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), 133–141.

Levenshtein, Vladimir I. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet Physics Doklady 10(8). 707–710.

Manning, Christopher & Hinrich Schütze. 1999. Foundations of statistical natural language processing. Cambridge: MIT Press.

Nerbonne, John. 2011. Mapping aggregate variation. In Alfred Lameli, Roland Kehrein & Stefan Rabanus (eds.), An international handbook of linguistic variation volume 2 Language mapping, 476–501. Berlin, New York: De Gruyter Mouton.

Nerbonne, John & Peter Kleiweg. 2007. Toward a dialectological yardstick. Journal of Quantitative Linguistics 14(2–3). 148–166.

Orton, Harold. 1962. Survey of the English dialects: Introduction. Leeds: E. J. Arnold & Son.

Paatero, Pentti & Unto Tapper. 1994. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5(2). 111–126.

QGIS Development Team. 2025. QGIS geographic information system. Open Source Geospatial Foundation Project.

Ryckeboer, Hugo. 2013. A west Flemish dialect as a minority language in the north of France. In Frans Hinskens & Johan Taeldeman (eds.), An international handbook of linguistic variation volume 3 Dutch, 782–800. Berlin, Boston: De Gruyter Mouton.

Siewert, Janine, Yves Scherrer & Martijn Wieling. 2022. Low Saxon dialect distances at the orthographic and syntactic level. In Nina Tahmasebi, Syrielle Montariol, Andrey Kutuzov, Simon Hengchen, Haim Dubossarsky & Lars Borin (eds.), 3rd International Workshop on Computational Approaches to Historical Language Change (LChange) 2022, 119–124.

Spruit, Marco R., Wilbert Heeringa & John Nerbonne. 2009. Associations among linguistic levels. Lingua 119(11). 1624–1642.

Sung, Ho Wang Matthew & Jelena Prokić. 2024. Identification of dialect typicality and kernels. 12th International Conference on Language Variation in Europe (ICLaVE|12). (Oral Presentation)

Szmrecsanyi, Benedikt. 2013. Grammatical variation in British English dialects: A study in corpus-based dialectometry. Cambridge University Press.

Szmrecsanyi, Benedikt. & Lieselotte Anderwald. 2018. Corpus-based approaches to dialect study. In Charles Boberg, John Nerbonne & Dominic Watt (eds.), The handbook of dialectology, 300–313. New Jersey: Wiley-Blackwell.

Taeldeman, Johan. 2001. De regenboog van de Vlaamse dialecten. In Johan Taeldeman, Magda Devos & Johan De Caluwe (eds.), Het taallandschap in Vlaanderen, 49–58. Ghent: Academia Press.

Taeldeman, Johan & Hermann Niebaum. 2013. History and development of Dutch dialect research. In Frans Hinskens & Johan Taeldeman (eds.), An international handbook of linguistic variation, vol. 3: Dutch, 13–35. Berlin, Boston: De Gruyter Mouton.

Trudgill, Peter. 1999. The dialects of England (2nd ed.). Oxford: Blackwell.

Van Keymeulen, Jacques, Anne Breitbarth, Anne-Sophie Ghyselen & Melissa Farasyn. 2020. Transcriptieproject ‘Stemmen uit het verleden’. Transcriptieprotocol. Available at: [URL]

Vanacker, Valeer F. & Georges De Schutter. 1967. Zuidnederlandse dialekten op de band. Taal en Tongval 191. 35–51.

Virpioja, Sami, Peter Smit, Stig-Arne Grönroos & Mikko Kurimo. 2013. Morfessor 2.0: Python implementation and extensions for Morfessor baseline. ([URL])

Weijnen, Antonius A. 1966 (= 1958). Nederlandse dialectkunde. Van Gorcum.

Wiesinger, Peter. 1983. Die einteilung der deutschen dialekte. In Werner Besh, Ulrich Knoop, Wolfgang Putschke & Herbert E. Wiegand (eds.) Dialektologie. 2. halbband, 807–899. De Gruyter.