Distribution of dependency tags in different text types in Czech

Chen, Xinying; Kubát, Miroslav

doi:10.1075/cilt.370.08che

In:Mathematical Modelling in Linguistics and Text Analysis: Theory and applications
Edited by Adam Pawłowski, Sheila Embleton, Jan Mačutek and Aris Xanthos
[Current Issues in Linguistic Theory 370] 2025
► pp. 90–103

Get fulltext from our e-platform

Download Book PDF

Distribution of dependency tags in different text types in Czech

Xinying Chen | University of Ostrava

Miroslav Kubát | University of Ostrava

Published online: 13 October 2025

https://doi.org/10.1075/cilt.370.08che

Abstract

The objective of this chapter is to examine how dependency tags are distributed across various genres in the Czech language. Specifically, our focus is on identifying the similarities and differences in these distributions among different types of text. This study utilizes data from the SYN2020, a large balanced corpus of contemporary written Czech, comprising 100 million words. The findings indicate that the frequencies of dependency tags follow a power-law distribution, represented by the equation y = ax-^b. When analyzing the values of the parameters a and b, a distinct pattern emerges that differentiates the genres. Genres within fiction, such as poetry, drama, novels, and short stories, typically exhibit lower values for a and b, whereas non-fiction genres, including administrative texts and professional literature, demonstrate higher values. Journalistic texts, like newspapers and leisure magazines, fall in between fiction and non-fiction literature in terms of these parameter values. Consequently, comparing these parameters appears to be an effective method for conducting stylometric research.

Keywords: stylometry, syntax, genre, Czech, corpus

Article outline

1.Introduction
2.Dependency grammar and dependency tags
3.Material and methodology
4.Results and discussions
5.Conclusion
Notes
References

References (19)

References

Altmann, Gabriel. 2002. Zipfian linguistics. Glottometrics 3. 19–26.

Bejček, Eduard, Jarmila Panevová, Jan Popelka, Pavel Straňák, Magda Ševčíková, Jan Štěpánek & Zdeněk Žabokrtský. 2012. Prague Dependency Treebank 2.5 — a revisited version of PDT 2.0. In Martin Kay & Christian Boitet (eds.), Proceedings of the 24th international conference on computational linguistics (Coling 2012), 231–246. Mumbai.

Best, Karl-Heinz. 2001. Probability distributions of language entities. Journal of Quantitative Linguistics 8(1). 1–11.

Corral, Álvaro, Gemma Boleda & Ramon Ferrer-i-Cancho. 2015. Zipf’s law for word frequencies: Word forms versus lemmas in long texts. PloS one 10(7). e0129031.

Čech, Radek, Ján Mačutek, Miroslav Kubát & Michaela Koščová. 2022. Does an author leave a syntactic footprint? In Michelangelo Misuraca, Germana Scepi & Maria Spano (eds.), Proceedings of the 16th international conference on statistical analysis of textual data, 221–228. Naples: Vadistat Press.

De Marneffe, Marie-Catherine & Joakim Nivre. 2019. Dependency grammar. Annual Review of Linguistics 5. 197–218.

Hatzigeorgiu, Nick, George Mikros & George Carayannis. 2001. Word length, word frequencies and Zipf’s law in the Greek language. Journal of Quantitative Linguistics 8(3). 175–185.

Hudson, Richard. 2010. An introduction to word grammar. Cambridge: Cambridge University Press.

Jelínek, Tomáš. 2017. FicTree: a manually annotated treebank of Czech fiction. In Jaroslava Hlaváčová (ed.), Proceedings of the 17th Conference on Information Technologies — Applications and Theory (ITAT 2017), 181–185.

Köhler, Reinhard. 2012. Quantitative syntax analysis. Berlin: Walter de Gruyter.

Krevitt, Beth & Bekver C. Griffith. 1972. A comparison of several Zipf-type distributions in their goodness of fit to language data. Journal of the American Society for Information Science 23(3). 220–221.

Křen, Michal et al. 2020. SYN2020: reprezentativní korpus psané češtiny. Ústav Českého národního korpusu FF UK, Prague. Accessible at: [URL]

Mačutek, Ján & Gejza Wimmer. 2013. Evaluating goodness-of-fit of discrete distribution models in quantitative linguistics. Journal of Quantitative Linguistics 20(3). 227–240.

Melčuk, Igor A. 1988. Dependency syntax: Theory and practice. Albany: SUNY Press.

Soler-Company, Juan & Leo Wanner. 2016. Authorship attribution using syntactic dependencies. In Ángela Nebot, Xavier Binefa & Ramin López de Mántaras (eds.), Artificial intelligence research and development, 303–308. Amsterdam: IOS Press.

Soukupová, Klára. 2015. Autobiografie: žánr a jeho hranice. Česká literatura 63:1. 49–72.

Tesnière, Lucien. 2015. Elements of structural syntax. Amsterdam: Benjamins.

Yamazaki, Makoto, Haruko Sanada, Reinhard Köhler, Sheila Embleton, Relja Vulanović & Eric S. Wheeler (eds.). 2023. Quantitative approaches to universality and individuality in language. Berlin: de Gruyter.

Zipf, George K. 1949. Human behavior and the principle of least effort. Cambridge: Addison-Wesley Press.