In:Mathematical Modelling in Linguistics and Text Analysis: Theory and applications
Edited by Adam Pawłowski, Sheila Embleton, Jan Mačutek and Aris Xanthos
[Current Issues in Linguistic Theory 370] 2025
► pp. 90–103
Distribution of dependency tags in different text types in Czech
Published online: 13 October 2025
https://doi.org/10.1075/cilt.370.08che
https://doi.org/10.1075/cilt.370.08che
Abstract
The objective of this chapter is to examine how dependency tags are distributed across various genres in
the Czech language. Specifically, our focus is on identifying the similarities and differences in these distributions among
different types of text. This study utilizes data from the SYN2020, a large balanced corpus of contemporary written Czech,
comprising 100 million words. The findings indicate that the frequencies of dependency tags follow a power-law distribution,
represented by the equation y = ax-b. When analyzing the values of the parameters
a and b, a distinct pattern emerges that differentiates the genres. Genres within
fiction, such as poetry, drama, novels, and short stories, typically exhibit lower values for a and
b, whereas non-fiction genres, including administrative texts and professional literature, demonstrate
higher values. Journalistic texts, like newspapers and leisure magazines, fall in between fiction and non-fiction literature
in terms of these parameter values. Consequently, comparing these parameters appears to be an effective method for conducting
stylometric research.
Keywords: stylometry, syntax, genre, Czech, corpus
Article outline
- 1.Introduction
- 2.Dependency grammar and dependency tags
- 3.Material and methodology
- 4.Results and discussions
- 5.Conclusion
Notes References
References (19)
Bejček, Eduard, Jarmila Panevová, Jan Popelka, Pavel Straňák, Magda Ševčíková, Jan Štěpánek & Zdeněk Žabokrtský. 2012. Prague
Dependency Treebank 2.5 — a revisited version of PDT
2.0. In Martin Kay & Christian Boitet (eds.), Proceedings
of the 24th international conference on computational linguistics (Coling
2012), 231–246. Mumbai.
Best, Karl-Heinz. 2001. Probability
distributions of language entities. Journal of Quantitative
Linguistics 8(1). 1–11.
Corral, Álvaro, Gemma Boleda & Ramon Ferrer-i-Cancho. 2015. Zipf’s
law for word frequencies: Word forms versus lemmas in long texts. PloS
one 10(7). e0129031.
Čech, Radek, Ján Mačutek, Miroslav Kubát & Michaela Koščová. 2022. Does
an author leave a syntactic footprint? In Michelangelo Misuraca, Germana Scepi & Maria Spano (eds.), Proceedings
of the 16th international conference on statistical analysis of textual
data, 221–228. Naples: Vadistat Press.
De Marneffe, Marie-Catherine & Joakim Nivre. 2019. Dependency
grammar. Annual Review of
Linguistics 5. 197–218.
Hatzigeorgiu, Nick, George Mikros & George Carayannis. 2001. Word
length, word frequencies and Zipf’s law in the Greek language. Journal of Quantitative
Linguistics 8(3). 175–185.
Jelínek, Tomáš. 2017. FicTree:
a manually annotated treebank of Czech fiction. In Jaroslava Hlaváčová (ed.), Proceedings
of the 17th Conference on Information Technologies — Applications and Theory (ITAT
2017), 181–185.
Krevitt, Beth & Bekver C. Griffith. 1972. A
comparison of several Zipf-type distributions in their goodness of fit to language
data. Journal of the American Society for Information
Science 23(3). 220–221.
Křen, Michal et al. 2020. SYN2020:
reprezentativní korpus psané češtiny. Ústav Českého národního korpusu FF UK, Prague. Accessible
at: [URL]
Mačutek, Ján & Gejza Wimmer. 2013. Evaluating
goodness-of-fit of discrete distribution models in quantitative linguistics. Journal of
Quantitative
Linguistics 20(3). 227–240.
Soler-Company, Juan & Leo Wanner. 2016. Authorship
attribution using syntactic dependencies. In Ángela Nebot, Xavier Binefa & Ramin López de Mántaras (eds.), Artificial
intelligence research and
development, 303–308. Amsterdam: IOS Press.
Tesnière, Lucien. 2015. Elements
of structural
syntax. Amsterdam: Benjamins.
