In:Mathematical Modelling in Linguistics and Text Analysis: Theory and applications
Edited by Adam Pawłowski, Sheila Embleton, Jan Mačutek and Aris Xanthos
[Current Issues in Linguistic Theory 370] 2025
► pp. 173–190
CapekDraCor database and some aspects of quantitative linguistic analysis of the Čapek brothers’ plays
Published online: 13 October 2025
https://doi.org/10.1075/cilt.370.15por
https://doi.org/10.1075/cilt.370.15por
Abstract
This text is a methodological contribution to the possibilities of the quantitative analysis of plays.
Specifically, we focus on the plays of the Čapek brothers and introduce here a new source of data for the linguistic analysis
of their plays — the CapekDraCor database. Texts are segmented into the relevant sublayers: (1) text (character dialogues);
(2) metatext and indexing of its type (situational or authorial comments, stage directions). The significance of adequate text
segmentation and its impact on quantitative analysis is examined by comparing selected phenomena (proper names, keyword robot)
in two corpora with different text processing — the Čapek corpus in the Czech National Corpus and the CapekDraCor database.
The possibilities of quantitative analysis of plays are discussed, the literary character is treated as a central element. We
describe the characteristics and specifics of plays as a genre with a multi-layered structure that oscillates between written
and spoken language — this duality is further emphasized by the frequency distribution of word classes within the plays. We
perform an illustrative keyword analysis complemented by a sentiment analysis of the text and the use of a specificity score
in the plays The White Disease (Bílá nemoc) and R.U.R.
Article outline
- 1.Introduction and state of the art
- 2.CapekDraCor database
- 3.Possibilities of quantitative analyses of dramas
- 4.Genres of literary works by Karel Čapek
- 5.Characteristics and specifics of dramatic texts
- 6.Content analysis and prominent units (keywords)
- 7.Sentiment analysis of keywords
- 8.Specificity score
- 9.Conclusion
Notes References
References (24)
Čech, Radek. 2015. The
development of thematic concentration of text in Karel Čapek’s travel books. Czech and
Slovak Linguistic
Review 2015(1). 8–21.
Čermák, František et al. 2007. Capek:
korpus pouze vlastních textů Karla
Čapka. Prague: ÚČNK. Available
at: [URL]
Cvrček, Václav, František Čermák & Michal Křen. 2007. Statistické
aspekty jazyka Karla Čapka, zvláště jeho lexikonu. In František Čermák (ed.), Slovník
Karla
Čapka, 671–690. Prague: NLN.
Davidová Glogarová, Jana, Radek Čech & Jaroslav David. 2013. Tematické
charakteristiky textu — kvantitativní analýza publicistiky Jaroslava Durycha, Ladislava Jehličky a Karla
Čapka. In Jaroslav David, Radek Čech, Jana Davidová Glogarová, Lucie Radková & Hana Šústková (eds.), Slovo
a text v historickém kontextu — perspektivy historickosémantické analýzy
jazyka, 62–84. Brno: Host.
DraCor Drama Corpora
Project [database]. Available at: [URL]
Fischer, Frank et al. 2019. Programmable
Corpora: Introducing DraCor, an Infrastructure for the Research on European
Drama. In Proceedings of Digital Humanities 2019:
“Complexities” (DH2019), Utrecht University. URL: [URL]
Fořt, Bohumil. 2008. Literární
postava. Vývoj a aspekty naratologických
zkoumání. Prague: Ústav pro českou literaturu AV ČR.
Leskovec, Jure, Anand Rajaraman & Jeffrey D. Ullman. 2014. Mining
of massive datasets. URL: [URL].
Libovický, Jindřich. 2016. KER
— Keyword extractor. [software] Available
at: [URL]
Machálek, Tomáš. 2014. KonText
— rozhraní pro vyhledávání v korpusech. [software] FF UK, Prague. Available at: [URL]
Mačutek, Ján, Michaela Koščová & Radek Čech. 2016. Lexical
compactness across genres in works by Karel Čapek. In Damon Mayaffre, Céline Poudat, Laurent Vanni, Véronique Magri & Peter Follette (eds.), Statistical
analysis of textual
data, 825–832. Nice: University Nice Sophia Antipolis.
Matlach, Vladimír, Miroslav Kubát & Radek Čech. 2014. QUITA
— Quantitative Text Analyzer
[software]. Olomouc. Available
at: [URL]
MATTR v2.0 (Moving-Average Type-Token Ratio)
[software]. Available at: [URL]
NLTK: Natural Language Toolkit [software]. Available
at: [URL]
Pořízka, Petr. 2019. On
possibilities and methods of analysis of thematic expressions in spoken
texts. Jazykovedný
časopis 70(2). 469–480.
. 2023a. CapekDraCor:
A new contribution to the European Programmable Drama Corpora. Jazykovedný
časopis 74(1). 244–253.
. 2023b. The
function of proper nouns in quantitative analysis of dramas: A case study of Karel Čapek’s
plays. In Urszula Bijak, Paweł Swoboda & Justyna B. Walkowiak (eds.), Onomastics
in interaction with other branches of science. Volume 3: General and applied onomastics. Literary onomastics.
Chrematonomastics.
Reports, 351–379. Cracow: Jagellonian University Press.
Scott, Mike & Christopher Tribble. 2006. Textual
patterns. Key words and corpus analysis in language
education. Amsterdam: Benjamins.
Textometrie project: TXM (version 0.8.1)
[software]. Available at: [URL]
