CapekDraCor database and some aspects of quantitative linguistic analysis of the Čapek brothers’ plays

Pořízka, Petr

doi:10.1075/cilt.370.15por

In:Mathematical Modelling in Linguistics and Text Analysis: Theory and applications
Edited by Adam Pawłowski, Sheila Embleton, Jan Mačutek and Aris Xanthos
[Current Issues in Linguistic Theory 370] 2025
► pp. 173–190

Get fulltext from our e-platform

Download Book PDF

CapekDraCor database and some aspects of quantitative linguistic analysis of the Čapek brothers’ plays

Petr Pořízka | Palacký University

Published online: 13 October 2025

https://doi.org/10.1075/cilt.370.15por

Abstract

This text is a methodological contribution to the possibilities of the quantitative analysis of plays. Specifically, we focus on the plays of the Čapek brothers and introduce here a new source of data for the linguistic analysis of their plays — the CapekDraCor database. Texts are segmented into the relevant sublayers: (1) text (character dialogues); (2) metatext and indexing of its type (situational or authorial comments, stage directions). The significance of adequate text segmentation and its impact on quantitative analysis is examined by comparing selected phenomena (proper names, keyword robot) in two corpora with different text processing — the Čapek corpus in the Czech National Corpus and the CapekDraCor database. The possibilities of quantitative analysis of plays are discussed, the literary character is treated as a central element. We describe the characteristics and specifics of plays as a genre with a multi-layered structure that oscillates between written and spoken language — this duality is further emphasized by the frequency distribution of word classes within the plays. We perform an illustrative keyword analysis complemented by a sentiment analysis of the text and the use of a specificity score in the plays The White Disease (Bílá nemoc) and R.U.R.

Keywords: computational literary studies, drama, CapekDraCor, network analysis, keywords, specificity score

Article outline

1.Introduction and state of the art
2.CapekDraCor database
3.Possibilities of quantitative analyses of dramas
4.Genres of literary works by Karel Čapek
5.Characteristics and specifics of dramatic texts
6.Content analysis and prominent units (keywords)
7.Sentiment analysis of keywords
8.Specificity score
9.Conclusion
Notes
References

References (24)

References

Čech, Radek. 2015. The development of thematic concentration of text in Karel Čapek’s travel books. Czech and Slovak Linguistic Review 2015(1). 8–21.

. 2016. Tematická koncentrace textu v češtině. Prague: ÚFAL.

Čermák, František et al. 2007. Capek: korpus pouze vlastních textů Karla Čapka. Prague: ÚČNK. Available at: [URL]

Čermák, František (ed.). 2007. Slovník Karla Čapka. Prague: NLN.

Cvrček, Václav, František Čermák & Michal Křen. 2007. Statistické aspekty jazyka Karla Čapka, zvláště jeho lexikonu. In František Čermák (ed.), Slovník Karla Čapka, 671–690. Prague: NLN.

Davidová Glogarová, Jana, Radek Čech & Jaroslav David. 2013. Tematické charakteristiky textu — kvantitativní analýza publicistiky Jaroslava Durycha, Ladislava Jehličky a Karla Čapka. In Jaroslav David, Radek Čech, Jana Davidová Glogarová, Lucie Radková & Hana Šústková (eds.), Slovo a text v historickém kontextu — perspektivy historickosémantické analýzy jazyka, 62–84. Brno: Host.

DraCor Drama Corpora Project [database]. Available at: [URL]

Fischer, Frank et al. 2019. Programmable Corpora: Introducing DraCor, an Infrastructure for the Research on European Drama. In Proceedings of Digital Humanities 2019: “Complexities” (DH2019), Utrecht University. URL: [URL]

Fořt, Bohumil. 2008. Literární postava. Vývoj a aspekty naratologických zkoumání. Prague: Ústav pro českou literaturu AV ČR.

Kubát, Miroslav. 2016. Kvantitativní analýza žánrů. Ostrava: Ostravská univerzita.

Lafon, Pierre. 1980. Sur la variabilité de la fréquence des formes dans un corpus. Mots 1. 127–165.

Leskovec, Jure, Anand Rajaraman & Jeffrey D. Ullman. 2014. Mining of massive datasets. URL: [URL].

Libovický, Jindřich. 2016. KER — Keyword extractor. [software] Available at: [URL]

Machálek, Tomáš. 2014. KonText — rozhraní pro vyhledávání v korpusech. [software] FF UK, Prague. Available at: [URL]

Mačutek, Ján, Michaela Koščová & Radek Čech. 2016. Lexical compactness across genres in works by Karel Čapek. In Damon Mayaffre, Céline Poudat, Laurent Vanni, Véronique Magri & Peter Follette (eds.), Statistical analysis of textual data, 825–832. Nice: University Nice Sophia Antipolis.

Matlach, Vladimír, Miroslav Kubát & Radek Čech. 2014. QUITA — Quantitative Text Analyzer [software]. Olomouc. Available at: [URL]

MATTR v2.0 (Moving-Average Type-Token Ratio) [software]. Available at: [URL]

NLTK: Natural Language Toolkit [software]. Available at: [URL]

Pořízka, Petr. 2019. On possibilities and methods of analysis of thematic expressions in spoken texts. Jazykovedný časopis 70(2). 469–480.

. 2023a. CapekDraCor: A new contribution to the European Programmable Drama Corpora. Jazykovedný časopis 74(1). 244–253.

. 2023b. The function of proper nouns in quantitative analysis of dramas: A case study of Karel Čapek’s plays. In Urszula Bijak, Paweł Swoboda & Justyna B. Walkowiak (eds.), Onomastics in interaction with other branches of science. Volume 3: General and applied onomastics. Literary onomastics. Chrematonomastics. Reports, 351–379. Cracow: Jagellonian University Press.

Scott, Mike & Christopher Tribble. 2006. Textual patterns. Key words and corpus analysis in language education. Amsterdam: Benjamins.

Textometrie project: TXM (version 0.8.1) [software]. Available at: [URL]

Todorov, Tzvetan. 1977. The poetics of prose. Ithaca, NY: Cornell University Press.