Author and register as sources of variation: A corpus-based study using elicited texts

Cvrček, Václav; Laubeová, Zuzana; Lukeš, David; Poukarová, Petra; Řehořková, Anna; Zasina, Adrian Jan

doi:10.1075/ijcl.19020.cvr

Article published In: International Journal of Corpus Linguistics
Vol. 25:4 (2020) ► pp.461–488

Get fulltext from our e-platform

Download PDF

Author and register as sources of variation

A corpus-based study using elicited texts

Václav Cvrček | Charles University

Zuzana Laubeová | Charles University

David Lukeš | Charles University

Petra Poukarová | Charles University

Anna Řehořková | Charles University

Adrian Jan Zasina | Charles University

Published online: 27 October 2020

https://doi.org/10.1075/ijcl.19020.cvr

Abstract

This paper investigates the contribution of author/idiolect vs. register/type-of-text – as the most salient factors influencing the final shape of a text – towards explaining the variation observed in Czech texts. Since it is almost impossible to explore the effect of these factors on authentic data, we used elicited letters collected in a fully crossed experimental design (representative sample of 200 authors × four elicitation scenarios serving as a proxy to register variation). The variation encompassed by the elicited texts is analyzed through the lens of a general-purpose multi-dimensional model of Czech. Using triangulation via three established statistical methods and one devised for the purpose of this study, we find that register matters a great deal, explaining 1.5 times as much variation overall as idiolect. This should be taken into account when designing research in sociolinguistics or variation studies in general.

Keywords: variation, idiolect, register, multi-dimensional analysis, Czech

Article outline

1.Introduction
2.Sources of variation in linguistic research
3.Methods and data
- 3.1General-purpose model of language variability
- 3.2CPACT elicited data
4.Results
- 4.1Effect size
- 4.2Distances between texts
5.Discussion
6.Conclusion
Notes
References

References (52)

References

Amoroso, L. W. (2018). Analyzing group differences. In A. Phakiti, P. D. Costa, L. Plonsky, & S. Starfield (Eds.), The Palgrave Handbook of Applied Linguistics Research Methodology (pp. 501–521). Palgrave Macmillan.

Baayen, H., van Halteren, H., & Tweedie, F. (1996). Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, 11(3), 121–132.

Bakeman, R. (2005). Recommended effect size statistics for repeated measures designs. Behavior Research Methods, 37(3), 379–384.

Baker, P. (2010). Sociolinguistics and Corpus Linguistics. Edinburgh University Press.

Baker, P., & Egbert, J. (2016). Triangulating Methodological Approaches in Corpus Linguistic Research. Routledge.

Bayley, R., Cameron, R., & Lucas, C. (Eds.). (2013). The Oxford Handbook of Sociolinguistics. Oxford University Press.

Biber, D. (1988). Variation Across Speech and Writing. Cambridge University Press.

(1995). Dimensions of Register Variation: A Cross-Linguistic Comparison. Cambridge University Press.

(2012). Register as a predictor of linguistic variation. Corpus Linguistics and Linguistic Theory, 8(1), 9–37.

Biber, D., & Conrad, S. (2009). Register, Genre, and Style. Cambridge University Press.

Biber, D., & Finegan, E. (Eds.). (1994). Sociolinguistic Perspectives on Register. Oxford University Press.

Čermák, F. (Ed.). (2007). Slovník Karla Čapka [Karel Čapek՚s Dictionary]. Nakladatelství Lidové noviny.

Český statistický úřad [Czech Statistical Office]. (2015). Věk a vzdělání populace [Age and education of the population]. [URL]

Conrad, S. (2015). Register variation. In D. Biber, & R. Reppen (Eds.), The Cambridge Handbook of English Corpus Linguistics (pp. 309–329). Cambridge University Press.

Cvrček, V., Komrsková, Z., Lukeš, D., Poukarová, P., Řehořková, A., & Zasina, A. J. (in preparation). Register variability of elicited texts.

(2018a). From extra- to intratextual characteristics: Charting the space of variation in Czech through MDA. Corpus Linguistics and Linguistic Theory. Advance online publication.

(2018b). Variabilita češtiny: Multidimenzionální analýza [Variability of Czech: A multi-dimensional analysis]. Slovo a slovesnost, 79(4), 293–321.

Cvrček, V., Komrsková, Z., Lukeš, D., Poukarová, P., Řehořková, A., Zasina, A. J., & Benko, V. (2020). Comparing web-crawled and traditional corpora. Language Resources and Evaluation, 541, 713–745.

Eckert, E. (Ed.). (1993). Varieties of Czech: Studies in Czech Sociolinguistics. Rodopi.

Egbert, J., & Baker, P. (2019). Using Corpus Methods to Triangulate Linguistic Analysis. Taylor & Francis.

Fairclough, N. (2003). Analysing Discourse: Textual Analysis for Social Research. Routledge.

Finegan, E., & Rickford, J. R. (Eds.). (2004). Language in the USA: Themes for the 21st Century. Cambridge University Press.

Grant, T. (2007). Quantifying evidence in forensic authorship analysis. International Journal of Speech, Language and the Law, 14(1), 1–25.

Grice, J. W. (2001). Computing and evaluating factor scores. Psychological Methods, 6(4), 430–450.

Hinrichs, L., & Szmrecsanyi, B. (2007). Recent changes in the function and frequency of Standard English genitive constructions: A multivariate analysis of tagged corpora. English Language & Linguistics, 11(3), 437–474.

Hnátková, M. (2002). Značkování frazémů a idiomů v Českém národním korpusu s pomocí Slovníku české frazeologie a idiomatiky [Tagging phraseological units and idioms in the Czech National Corpus with the aid of the Dictionary of Czech phraseology and idiomatics]. Slovo a slovesnost, 63(2), 117–126.

Iwasaki, S., & Horie, P. I. (2000). Creating speech register in Thai conversation. Language in Society, 29(4), 519–554.

Jelínek, T. (2008). Nové značkování v Českém národním korpusu [New tagging in the Czech National Corpus]. Naše řeč, 91(1), 13–20.

King, B. M., Rosopa, P. J., & Minium, E. W. (2010). Some (almost) assumption-free tests. In Statistical Reasoning in the Behavioral Sciences (6th ed., pp. 381–401). Wiley.

Krejci, B., & Hilton, K. (2017). There’s three variants: Agreement variation in existential there constructions. Language Variation and Change, 29(2), 187–204.

Kučera, D. (2017). Computational psycholinguistic analysis of Czech text and the CPACT research. In ISC SGEM 4th International Multidisciplinary Scientific Conference on Social Sciences and Arts SGEM 2017: Science & Society Conference Proceedings, (pp. 77–84). ISC SGEM.

Kučera, D., & Havigerová, J. M. (2015). Computational psycholinguistic analysis and its application in psychological assessment of college students. Journal of Pedagogy, 6(1), 61–72.

Labov, W. (1966). The Social Stratification of English in New York City. Center for Applied Linguistics.

Louwerse, M. M. (2004). Semantic variation in idiolect and sociolect: Corpus linguistic evidence from literary texts. Computers and the Humanities, 38(2), 207–221.

McMenamin, G. R. (2002). Forensic Linguistics: Advances in Forensic Stylistics. CRC Press.

Milroy, L., & Gordon, M. (2003). Sociolinguistics: Models and Methods. Blackwell.

Nakagawa, S., Johnson, P. C. D., & Schielzeth, H. (2017). The coefficient of determination R2 and intra-class correlation coefficient from generalized linear mixed-effects models revisited and expanded. Journal of the Royal Society, Interface, 14(134).

Olsson, J. (2008). Forensic Linguistics (2nd ed.). Continuum.

Page, N. (2011). The Language of Jane Austen. Routledge.

Petkevič, V. (2014). Problémy automatické morfologické disambiguace češtiny [Problems of automatic morphological disambiguation of Czech]. Naše řeč, 97(4–5), 194–207.

Rickford, J. R., & McNair-Knox, F. (1994). Addressee- and topic-influenced style shift: A quantitative sociolinguistic study. In D. Biber & E. Finegan (Eds.), Sociolinguistic Perspectives on Register (pp. 235–276). Oxford University Press.

Riordan, B. (2007). There’s two ways to say it: Modeling nonprestige there’s. Corpus Linguistics and Linguistic Theory, 3(2), 233–279.

Spoustová, D., Hajič, J., Votrubec, J., Krbec, P., & Květoň, P. (2007). The best of two worlds: Cooperation of statistical and rule-based taggers for Czech. In J. Piskorski & T. Hristo (Eds.), Proceedings of the Workshop on Balto-Slavonic Natural Language Processing (pp. 67–74). Association for Computational Linguistics. [URL]

Staples, S., Biber, D., & Reppen, R. (2018). Using corpus-based register analysis to explore the authenticity of high-stakes language exams: A register comparison of TOEFL iBT and disciplinary writing tasks. The Modern Language Journal, 102(2), 310–332.

Straková, J., Straka, M., & Hajič, J. (2013). A new state-of-the-art Czech named entity recognizer. In I. Habernal, & V. Matoušek (Eds.), Text, Speech, and Dialogue (pp. 68–75). Springer.

(2014). Open-source tools for morphology, lemmatization, POS tagging and named entity recognition. In K. Bontcheva & J. Zhu (Eds.), Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 13–18). Association for Computational Linguistics.

Szmrecsanyi, B. (2005). Language users as creatures of habit: A corpus-based analysis of persistence in spoken English. Corpus Linguistics and Linguistic Theory, 1(1), 113–150.

Szmrecsanyi, B., & Hinrichs, L. (2008). Probabilistic determinants of genitive variation in spoken and written English: A multivariate comparison across time, space, and genres. In T. Nevalainen, I. Taavitsainen, P. Pahta, & M. Korhonen (Eds.), The Dynamics of Linguistic Variation: Corpus Evidence on English Past and Present (pp. 291–309). John Benjamins.

Tagliamonte, S. (1998). Was/were variation across the generations: View from the city of York. Language Variation and Change, 10(2), 153–191.

Tambouratzis, G., Markantonatou, S., Hairetakis, N., Vassiliou, M., Tambouratzis, D., & Carayannis, G. (2000). Discriminating the registers and styles in the Modern Greek language. In A. Kilgarriff & T. Berber Sardinha (Eds.), Proceedings of the Workshop on Comparing Corpora – Volume 9 (pp. 35–42). Association for Computational Linguistics.

Trudgill, P. (2004). Dialects (2nd ed.). Routledge.

Zasina, A. J., Lukeš, D., Komrsková, Z., Poukarová, P., & Řehořková, A. (2018). Koditex: Korpus diverzifikovaných textů [Koditex: Corpus of diversified texts] (version 1). Ústav Českého národního korpusu FF UK. [URL]

Cited by (6)

Cited by six other publications

Order by:

Cvrček, Václav & Martina Berrocal

2025. Sibling-texts keyword analysis: exploring topic and register keywords. Digital Scholarship in the Humanities 40:3 ► pp. 762 ff.

Cvrček, Václav, Zuzana Laubeová, David Lukeš, Petra Poukarová, Anna Řehořková & Adrian Jan Zasina

2023. Register differences and intra-register variation of elicited texts. Register Studies 5:2 ► pp. 143 ff.

Gracheva, Marianna

2023. The role of situation in individual style. Register Studies 5:2 ► pp. 205 ff.

Pyykönen, Maria

2023. Epistemic stance in written L2 English: The role of task type, L2 proficiency, and authorial style. Applied Corpus Linguistics 3:1 ► pp. 100040 ff.

Kučera, Dalibor, Jiří Haviger & Jana M. Havigerová

2022. Personality and Word Use: Study on Czech Language and the Big Five. Journal of Psycholinguistic Research 51:5 ► pp. 1165 ff.

Kučera, Dalibor & Matthias R. Mehl

2022. Beyond English: Considering Language and Culture in Psychological Text Analysis. Frontiers in Psychology 13

This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.