Article published In: International Journal of Corpus Linguistics
Vol. 25:4 (2020) ► pp.461–488
Author and register as sources of variation
A corpus-based study using elicited texts
Published online: 27 October 2020
https://doi.org/10.1075/ijcl.19020.cvr
https://doi.org/10.1075/ijcl.19020.cvr
Abstract
This paper investigates the contribution of author/idiolect vs. register/type-of-text – as the most salient factors
influencing the final shape of a text – towards explaining the variation observed in Czech texts. Since it is almost impossible to explore
the effect of these factors on authentic data, we used elicited letters collected in a fully crossed experimental design (representative
sample of 200 authors × four elicitation scenarios serving as a proxy to register variation). The variation encompassed by the elicited
texts is analyzed through the lens of a general-purpose multi-dimensional model of Czech. Using triangulation via three established
statistical methods and one devised for the purpose of this study, we find that register matters a great deal, explaining 1.5 times as much
variation overall as idiolect. This should be taken into account when designing research in sociolinguistics or variation studies in
general.
Keywords: variation, idiolect, register, multi-dimensional analysis, Czech
Article outline
- 1.Introduction
- 2.Sources of variation in linguistic research
- 3.Methods and data
- 3.1General-purpose model of language variability
- 3.2CPACT elicited data
- 4.Results
- 4.1Effect size
- 4.2Distances between texts
- 5.Discussion
- 6.Conclusion
- Notes
References
References (52)
Amoroso, L. W. (2018). Analyzing group differences. In A. Phakiti, P. D. Costa, L. Plonsky, & S. Starfield (Eds.), The Palgrave Handbook of Applied Linguistics Research Methodology (pp. 501–521). Palgrave Macmillan.
Baayen, H., van Halteren, H., & Tweedie, F. (1996). Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, 11(3), 121–132.
Bakeman, R. (2005). Recommended effect size statistics for repeated measures designs. Behavior Research Methods, 37(3), 379–384.
Baker, P., & Egbert, J. (2016). Triangulating Methodological Approaches in Corpus Linguistic Research. Routledge.
Bayley, R., Cameron, R., & Lucas, C. (Eds.). (2013). The Oxford Handbook of Sociolinguistics. Oxford University Press.
(1995). Dimensions of Register Variation: A Cross-Linguistic Comparison. Cambridge University Press.
(2012). Register as a predictor of linguistic variation. Corpus Linguistics and Linguistic Theory, 8(1), 9–37.
Biber, D., & Finegan, E. (Eds.). (1994). Sociolinguistic Perspectives on Register. Oxford University Press.
Čermák, F. (Ed.). (2007). Slovník Karla Čapka [Karel Čapek՚s Dictionary]. Nakladatelství Lidové noviny.
Český statistický úřad [Czech Statistical Office]. (2015). Věk a vzdělání populace [Age and education of the population]. [URL]
Conrad, S. (2015). Register variation. In D. Biber, & R. Reppen (Eds.), The Cambridge Handbook of English Corpus Linguistics (pp. 309–329). Cambridge University Press.
Cvrček, V., Komrsková, Z., Lukeš, D., Poukarová, P., Řehořková, A., & Zasina, A. J. (in preparation). Register variability of elicited texts.
(2018a). From extra- to intratextual characteristics: Charting the space of variation in Czech through MDA. Corpus Linguistics and Linguistic Theory. Advance online publication.
(2018b). Variabilita češtiny: Multidimenzionální analýza [Variability of Czech: A multi-dimensional analysis]. Slovo a slovesnost, 79(4), 293–321.
Cvrček, V., Komrsková, Z., Lukeš, D., Poukarová, P., Řehořková, A., Zasina, A. J., & Benko, V. (2020). Comparing web-crawled and traditional corpora. Language Resources and Evaluation, 541, 713–745.
Egbert, J., & Baker, P. (2019). Using Corpus Methods to Triangulate Linguistic Analysis. Taylor & Francis.
Finegan, E., & Rickford, J. R. (Eds.). (2004). Language in the USA: Themes for the 21st Century. Cambridge University Press.
Grant, T. (2007). Quantifying evidence in forensic authorship analysis. International Journal of Speech, Language and the Law, 14(1), 1–25.
Hinrichs, L., & Szmrecsanyi, B. (2007). Recent changes in the function and frequency of Standard English genitive constructions: A multivariate analysis of tagged corpora. English Language & Linguistics, 11(3), 437–474.
Hnátková, M. (2002). Značkování frazémů a idiomů v Českém národním korpusu s pomocí Slovníku české frazeologie a idiomatiky [Tagging phraseological units and idioms in the Czech National Corpus with the aid of the Dictionary of Czech phraseology and idiomatics]. Slovo a slovesnost, 63(2), 117–126.
Iwasaki, S., & Horie, P. I. (2000). Creating speech register in Thai conversation. Language in Society, 29(4), 519–554.
Jelínek, T. (2008). Nové značkování v Českém národním korpusu [New tagging in the Czech National Corpus]. Naše řeč, 91(1), 13–20.
King, B. M., Rosopa, P. J., & Minium, E. W. (2010). Some (almost) assumption-free tests. In Statistical Reasoning in the Behavioral Sciences (6th ed., pp. 381–401). Wiley.
Krejci, B., & Hilton, K. (2017). There’s three variants: Agreement variation in existential there constructions. Language Variation and Change, 29(2), 187–204.
Kučera, D. (2017). Computational psycholinguistic analysis of Czech text and the CPACT research. In ISC SGEM 4th International Multidisciplinary Scientific Conference on Social Sciences and Arts SGEM 2017: Science & Society Conference Proceedings, (pp. 77–84). ISC SGEM.
Kučera, D., & Havigerová, J. M. (2015). Computational psycholinguistic analysis and its application in psychological assessment of college students. Journal of Pedagogy, 6(1), 61–72.
Labov, W. (1966). The Social Stratification of English in New York City. Center for Applied Linguistics.
Louwerse, M. M. (2004). Semantic variation in idiolect and sociolect: Corpus linguistic evidence from literary texts. Computers and the Humanities, 38(2), 207–221.
Nakagawa, S., Johnson, P. C. D., & Schielzeth, H. (2017). The coefficient of determination R2 and intra-class correlation coefficient from generalized linear mixed-effects models revisited and expanded. Journal of the Royal Society, Interface, 14(134).
Petkevič, V. (2014). Problémy automatické morfologické disambiguace češtiny [Problems of automatic morphological disambiguation of Czech]. Naše řeč, 97(4–5), 194–207.
Rickford, J. R., & McNair-Knox, F. (1994). Addressee- and topic-influenced style shift: A quantitative sociolinguistic study. In D. Biber & E. Finegan (Eds.), Sociolinguistic Perspectives on Register (pp. 235–276). Oxford University Press.
Riordan, B. (2007). There’s two ways to say it: Modeling nonprestige there’s. Corpus Linguistics and Linguistic Theory, 3(2), 233–279.
Spoustová, D., Hajič, J., Votrubec, J., Krbec, P., & Květoň, P. (2007). The best of two worlds: Cooperation of statistical and rule-based taggers for Czech. In J. Piskorski & T. Hristo (Eds.), Proceedings of the Workshop on Balto-Slavonic Natural Language Processing (pp. 67–74). Association for Computational Linguistics. [URL]
Staples, S., Biber, D., & Reppen, R. (2018). Using corpus-based register analysis to explore the authenticity of high-stakes language exams: A register comparison of TOEFL iBT and disciplinary writing tasks. The Modern Language Journal, 102(2), 310–332.
Straková, J., Straka, M., & Hajič, J. (2013). A new state-of-the-art Czech named entity recognizer. In I. Habernal, & V. Matoušek (Eds.), Text, Speech, and Dialogue (pp. 68–75). Springer.
(2014). Open-source tools for morphology, lemmatization, POS tagging and named entity recognition. In K. Bontcheva & J. Zhu (Eds.), Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 13–18). Association for Computational Linguistics.
Szmrecsanyi, B. (2005). Language users as creatures of habit: A corpus-based analysis of persistence in spoken English. Corpus Linguistics and Linguistic Theory, 1(1), 113–150.
Szmrecsanyi, B., & Hinrichs, L. (2008). Probabilistic determinants of genitive variation in spoken and written English: A multivariate comparison across time, space, and genres. In T. Nevalainen, I. Taavitsainen, P. Pahta, & M. Korhonen (Eds.), The Dynamics of Linguistic Variation: Corpus Evidence on English Past and Present (pp. 291–309). John Benjamins.
Tagliamonte, S. (1998). Was/were variation across the generations: View from the city of York. Language Variation and Change, 10(2), 153–191.
Tambouratzis, G., Markantonatou, S., Hairetakis, N., Vassiliou, M., Tambouratzis, D., & Carayannis, G. (2000). Discriminating the registers and styles in the Modern Greek language. In A. Kilgarriff & T. Berber Sardinha (Eds.), Proceedings of the Workshop on Comparing Corpora – Volume 9 (pp. 35–42). Association for Computational Linguistics.
Zasina, A. J., Lukeš, D., Komrsková, Z., Poukarová, P., & Řehořková, A. (2018). Koditex: Korpus diverzifikovaných textů [Koditex: Corpus of diversified texts] (version 1). Ústav Českého národního korpusu FF UK. [URL]
Cited by (6)
Cited by six other publications
Cvrček, Václav & Martina Berrocal
Cvrček, Václav, Zuzana Laubeová, David Lukeš, Petra Poukarová, Anna Řehořková & Adrian Jan Zasina
2023. Register differences and intra-register variation of elicited texts. Register Studies 5:2 ► pp. 143 ff.
Gracheva, Marianna
Pyykönen, Maria
Kučera, Dalibor, Jiří Haviger & Jana M. Havigerová
This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
