Article published In: International Journal of Corpus Linguistics
Vol. 28:2 (2023) ► pp.202–231
Register variation across text lengths
Evidence from social media
Published online: 23 August 2022
https://doi.org/10.1075/ijcl.20177.lii
https://doi.org/10.1075/ijcl.20177.lii
Abstract
This paper explores variation in lexico-grammatical register features across text lengths in a large-scale sample of Reddit comments. Very short texts are known to be problematic for many statistical methods, so understanding their nature is important for the corpus-linguistic study of social media, where most contributions are short. I show that the frequencies of linguistic features change with comment length, even between longer comments, although longer texts are often considered similar in statistical terms. Moreover, I classify the variation found between short comments of different lengths into two main patterns, although other patterns can also be found, and there is variation even within these patterns. Furthermore, I interpret the observed differences in terms of register variation. For example, shorter comments appear to be more casual and less edited in terms of their feature makeup, whereas narrative and informational registers seem to favor longer comments.
Keywords: text length, register analysis, social media, Reddit, functional variation
Article outline
- 1.Introduction
- 2.Register, social media, and text length
- 2.1Register analysis
- 2.2Social media and the problem of text length
- 2.3Earlier studies with text length as a variable
- 3.Data and method
- 3.1Data
- 3.2Method
- 4.Variation across text lengths
- 4.1Rounded rise
- 4.2Initial peak
- 4.3Other patterns
- 4.4Lengthwise variation between longer comments
- 5.Conclusions
- Acknowledgements
- Notes
References
References (30)
Baroni, M. (2008). Distributions in text. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics: An International Handbook (pp. 803–822). Mouton de Gruyter.
Baumgartner, J., Zannettou, S., Keegan, B., Squire, M., & Blackburn, J. (2020). The Pushshift Reddit Dataset. Proceedings of the International AAAI Conference on Web and Social Media, 14(1), 830–839. [URL].
Berber Sardinha, T., & Veirano Pinto, M. (2014). Multi-dimensional Analysis, 25 Years on: A Tribute to Douglas Biber. John Benjamins.
(1994). An analytical framework for register studies. In D. Biber & E. Finegan (Eds.), Sociolinguistic Perspectives on Register (pp. 31–56). Oxford University Press.
(2014). Using multi-dimensional analysis to explore cross-linguistic universals of register variation. Languages in Contrast, 14(1), 7–34.
Biber, D., & Conrad, S. (2001). Introduction: Multi-dimensional analysis and the study of register variation. In S. Conrad & D. Biber (Eds.), Variation in English: Multi-Dimensional Studies (pp. 3–12). Pearson Education.
Biber, D., & Egbert, J. (2016). Register variation on the searchable web: A multi-dimensional analysis. Journal of English Linguistics, 44(2), 95–137.
Biber, D., & Gray, B. (2013). Being specific about historical change: The influence of sub-register. The Journal of English Linguistics, 411, 104–134.
Clarke, I., & Grieve, J. (2017). Dimensions of abusive language on Twitter. In Z. Waseem, W. Hui Kyong, D. Hovy, & J. Tetreault (Eds.), Proceedings of the First Workshop on Abusive Language Online (pp. 1–10). Association for Computational Linguistics.
(2019). Stylistic variation on the Donald Trump Twitter account: A linguistic analysis of tweets posted between 2009 and 2018. PLoS ONE, 14(9), Article e0222062.
Eberl, M. (2020). Double trouble: Are 280-character tweets comparable to 140-character tweets? In S. Rüdiger & D. Dayter (Eds.), Corpus Approaches to Social Media. John Benjamins.
Egbert, J., & Schnur, E. (2018). The role of text in corpus and discourse analysis. In C. Taylor & A. Marchi (Eds.), Corpus Approaches to Discourse: A Critical Review (pp. 159–173). Taylor & Francis.
Friginal, E. (Ed.) (2013). Twenty-five Years of Biber–s Multi-Dimensional Analysis [Special issue]. Corpora, 8(2).
Glynn, D. (2014). Correspondence analysis: Exploring data and identifying patterns. In D. Glynn & J. A. Robinson (Eds.), Corpus Methods for Semantics: Quantitative Studies in Polysemy and Synonymy (pp. 443–485). John Benjamins.
Grieve, J., Biber, D., Friginal, E., & Nekrasova, T. (2011). Variation among blog text types: A multi-dimensional analysis. In A. Mehler, S. Sharoff, & M. Santini (Eds.), Genres on the Web: Corpus Studies and Computational Models (pp. 302–322). Springer.
Hess, C. W., Haug, H. T., & Landry, R. G. (1989). The reliability of type-token ratios for the oral language of school age children. Journal of Speech and Hearing Research, 32(3), 536–540.
Hess, C. W., Sefton, K. M., & Landry, R. G. (1986). Sample size and type-token ratios for oral language of preschool children. Journal of Speech and Hearing Research, 29(1), 129–134.
Hiltunen, T. (2014). Choice of national variety in the English-language Wikipedia. In J. Tyrkkö & S. Leppänen (Eds.), Texts and Discourses of New Media. VARIENG. [URL]
Holler, J., Kendrick, K. H., Casillas, M., & Levinson, S. C. (2015). Editorial: Turn-taking in human communicative interaction. Frontiers in Psychology, 61(1919).
Koizumi, R., & In–nami, Y. (2012). Effects of text length on lexical diversity measures: Using short texts with less than 200 tokens. System, 40(4), 554–564.
Liimatta, A. (2019). Exploring register variation on Reddit: A multi-dimensional study of language use on a social media website. Register Studies, 1(2), 269–295.
(2020). Using lengthwise scaling to compare feature frequencies across text lengths on Reddit. In S. Rüdiger & D. Dayter (Eds.), Corpus Approaches to Social Media (pp. 111–130). John Benjamins.
Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In K. Bontcheva & J. Zhu (Eds.), Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 55–60). Association for Computational Linguistics.
Rosen, A. (2017, November 7). Tweeting made easier. [URL]
Cited by (6)
Cited by six other publications
Coussé, Evie & Yvonne Adesam
Messerli, Thomas C, Daria Dayter, Sven Leuckert, Aatu Liimatta, Hanna Mahler, Axel Bohmann, Gustavo Kozma & Rafaela Tosin
Wood, Margaret
Wang, Jiawei & Zhiying Xin
Liimatta, Aatu
Liimatta, Aatu
2024. Text length and short texts. In Challenges in corpus linguistics [Studies in Corpus Linguistics, 118], ► pp. 106 ff.
This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
