Register variation across text lengths: Evidence from social media

Liimatta, Aatu

doi:10.1075/ijcl.20177.lii

Article published In: International Journal of Corpus Linguistics
Vol. 28:2 (2023) ► pp.202–231

Get fulltext from our e-platform

Download PDF

Download EPUB

Register variation across text lengths

Evidence from social media

Aatu Liimatta | University of Helsinki

Published online: 23 August 2022

https://doi.org/10.1075/ijcl.20177.lii

Abstract

This paper explores variation in lexico-grammatical register features across text lengths in a large-scale sample of Reddit comments. Very short texts are known to be problematic for many statistical methods, so understanding their nature is important for the corpus-linguistic study of social media, where most contributions are short. I show that the frequencies of linguistic features change with comment length, even between longer comments, although longer texts are often considered similar in statistical terms. Moreover, I classify the variation found between short comments of different lengths into two main patterns, although other patterns can also be found, and there is variation even within these patterns. Furthermore, I interpret the observed differences in terms of register variation. For example, shorter comments appear to be more casual and less edited in terms of their feature makeup, whereas narrative and informational registers seem to favor longer comments.

Keywords: text length, register analysis, social media, Reddit, functional variation

Article outline

1.Introduction
2.Register, social media, and text length
- 2.1Register analysis
- 2.2Social media and the problem of text length
- 2.3Earlier studies with text length as a variable
3.Data and method
- 3.1Data
- 3.2Method
4.Variation across text lengths
- 4.1Rounded rise
- 4.2Initial peak
- 4.3Other patterns
- 4.4Lengthwise variation between longer comments
5.Conclusions
Acknowledgements
Notes
References

References (30)

References

Baroni, M. (2008). Distributions in text. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics: An International Handbook (pp. 803–822). Mouton de Gruyter.

Baumgartner, J., Zannettou, S., Keegan, B., Squire, M., & Blackburn, J. (2020). The Pushshift Reddit Dataset. Proceedings of the International AAAI Conference on Web and Social Media, 14(1), 830–839. [URL].

Berber Sardinha, T., & Veirano Pinto, M. (2014). Multi-dimensional Analysis, 25 Years on: A Tribute to Douglas Biber. John Benjamins.

Biber, D. (1988). Variation across Speech and Writing. Cambridge University Press.

(1993). Representativeness in corpus design. Literary and Linguistic Computing, 8(4), 243–257.

(1994). An analytical framework for register studies. In D. Biber & E. Finegan (Eds.), Sociolinguistic Perspectives on Register (pp. 31–56). Oxford University Press.

(2014). Using multi-dimensional analysis to explore cross-linguistic universals of register variation. Languages in Contrast, 14(1), 7–34.

Biber, D., & Conrad, S. (2001). Introduction: Multi-dimensional analysis and the study of register variation. In S. Conrad & D. Biber (Eds.), Variation in English: Multi-Dimensional Studies (pp. 3–12). Pearson Education.

(2009). Register, Genre, and Style. Cambridge University Press.

Biber, D., & Egbert, J. (2016). Register variation on the searchable web: A multi-dimensional analysis. Journal of English Linguistics, 44(2), 95–137.

(2018). Register Variation Online. Cambridge University Press.

Biber, D., & Gray, B. (2013). Being specific about historical change: The influence of sub-register. The Journal of English Linguistics, 411, 104–134.

Clarke, I., & Grieve, J. (2017). Dimensions of abusive language on Twitter. In Z. Waseem, W. Hui Kyong, D. Hovy, & J. Tetreault (Eds.), Proceedings of the First Workshop on Abusive Language Online (pp. 1–10). Association for Computational Linguistics.

(2019). Stylistic variation on the Donald Trump Twitter account: A linguistic analysis of tweets posted between 2009 and 2018. PLoS ONE, 14(9), Article e0222062.

Eberl, M. (2020). Double trouble: Are 280-character tweets comparable to 140-character tweets? In S. Rüdiger & D. Dayter (Eds.), Corpus Approaches to Social Media. John Benjamins.

Egbert, J., & Schnur, E. (2018). The role of text in corpus and discourse analysis. In C. Taylor & A. Marchi (Eds.), Corpus Approaches to Discourse: A Critical Review (pp. 159–173). Taylor & Francis.

Friginal, E. (Ed.) (2013). Twenty-five Years of Biber–s Multi-Dimensional Analysis [Special issue]. Corpora, 8(2).

Glynn, D. (2014). Correspondence analysis: Exploring data and identifying patterns. In D. Glynn & J. A. Robinson (Eds.), Corpus Methods for Semantics: Quantitative Studies in Polysemy and Synonymy (pp. 443–485). John Benjamins.

Grieve, J., Biber, D., Friginal, E., & Nekrasova, T. (2011). Variation among blog text types: A multi-dimensional analysis. In A. Mehler, S. Sharoff, & M. Santini (Eds.), Genres on the Web: Corpus Studies and Computational Models (pp. 302–322). Springer.

Hess, C. W., Haug, H. T., & Landry, R. G. (1989). The reliability of type-token ratios for the oral language of school age children. Journal of Speech and Hearing Research, 32(3), 536–540.

Hess, C. W., Sefton, K. M., & Landry, R. G. (1986). Sample size and type-token ratios for oral language of preschool children. Journal of Speech and Hearing Research, 29(1), 129–134.

Hiltunen, T. (2014). Choice of national variety in the English-language Wikipedia. In J. Tyrkkö & S. Leppänen (Eds.), Texts and Discourses of New Media. VARIENG. [URL]

Holler, J., Kendrick, K. H., Casillas, M., & Levinson, S. C. (2015). Editorial: Turn-taking in human communicative interaction. Frontiers in Psychology, 61(1919).

Koizumi, R., & In–nami, Y. (2012). Effects of text length on lexical diversity measures: Using short texts with less than 200 tokens. System, 40(4), 554–564.

Liimatta, A. (2019). Exploring register variation on Reddit: A multi-dimensional study of language use on a social media website. Register Studies, 1(2), 269–295.

(2020). Using lengthwise scaling to compare feature frequencies across text lengths on Reddit. In S. Rüdiger & D. Dayter (Eds.), Corpus Approaches to Social Media (pp. 111–130). John Benjamins.

Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In K. Bontcheva & J. Zhu (Eds.), Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 55–60). Association for Computational Linguistics.

Rosen, A. (2017, November 7). Tweeting made easier. [URL]

Titak, A., & Roberson, A. (2013). Dimensions of web registers: An exploratory multi-dimensional comparison. Corpora, 8(2), 239–271.

Wallis, S. (2020). Statistics in Corpus Linguistic Research: A New Approach. Routledge.

Cited by (6)

Cited by six other publications

Order by:

Coussé, Evie & Yvonne Adesam

2025. Exploring the language of Swedish social media: A contrastive corpus analysis. Nordic Journal of Linguistics ► pp. 1 ff.

Messerli, Thomas C, Daria Dayter, Sven Leuckert, Aatu Liimatta, Hanna Mahler, Axel Bohmann, Gustavo Kozma & Rafaela Tosin

2025. Digital debating cultures: communicative practices on Reddit. Digital Scholarship in the Humanities 40:1 ► pp. 227 ff.

Wood, Margaret

2024. Linguistic variation in functional types of statutory law. Applied Corpus Linguistics 4:1 ► pp. 100081 ff.

Wang, Jiawei & Zhiying Xin

2023. A novel multi-dimensional analysis of reply, response and rejoinder articles: When discipline meets time. Journal of English for Academic Purposes 65 ► pp. 101286 ff.

Liimatta, Aatu

2022. Do registers have different functions for text length?. Register Studies 4:2 ► pp. 263 ff.

Liimatta, Aatu

2024. Text length and short texts. In Challenges in corpus linguistics [Studies in Corpus Linguistics, 118], ► pp. 106 ff.

This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.