In:Challenges in Corpus Linguistics: Rethinking corpus compilation and analysis
Edited by Mark Kaunisto and Marco Schilk
[Studies in Corpus Linguistics 118] 2024
► pp. 106–125
Text length and short texts
An overview of the problem
Published online: 19 September 2024
https://doi.org/10.1075/scl.118.07lii
https://doi.org/10.1075/scl.118.07lii
Abstract
Variation in text length is an unavoidable confounder in
quantitative text-analytic corpus-linguistic studies. Texts can be difficult
to compare across text lengths, particularly if many of them are short, due
to the difficulty of calculating meaningful frequencies for the lexical
items and linguistic features of interest. Traditionally, this has been less
of an issue, since texts in many of the genres typically studied in
linguistics have been relatively long. However, the rise of social media has
brought the issue to the forefront. In this chapter, I describe the problem
of text length and short texts together with a number of solutions and
workarounds to this and related problems.
Keywords: text length, normalization, lexical diversity, lengthwise analysis
Article outline
- 1.Introduction
- 2.Background
- 2.1Text length, corpora, and social media
- 2.2The importance of text length
- 3.Solutions and workarounds
- 3.1Manipulation of the data
- 3.1.1Exclusion
- 3.1.2Combining
- 3.1.3Chunking
- 3.2Computational and statistical approaches
- 3.2.1Lengthwise analysis
- 3.2.2Multiple Correspondence Analysis
- 3.2.3Resampling methods
- 3.3A related problem: Lexical diversity
- 3.1Manipulation of the data
- 4.Conclusion
Notes References
References (24)
. 2014. Using
multi-dimensional analysis to explore cross-linguistic universals of
register variation. Languages in
Contras, 14(1): 7–34.
Biber, Douglas, Csomay, Eniko, Jones, James K. & Keck, Casey. 2004. A
corpus linguistic investigation of vocabulary-based discourse units
in university
registers. In Applied
Corpus Linguistics: A Multidimensional
Perspective, Ulla Connor & Thomas A. Upton (eds), 53–72. Amsterdam: Rodopi.
Biber, Douglas, Egbert, Jesse & Keller, Daniel. 2020. Reconceptualizing
register in a continuous situational
space. Corpus Linguistics and
Linguistic
Theory 16(3): 581–616.
Clarke, Isobelle & Grieve, Jack. 2017. Dimensions
of abusive language on
Twitter. In Proceedings
of the First Workshop on Abusive Language
Online, Zeerak Waseem, Wendy Hui Kyong Chung, Dirk Hovy & Joel Tetreault (eds), 1–10. Vancouver BC: Association for Computational Linguistics.
. 2019. Stylistic
variation on the Donald Trump Twitter account: A linguistic analysis
of tweets posted between 2009 and
2018. PLoS
One 14(9): e0222062.
Conrad, Susan & Biber, Douglas (eds). 2001. Variation
in English: Multi-dimensional
Studies. Harlow: Pearson Education.
Covington, Michael A. & McFall, Joe D. 2010. Cutting
the Gordian Knot: The Moving-Average Type-Token Ratio
(MATTR). Journal of Quantitative
Linguistics 17(2): 94–100.
Gries, Stefan T. 2006. Exploring
variability within and between corpora: Some methodological
considerations. Corpora 1(2): 109–151.
2022. Toward
more careful corpus statistics: uncertainty estimates for
frequencies, dispersions, association measures, and
more. Research Methods in Applied
Linguistics 1(1).
Hess, Carla W., Haug, Holly T. & Landry, Richard G. 1989. The
reliability of type-token ratios for the oral language of school age
children. Journal of Speech and
Hearing
Research 32: 536–540.
Hess, Carla W., Sefton, Karen M. & Landry, Richard G. 1986. Sample
size and type-token ratios for oral language of preschool
children. Journal of Speech and
Hearing
Research 29: 129–134.
Hiltunen, Turo & Tyrkkö, Jukka. 2019. Academic
vocabulary in Wikipedia articles: Frequency and dispersion in uneven
datasets. In From
Data to Evidence in English Language
Research, Carla Suhr, Terttu Nevalainen & Irma Taavitsainen (eds), 282–306. Leiden: Brill.
Koizumi, Rie & In’nami, Yo. 2012. Effects
of text length on lexical diversity measures: Using short texts with
less than 200
tokens. System 40(4): 554–564.
Kubát, Miroslav & Milička, Jiří. 2013. Vocabulary
richness measure in genres. Journal
of Quantitative
Linguistics 20(4): 339–349.
Liimatta, Aatu. 2019. Exploring
register variation on Reddit: A multi-dimensional study of language
use on a social media
website. Register
Studies 1(2): 269–295.
. 2020. Using
lengthwise scaling to compare feature frequencies across text
lengths on
Reddit. In Corpus
Approaches to Social Media, Sofia Rüdiger & Daria Dayter (eds), 111–130. Amsterdam: John Benjamins.
. 2022a. Register
variation across text lengths: Evidence from social
media. International Journal of
Corpus
Linguistics 28(2): 202–231.
. 2022b. Do
registers have different functions for text length? A case study of
Reddit. Register
Studies 4(2): 263–287.
Lijffijt, Jefrey, Nevalainen, Terttu, Säily, Tanja, Papapetrou, Panagiotis, Puolamäki, Kai & Mannila, Heikki. 2016. Significance
testing of word frequencies in
corpora. Digital Scholarship in the
Humanities 31(2): 374–397.
Shi, Yaqian & Lei, Lei. 2020. Lexical
richness and text length: An entropy-based
perspective. Journal of Quantitative
Linguistics 29(1), 62–79.
Cited by (1)
Cited by one other publication
This list is based on CrossRef data as of 1 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
