Reuse of social media data in corpus linguistics

Laitinen, Mikko; Rautionaho, Paula

doi:10.1075/ijcl.24136.lai

Article published In: Reproducibility, Replicability, and Robustness in Corpus Linguistics
Edited by Martin Schweinberger and Michael Haugh
[International Journal of Corpus Linguistics 30:2] 2025
► pp. 171–194

Get fulltext from our e-platform

Download PDF

Download EPUB

Reuse of social media data in corpus linguistics

Mikko Laitinen | University of Eastern Finland | Center for Data Intensive Sciences and Applications

Paula Rautionaho | University of Eastern Finland

Published online: 12 June 2025

https://doi.org/10.1075/ijcl.24136.lai

Abstract

The use of very large social media datasets in corpus linguistics has obvious benefits. Such data represent a novel source of evidence when compared with structured digital text corpora. However, there is a clear need to assess critically how the effective reuse of data can be handled, how findings can be reproduced, and how results can be generalized. A relevant question concerns the presentation of data to ensure reproducibility and replicability. This article surveys the state-of-the-art of descriptions of data collection and methodological transparency in 30 studies that used Twitter/X as their data. The empirical section investigates how easy it would be to reproduce a study based on these descriptions. While we concentrate on evidence from one social media application, the discussion continues to a presentation of concrete steps that might be used to improve data management related to the reuse, discovery, and evaluation of social media data in general.

Keywords: social media data, replicability, reproducibility, metadata, research infrastructures

Article outline

1.Introduction
2.Background
3.Current status of reproducible social media data
- 3.1Data collection
- 3.2Methods
4.Discussion and conclusions
Acknowledgements
Notes
References

References (52)

References

Abitbol, J. L., Karsai, M., Magué, J.-P., Chevrot, J.-P., & Fleury, E. (2018). Socioeconomic dependencies of linguistic patterns in Twitter: A multivariate analysis use. In P.-A. Champin, F. Gandon, M. Lalmas, & P. G. Ipeirotis (Eds.), Proceedings of the 2018 World Wide Web Conference (pp. 1125–1134). ACM Press.

Altahmazi, T. H. (2020). Collective pragmatic acting in networked spaces: The case of #activism in Arabic and English Twitter discourse. Lingua, 2391, Article e102837.

An, J., & Weber, I. (2016). #greysanatomy vs. #yankees: Demographics and hashtag use on Twitter. In K. P. Gummadi & M. Strohmaier (Eds.), Proceedings of the tenth international AAAI conference on web and social media (ICWSM 2016) (pp. 523–526).

Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing, 8(4), 243–257.

Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics: Investigating language structure and use. Cambridge University Press.

Biber, D., & Reppen, R. (2015). The Cambridge handbook of English corpus linguistics. Cambridge University Press.

Bruns, A. (2019). After the ‘APIcalypse’: Social media platforms and their fight against critical scholarly research. Information, Communication & Society, 22(11), 1544–1566.

Clausen, Y., & Scheffler, T. (2020). A corpus-based analysis of meaning variations in German tag questions: Evidence from spoken and written conversational corpora. Corpus Linguistics and Linguistic Theory, 18(1), 1–31.

Coats, S. (2017). Gender and lexical type frequencies in Finland Twitter English. In T. Hiltunen, J. McVeigh, & T. Säily (Eds.), Big and rich data in English corpus linguistics: Methods and explorations (Studies in variation, contacts and change in English 19). Varieng. [URL]

Davies, M. (2013). Corpus of Global Web-Based English. [URL]

(2015). Corpora: An introduction. In D. Biber & R. Reppen (Eds.), The Cambridge handbook of English corpus linguistics (pp. 11–31). Cambridge University Press.

Dijkstra, J., Heeringa, W., Jongbloed-Faber, L., & Van de Velde, H. (2021). Using Twitter data for the study of language change in low-resource languages. A panel study of relative pronouns in Frisian. Frontiers in Artificial Intelligence, 41, Article e644554.

Dunbar, R. I. M. (2020). Structure and function in human and primate social networks: Implications for diffusion, network stability and health. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 476(2240), Article e20200446.

Eisenstein, J., O’Connor, B., Smith, N. A., & Xing, E. P. (2014). Diffusion of lexical change in social media. PLOS ONE, 9(11), Article e113114.

Fiil-Flynn, S. M., Butler, B., Carroll, M., Cohen-Sasson, O., Craig, C., Guibault, L., Jaszi, P., Jütte, B. J., Katz, A., Quintais, J. P., Margoni, T., Souza, A. R., Sag, M., Samberg, R., Schirru, L., Senftleben, M., Tur-Sinai, O., & Contreras, J. L. (2022). Legal reform to enhance global text and data mining research. Science, 378(6623), 951–953.

Flanagan, J. (2017). Reproducible research: Strategies, tools, and workflows. In T. Hiltunen, J. McVeigh, & T. Säily (Eds.), Big and rich data in English corpus linguistics: Methods and explorations (Studies in variation, contacts and change in English 19). Varieng. [URL]

Francis, W. N., & Kučera, H. (1964). A Standard Corpus of Present-Day Edited American English, for use with digital computers. Department of Linguistics, Brown University.

Gonçalves, B., Loureiro-Porto, L., Ramasco, J. J., & Sánchez, D. (2018). Mapping the Americanization of English in space and time. PLOS ONE, 13(5), Article e0197741.

Grieve, J. (2021). Observation, experimentation, and replication in linguistics. Linguistics, 59(5), 1343–1356.

Grieve, J., Montgomery, C., Nini, A., Murakami, A., & Guo, D. (2019). Mapping lexical dialect variation in British English using Twitter. Frontiers in Artificial Intelligence, 21, Article 11.

Grieve, J., Nini, A., & Guo, D. (2017). Analyzing lexical emergence in American English online. English Language and Linguistics, 21(1), 99–127.

(2018). Mapping lexical innovation on American social media. Journal of English Linguistics, 46(4), 293–319.

Groves, R. M., & Couper, M. P. (1998). Nonresponse in household interview surveys. Wiley.

Hardie, A. (2012). CQPweb — combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics, 17(3), 380–409.

Huang, Y., Guo, D., Kasakoff, A., & Grieve, J. (2016). Understanding U.S. regional linguistic variation with Twitter data analysis. Computers, Environment and Urban Systems, 591, 244–255.

Hundt, M., Lehmann, H. M., & Schneider, G. (2023). The International Corpus of English. Retrieved January 7, 2025, from [URL]

Kathpalia, S. (2023). Satiric parody through Indian English tweets in Twitter. World Englishes, 42(4), 606–623.

Kazansky, B., Torres, G., van der Velden, L., Wissenbach, K., & Milan, S. (2019). Data for the social good: Toward a data-activist research agenda. In A. Daly, S. K. Devitt, & M. Mann (Eds.), Good data (pp. 244–259). Institute of Network Cultures.

Kellert, O., & Matlis, N. H. (2022). Geolocation of multiple sociolinguistic markers in Buenos Aires. PLOS ONE, 17(9), Article e0274114.

Laitinen, M., & Fatemi, M. (2022). Big and rich social networks in computational sociolinguistics. In P. Rautionaho, H. Parviainen, M. Kaunisto, & A. Nurmi (Eds.), Social and regional variation in World Englishes: Local and global perspectives (pp. 166–189). Routledge.

Laitinen, M., & Lundberg, J. (2020). ELF, language change, and social networks: Evidence from real-time social media data. In A. Mauranen & S. Vetchinnikova (Eds.), Language change: The impact of English as a lingua franca (pp. 179–204). Cambridge University Press.

Laitinen, M., Lundberg, J., Levin, M., & Lakaw, A. (2017). Utilizing multilingual language data in (nearly) real time: The case of the Nordic Tweet Stream. The Journal of Universal Computer Science, 231, 1038–1056.

Li, H., Dunn, J., & Nini, A. (2022). Register variation remains stable across 60 languages. Corpus Linguistics and Linguistic Theory, 19(3), 397–426.

Liimatta, A. (2021). Using lengthwise scaling to compare feature frequencies across text lengths on Reddit. In S. Rüdiger & D. Dayter (Eds.), Corpus approaches to social media (pp. 111–130). John Benjamins.

Love, R., Dembry, C., Hardie, A., Brezina, V., & McEnery, T. (2017). The Spoken BNC2014: Designing and building a spoken corpus of everyday conversations. International Journal of Corpus Linguistics, 22(3), 319–344.

Lüdeling, A., & Kytö, M. (2009). Corpus linguistics: An international handbook. De Gruyter Mouton.

Lundberg, J., Nordqvist, J., & Laitinen, M. (2019). Towards a language independent Twitter bot detector. In C. Navarretta, M. Agirrezabal, & B. Maegaard (Eds.), Proceedings of the Digital Humanities in the Nordic Countries 4th Conference (pp. 308–319). University of Copenhagen. [URL].

Ma, X., Cheng, J., Iyer, S., & Naaman, M. (2019). When do people trust their social groups? In S. Brewster, G. Fitzpatrick, A. Cox, & V. Kostakos (Eds.), CHI '19: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. Paper 67. Association for Computing Machinery.

Milroy, L., & Milroy, J. (1992). Social network and social class: Toward an integrated sociolinguistic model. Language in Society, 21(1), 1–26.

Morstatter, F., Wu, L., Nazer, T. H., Carley, K. M., & Liu, H. (2016). A new approach to bot detection: Striking the balance between precision and recall. In R. Kumar, J. Caverlee, & H. Tong (Eds.), Proceedings of the 2016 IEEE/ACM international conference on advances in social networks analysis and mining (pp. 533–540). IEEE press.

National Academies of Sciences, Engineering, and Medicine (NAS). (2019). Reproducibility and replicability in science. The National Academies Press.

Nevalainen, T., Raumolin-Brunberg, H., Keränen, J., Nevala, M., Nurmi, A., & Palander-Collin, M. (1993–). Corpus of Early English Correspondence. [URL]

Nevalainen, T., Tyrkkö, J., & Minna Palander-Collin, M. (n.d.). Corpus Resource Database. [URL]

Schweinberger, M., & Flanagan, J. (2021). Replication and reproducibility in English corpus linguistics [Workshop]. 6th meeting of the International Society for the Linguistics of English (ISLE 6), Joensuu, Finland.

Shakir, M., & Deuber, D. (2018). A multidimensional study of interactive registers in Pakistani and US English. World Englishes, 37(4), 607–623.

Sönning, L., & Werner, V. (2021). The replication crisis, scientific revolutions, and linguistics. Linguistics, 59(5), 1179–1206.

Statista. (2022). Number of worldwide social media users. Retrieved November 19, 2022, from [URL]

Truan, N. (2023). “I am a real cat”: French-speaking cats on Twitter as an enregistered variety and community of practice. Internet Pragmatics, 6(1), 67–106.

Tyrkkö, J., Levin, M., & Laitinen, M. (2021). Actually in Nordic tweets. World Englishes, 40(4), 631–649.

Van Hee, C., Lefever, E., & Hoste, V. (2018). We usually don’t like going to the dentist: Using common sense to detect irony on Twitter. Computational Linguistics, 44(4), 793–832.

Wachter, S., Mittelstadt, B., & Russell, C. (2024). Do large language models have a legal duty to tell the truth? Royal Society Open Science, 111, Article e240197.

Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., Bonino da Silva Santos, L., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R. … Mons, B. (2016). The FAIR guiding principles for scientific data management and stewardship. Scientific Data, 31, Article e160018.

Cited by (1)

Cited by one other publication

Becker, Laura & Matías Guzmán Naranjo

2025. Authors’ response to “Replication and methodological robustness in quantitative typology”. Linguistic Typology 29:3 ► pp. 591 ff.

This list is based on CrossRef data as of 30 november 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.