Chapter 16. Automated text analyzers

Shao, Zimeng; Wang, Yuanheng (Arthur); Lu, Xiaofei

doi:10.1075/rmal.15.16sha

In:Digital and Internet-Based Research Methods in Applied Linguistics
Edited by Matt Kessler
[Research Methods in Applied Linguistics 15] 2026
► pp. 338–361

Get fulltext from our e-platform

Download Book PDF

Download Book EPUB

Chapter 16
Automated text analyzers

Zimeng Shao | The Pennsylvania State University

Yuanheng (Arthur) Wang | University of North CarolinaWilmington

Xiaofei Lu | The Pennsylvania State University

Published online: 5 January 2026

https://doi.org/10.1075/rmal.15.16sha

Abstract

This chapter introduces (web-based) automated text analyzers (ATAs) in applied linguistics research. It begins by briefly surveying key strands of research involving ATAs and outlining three types of text analysis alongside questions commonly addressed by these tools. The core of this chapter presents a conceptual framework for the typology and implementation of ATAs, structured around four continuum dimensions: (1) pre-built corpus platform vs. custom corpus platform, (2) developer-oriented vs. user-oriented, (3) focused vs. versatile, and (4) descriptive vs. interpretive. The framework is illustrated through five practical studies showcasing the application of various types of ATAs in applied linguistics research, including L2SCA, Coh-Metrix, Sketch Engine, #LancsBox, Voyant Tools, and Wmatrix3. The chapter also discusses ethical considerations and methodological challenges associated with ATA use. It concludes by outlining future directions for ATA development and research, including improving annotation accuracy, enhancing qualitative interpretability, and expanding analytical capacities across languages and modalities.

Article outline

1.Introduction
2.Frequently asked research questions
- Exploratory
- Predictive
- Inferential
3.Implementation
- Dimension 1: Pre-built corpus platform vs. custom corpus platform
- Dimension 2: User-oriented vs. developer-oriented tools
- Dimension 3: Focused vs. versatile function
- Dimension 4: Descriptive vs. interpretative focus
4.Example studies
- Kim and Lu (2024b)
- Polio and Yoon (2018)
- Taylor (2021)
- Elmas et al. (2025)
- Breeze (2019)
5.Ethics and research integrity considerations
6.Challenges and issues
7.Future research directions
References

References (36)

References

Alexopoulou, T., Michel, M., Murakami, A., & Meurers, D. (2017). Task effects on linguistic complexity and accuracy: A large-scale learner corpus analysis employing natural language processing techniques. Language Learning, 67(S1), 180–208.

Anthony, L. (2022). What can corpus software do? In A. O’Keeffe & M. McCarthy (Eds.), The Routledge handbook of corpus linguistics (2nd ed., pp. 103–125). Routledge.

Baker, P. (2023). Using corpora in discourse analysis (2nd ed.). Bloomsbury.

Baker, P., & McEnery, A. (Eds.). (2015). Corpora and discourse studies: Integrating discourse and corpora. Palgrave Macmillan.

Bednarek, M. (2015). Corpus-assisted multimodal discourse analysis of television and film narratives. In P. Baker & A. McEnery (Eds.), Corpora and discourse studies: Integrating discourse and corpora (pp. 63–87). Palgrave Macmillan.

Bengfort, B., Bilbro, R., & Ojeda, T. (2018). Applied text analysis with Python: Enabling language-aware data products with machine learning. O’Reilly Media.

Breeze, R. (2019). Emotion in politics: Affective-discursive practices in UKIP and Labour. Discourse & Society, 30(1), 24–43.

Brezina, V., & Platt, W. (2024). #LancsBox X (Version 5.0.3) [Computer Software]. Lancaster University. [URL]

Buck, A. M., & Ralston, D. F. (2021). I didn’t sign up for your research study: The ethics of using “public” data. Computers and Composition, 61, 102655.

Chen, Y. H., & Baker, P. (2016). Investigating criterial discourse features across second language development: Lexical bundles in rated learner essays, CEFR B1, B2 and C1. Applied linguistics, 37(6), 849–880.

Choi, J., & Crossley, S. A. (2022). Advanced in readability research: Automated readability web app for English. In Proceedings of the 2022 International Conference on Advanced Learning Technologies (pp. 1–5). IEEE.

Crossley, S. A., & Kim, M. (2022). Linguistic features of writing quality and development: A longitudinal approach. The Journal of Writing Analytics, 6(1), 59–93.

Elmas, T., Yılmaz, F., & Gürbüz, N. (2025). “Refugees from Ukraine are called humans”: A corpus-based critical discourse analysis of Turkish tweets about Ukrainian refugees. Media, Culture & Society, 47(1), 75–95.

Flowerdew, J., & Richardson, J. E. (Eds.). (2018). The Routledge handbook of critical discourse studies. Routledge.

Francom, J. (2025). An introduction to quantitative text analysis for linguistics: Reproducible research using R. Taylor & Francis.

Hardie, A. (2012). CQPweb — combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics, 17(3), 380–409.

Hunt, D., & Harvey, K. (2015). Health communication and corpus linguistics: Using corpus tools to analyse eating disorder discourse online. In P. Baker & A. McEnery (Eds.), Corpora and discourse studies: Integrating discourse and corpora (pp. 134–154). Palgrave Macmillan.

Jin, T., Lu, X., Guo, K., Li, B., Liu, F., Deng, Y., Wu, J., & Chen, G. (2021). Eng-Editor: An online English text evaluation and adaptation system. LanguageData. [URL]

Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Rychlý, P., & Suchomel, V. (2014). The Sketch Engine: Ten years on. Lexicography, 1(1), 7–36.

Kim, M., & Lu, X. (2024a). Exploring the potential of using ChatGPT for rhetorical move-step analysis: The impact of prompt refinement, few-shot learning, and fine-tuning. Journal of English for Academic Purposes, 71, 101422.

(2024b). L2 English speaking syntactic complexity: Data preprocessing issues, reliability of automated analysis, and the effects of proficiency, L1 background, and topic. The Modern Language Journal, 108(1), 270–296.

Kyle, K., Crossley, S., & Verspoor, M. (2021). Measuring longitudinal writing development using indices of syntactic complexity and sophistication. Studies in Second Language Acquisition, 43(4), 781–812.

Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), 474–496.

(2014). Computational methods for corpus annotation and analysis. Springer.

(2021). Directions for future automated analyses of L2 written texts. In The Routledge handbook of second language acquisition and writing (pp. 370–382). Routledge.

(2022). What can corpus software reveal about language development? In A. O’Keeffe & M. McCarthy (Eds.), The Routledge handbook of corpus linguistics (2nd ed.) (pp. 155–167). Routledge.

Mautner, G. (2022). What can a corpus tell us about discourse? In A. O’Keeffe & M. McCarthy (Eds.), The Routledge handbook of corpus linguistics (2nd ed.) (pp. 250–262). Routledge.

McNamara, D. S., Graesser, A. C., McCarthy, P., & Cai, Z. (2014). Automated evaluation of text and discourse with Coh-Metrix. Cambridge University Press.

O’Keeffe, A., & McCarthy, M. J. (Eds.). (2022). The Routledge handbook of corpus linguistics (2nd ed.). Routledge.

Polio, C., & Yoon, H. J. (2018). The reliability and validity of automated tools for examining variation in syntactic complexity across genres. International Journal of Applied Linguistics, 28(1), 165–188.

Potts, A. (2015). Filtering the flood: Semantic tagging as a method of identifying salient discourse topics in a large corpus of Hurricane Katrina reportage. In P. Baker & A. McEnery (Eds.), Corpora and discourse studies: Integrating discourse and corpora (pp. 285–304). Palgrave Macmillan.

Sinclair, S., & Rockwell, G. (2016). Voyant-tools. [URL]

Srinivasa-Desikan, B. (2018). Natural language processing and computational linguistics: A practical guide to text analysis with Python, Gensim, spaCy, and Keras. Packt Publishing.

Taylor, C. (2021). Investigating gendered language through collocation: The case of mock politeness. In J. Angouri & J. Baxter (Eds.), The Routledge handbook of language, gender, and sexuality (pp. 572–586). Routledge.

Taylor, C., & Marchi, A. (Eds.). (2018). Corpus approaches to discourse: A critical review. Routledge.

Zimmer, M. (2010). “But the data is already public”: On the ethics of research in Facebook. Ethics and Information Technology, 12(4), 313–325.

Chapter 16Automated text analyzers

Chapter 16
Automated text analyzers