In:Digital and Internet-Based Research Methods in Applied Linguistics
Edited by Matt Kessler
[Research Methods in Applied Linguistics 15] 2026
► pp. 172–191
Chapter 9Web-based corpora and web-based corpus platforms
Published online: 5 January 2026
https://doi.org/10.1075/rmal.15.09cas
https://doi.org/10.1075/rmal.15.09cas
Abstract
Recent years have seen a rapid growth in the range of web-based corpora and the functionality of
web-based corpus platforms available for linguistic research across a wide number of world languages, such that these
tools are a key consideration when discussing internet-based research methods. This chapter discusses the use of such
resources in applied linguistics research, adopting a focus on corpora or corpus platforms with web-based interfaces
that enable linguistic analysis, rather than resources which are hosted online primarily for offline use. After
framing this definition, the authors outline commonly used web-based corpora and platforms, as well as research
questions and considerations in corpus selection. Five studies which use prominent tools in innovative ways are
reviewed. Ethical considerations and other challenges are discussed, and the chapter concludes by projecting future
directions in the use of web-based corpora and web-based corpus platforms for research in applied linguistics.
Article outline
- 1.Introduction
- 2.Frequently asked research questions
- 3.Implementation
- 4.Example studies
- Jiang and Hyland (2022)
- Baker et al. (2013)
- Norberg (2016)
- Liu (2011)
- Schaeffer-Lacroix (2021)
- 5.Ethics and research integrity considerations
- 6.Challenges and issues
- 7.Future research directions
References
References (50)
Anthony, L. (2024). AntConc
(Version 4.3.1) [Computer Software]. Waseda University. [URL]
Baker, P., Gabrielatos, C., & McEnery, T. (2013). Sketching
Muslims: A corpus driven analysis of representations around the word ‘Muslim’ in the British Press
1998–2009. Applied
Linguistics, 34(3), 255–278.
Biber, D. (1993). Representativeness
in corpus design. Literary and Linguistic
Computing, 8(4), 243–257.
Biber, D., Conrad, S., & Reppen, R. (1998). Corpus
linguistics: Investigating language structure and use. Cambridge University Press.
Brezina, V. (2018). Statistics
in corpus linguistics: A practical guide. Cambridge University Press.
Callies, M. (2019). Integrating
corpus literacy into language teacher education: The case of learner
corpora. In S. Götz & J. Mukherjee (Eds.), Learner
corpora and language
teaching (pp. 245–264). John Benjamins.
Charles, M. (2012). ‘Proper
vocabulary and juicy collocations’: EAP students evaluate do-it-yourself
corpus-building. English for Specific
Purposes, 31(2), 93–102.
Chen, M., & Flowerdew, J. (2018). Introducing
data-driven learning to PhD students for research writing purposes: A territory-wide project in Hong
Kong. English for Specific
Purposes, 50, 97–112.
Crosthwaite, P., & Baisa, V. (2023). Generative
AI and the end of corpus-assisted data-driven learning? Not so fast! Applied
Corpus
Linguistics, 3(3), 100066.
Crosthwaite, P., & Steeples, B. (2024). Data-driven
learning with younger learners: Exploring corpus-assisted development of the passive voice for science writing
with female secondary school students. Computer Assisted Language
Learning, 37(5–6), 116–1197.
Curry, N., Baker, P., & Brookes, G. (2024). Generative
AI for corpus approaches to discourse studies: A critical evaluation of
ChatGPT. Applied Corpus
Linguistics, 4(1), 100082.
Davies, M. (2008). The
Corpus of Contemporary American English (COCA). [URL]
(2021). Language
and COVID-19: Design, construction, and use. International Journal of Corpus
Linguistics, 26(4), 583–598.
Davies, M., & Chapman, D. (2016). The
effect of representativeness and size in historical corpora: An empirical study of changes in lexical
frequency. In D. Chapman, C. Moore, & M. Wilxoc (Eds.), Studies
in the history of the English language VII: Generalizing vs. particularizing methodologies in historical
linguistic
analysis (pp. 131–151). De Gruyter Mouton.
Farr, F., & Leńko-Szymańska, A. (2024). Corpora in English language teacher education: Research, integration, and resources. TESOL Quarterly, 58(3), 1181–1192.
Forti, L. (2023). Corpus
use in Italian language pedagogy: Exploring the effects of data-driven
learning. Routledge.
Francis, W., & Kucera, H. (1979). The
Brown Corpus. Department of Linguistics, Brown University, Providence, Rhode Island, US.
Geertzen, J., Alexopoulou, T., & Korhonen, A. (2013). Automatic
linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database
(EFCAMDAT). In Proceedings of the 31st Second
Language Research
Forum (pp. 240–254). Cascadilla Proceedings Project.
Granger, S., Dupont, M., Meunier, F., Naets, H. & Paquot, M. (2020). The
International Corpus of Learner English. Version 3. Presses universitaires de Louvain. [URL]
Hardie, A. (2012). CQPweb
— Combining power, flexibility and usability in a corpus analysis
tool. International Journal of Corpus
Linguistics, 17(3), 380–409.
Ishikawa, S. (2023). The
ICNALE Guide: An Introduction to a learner corpus study on Asian learners’ L2
English. Routledge.
Jiang, F., & Hyland, K. (2022). COVID-19
in the news: The first 12 months. International Journal of Applied
Linguistics, 31(2), 241–258. [URL].
Juffs, A., Han, N-R., & Naismith, B. (2020). The
University of Pittsburgh English Language Corpus (PELIC) [Data set].
Kaunisto, M., & Schilk, M. (2024). Challenges
in corpus linguistics: Rethinking corpus compilation and analysis. John Benjamins.
Kennedy, C., & Miceli, T. (2010). Corpus-assisted
creative writing: Introducing intermediate Italian learners to a corpus as a reference
resource. Language Learning &
Technology, 14(1), 28–44.
Kessler, M. (2024). Do
we know our language learners? Investigating students’ and teachers’ technology ownership, access, literacy,
and interest in online education. E-Learning and Digital
Media.
Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Rychlý, P., & Suchomel, V. (2014). The
sketch engine: Ten years
on. Lexicography, 1, 7–36.
Kilgarriff, A., & Grefenstette, G. (2003). Introduction
to the Special Issue on the Web as Corpus. Computational
Linguistics, 29(3), 333–347.
Larsson, T., Plonsky, L., Sterling, S., Kytö, M., Yaw, K., Wood, M. (2023). On
the frequency, prevalence, and perceived severity of questionable research
practices. Research Methods in Applied
Linguistics, 2(3), 100064.
Lee, D., & Swales, J. (2006). A
corpus-based EAP course for NNS doctoral students: Moving from available specialized corpora to self-compiled
corpora. English for Specific
Purposes, 25, 56–75.
Lin, P. (2023). ChatGPT:
Friend or foe (to corpus linguists)? Applied Corpus
Linguistics, 3(3), 100065.
Liu, D. (2011). Making
grammar instruction more empowering: An exploratory case study if corpus use in the learning/teaching of
grammar. Research in the Teaching of
English, 45(4).
Ma, Q., Tang, J., & Lin, S. (2021). The
development of corpus-based language pedagogy for TESOL teachers: A two-step training approach facilitated by
online collaboration. Computer Assisted Language
Learning, 35(9), 2731–2760.
McEnery, T., & Hardie, A. (2012). Corpus
linguistics: Method, theory and practice. Cambridge University Press.
Michel, J-B., Shen, Y., Aiden, A., Veres, A., Gray, M., Brockman, W., The
Google Books
Team, Pickett, J., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M., & Aiden, E. (2010). Quantitative
analysis of culture using millions of digitized
books. Science, 331(6014).
Mizumoto, A. (2023). Data-driven
learning meets generative AI: Introducing the framework of metacognitive resource
use. Applied Corpus
Linguistics, 3(3), 100074.
Norberg, C. (2016). Naughty
boys and sexy girls: The representation of young individuals in a web-based corpus of
English. Journal of English
Linguistics, 44(4), 291–317.
O’Keeffe, A., McCarthy, M., & Carter, R. (2007). From
corpus to classroom: Language use and language teaching. Cambridge University Press.
Pérez-Paredes, P., Mark, G., & O’Keeffe, A. (2025). Corpus
Linguistics for Language Learning Research. John Benjamins.
Poole, R., & Micalay-Hurtado, M. A. (2022). A
corpus-assisted ecolinguistic analysis of the representations of tree/s and forest/s in US discourse from
1820–2019. Applied Corpus
Linguistics, 2(3), 100036.
Rayson, P. (2009). Wmatrix:
A web-based corpus processing environment. Computing Department, Lancaster University. [URL]
Schaeffer-Lacroix, E. (2021). Integrating
corpus-based audio description tasks into an intermediate-level German
course. International Journal of Applied
Linguistics, 31(2), 173–192.
Sinclair, J. (2004). Corpus
and text — Basic principles. In M. Wynne (Ed.), Developing
linguistic corpora: A guide to good
practice (pp. 1–16). Oxbow Books.
Subirats, C., & Ortega, M. (2012). Corpus
del Español Actual. [URL]
