In:Quantitative Methods in Multilingual Acquisition and Processing
Edited by Gabrielle Klassen and John W. Schwieter
[Research Methods in Applied Linguistics 16] 2026
► pp. 28–55
Chapter 3Using corpora and variationist approaches to make inferences about
language representation
Published online: 26 March 2026
https://doi.org/10.1075/rmal.16.03esc
https://doi.org/10.1075/rmal.16.03esc
Abstract
As interest in usage-based approaches to multilingualism
has grown, researchers have increasingly turned to large data sets (i.e.,
corpora) of actual language usage to study linguistic phenomena. Exploring
multilingual or multidialectal corpora can shed important light on
multilingual acquisition and processing, extending our understanding of how
speakers develop and maintain or lose languages over time, how their
languages vary and change, and how languages interact within the mind of the
speaker and within a multilingual community. While English has traditionally
dominated the field of corpus linguistics, there has been a growing
recognition of the need to develop corpora and tools for other languages to
assure that research findings sufficiently consider linguistic diversity
(O’Keeffe & McCarthy, 2010). While corpus research in recent years has
expanded to include a wider variety of languages, there are still few
resources that guide researchers to engage with these corpora. This chapter
serves as a theoretical and practical resource for researchers interested in
using corpora to study multilingualism — particularly language variation and
acquisition — with special attention paid to the two most widely-spoken
languages aside from English: Spanish and Mandarin Chinese. We discuss how
to use corpora to answer questions of bi/multilingual language
representations and linguistic theory, including how to extract patterns in
large corpora (e.g., tokenizing, part of speech tagging, error tagging) and
how to analyze these patterns using logistic regression and mixed-effects
modeling. We review research questions about why and how languages vary when
in contact with other languages, either internally (within a single mind) or
externally (in a community). Lastly, we include guidelines and best
practices for the development and analysis of new and existing corpora.
Article outline
- 3.1Introduction
- 3.2Corpus linguistics, variation, and the benefits of quantitative
analysis
- 3.2.1What is a corpus?
- 3.2.2Corpus types, characteristics, and objectives
- 3.2.3Variation
- 3.3How to extract patterns using corpora
- 3.3.1Corpus analysis
- 3.3.1.1Tokenizing and part of speech tagging
- 3.3.1.2Error tagging and other manual annotations
- 3.3.1.3Off-the-shelf tools
- 3.3.1.4Logistic regression
- 3.3.1.5Mixed-effects modeling
- 3.3.1.6Multi-dimensional framework
- 3.3.1Corpus analysis
- 3.4Looking ahead
- 3.4.1Developing new corpora
- 3.4.2Future directions for analyses of existing corpora
- 3.4.2.1Phonetic analyses
- 3.4.2.2Task variables
- 3.4.2.3Other methodological approaches
- 3.5Conclusion
References
References (101)
Ädel, A. (2021). Corpus
compilation. In M. Paquot & S. T. Gries (Eds.), A
practical handbook of corpus
linguistics (pp. 3–24). Springer.
Alfonso Lozano, R. (2010). El
vocalismo del español en el habla
espontánea (Unpublished doctoral
dissertation). Universitat de Barcelona.
Asención-Delaney, Y., Collentine, J. G., Colmenares, J. J., & Urzúa, A. (2022). Training
teachers to use corpus tools in the Spanish language
classroom. Journal of Spanish
Language
Teaching, 9(2), 134–147.
Baayen, R. H., & Linke, M. (2020). Generalized
additive mixed
models. In M. Paquot & S. T. Gries (Eds.), A
practical handbook of corpus
linguistics (pp. 563–591). Springer.
Backus, A. (2021). Usage-based
approaches. In E. Adamou & Y. Matras (Eds.), The
Routledge handbook of language
contact (pp. 110–126). Routledge.
Bayley, R. (2013a). The
quantitative
paradigm. In J. K. Chambers & N. Schilling (Eds.), The
handbook of language variation and change, second
edition (pp. 85–107). Blackwell.
Bennett, G. (2010). Using
corpora in the language learning classroom: Corpus linguistics for
teachers. University of Michigan Press.
Biber, D. (1995). Dimensions
of register variation: A cross-linguistic
comparison. Cambridge University Press.
(2004). Conversation
text types: A multi-dimensional
analysis. In G. Purnelle, C. Fairon, & A. Dister (Eds.), Le
poids des mots: Proceedings of the 7th International Conference on
the Statistical Analysis of Textual
Data (pp. 15–34). Presses universitaires de Louvain. [URL]
Biber, D., Conrad, S., & Reppen, R. (1998). Corpus
linguistics: Investigating language structure and
use. Cambridge University Press.
Bird, S., Klein, E., & Loper, E. (2009). Natural
language processing with Python: Analyzing text with the natural
language toolkit. O’Reilly Media. [URL]
Boersma, P., & Van Heuven, V. (2001). Speak
and unSpeak with PRAAT. Glot
International, 5(9/10), 341–347. [URL]
Bullock, B. E., Serigos, J., Toribio, A. J., & Wendorf, A. (2018). The
challenges and benefits of annotating oral bilingual corpora: The
Spanish in Texas Corpus
Project. Linguistic
Variation, 18(1), 100–119.
Bybee, J. (2006). From
usage to grammar: The mind’s response to
repetition. Language, 82(4), 711–733.
Campillos Llanos, L. (2016). PoS-tagging
a Spanish oral learner
corpus. In M. Alonso-Ramos (Ed.), Spanish
Learner Corpus Research: Current Trends and Future
Perspectives (pp. 89–116). 89. John Benjamins.
Cao, Y., Font-Rotchés, D., & Rius-Escudé, A. (2023). Front
vowels of Spanish: A challenge for Chinese
speakers. Open
Linguistics, 9(1), 20220230.
Carando, A., Minnillo, S., Fernández-Mira, P., Davidson, S., Sagae, K., & Sánchez-Gutiérrez, C. (2023). Writing
development in Spanish as a second and heritage language: A corpus
study on complexity. Journal of
Spanish Language
Teaching, 10(1), 59–71.
Carreira, M., & Hitchens Chik, C. (2018). Differentiated
teaching: A primer for heritage and mixed
clases. In K. Potowski (Ed.), The
Routledge handbook of Spanish as a heritage
language (pp. 359–374). Routledge. [URL].
Carter, P. M., & Merii, K. D. (2023). Spanish-influenced
lexical phenomena in emerging Miami English: Tracking production and
perception. English
World-Wide, 44(2), 219–250.
Carvalho, A. M. (2012). Corpus
del Español en el Sur de Arizona (CESA). [URL]
Chen, H. C., & Han, Q. W. (2020). Designing
and implementing a corpus-based online pronunciation learning
platform for Cantonese learners of
Mandarin. Interactive Learning
Environments, 28(1), 18–31.
Cui, Y., Zhu, J., Yang, L., Fang, X., Chen, X., Wang, Y., & Yang, E. (2022). CTAP
for Chinese: A linguistic complexity feature automatic calculation
Platform. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, & S. Piperidis (Eds.), Proceedings
of the Thirteenth Language Resources and Evaluation
Conference (pp. 5525–5538). European Language Resources Association. [URL]
Davies, M. (2016). Corpus
del Español: Two billion words, 21
countries. [URL]
Díaz-Negrillo, A., & Fernández-Domínguez, J. (2006). Error
tagging systems for learner
corpora. Revista Española de
Lingüística
Aplicada, 19, 83–102. [URL]
Díez-Bedmar, M. B. (2021). Error
analysis. In N. Tracy-Ventura & M. Paquot (Eds.), The
Routledge handbook of second language acquisition and
corpora (pp. 90–104). Routledge.
Díez-Ortega, M., & Kyle, K. (2023). Measuring
the development of lexical richness of L2 Spanish: A longitudinal
learner corpus study. Studies in
Second Language
Acquisition, 46(1), 1–31.
Dybkjær, L. & Ole Bernsen, N. (2004). Recommendations
for natural interactivity and multimodal annotation
schemes. Proceedings of the LREC
'2004 Workshop on Multimodal
Corpora, Lisbon, Portugal. pp. 5–8. [URL]
Ellis, N. C. (2002). Frequency
effects in language processing: A review with implications for
theories of implicit and explicit language
acquisition. Studies in Second
Language
Acquisition, 24(2), 143–188.
Fernández-Mira, P., Morgan, E., Davidson, S., Yamada, A., Carando, A., Sagae, K., & Sánchez-Gutiérrez, C. H. (2021). Lexical
diversity in an L2 Spanish learner corpus: The effect of
topic-related
variables. International Journal of
Learner Corpus
Research, 7(2), 230–258.
Fromont, R., & Hay, J. (2012). LaBB-CAT:
An annotation
store. In Processings
of Australasian Language Technology Association
Wordshop (pp. 113–117).
Gardner, D., & Davies, M. (2014). A
new academic vocabulary list. Applied
Linguistics, 35(3), 305–327.
Gilquin, G. (2015). From
design to collection of learner
corpora. In S. Granger (Ed.), The
Cambridge handbook of learner corpus
research (pp. 9–34). Cambridge University Press.
Gonzalez, S., Grama, J., & Travis, C. E. (2020). Comparing
the performance of forced aligners used in sociophonetic
research. Linguistics
Vanguard, 6(1), 20190058.
Graesser, A. C., McNamara, D. S., & Kulikowich, J. M. (2011). Coh-Metrix:
Providing multilevel analyses of text
characteristics. Educational
Researcher, 40(5), 223–234.
Granger, S. (2003). Error-tagged
learner corpora and CALL: A promising
synergy. CALICO
Journal, 465–480.
Gries, S. T. (2022). How
to use statistics in quantitative corpus
analysis. In M. McCarthy & A. O’Keeffe (Eds.), The
Routledge handbook of corpus
linguistics (2nd
ed., pp. 168–181). Routledge.
Gudmestad, A., Edmonds, A., & Metzger, T. (2019). Using
variationism and learner corpus research to investigate grammatical
gender marking in additional language
Spanish. Language
Learning, 69(4), 911–942.
Gut, U. (2012). The
LeaP corpus A multilingual corpus of
spoken. In T. Schmidt & K. Wörner (Eds.), Multilingual
corpora and multilingual corpus
analysis (pp. 3–23). John Benjamins.
Erker, D., Guy, G. R., Beaman, K. V., Bayley, R., Adli, A., Orozco, R., & Zhang, X. (forthcoming). Subject pronoun variation: A cross-language sociolinguistic study. Cambridge University Press.
Hedlund, G., & Rose, Y. (2020). Phon
3.1 [Computer
Software]. [URL]
Hilpert, M., & Blasi, D. E. (2020). Fixed-effects
regression
modeling. In M. Paquot & S. T. Gries (Eds.), A
practical handbook of corpus
linguistics (pp. 505–533). Springer.
Jin, T., & Lu, X. (2018). A
data-driven approach to text adaptation in teaching material
preparation: Design, implementation, and teacher professional
development. TESOL
Quarterly, 52(2), 457–467.
Kisler, T., Schiel, F., & Sloetjes, H. (2017). Signal
processing via web services: the use case
WebMAUS. Computer Speech and
Language, 45, 326–347.
Kyle, K. (2020). Measuring
lexical
richness. In S. Webb (Ed.), The
Routledge handbook of vocabulary
studies (pp. 454–476). Routledge.
Kyle, K., & Crossley, S. (n.d.). NLP
Tools for the Social
Sciences. Retrieved 1 November
2023 from [URL]
Labov, W. (1972). Language
in the inner city: Studies in the Black English
Vernacular. University of Pennsylvania Press.
Liu, H., & Van Dongen, E. (2013). The
Chinese diaspora. Oxford Bibliographies. [URL].
Lozano, C. (2022). CEDEL2:
Design, compilation and web interface of an online corpus for L2
Spanish acquisition research. Second
Language
Research, 38(4), 965–983.
Lozano, C., & Fernández-Mira, P. (2022). Designing,
compiling and interrogating corpora in L2 Spanish acquisition
research. Journal of Spanish Language
Teaching, 9(2), 190–206.
MacWhinney, B. (1998). Models
of the emergence of language. Annual
Review of
Psychology, 49(1), 199–227.
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., & Sonderegger, M. (2017). Montreal
forced aligner: Trainable text-speech alignment using
Kaldi. Interspeech, 2017, 498–502.
McEnery, A., & Xiao, Z. (2004). The
Lancaster Corpus of Mandarin Chinese: A corpus for monolingual and
contrastive language
study. Proceedings of the Fourth
International Conference on Language Resources and Evaluation
(LREC’04), Lisbon, Portugal. European Language Resources Association (ELRA). [URL]
McEnery, T., Xiao, R., & Tono, Y. (2006). Corpus-based
language studies: An advanced resource
book. Routledge.
Ming, T., & Tao, H. (2008). Developing
a Chinese heritage language corpus: Issues and a preliminary
report. In Chinese
as a heritage language: Fostering rooted world
citizenry (pp. 167–187). University of Hawai’i.
Minnillo, S., Sánchez-Gutiérrez, C., Carando, A., Davidson, S., Mira, P. F., & Sagae, K. (2022). Preterit-imperfect
acquisition in L2 Spanish writing: Moving beyond lexical
aspect. Research in Corpus
Linguistics, 10(1), 156–184.
Mitchell, R., Tracy-Ventura, N., & McManus, K. (2019). Anglophone
students abroad: Identity, social relationships, and language
learning. Routledge.
Nagy, N. (2011). A
multilingual corpus to explore geographic
variation. Rassegna Italiana di
Linguistica
Applicata, 43(1–2), 65–84.
Paquot, M., & Plonsky, L. (2017). Quantitative
research methods and study quality in learner corpus
research. International Journal of
Learner Corpus
Research, 3(1), 61–94.
Paris, D., & Alim, H. S. (2017). Culturally
sustaining pedagogies: Teaching and learning for justice in a
changing world. Teachers College Press.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn:
Machine Learning in Python. Journal
of Machine Learning
Research, 12(85), 2825–2830. [URL]
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlíček, P., Qian, Y., Schwarz, P., Silovský, J., Stemmer, G., & Vesel, K. (2011). The
Kaldi speech recognition
toolkit. IEEE 2011 Workshop on
Automatic Speech Recognition and
Understanding.
Real Academia
Española. Banco de datos (CORPES XXI)
Corpus del Español del Siglo XXI. [URL]
. Banco de datos (CREA)
Corpus de Referencia del Español actual. [URL]
. Banco de datos (CORDE)
Corpus Diacrónico del Español actual. [URL]
Rojo, G., & Palacios, I. M. (2016). Learner
Spanish on computer. The CAES ‘Corpus de Aprendices de Español’
project. In M. Alonso-Ramos (Ed.), Spanish
Learner Corpus
Research (pp. 55–87). John Benjamins.
Rojo, G., Palacios, I., Sampedro Mella, M., & Marsily, A. (2022). Los
corpus de aprendices de español LE/L2: panorama actual y
perspectivas futuras. Journal of
Spanish Language
Teaching, 9(2), 174–189.
Rosenfelder, I., Fruehwald, J., Keelan, Evanini, Seyfarth, S., Gorman, K., Prichard, H., & Jiahong Yuan. (2015). FAVE:
Speaker
fix (v1.2.2). Zenodo.
Sampedro Mella, M. (2021). Los
adverbios locativos de ubicación en el español como lengua
extranjera. In C. Ballestero de Celis & M. Sampedro Mella (Eds.), Aportes
del CAES a la enseñanza del español como lengua
extranjera (pp. 24–50). Universidad de Santiago de Compostela.
Sánchez-Gutiérrez, C., De Cock, B., & Tracy-Ventura, N. (2022). Spanish
corpora and their pedagogical uses: challenges and
opportunities. Journal of Spanish
Language
Teaching, 9(2), 105–115.
Sánchez-Gutiérrez, C. H., Minnillo, S., Mira, P. F., & Hernández, A. (2024). Prompt
response variation in learner corpus research: Implications for data
interpretation. Research Methods in
Applied
Linguistics, 3(3).
Sankoff, D. (1988). Sociolinguistics
and syntactic
variation. In F. J. Newmeyer (Ed.), Linguistics:
The Cambridge
survey (pp. 140–161). Cambridge University Press.
Schäfer, R. (2020). Mixed-effects
regression
modeling. In M. Paquot & S. T. Gries (Eds.), A
Practical Handbook of Corpus
Linguistics (pp. 535–561). Springer.
Sloetjes, H., & Wittenburg, P. (2008, May). Annotation
by Category: ELAN and ISO
DCR. Proceedings of the Sixth
International Conference on Language Resources and Evaluation
(LREC’08). [URL]
SpaCy. (2023). Explosion. [URL]
Sung, Y. -T., Chang, T. -H., Lin, W. -C., Hsieh, K. -S., & Chang, K. -E. (2016). CRIE:
An automated analyzer for Chinese
texts. Behavior Research
Methods, 48(4), 1238–1251.
Tomasello, M. (2000). First
steps toward a usage-based theory of language
acquisition. Cognitive
Linguistics, 11(1–2), 61–82.
(2003). Constructing
a language: A usage-based theory of language
acquisition. Harvard University Press.
Tracy-Ventura, N., & Myles, F. (2015). The
importance of task variability in the design of learner corpora for
SLA research. International Journal
of Learner Corpus
Research, 1(1), 58–95.
Weinreich, U., Labov, W. & Herzog, M. (1968). Empirical
foundations for a theory of language
change. In W. Lehmann and Y. Malkiel (Eds.), Directions
for historical
linguistics (pp. 95–188). University of Texas Press. [URL]
Winter, B. (2013). Linear
models and linear mixed effects models in R with linguistic
applications. ArXiv
Preprint, 1308.5499.
Wolfram, W. (1969). A
sociolinguistic study of Detroit Negro
speech. Center for Applied Linguistics.
Wu, B., Xie, Y., Lu, L., Cao, C., & Zhang, J. (2016). The
construction of a Chinese interlanguage
corpus. In 2016
Conference of The Oriental Chapter of International Committee for
Coordination and Standardization of Speech Databases and Assessment
Techniques
(O-COCOSDA) (pp. 183–187).
Wu, C., & Shih, C. (2014). A
design of the spontaneous Chinese learner speech
corpus. 神戸大学国際コミュニケーションセンター.
Xu, H., Jiang, M., Lin, J., & Huang, C. -R. (2022). Light
verb variations and varieties of Mandarin Chinese: Comparable corpus
driven approaches to grammatical
variations. Corpus Linguistics and
Linguistic
Theory, 18(1), 145–173.
Xu, J. (2019). The
corpus approach to the teaching and learning of Chinese as an L1 and
an L2 in
retrospect. In X. Lu & B. Chen (Eds.), Computational
and corpus approaches to Chinese language
learning (pp. 33–53). Springer.
(2015). Corpus-based
Chinese studies: A historical review from the 1920s to the
present. Chinese Language and
Discourse, 6(2), 218–244.
Yamada, A., Davidson, S., Fernández-Mira, P., Carando, A., Sagae, K., & Sánchez-Gutiérrez, C. (2020). COWS-L2H:
A corpus of Spanish learner
writing. Research in Corpus
Linguistics, 8(1), 17–32.
Young, R., & Bayley, R. (1996). Varbrul
analysis for second language acquisition
research. In R. Bayley & D. R. Preston (Eds.), Second
language acquisition and linguistic
variation (pp. 253–306). John Benjamins.
Zhang, J., & Lu, X. (2013). Variability
in Chinese as a foreign language learners’ development of the
Chinese numeral classifier
system. The Modern Language
Journal, 97(S1), 46–60.
Zhang, X. (2023). Language
variation in teacher speech in a dual immersion
preschool. Proceedings of the
Linguistic Society of
America, 8(1), 5474.
