Chapter 3. Using corpora and variationist approaches to make inferences about language representation

Escalante, Chelsea; Minnillo, Sophia; Zhang, Xinye

doi:10.1075/rmal.16.03esc

In:Quantitative Methods in Multilingual Acquisition and Processing
Edited by Gabrielle Klassen and John W. Schwieter
[Research Methods in Applied Linguistics 16] 2026
► pp. 28–55

Get fulltext from our e-platform

Download Book PDF

Download Book EPUB

Chapter 3
Using corpora and variationist approaches to make inferences about language representation

Chelsea Escalante

Sophia Minnillo

Xinye Zhang

Published online: 26 March 2026

https://doi.org/10.1075/rmal.16.03esc

Abstract

As interest in usage-based approaches to multilingualism has grown, researchers have increasingly turned to large data sets (i.e., corpora) of actual language usage to study linguistic phenomena. Exploring multilingual or multidialectal corpora can shed important light on multilingual acquisition and processing, extending our understanding of how speakers develop and maintain or lose languages over time, how their languages vary and change, and how languages interact within the mind of the speaker and within a multilingual community. While English has traditionally dominated the field of corpus linguistics, there has been a growing recognition of the need to develop corpora and tools for other languages to assure that research findings sufficiently consider linguistic diversity (O’Keeffe & McCarthy, 2010). While corpus research in recent years has expanded to include a wider variety of languages, there are still few resources that guide researchers to engage with these corpora. This chapter serves as a theoretical and practical resource for researchers interested in using corpora to study multilingualism — particularly language variation and acquisition — with special attention paid to the two most widely-spoken languages aside from English: Spanish and Mandarin Chinese. We discuss how to use corpora to answer questions of bi/multilingual language representations and linguistic theory, including how to extract patterns in large corpora (e.g., tokenizing, part of speech tagging, error tagging) and how to analyze these patterns using logistic regression and mixed-effects modeling. We review research questions about why and how languages vary when in contact with other languages, either internally (within a single mind) or externally (in a community). Lastly, we include guidelines and best practices for the development and analysis of new and existing corpora.

Article outline

3.1Introduction
3.2Corpus linguistics, variation, and the benefits of quantitative analysis
- 3.2.1What is a corpus?
- 3.2.2Corpus types, characteristics, and objectives
- 3.2.3Variation
3.3How to extract patterns using corpora
- 3.3.1Corpus analysis
  - 3.3.1.1Tokenizing and part of speech tagging
  - 3.3.1.2Error tagging and other manual annotations
  - 3.3.1.3Off-the-shelf tools
  - 3.3.1.4Logistic regression
  - 3.3.1.5Mixed-effects modeling
  - 3.3.1.6Multi-dimensional framework
3.4Looking ahead
- 3.4.1Developing new corpora
- 3.4.2Future directions for analyses of existing corpora
  - 3.4.2.1Phonetic analyses
  - 3.4.2.2Task variables
  - 3.4.2.3Other methodological approaches
3.5Conclusion
References

References (101)

References

Ädel, A. (2021). Corpus compilation. In M. Paquot & S. T. Gries (Eds.), A practical handbook of corpus linguistics (pp. 3–24). Springer.

Alfonso Lozano, R. (2010). El vocalismo del español en el habla espontánea (Unpublished doctoral dissertation). Universitat de Barcelona.

Asención-Delaney, Y., Collentine, J. G., Colmenares, J. J., & Urzúa, A. (2022). Training teachers to use corpus tools in the Spanish language classroom. Journal of Spanish Language Teaching, 9(2), 134–147.

Baayen, R. H., & Linke, M. (2020). Generalized additive mixed models. In M. Paquot & S. T. Gries (Eds.), A practical handbook of corpus linguistics (pp. 563–591). Springer.

Backus, A. (2021). Usage-based approaches. In E. Adamou & Y. Matras (Eds.), The Routledge handbook of language contact (pp. 110–126). Routledge.

Baker, P. (2006). Using corpora in discourse analysis. Continuum.

(2010). Sociolinguistics and corpus linguistics. Edinburgh University Press.

Bayley, R. (2013a). The quantitative paradigm. In J. K. Chambers & N. Schilling (Eds.), The handbook of language variation and change, second edition (pp. 85–107). Blackwell.

Bennett, G. (2010). Using corpora in the language learning classroom: Corpus linguistics for teachers. University of Michigan Press.

Biber, D. (1995). Dimensions of register variation: A cross-linguistic comparison. Cambridge University Press.

(2004). Conversation text types: A multi-dimensional analysis. In G. Purnelle, C. Fairon, & A. Dister (Eds.), Le poids des mots: Proceedings of the 7th International Conference on the Statistical Analysis of Textual Data (pp. 15–34). Presses universitaires de Louvain. [URL]

Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics: Investigating language structure and use. Cambridge University Press.

Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: Analyzing text with the natural language toolkit. O’Reilly Media. [URL]

Boersma, P., & Van Heuven, V. (2001). Speak and unSpeak with PRAAT. Glot International, 5(9/10), 341–347. [URL]

Bullock, B. E., Serigos, J., Toribio, A. J., & Wendorf, A. (2018). The challenges and benefits of annotating oral bilingual corpora: The Spanish in Texas Corpus Project. Linguistic Variation, 18(1), 100–119.

Bybee, J. (2006). From usage to grammar: The mind’s response to repetition. Language, 82(4), 711–733.

Campillos Llanos, L. (2016). PoS-tagging a Spanish oral learner corpus. In M. Alonso-Ramos (Ed.), Spanish Learner Corpus Research: Current Trends and Future Perspectives (pp. 89–116). 89. John Benjamins.

Cao, Y., Font-Rotchés, D., & Rius-Escudé, A. (2023). Front vowels of Spanish: A challenge for Chinese speakers. Open Linguistics, 9(1), 20220230.

Carando, A., Minnillo, S., Fernández-Mira, P., Davidson, S., Sagae, K., & Sánchez-Gutiérrez, C. (2023). Writing development in Spanish as a second and heritage language: A corpus study on complexity. Journal of Spanish Language Teaching, 10(1), 59–71.

Carreira, M., & Hitchens Chik, C. (2018). Differentiated teaching: A primer for heritage and mixed clases. In K. Potowski (Ed.), The Routledge handbook of Spanish as a heritage language (pp. 359–374). Routledge. [URL].

Carter, P. M., & Merii, K. D. (2023). Spanish-influenced lexical phenomena in emerging Miami English: Tracking production and perception. English World-Wide, 44(2), 219–250.

Carvalho, A. M. (2012). Corpus del Español en el Sur de Arizona (CESA). [URL]

Chen, H. C., & Han, Q. W. (2020). Designing and implementing a corpus-based online pronunciation learning platform for Cantonese learners of Mandarin. Interactive Learning Environments, 28(1), 18–31.

Crystal, D. (2003). English as a global language (2nd ed.). Cambridge University Press.

Cui, Y., Zhu, J., Yang, L., Fang, X., Chen, X., Wang, Y., & Yang, E. (2022). CTAP for Chinese: A linguistic complexity feature automatic calculation Platform. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference (pp. 5525–5538). European Language Resources Association. [URL]

Davies, M. (2016). Corpus del Español: Two billion words, 21 countries. [URL]

Díaz-Negrillo, A., & Fernández-Domínguez, J. (2006). Error tagging systems for learner corpora. Revista Española de Lingüística Aplicada, 19, 83–102. [URL]

Díez-Bedmar, M. B. (2021). Error analysis. In N. Tracy-Ventura & M. Paquot (Eds.), The Routledge handbook of second language acquisition and corpora (pp. 90–104). Routledge.

Díez-Ortega, M., & Kyle, K. (2023). Measuring the development of lexical richness of L2 Spanish: A longitudinal learner corpus study. Studies in Second Language Acquisition, 46(1), 1–31.

Dybkjær, L. & Ole Bernsen, N. (2004). Recommendations for natural interactivity and multimodal annotation schemes. Proceedings of the LREC '2004 Workshop on Multimodal Corpora, Lisbon, Portugal. pp. 5–8. [URL]

Ellis, N. C. (2002). Frequency effects in language processing: A review with implications for theories of implicit and explicit language acquisition. Studies in Second Language Acquisition, 24(2), 143–188.

Fernández-Mira, P., Morgan, E., Davidson, S., Yamada, A., Carando, A., Sagae, K., & Sánchez-Gutiérrez, C. H. (2021). Lexical diversity in an L2 Spanish learner corpus: The effect of topic-related variables. International Journal of Learner Corpus Research, 7(2), 230–258.

Fromont, R., & Hay, J. (2012). LaBB-CAT: An annotation store. In Processings of Australasian Language Technology Association Wordshop (pp. 113–117).

Gardner, D., & Davies, M. (2014). A new academic vocabulary list. Applied Linguistics, 35(3), 305–327.

Gilquin, G. (2015). From design to collection of learner corpora. In S. Granger (Ed.), The Cambridge handbook of learner corpus research (pp. 9–34). Cambridge University Press.

Gonzalez, S., Grama, J., & Travis, C. E. (2020). Comparing the performance of forced aligners used in sociophonetic research. Linguistics Vanguard, 6(1), 20190058.

Graesser, A. C., McNamara, D. S., & Kulikowich, J. M. (2011). Coh-Metrix: Providing multilevel analyses of text characteristics. Educational Researcher, 40(5), 223–234.

Granger, S. (2003). Error-tagged learner corpora and CALL: A promising synergy. CALICO Journal, 465–480.

Gries, S. T. (2022). How to use statistics in quantitative corpus analysis. In M. McCarthy & A. O’Keeffe (Eds.), The Routledge handbook of corpus linguistics (2nd ed., pp. 168–181). Routledge.

Gudmestad, A., Edmonds, A., & Metzger, T. (2019). Using variationism and learner corpus research to investigate grammatical gender marking in additional language Spanish. Language Learning, 69(4), 911–942.

Gut, U. (2012). The LeaP corpus A multilingual corpus of spoken. In T. Schmidt & K. Wörner (Eds.), Multilingual corpora and multilingual corpus analysis (pp. 3–23). John Benjamins.

Erker, D., Guy, G. R., Beaman, K. V., Bayley, R., Adli, A., Orozco, R., & Zhang, X. (forthcoming). Subject pronoun variation: A cross-language sociolinguistic study. Cambridge University Press.

Hedlund, G., & Rose, Y. (2020). Phon 3.1 [Computer Software]. [URL]

Hilpert, M., & Blasi, D. E. (2020). Fixed-effects regression modeling. In M. Paquot & S. T. Gries (Eds.), A practical handbook of corpus linguistics (pp. 505–533). Springer.

Huang, C. -R. (2009). Tagged Chinese Gigaword Version 2.0. Linguistic Data Consortium.

Jin, T., & Lu, X. (2018). A data-driven approach to text adaptation in teaching material preparation: Design, implementation, and teacher professional development. TESOL Quarterly, 52(2), 457–467.

Kisler, T., Schiel, F., & Sloetjes, H. (2017). Signal processing via web services: the use case WebMAUS. Computer Speech and Language, 45, 326–347.

Kyle, K. (2020). Measuring lexical richness. In S. Webb (Ed.), The Routledge handbook of vocabulary studies (pp. 454–476). Routledge.

Kyle, K., & Crossley, S. (n.d.). NLP Tools for the Social Sciences. Retrieved 1 November 2023 from [URL]

Labov, W. (1972). Language in the inner city: Studies in the Black English Vernacular. University of Pennsylvania Press.

Liberman, M. Y. (2019). Corpus phonetics. Annual Review of Linguistics, 5(1), 91–107.

Liu, H., & Van Dongen, E. (2013). The Chinese diaspora. Oxford Bibliographies. [URL].

Lozano, C. (2022). CEDEL2: Design, compilation and web interface of an online corpus for L2 Spanish acquisition research. Second Language Research, 38(4), 965–983.

Lozano, C., & Fernández-Mira, P. (2022). Designing, compiling and interrogating corpora in L2 Spanish acquisition research. Journal of Spanish Language Teaching, 9(2), 190–206.

MacWhinney, B. (1998). Models of the emergence of language. Annual Review of Psychology, 49(1), 199–227.

(2000). The childes project. 2: The database. Lawrence Erlbaum Associates.

Maher, J. C. (2017). Multilingualism: A very short introduction. University Press.

McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., & Sonderegger, M. (2017). Montreal forced aligner: Trainable text-speech alignment using Kaldi. Interspeech, 2017, 498–502.

McEnery, A., & Xiao, Z. (2004). The Lancaster Corpus of Mandarin Chinese: A corpus for monolingual and contrastive language study. Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal. European Language Resources Association (ELRA). [URL]

McEnery, T., Xiao, R., & Tono, Y. (2006). Corpus-based language studies: An advanced resource book. Routledge.

Ming, T., & Tao, H. (2008). Developing a Chinese heritage language corpus: Issues and a preliminary report. In Chinese as a heritage language: Fostering rooted world citizenry (pp. 167–187). University of Hawai’i.

Minnillo, S., Sánchez-Gutiérrez, C., Carando, A., Davidson, S., Mira, P. F., & Sagae, K. (2022). Preterit-imperfect acquisition in L2 Spanish writing: Moving beyond lexical aspect. Research in Corpus Linguistics, 10(1), 156–184.

Mitchell, R., Tracy-Ventura, N., & McManus, K. (2019). Anglophone students abroad: Identity, social relationships, and language learning. Routledge.

Nagy, N. (2011). A multilingual corpus to explore geographic variation. Rassegna Italiana di Linguistica Applicata, 43(1–2), 65–84.

O’Keeffe, M., & McCarthy, A. (2010). The Routledge handbook of corpus linguistics. Routledge.

Padró, L., & Stanilovsky, E. (2012). Freeling 3.0: Towards wider multilinguality. LREC2012.

Paquot, M., & Plonsky, L. (2017). Quantitative research methods and study quality in learner corpus research. International Journal of Learner Corpus Research, 3(1), 61–94.

Paris, D., & Alim, H. S. (2017). Culturally sustaining pedagogies: Teaching and learning for justice in a changing world. Teachers College Press.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(85), 2825–2830. [URL]

Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlíček, P., Qian, Y., Schwarz, P., Silovský, J., Stemmer, G., & Vesel, K. (2011). The Kaldi speech recognition toolkit. IEEE 2011 Workshop on Automatic Speech Recognition and Understanding.

Real Academia Española. Banco de datos (CORPES XXI) Corpus del Español del Siglo XXI. [URL]

. Banco de datos (CREA) Corpus de Referencia del Español actual. [URL]

. Banco de datos (CORDE) Corpus Diacrónico del Español actual. [URL]

Rojo, G., & Palacios, I. M. (2016). Learner Spanish on computer. The CAES ‘Corpus de Aprendices de Español’ project. In M. Alonso-Ramos (Ed.), Spanish Learner Corpus Research (pp. 55–87). John Benjamins.

Rojo, G., Palacios, I., Sampedro Mella, M., & Marsily, A. (2022). Los corpus de aprendices de español LE/L2: panorama actual y perspectivas futuras. Journal of Spanish Language Teaching, 9(2), 174–189.

Rosenfelder, I., Fruehwald, J., Keelan, Evanini, Seyfarth, S., Gorman, K., Prichard, H., & Jiahong Yuan. (2015). FAVE: Speaker fix (v1.2.2). Zenodo.

Sampedro Mella, M. (2021). Los adverbios locativos de ubicación en el español como lengua extranjera. In C. Ballestero de Celis & M. Sampedro Mella (Eds.), Aportes del CAES a la enseñanza del español como lengua extranjera (pp. 24–50). Universidad de Santiago de Compostela.

Sánchez-Gutiérrez, C., De Cock, B., & Tracy-Ventura, N. (2022). Spanish corpora and their pedagogical uses: challenges and opportunities. Journal of Spanish Language Teaching, 9(2), 105–115.

Sánchez-Gutiérrez, C. H., Minnillo, S., Mira, P. F., & Hernández, A. (2024). Prompt response variation in learner corpus research: Implications for data interpretation. Research Methods in Applied Linguistics, 3(3).

Sankoff, D. (1988). Sociolinguistics and syntactic variation. In F. J. Newmeyer (Ed.), Linguistics: The Cambridge survey (pp. 140–161). Cambridge University Press.

Schäfer, R. (2020). Mixed-effects regression modeling. In M. Paquot & S. T. Gries (Eds.), A Practical Handbook of Corpus Linguistics (pp. 535–561). Springer.

Sloetjes, H., & Wittenburg, P. (2008, May). Annotation by Category: ELAN and ISO DCR. Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08). [URL]

SpaCy. (2023). Explosion. [URL]

Sung, Y. -T., Chang, T. -H., Lin, W. -C., Hsieh, K. -S., & Chang, K. -E. (2016). CRIE: An automated analyzer for Chinese texts. Behavior Research Methods, 48(4), 1238–1251.

Tomasello, M. (2000). First steps toward a usage-based theory of language acquisition. Cognitive Linguistics, 11(1–2), 61–82.

(2003). Constructing a language: A usage-based theory of language acquisition. Harvard University Press.

Tracy-Ventura, N., & Myles, F. (2015). The importance of task variability in the design of learner corpora for SLA research. International Journal of Learner Corpus Research, 1(1), 58–95.

Weinreich, U., Labov, W. & Herzog, M. (1968). Empirical foundations for a theory of language change. In W. Lehmann and Y. Malkiel (Eds.), Directions for historical linguistics (pp. 95–188). University of Texas Press. [URL]

Winter, B. (2013). Linear models and linear mixed effects models in R with linguistic applications. ArXiv Preprint, 1308.5499.

Wolfram, W. (1969). A sociolinguistic study of Detroit Negro speech. Center for Applied Linguistics.

Wu, B., Xie, Y., Lu, L., Cao, C., & Zhang, J. (2016). The construction of a Chinese interlanguage corpus. In 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA) (pp. 183–187).

Wu, C., & Shih, C. (2014). A design of the spontaneous Chinese learner speech corpus. 神戸大学国際コミュニケーションセンター.

Xu, H., Jiang, M., Lin, J., & Huang, C. -R. (2022). Light verb variations and varieties of Mandarin Chinese: Comparable corpus driven approaches to grammatical variations. Corpus Linguistics and Linguistic Theory, 18(1), 145–173.

Xu, J. (2019). The corpus approach to the teaching and learning of Chinese as an L1 and an L2 in retrospect. In X. Lu & B. Chen (Eds.), Computational and corpus approaches to Chinese language learning (pp. 33–53). Springer.

(2015). Corpus-based Chinese studies: A historical review from the 1920s to the present. Chinese Language and Discourse, 6(2), 218–244.

Yamada, A., Davidson, S., Fernández-Mira, P., Carando, A., Sagae, K., & Sánchez-Gutiérrez, C. (2020). COWS-L2H: A corpus of Spanish learner writing. Research in Corpus Linguistics, 8(1), 17–32.

Young, R., & Bayley, R. (1996). Varbrul analysis for second language acquisition research. In R. Bayley & D. R. Preston (Eds.), Second language acquisition and linguistic variation (pp. 253–306). John Benjamins.

Zhang, J., & Lu, X. (2013). Variability in Chinese as a foreign language learners’ development of the Chinese numeral classifier system. The Modern Language Journal, 97(S1), 46–60.

Zhang, X. (2023). Language variation in teacher speech in a dual immersion preschool. Proceedings of the Linguistic Society of America, 8(1), 5474.

Zhang, Z. (2012). A corpus study of variation in written Chinese. Corpus Linguistics and Linguistic Theory, 8(1), 209–240.

Zyzik, E. (2009). The role of input revisited: Nativist versus usage-based models. L2 Journal: An Electronic Refereed Journal for Foreign and Second Language Educators, 1(1).

Chapter 3Using corpora and variationist approaches to make inferences about language representation

Chapter 3
Using corpora and variationist approaches to make inferences about language representation