On over- and underuse in learner corpus research and multifactoriality in corpus linguistics more generally

Gries, Stefan Th.

doi:10.1075/jsls.00005.gri

Article published In: Journal of Second Language Studies
Vol. 1:2 (2018) ► pp.277–309

Get fulltext from our e-platform

Download PDF

On over- and underuse in learner corpus research and multifactoriality in corpus linguistics more generally

Stefan Th. Gries | University of California, Santa Barbara | Justus Liebig University Giessen

Published online: 27 August 2018

https://doi.org/10.1075/jsls.00005.gri

Abstract

This paper critically discusses how corpus linguistics in general, but learner corpus research in particular, has been dealing with all sorts of frequency data in general, but over- and underuse frequencies in particular. I demonstrate on the basis of learner corpus data the pitfalls of using aggregate data and lacking statistical control that much work is unfortunately characterized by. In fact, I will demonstrate that monofactorial methods have very little to offer at all to research on observational data. While this paper is admittedly very didactic and methodological, I think the discussion of the empirical data offered here – a reanalysis of previously published work – shows how misleading many studies potentially and provides far-reaching implications for much of corpus linguistics and learner corpus research. Ideally/maximally, this paper together with Paquot & Plonsky (, Intntl. J. of Learner Corpus Research) would lead to a complete revision of how learner corpus linguists use quantitative methods and study over-/underuse; minimally, this paper would stimulate a much-needed discussion of currently lacking methodological sophistication.

Keywords: learner corpora, speaker/file variation, multifactorial analysis, over-/underuse

Article outline

1.Introduction
- 1.1General introduction
- 1.2The two goals of this paper
2.A regression-modeling approach to over- and underuse
- 2.1The data: quite in learner and native-speaker data
- 2.2A generalized linear model on the Hasselgård and Johansson data
- 2.3Why a generalized linear model is not enough
  - 2.3.1Which files to include?
  - 2.3.2The role of individual speakers
- 2.4A generalized linear mixed-effects model on the Hasselgård and Johansson data
3.Multifactoriality: What it means for over- and underuse and in general
- 3.1Why virtually every corpus study, every one, needs to be multifactorial
- 3.2The role of other predictors
  - 3.2.1Traditional multifactorial regression modeling
  - 3.2.2Multifactorial Prediction and Deviation Analysis Using Regressions (or other classifiers)
  - 3.2.3MupDAR: A very brief example
4.Concluding remarks
Acknowledgements
Notes
References

References (26)

References

Aijmer, K. (2002). Modality in advanced Swedish learners’ written interlanguage. In S. Granger, J. Hung, & S. Petch-Tyson (Eds.), Computer learner corpora, second language acquisition, and foreign language teaching (pp. 55–76). Amsterdam: John Benjamins.

Altenberg, B. (2002). Using bilingual corpus evidence in learner corpus research. In S. Granger, J. Hung, & S. Petch-Tyson (Eds.), Computer learner corpora, second language acquisition, and foreign language teaching (pp. 37–54). Amsterdam: John Benjamins.

Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: A practical information-theoretic approach (2nd ed). New York, NY: Springer.

Connor, U., Precht, K., & Upton, T. (2005). Business English: Learner data from Belgium, Finland, and the U.S. In S. Granger, J. Hung, & S. Petch-Tyson (Eds.), Computer learner corpora, second language acquisition, and foreign language teaching (pp. 175–194). Amsterdam: John Benjamins.

Doğruöz, A. S., & Gries, S. Th. (2012). Spread of on-going changes in an immigrant language: Turkish in the Netherlands. Review of Cognitive Linguistics, 10(2), 401–426.

Fox, J. (2003). Effect displays in R for generalised linear models. Journal of Statistical Software, 8(15), 1–27.

Gilquin, G., & Granger, S. (2011). From EFL to ESL: Evidence from the International Corpus of Learner English. In J. Mukherjee & M. Hundt (Eds.), Exploring second-language varieties of English and learner Englishes: Bridging a paradigm gap (pp. 55–78). Amsterdam: John Benjamins.

Gilquin, G., & Lefer, M. -A. (2017). Exploring word-formation in Learner Corpus Research: A case study on English negative affixes. Paper presented at the Learner Corpus Research conference 2017, Bolzano, Italy.

Gries, S. Th. (2006). Exploring variability within and between corpora: some methodological considerations. Corpora, 1(2), 109–151.

Gries, S. Th. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437.

. (2013). Statistics for linguistics with R (2nd rev. and ext. ed). Berlin: De Gruyter Mouton.

. (2015). The most underused statistical method in corpus linguistics: Multi-level (and mixed-effects) models. Corpora, 10(1), 95–125.

Gries, S. Th., & Adelman, A. S. (2014). Subject realization in Japanese conversation by native and non-native speakers: Exemplifying a new paradigm for learner corpus research. In J. Romero-Trillo (Ed.), Yearbook of corpus linguistics and pragmatics 2014: New empirical and theoretical paradigms (pp. 35–54). Cham: Springer.

Gries, S. Th., & Deshors, S. C. (2014). Using regressions to explore deviations between corpus data and a standard/target: Two suggestions. Corpora, 9(1), 109–136.

Gries, S. Th. (to appear). Priming of syntactic alternations by learners of English: An analysis of sentence-completion and collostructional results.

Gries, S. Th., & Wulff, S. (2009). Psycholinguistic and corpus linguistic evidence for L2 constructions. Annual Review of Cognitive Linguistics, 71, 163–186.

Hasselgård, H., & Johansson, S. (2011). Learner corpora and contrastive interlanguage analysis. In F. Meunier, S. De Cock, G. Gilquin, & M. Paquot (Eds.), A taste for corpora: In honour of Sylviane Granger (pp. 33–61). Amsterdam: John Benjamins.

Hawkins, J. A. (1994). A performance theory of order and constituency. Cambridge: Cambridge University Press.

Hyland, K., & Milton, J. (1997). Qualification and certainty in L1 and L2 students’ writing. Journal of Second Language Writing, 6(2), 183–205.

Jaeger, T. F. (2010). Redundancy and reduction: Speakers manage syntactic information density. Cognitive Psychology, 61(1), 23–62.

Labov, W. (1982). The social stratification of English in New York City. Washington, DC: Center for Applied Linguistics.

Laufer, B., & Waldman, T. (2011). Verb-noun collocations in second language writing: A corpus analysis of learners’ English. Language Learning, 61(2), 647–672.

Neff van Aertselaer, J. & Bunce, C. (2012). The use of small corpora for tracing the development of academic literacies. In F. Meunier, S. De Cock, G. Gilquin, & M. Paquot (Eds.), A taste for corpora: In honour of Sylviane Granger (pp. 63–83). Amsterdam: John Benjamins.

Paquot, M. & Plonsky, L. (2017). Quantitative research methods and study quality in learner corpus research. International Journal of Learner Corpus Research, 3(1), 61–94.

Wulff, S. (2016). A friendly conspiracy of input, L1, and processing demands: that-variation in German and Spanish learner language. In A. Tyler, L. Ortega, H. I. Park, & M. Uno (Eds.), The usage-based study of language learning and multilingualism (pp. 115–136). Washington, DC: Georgetown University Press.

Wulff, S., Lester, N. A. & Martinez-Garcia, M. M. (2014). That-variation in German and Spanish L2 English. Language and Cognition, 6(2), 271–299.

Cited by (37)

Cited by 37 other publications

Order by:

Bernaisch, Tobias, Aishath Suad & Aminath Saeed

2025. Particle verbs versus simplex verbs in Maldivian English. World Englishes 44:3 ► pp. 339 ff.

Botha, Werner & Tobias Bernaisch

2025. Social network effects on particle variation among Singapore students. World Englishes 44:1-2 ► pp. 144 ff.

Degenhardt, Julia

2025. Parentheticals in spoken Indian and Sri Lankan English. World Englishes 44:1-2 ► pp. 184 ff.

Dubois, Tanguy, Magali Paquot & Benedikt Szmrecsanyi

2025. Future-time reference in spoken EFL. Journal of Second Language Studies 8:1 ► pp. 89 ff.

Huang, Michelle Zeping, Mariah Chan & Jianwen Liu

2025. Stance and engagement in digital oratory. Journal of Second Language Studies

Thomas, Anita

2025. Researching L2 French input and instructed learning. In Approaches and Methods in French Second Language Acquisition Research [Research Methods in Applied Linguistics, 9], ► pp. 261 ff.

Casal, J. Elliott, Genggeng Zhang, Ghadi Matouq & Hana Alqabba

2024. ‘These results are inconsistent’. Journal of Second Language Studies 7:2 ► pp. 320 ff.

Gries, Stefan Th.

2024. Against level-3-only analyses in corpus linguistics. ICAME Journal 48:1 ► pp. 23 ff.

Gries, Stefan Th.

2025. Corpus Linguistics: Quantitative Methods. In The Encyclopedia of Applied Linguistics, ► pp. 1 ff.

Leuckert, Sven, Claudia Lange, Tobias Bernaisch & Asya Yurchenko

2024. Indian Englishes in the Twenty-First Century,

Paquot, Magali

2024. Learner corpus research: a critical appraisal and roadmap for contributing (more) to SLA research agendas. Corpus Linguistics and Linguistic Theory 20:3 ► pp. 567 ff.

Zhang, Genggeng

2024. Emerging engineering scholars’ stance in citations. Journal of Second Language Studies 7:2 ► pp. 347 ff.

de Baets, Pauline & Gert de Sutter

2023. How do translators select among competing (near-)synonyms in translation?. Target. International Journal of Translation Studies 35:1 ► pp. 1 ff.

Gonzales, Wilkinson Daniel Wong, Mie Hiramoto, Jakob R. E. Leimgruber & Jun Jie Lim

2023. The Corpus of Singapore English Messages (CoSEM). World Englishes 42:2 ► pp. 371 ff.

Pyykönen, Maria

2023. Epistemic stance in written L2 English: The role of task type, L2 proficiency, and authorial style. Applied Corpus Linguistics 3:1 ► pp. 100040 ff.

Bernaisch, Tobias, Stefan Th. Gries & Benedikt Heller

2022. Theoretical models and statistical modelling of linguistic epicentres. World Englishes 41:3 ► pp. 333 ff.

Chen, Jianhua & Xiaopeng Zhang

2022. L2 development of phraseological knowledge via a xu-argument based continuation task: A latent curve modeling approach. System 106 ► pp. 102767 ff.

Paquot, Magali, Dana Gablasova, Vaclav Brezina & Hubert Naets

2022. Phraseological complexity in EFL learners’ spoken production across proficiency levels. In Complexity, Accuracy and Fluency in Learner Corpus Research [Studies in Corpus Linguistics, 104], ► pp. 115 ff.

Staples, Shelley

2022. Review of Durrant, Brenchley & McCallum (2021): Understanding Development and Proficiency in Writing: Quantitative Corpus Linguistic Approaches. International Journal of Learner Corpus Research 8:2 ► pp. 283 ff.

Wisniewski, Katrin

2022. Gesprochene Lernerkorpora des Deutschen: Eine Bestandsaufnahme. Zeitschrift für germanistische Linguistik 50:1 ► pp. 1 ff.

König, Alexander, Jennifer-Carmen Frey & Egon W. Stemle

2021. Exploring Reusability and Reproducibility for a Research Infrastructure for L1 and L2 Learner Corpora. Information 12:5 ► pp. 199 ff.

Sönning, Lukas & Valentin Werner

2021. The replication crisis, scientific revolutions, and linguistics. Linguistics 59:5 ► pp. 1179 ff.

Winter, Bodo & Martine Grice

2021. Independence and generalizability in linguistics. Linguistics 59:5 ► pp. 1251 ff.

Bernaisch, Tobias

2020. Introduction. In Gender in World Englishes, ► pp. 1 ff.

Bernaisch, Tobias

2022. Comparing Generalised Linear Mixed-Effects Models, Generalised Linear Mixed-Effects Model Trees and Random Forests. In Data and Methods in Corpus Linguistics, ► pp. 163 ff.

De Sutter, Gert & Marie-Aude Lefer

2020. On the need for a new research agenda for corpus-based translation studies: a multi-methodological, multifactorial and interdisciplinary approach. Perspectives 28:1 ► pp. 1 ff.

Gries, Stefan Th. & Philip Durrant

2020. Analyzing Co-occurrence Data. In A Practical Handbook of Corpus Linguistics, ► pp. 141 ff.

Myles, Florence

2020. Commentary: An SLA Perspective on Learner Corpus Research. In Learner Corpus Research Meets Second Language Acquisition, ► pp. 258 ff.

Paquot, Magali & Marcus Callies

2020. Promoting methodological expertise, transparency, replication, and cumulative learning. International Journal of Learner Corpus Research 6:2 ► pp. 121 ff.

Paquot, Magali, Hubert Naets & Stefan Th. Gries

2020. Using Syntactic Co-occurrences to Trace Phraseological Complexity Development in Learner Writing: Verb + Object Structures in LONGDALE. In Learner Corpus Research Meets Second Language Acquisition, ► pp. 122 ff.

Schweinberger, Martin

2020. A corpus-based analysis of differences in the use ofveryfor adjective amplification among native speakers and learners of English. International Journal of Learner Corpus Research 6:2 ► pp. 163 ff.

Schweinberger, Martin

2020. How Learner Corpus Research can inform language learning and teaching. Australian Review of Applied Linguistics 43:2 ► pp. 196 ff.

Stratton, James M.

2020. A diachronic analysis of the adjective intensifierwellfrom Early Modern English to Present Day English. Canadian Journal of Linguistics/Revue canadienne de linguistique 65:2 ► pp. 216 ff.

Wulff, Stefanie & Stefan Th. Gries

2019. Particle Placement in Learner Language. Language Learning 69:4 ► pp. 873 ff.

Wulff, Stefanie & Stefan Th. Gries

2020. Exploring Individual Variation in Learner Corpus Research: Methodological Suggestions. In Learner Corpus Research Meets Second Language Acquisition, ► pp. 191 ff.

[no author supplied]

2021. Corpora, Constructions, New Englishes [Studies in Corpus Linguistics, 100],

[no author supplied]

2024. Textbook English [Studies in Corpus Linguistics, 116],

This list is based on CrossRef data as of 13 november 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.