Article published In: Journal of Second Language Studies
Vol. 1:2 (2018) ► pp.277–309
On over- and underuse in learner corpus research and multifactoriality in corpus linguistics more generally
Published online: 27 August 2018
https://doi.org/10.1075/jsls.00005.gri
https://doi.org/10.1075/jsls.00005.gri
Abstract
This paper critically discusses how corpus linguistics in general, but learner corpus research in particular, has been dealing with
all sorts of frequency data in general, but over- and underuse frequencies in particular. I demonstrate on the basis of learner
corpus data the pitfalls of using aggregate data and lacking statistical control that much work is unfortunately characterized by.
In fact, I will demonstrate that monofactorial methods have very little to offer at all to research on observational data. While
this paper is admittedly very didactic and methodological, I think the discussion of the empirical data offered here – a
reanalysis of previously published work – shows how misleading many studies potentially and provides far-reaching implications for
much of corpus linguistics and learner corpus research. Ideally/maximally, this paper together with Paquot & Plonsky (Paquot, M. & Plonsky, L. (2017). Quantitative research methods and study quality in learner corpus research. International Journal of Learner Corpus Research, 3(1), 61–94. , Intntl. J. of Learner Corpus Research) would lead to a complete
revision of how learner corpus linguists use quantitative methods and study over-/underuse; minimally, this paper would stimulate
a much-needed discussion of currently lacking methodological sophistication.
Article outline
- 1.Introduction
- 1.1General introduction
- 1.2The two goals of this paper
- 2.A regression-modeling approach to over- and underuse
- 2.1The data: quite in learner and native-speaker data
- 2.2A generalized linear model on the Hasselgård and Johansson data
- 2.3Why a generalized linear model is not enough
- 2.3.1Which files to include?
- 2.3.2The role of individual speakers
- 2.4A generalized linear mixed-effects model on the Hasselgård and Johansson data
- 3.Multifactoriality: What it means for over- and underuse and in general
- 3.1Why virtually every corpus study, every one, needs to be multifactorial
- 3.2The role of other predictors
- 3.2.1Traditional multifactorial regression modeling
- 3.2.2Multifactorial Prediction and Deviation Analysis Using Regressions (or other classifiers)
- 3.2.3MupDAR: A very brief example
- 4.Concluding remarks
- Acknowledgements
- Notes
References
References (26)
Aijmer, K. (2002). Modality in advanced Swedish learners’ written interlanguage. In S. Granger, J. Hung, & S. Petch-Tyson (Eds.), Computer learner corpora, second language acquisition, and foreign language teaching (pp. 55–76). Amsterdam: John Benjamins.
Altenberg, B. (2002). Using bilingual corpus evidence in learner corpus research. In S. Granger, J. Hung, & S. Petch-Tyson (Eds.), Computer learner corpora, second language acquisition, and foreign language teaching (pp. 37–54). Amsterdam: John Benjamins.
Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: A practical information-theoretic approach (2nd ed). New York, NY: Springer.
Connor, U., Precht, K., & Upton, T. (2005). Business English: Learner data from Belgium, Finland, and the U.S. In S. Granger, J. Hung, & S. Petch-Tyson (Eds.), Computer learner corpora, second language acquisition, and foreign language teaching (pp. 175–194). Amsterdam: John Benjamins.
Doğruöz, A. S., & Gries, S. Th. (2012). Spread of on-going changes in an immigrant language: Turkish in the Netherlands. Review of Cognitive Linguistics, 10(2), 401–426.
Fox, J. (2003). Effect displays in R for generalised linear models. Journal of Statistical Software, 8(15), 1–27.
Gilquin, G., & Granger, S. (2011). From EFL to ESL: Evidence from the International Corpus of Learner English. In J. Mukherjee & M. Hundt (Eds.), Exploring second-language varieties of English and learner Englishes: Bridging a paradigm gap (pp. 55–78). Amsterdam: John Benjamins.
Gilquin, G., & Lefer, M. -A. (2017). Exploring word-formation in Learner Corpus Research: A case study on English negative affixes. Paper presented at the Learner Corpus Research conference 2017, Bolzano, Italy.
Gries, S. Th. (2006). Exploring variability within and between corpora: some methodological considerations. Corpora, 1(2), 109–151.
Gries, S. Th. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437.
. (2015). The most underused statistical method in corpus linguistics: Multi-level (and mixed-effects) models. Corpora, 10(1), 95–125.
Gries, S. Th., & Adelman, A. S. (2014). Subject realization in Japanese conversation by native and non-native speakers: Exemplifying a new paradigm for
learner corpus research. In J. Romero-Trillo (Ed.), Yearbook of corpus linguistics and pragmatics 2014: New empirical and theoretical paradigms (pp. 35–54). Cham: Springer.
Gries, S. Th., & Deshors, S. C. (2014). Using regressions to explore deviations between corpus data and a standard/target: Two suggestions. Corpora, 9(1), 109–136.
Gries, S. Th. (to appear). Priming of syntactic alternations by learners of English: An analysis of sentence-completion and collostructional
results.
Gries, S. Th., & Wulff, S. (2009). Psycholinguistic and corpus linguistic evidence for L2 constructions. Annual Review of Cognitive Linguistics, 71, 163–186.
Hasselgård, H., & Johansson, S. (2011). Learner corpora and contrastive interlanguage analysis. In F. Meunier, S. De Cock, G. Gilquin, & M. Paquot (Eds.), A taste for corpora: In honour of Sylviane Granger (pp. 33–61). Amsterdam: John Benjamins.
Hawkins, J. A. (1994). A performance theory of order and constituency. Cambridge: Cambridge University Press.
Hyland, K., & Milton, J. (1997). Qualification and certainty in L1 and L2 students’ writing. Journal of Second Language Writing, 6(2), 183–205.
Jaeger, T. F. (2010). Redundancy and reduction: Speakers manage syntactic information density. Cognitive Psychology, 61(1), 23–62.
Labov, W. (1982). The social stratification of English in New York City. Washington, DC: Center for Applied Linguistics.
Laufer, B., & Waldman, T. (2011). Verb-noun collocations in second language writing: A corpus analysis of learners’ English. Language Learning, 61(2), 647–672.
Neff van Aertselaer, J. & Bunce, C. (2012). The use of small corpora for tracing the development of academic literacies. In F. Meunier, S. De Cock, G. Gilquin, & M. Paquot (Eds.), A taste for corpora: In honour of Sylviane Granger (pp. 63–83). Amsterdam: John Benjamins.
Paquot, M. & Plonsky, L. (2017). Quantitative research methods and study quality in learner corpus research. International Journal of Learner Corpus Research, 3(1), 61–94.
Wulff, S. (2016). A friendly conspiracy of input, L1, and processing demands: that-variation in German and Spanish
learner language. In A. Tyler, L. Ortega, H. I. Park, & M. Uno (Eds.), The usage-based study of language learning and multilingualism (pp. 115–136). Washington, DC: Georgetown University Press.
Cited by (37)
Cited by 37 other publications
Bernaisch, Tobias, Aishath Suad & Aminath Saeed
Botha, Werner & Tobias Bernaisch
Degenhardt, Julia
Dubois, Tanguy, Magali Paquot & Benedikt Szmrecsanyi
Huang, Michelle Zeping, Mariah Chan & Jianwen Liu
Thomas, Anita
2025. Researching L2 French input and instructed learning. In Approaches and Methods in French Second Language Acquisition Research [Research Methods in Applied Linguistics, 9], ► pp. 261 ff.
Casal, J. Elliott, Genggeng Zhang, Ghadi Matouq & Hana Alqabba
Gries, Stefan Th.
Gries, Stefan Th.
Leuckert, Sven, Claudia Lange, Tobias Bernaisch & Asya Yurchenko
Paquot, Magali
Zhang, Genggeng
2024. Emerging engineering scholars’ stance in citations. Journal of Second Language Studies 7:2 ► pp. 347 ff.
de Baets, Pauline & Gert de Sutter
2023. How do translators select among competing (near-)synonyms in translation?. Target. International Journal of Translation Studies 35:1 ► pp. 1 ff.
Gonzales, Wilkinson Daniel Wong, Mie Hiramoto, Jakob R. E. Leimgruber & Jun Jie Lim
Pyykönen, Maria
Bernaisch, Tobias, Stefan Th. Gries & Benedikt Heller
Chen, Jianhua & Xiaopeng Zhang
Paquot, Magali, Dana Gablasova, Vaclav Brezina & Hubert Naets
2022. Phraseological complexity in EFL learners’ spoken production across proficiency levels. In Complexity, Accuracy and Fluency in Learner Corpus Research [Studies in Corpus Linguistics, 104], ► pp. 115 ff.
Staples, Shelley
2022. Review of Durrant, Brenchley & McCallum (2021): Understanding Development and Proficiency in Writing: Quantitative Corpus Linguistic Approaches. International Journal of Learner Corpus Research 8:2 ► pp. 283 ff.
Wisniewski, Katrin
König, Alexander, Jennifer-Carmen Frey & Egon W. Stemle
Sönning, Lukas & Valentin Werner
Winter, Bodo & Martine Grice
Bernaisch, Tobias
De Sutter, Gert & Marie-Aude Lefer
Gries, Stefan Th. & Philip Durrant
Myles, Florence
Paquot, Magali & Marcus Callies
2020. Promoting methodological expertise, transparency, replication, and cumulative learning. International Journal of Learner Corpus Research 6:2 ► pp. 121 ff.
Paquot, Magali, Hubert Naets & Stefan Th. Gries
Schweinberger, Martin
2020. A corpus-based analysis of differences in the use ofveryfor adjective amplification among native speakers and learners of English. International Journal of Learner Corpus Research 6:2 ► pp. 163 ff.
Schweinberger, Martin
2020. How Learner Corpus Research can inform language learning and teaching. Australian Review of Applied Linguistics 43:2 ► pp. 196 ff.
Stratton, James M.
Wulff, Stefanie & Stefan Th. Gries
Wulff, Stefanie & Stefan Th. Gries
[no author supplied]
[no author supplied]
This list is based on CrossRef data as of 13 november 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
