Two (more) suggestions for improvement: MuPDAR for corpus-based learner and variety studies

Gries, Stefan Th.

doi:10.1075/scl.105.09gri

In:Broadening the Spectrum of Corpus Linguistics: New approaches to variability and change
Edited by Susanne Flach and Martin Hilpert
[Studies in Corpus Linguistics 105] 2022
► pp. 257–283

Get fulltext from our e-platform

Download Book PDF

Download Book EPUB

MuPDAR for corpus-based learner and variety studies

Two (more) suggestions for improvement

Stefan Th. Gries | University of California, Santa Barbara | Justus Liebig University Giessen

Published online: 10 November 2022

https://doi.org/10.1075/scl.105.09gri

Abstract

Corpus-based studies of learner language and (especially) English varieties have become more quantitative in nature and increasingly use regression-based methods and classifiers such as classification trees, random forests, etc. One recent development that is becoming more widely used is the MuPDAR (Multifactorial Prediction and Deviation Analysis using Regressions) approach of Gries and Deshors (2014) and Gries and Adelman (2014). This approach attempts to improve on traditional regression- or tree-based approaches by, firstly, training a model/classifier on the reference speakers (often native speakers in learner corpus studies or British English speakers in variety studies), then, secondly, using this model/classifier to predict what such a reference speaker would produce in the situation the target speaker is in (often non-native speakers or indigenized-variety speakers). The third step then consists of determining whether the target speakers made a canonical choice or not and explore that variability with a second regression model or classifier. The present paper is a follow-up to Gries and Deshors’s (2020) and offers additional answers to a variety of questions that readers and audiences to MuPDAR presentations have been raising for a few years. First, I show how MuPDAR can be extended straightforwardly to alternations that involve more than the typically used binary choices; I do so in a way that also addresses another potential challenge and exemplify this with a case study from varieties research. Second, I outline a casewise-similarity approach towards predicting what reference speakers would do that avoids frequent regression modeling problems and exemplify, as well as compare, it to competing alternatives with a case study from learner corpus research.

Keywords: corpus-based alternation research, learner corpus research, variety research, MuPDAR, predictive modeling

Article outline

1.Introduction
- 1.1General introduction
- 1.2Motivation of the present paper
2.Case study 1: The dative and voice alternation across varieties
- 2.1Introduction
- 2.2MuPDAR: Steps i and ii
- 2.3MuPDAR: Step iii for a multinomial context
  - 2.3.1The simple version
  - 2.3.2The better version
- 2.4MuPDAR: Step iv for a multinomial context
- 2.5Interim conclusion
3.Case study 2: The dative alternation by learners
- 3.1Introduction
- 3.2The proposed classifier
- 3.3Results
  - 3.3.1Prediction accuracies without and with ‘either’ cases
  - 3.3.2Comparison and validation
4.Concluding remarks
Note
References

References (30)

References

Baayen, R. Harald & Ramscar, Michael. 2015. Abstraction, storage, and native discriminative learning. In Handbook of Cognitive Linguistics, Ewa Dąbrowska & Dagmar S. Divjak (eds), 100–120. Berlin: Mouton de Gruyter.

Bernaisch, Tobias, Gries, Stefan Th., & Mukherjee, Joybrato. 2014. The dative alternation in South Asian English(es): Modelling predictors and predicting prototypes. English World-Wide 35(1): 7–31.

Boulesteix, Anne-Laure, Janitza, Silke, Hapfelmeier, Alexander, Van Steen, Kristel & Strobl, Carolin. 2015. Letter to the editor: On the term ‘interaction’ and related phrases in the literature on Random Forests. Briefings in Bioinformatics 16(2): 338–345.

Daelemans, Walter & van den Bosch, Antal. 2005. Memory-Based Language Processing. Cambridge: CUP.

Daelemans, Walter, Zavrel, Jakub, van der Sloot, Ko, & van den Bosch, Antal. 2018. TiMBL: Tilburg Memory-Based Learner. Version 6.4 Reference Guide. ILK Technical Report – ILK 11–01. <[URL]> (4 April 2022).

Deshors, Sandra C. 2020. English as a Lingua Franca: A random forests approach to particle placement in multi-participant interactions. International Journal of Applied Linguistics 30(2): 214–231.

To appear. Contextualizing past tenses in L2: Combined effects and interactions in the present perfect vs. simple past alternation. Applied Linguistics.

Deshors, Sandra C. & Gries, Stefan Th. 2016. Profiling verb complementation constructions across New Englishes: A two-step random forests analysis to ing vs. to complements. International Journal of Corpus Linguistics 21(2): 192–218.

Deshors, Sandra C. & Gries, Stefan Th. 2020. Mandative subjunctive vs. should in world Englishes: A new take on an old alternation. Corpora 15(2): 213–241.

Divjak, Dagmar S., Arppe, Antti & Dąbrowska, Ewa. 2016. Machine meets man: Evaluating the psychological reality of corpus-based probabilistic models. Cognitive Linguistics 27(1): 1–34.

Gower, J. C. 1971. A general coefficient of similarity and some of its properties. Biometrics 27(4): 857–871.

Gries, Stefan Th. 2020. On classification trees and random forests in corpus linguistics: Some words of caution and suggestions for improvement. Corpus Linguistics and Linguistic Theory 16(3): 517–647.

. 2021. Statistics For Linguistics with R, 3rd rev. and ext. edn. Berlin: De Gruyter.

Gries, Stefan Th. & Adelman, Allison S. 2014. Subject realization in Japanese conversation by native and non-native speakers: Exemplifying a new paradigm for learner corpus research. In Yearbook of Corpus Linguistics and Pragmatics 2014: New empirical and theoretical paradigms, Jesús Romero-Trillo (ed.), 35–54. Cham: Springer.

Gries, Stefan Th. & Deshors, Sandra C. 2014. Using regressions to explore deviations between corpus data and a standard/target: Two suggestions. Corpora 9(1): 109–136.

2020. There’s more to alternations than the main diagonal of a 2×2 confusion matrix: Improvements of MuPDAR and other classificatory alternation studies. ICAME Journal 44: 69–96.

Heller, Benedikt, Bernaisch, Tobias, & Gries, Stefan Th. 2017. Empirical perspectives on two potential epicenters: The genitive alternation in Asian Englishes. ICAME Journal 41: 111–144.

Klavan, Jane & Divjak, Dagmar S. 2016. The cognitive plausibility of statistical classification models: Comparing textual and behavioral evidence. Folia Linguistica 50(2): 355–384.

Kolbe-Hanna, Daniela & Baldus, Lina. 2018. The choice between -ing and to complement clauses in English as first, second and foreign language. Paper presented at ICAME 39, University of Tampere.

Kruger, Haidee & De Sutter, Gert. 2018. Alternation in contact and non-contact varieties: Reconceptualising that-omission in translated and non-translated English using the MuPDAR approach. Translation, Cognition & Behavior 1(2): 251–290.

Lester, Nicholas A. 2019. That’s hard: Relativizer use in spontaneous L2 speech. International Journal of Learner Corpus Research 5(1): 1–32.

Milin, Petar, Divjak, Dagmar S., Dimitrijević, Strahinja & Baayen, R. Harald. 2016. Towards cognitively plausible data science in language research. Cognitive Linguistics 27(4): 507–526.

Podani, János. 1999. Extending Gower’s general coefficient of similarity to ordinal characters. Taxon 48: 331–340.

Schweinberger, Martin. 2020. A corpus-based analysis of differences in the use of very for adjective amplification among native speakers and learners of English. International Journal of Learner Corpus Research 6(2): 163–192.

Torgo, Luis. 2011. Data Mining with R: Learning with Case Studies. Boca Raton FL: Chapman & Hall/CRC.

Werner, Valentin, Fuchs, Robert, & Götz, Sandra. 2020. L1 influence vs. universal mechanisms: An SLA-driven corpus study on temporal expression. In Learner Corpora and Second Language Acquisition Research, Bert Le Bruyn & Magali Paquot (eds), 39–66. Cambridge: CUP.

Wright, Marvin N., Ziegler, Andreas, & König, Inke R. 2016. Do little interactions get lost in dark random forests? BMC Bioinformatics 17(145).

Wulff, Stefanie & Gries, Stefan Th. 2015. Prenominal adjective order preferences in Chinese and German L2 English: A multifactorial corpus study. Linguistic Approaches to Bilingualism 5(1): 122–150.

. 2019. Particle placement in learner English: Measuring effects of context, first language, and individual variation. Language Learning 69(4): 873–910.

. 2020. Explaining individual variation in learner corpus research: Some methodological suggestions. In Learner Corpora and Second Language Acquisition Research, Bert Le Bruyn & Magali Paquot (eds), 191–213. Cambridge: CUP.

Cited by (2)

Cited by two other publications

Kocher, Anna

2025. The subject realization in L2 Spanish by German L1 speakers. Isogloss. Open Journal of Romance Linguistics 11:2 ► pp. 1 ff.

Granger, Sylviane

2024. From early to future learner corpus research. International Journal of Learner Corpus Research 10:2 ► pp. 247 ff.

This list is based on CrossRef data as of 1 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.