Reproducibility, replicability, robustness, and generalizability in corpus linguistics

Flanagan, Joseph

doi:10.1075/ijcl.24113.fla

Article published In: Reproducibility, Replicability, and Robustness in Corpus Linguistics
Edited by Martin Schweinberger and Michael Haugh
[International Journal of Corpus Linguistics 30:2] 2025
► pp. 130–149

Get fulltext from our e-platform

Download PDF

Download EPUB

Reproducibility, replicability, robustness, and generalizability in corpus linguistics

Joseph Flanagan | University of Helsinki

Published online: 14 February 2025

https://doi.org/10.1075/ijcl.24113.fla

Abstract

Establishing the credibility of scientific research involves several related but significantly different concerns. One potential problem in surveying different approaches to these concerns is that of terminology, as some of the basic terms used in the discussion — reproducibility, replicability, robustness, and generalizability — are often used in inconsistent or contradictory ways. This paper proposes to resolve such confusion by providing a terminological framework for discussing what kind of confirmation is necessary for a scientific study to be deemed credible. A study is said to be ‘reproducible’ if we can obtain identical results by performing an identical analysis on identical data, ‘replicable’ if we can obtain consistent results using the same analysis on different data, ‘robust’ if we can obtain consistent results from identical data using a different analysis, and ‘generalizable’ if we can obtain consistent results from different data using a different analysis.

Keywords: reproducibility, replicability, robustness, generalizability, credibility crisis

Article outline

1.Introduction
2.Reproducibility
- 2.1Computational reproducibility
- 2.2Analytical reproducibility
  - 2.2.1Analytic reconstructability
  - 2.2.2Analytic traceability
3.Replicability
- 3.1The corpus as sample
- 3.2What does it mean for a study to replicate?
4.Robustness and generalizability
- 4.1Robustness
- 4.2Generalizability
5.Crisis or opportunity?
Notes
References

References (100)

References

Alvarez, R. M., & Heuberger, S. (2022). How (not) to reproduce: Practical considerations to improve research transparency in political science. PS: Political Science & Politics, 55(1), 149–154.

Anderson, S. F., & Maxwell, S. E. (2016). There’s more than one way to conduct a replication study: Beyond statistical significance. Psychological Methods, 21(1), 1–12.

Andringa, S., & Godfroid, A. (2020). Sampling bias and the problem of generalizability in applied linguistics. Annual Review of Applied Linguistics, 401, 134–142.

Artner, R., Verliefde, T., Steegen, S., Gomes, S., Traets, F., Tuerlinckx, F., & Vanpaemel, W. (2021). The reproducibility of statistical results in psychological research: An investigation using unpublished raw data. Psychological Methods, 26(5), 527–546.

Bakker, M., & Wicherts, J. M. (2011). The (mis)reporting of statistical results in psychology journals. Behavior Research Methods, 43(3), 666–678.

Barth, D., & Kapatsinski, V. (2017). A multimodel inference approach to categorical variant choice: Construction, priming and frequency effects on the choice between full and contracted forms of am, are and is. Corpus Linguistics and Linguistic Theory, 13(2), 203–260.

Belz, A., Agarwal, S., Shimorina, A., & Reiter, E. (2021). A systematic review of reproducibility research in natural language processing. In P. Merlo, J. Tiedemann, & R. Tsarfaty (Eds.), Proceedings of the 16th conference of the European chapter of the Association for Computational Linguistics: Main volume (pp. 381–393). Association for Computational Linguistics.

Bernaisch, T., Gries, S. Th., & Heller, B. (2022). Theoretical models and statistical modelling of linguistic epicentres. World Englishes, 41(3), 333–346.

Biber, D. (1988). Variation across speech and writing. Cambridge University Press.

Bisang, W. (2011). Variation and reproducibility in linguistics. In P. Siemund (Ed.), Linguistic universals and language variation (pp. 237–263). De Gruyter Mouton.

BNC Consortium. (2007). British National Corpus (version 3, BNC XML ed.). [URL]

Bollen, K., Cacioppo, J. T., Kaplan, R. M., Krosnick, J. A., & Olds, J. L. (2015). Social, behavioral, and economic sciences perspectives on robust and reliable science (Report of the Subcommittee on Replicability in Science Advisory Committee to the National Science Foundation Directorate for Social, Behavioral, and Economic Sciences). National Science Foundation. [URL]

Brezina, V., McEnery, T., & Wattam, S. (2015). Collocations in context: A new perspective on collocation networks. International Journal of Corpus Linguistics, 20(2), 139–173.

Brezina, V., & Meyerhoff, M. (2014). Significant or random?: A critical review of sociolinguistic generalisations based on large corpora. International Journal of Corpus Linguistics, 19(1), 1–28.

Brezina, V., & Timperley, M. (2017). How large is the BNC? A proposal for standardised tokenization and word counting. [Conference presentation]. Corpus linguistics conference 2017, Birmingham, UK.

Burch, B., & Egbert, J. (2022a). Confidence intervals for ratios of means applied to corpus-based word frequency classes. Journal of Applied Statistics, 50(7), 1592–1610.

(2022b). Word use equivalence and hierarchical word tiers. Journal of Quantitative Linguistics, 30(1), 104–124.

Burch, B., Egbert, J., & Biber, D. (2017). Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science, 3(2), 189–216.

Claerbout, J. F., & Karrenbach, M. (1992). Electronic documents give reproducible research a new meaning. In SEG Technical Program expanded abstracts 1992, (pp. 601–604).

Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49(12), 997–1003.

Doyle, P. G. (2003). Replicating corpus linguistics: A corpus-driven investigation of lexical networks in texts [Unpublished doctoral dissertation]. Lancaster University.

Earp, B. D., & Trafimow, D. (2015). Replication, falsification, and the crisis of confidence in social psychology. Frontiers in Psychology, 61, Article 621.

Egbert, J., & Baker, P. (Eds.). (2021). Using corpus methods to triangulate linguistic analysis. Routledge.

Egbert, J., & Biber, D. (2019). Incorporating text dispersion into keyword analyses. Corpora, 14(1), 77–104.

Egbert, J., Biber, D., & Gray, B. (2022). Designing and evaluating language corpora: A practical framework for corpus representativeness. Cambridge University Press.

Egbert, J., Burch, B., & Biber, D. (2020). Lexical dispersion and corpus design. International Journal of Corpus Linguistics, 25(1), 89–115.

Egbert, J., Larsson, T., & Biber, D. (2020). Doing linguistics with a corpus: Methodological considerations for the everyday user (1st ed.). Cambridge University Press.

Eubank, N. (2016). Lessons from a decade of replications at the quarterly journal of political science. PS: Political Science & Politics, 49(2), 273–276.

Flanagan, J. (2017). Reproducible research: Strategies, tools, and workflows. In T. Hiltunen, J. McVeigh, & T. Säily (Eds.), Big and rich data in English corpus linguistics: Methods and explorations. VARIENG. [URL]

Fletcher, S. C. (2021). How (not) to measure replication. European Journal for Philosophy of Science, 11(2), 57.

Fuscone, S., Favre, B., & Prévot, L. (2021). Reproducibility in speech rate convergence experiments. Language Resources and Evaluation, 55(3), 817–832.

Gawne, L., & Berez-Kroeker, A. L. (2018). Reflections on reproducible research. In B. McDonnell, A. L. Berez-Kroeker, & G. Holton (Eds.), Reflections on language documentation 20 years after Himmelmann 1998 (pp. 22–32). University of Hawaiʻi Press. [URL]

Gelman, A., & Loken, E. (2013, November 13). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. [URL]

Gelman, A., & Stern, H. (2006). The difference between “significant” and “not significant” is not itself statistically significant. The American Statistician, 60(4), 328–331.

Gervais, W. M. (2021). Practical methodological reform needs good theory. Perspectives on Psychological Science, 16(4), 827–843.

Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33(5), 587–606.

Gries, S. Th. (2015). The most under-used statistical method in corpus linguistics: Multi-level (and mixed-effects) models. Corpora, 10(1), 95–125.

(2020). Analyzing dispersion. In M. Paquot & S. T. Gries (Eds.), A practical handbook of corpus linguistics (pp. 99–118). Springer.

(2021). (Generalized linear) mixed-effects modeling: A learner corpus example. Language Learning, 71(3), 757–798.

(2022a). What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies, 5(2), 171–205.

(2022b). Toward more careful corpus statistics: Uncertainty estimates for frequencies, dispersions, association measures, and more. Research Methods in Applied Linguistics, 1(1), Article 100002.

Gries, S. Th., & Paquot, M. (2020). Writing up a corpus-linguistic paper. In M. Paquot & S. Th. Gries (Eds.), A practical handbook of corpus linguistics (pp. 647–659). Springer.

Hackert, S. (2008). Counting and coding the past: Circumscribing the variable context in quantitative analyses of past inflection. Language Variation and Change, 20(1), 127–153.

Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., & Stahel, W. A. (1986). Robust statistics: The approach based on influence functions (1st ed.). Wiley.

Hardwicke, T. E., Bohn, M., MacDonald, K., Hembacher, E., Nuijten, M. B., Peloquin, B. N., deMayo, B. L., Yoon, E. J., & Frank, M. C. (2021). Analytic reproducibility in articles receiving open data badges at the journal Psychological Science: An observational study. R. Soc. Open Sci., 81, Article 201494.

Hardwicke, T. E., Wallach, J. D., Kidwell, M. C., Bendixen, T., Crüwell, S., & Ioannidis, J. P. A. (2020). An empirical assessment of transparency and reproducibility-related research practices in the social sciences (2014–2017). R. Soc. Open Sci., 71, Article 190806.

Hundt, M. (2021). On models and modelling. World Englishes, 40(3), 298–317.

In’nami, Y., Mizumoto, A., Plonsky, L., & Koizumi, R. (2022). Promoting computationally reproducible research in applied linguistics: Recommended practices and considerations. Research Methods in Applied Linguistics, 1(3), Article 1000030.

Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124.

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23(5), 524–532.

Kytö, M., & Smitterberg, E. (2015). Diachronic registers. In D. Biber & R. Reppen (Eds.), The Cambridge handbook of English corpus linguistics (pp. 330–345). Cambridge University Press.

Laurinavichyute, A., Yadav, H., & Vasishth, S. (2022). Share the code, not just the data: A case study of the reproducibility of articles published in the Journal of Memory and Language under the open data policy. Journal of Memory and Language, 1251, Article 104332.

Lee, D. Y. W. (2000). Modelling variation in spoken and written language: The multi-dimensional approach revisited [Unpublished doctoral dissertation]. Lancaster University.

Lundberg, I., Johnson, R., & Stewart, B. M. (2021). What is your estimand? Defining the target quantity connects statistical evidence to theory. American Sociological Review, 86(3), 532–565.

McElreath, R. (2020). Statistical rethinking: A Bayesian course with examples in R and STAN (2nd ed.). Chapman and Hall/CRC.

McEnery, T., & Brezina, V. (2022). Fundamental principles of corpus linguistics (1st ed.). Cambridge University Press.

McEnery, T., & Hardie, A. (2011). Corpus linguistics: Method, theory and practice. Cambridge University Press.

Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34(2), 103–115.

Mehl, S. (2021). What we talk about when we talk about corpus frequency: The example of polysemous verbs with light and concrete senses. Corpus Linguistics and Linguistic Theory, 17(1), 223–247.

National Academies of Sciences, Engineering, and Medicine. (2019). Reproducibility and replicability in science. National Academies Press.

Nosek, B. A., Hardwicke, T. E., Moshontz, H., Allard, A., Corker, K. S., Dreber, A., Fidler, F., Hilgard, J., Kline Struhl, M., Nuijten, M. B., Rohrer, J. M., Romero, F., Scheel, A. M., Scherer, L. D., Schönbrodt, F. D., & Vazire, S. (2022). Replicability, robustness, and reproducibility in psychological science. Annual Review of Psychology, 731, 719–748.

Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2016). The prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 481, 1205–1226.

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.

Pedersen, T. (2008). Empiricism is not a matter of faith. Computational Linguistics, 34(3), 465–470.

Peikert, A., & Brandmaier, A. M. (2021). A reproducible data analysis workflow with R Markdown, Git, Make, and Docker. Quantitative and Computational Methods in Behavioral Sciences, 11, Article e3763.

Peng, R. D., & Hicks, S. C. (2021). Reproducible research: A retrospective. Annual Review of Public Health, 421, 79–93.

Phillips, M. (1985). Aspects of text structure: An investigation of the lexical organisation of text. North-Holland.

Pietschnig, J., Siegel, M., Eder, J. S. N., & Gittler, G. (2019). Effect declines are systematic, strong, and ubiquitous: A meta-meta-analysis of the decline effect in intelligence research. Frontiers in Psychology, 101, Article 2874.

Porte, G., & McManus, K. (2018). Doing replication research in applied linguistics (1st ed.). Routledge.

Rastle, K. (2022). Improving reproducibility in the Journal of Memory and Language. Journal of Memory and Language, 1261, Article 104351.

Schützler, O., & Schlüter, J. (Eds.). (2022). Data and methods in corpus linguistics: Comparative approaches [Supplemental material]. Cambridge University Press. [URL].

Sönning, L. (2024). Evaluation of keyness metrics: Performance and reliability. Corpus Linguistics and Linguistic Theory, 20(1), 263–288.

Sönning, L., & Grafmiller, J. (2024). Seeing the wood for the trees: Predictive margins for random forests. Corpus Linguistics and Linguistic Theory, 20(1), 153–181.

Sönning, L., & Krug, M. (2022). Comparing study designs and down-sampling strategies in corpus analysis: The importance of speaker metadata in the BNCs of 1994 and 2014. In O. Schützler & J. Schlüter (Eds.), Data and methods in corpus linguistics: Comparative approaches (pp. 127–160). Cambridge University Press.

Sönning, L., & Werner, V. (Eds.). (2021a). The replication crisis: Implications for linguistics [Special issue]. Linguistics, 59(5). [URL]

(2021b). The replication crisis, scientific revolutions, and linguistics. Linguistics, 59(5), 1179–1206.

Spence, J. R., & Stanley, D. J. (2016). Prediction interval: What to expect when you’re expecting … a replication. PLOS ONE, 11(9), Article e0162874.

Staudte, R. G., & Sheather, S. J. (1990). Robust estimation and testing (1st ed.). Wiley.

Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing transparency through a multiverse analysis. Perspectives on Psychological Science, 11(5), 702–712.

Stefanowitsch, A. (2020). Corpus linguistics: A guide to the methodology. Language Science Press.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from test of significance — or vice versa. Journal of the American Statistical Association, 54(285), 30–34.

Stodden, V., Seiler, J., & Ma, Z. (2018). An empirical analysis of journal policy effectiveness for computational reproducibility. Proceedings of the National Academy of Sciences, 115(11), 2584–2589.

Stubbs, M. (2001). Words and phrases: Corpus studies of lexical semantics. Blackwell.

Szmrecsanyi, B., Biber, D., Egbert, J., & Franco, K. (2016). Toward more accountability: Modeling ternary genitive variation in Late Modern English. Language Variation and Change, 28(1), 1–29.

Trisovic, A., Lau, M. K., Pasquier, T., & Crosas, M. (2022). A large-scale study on research code quality and execution. Scientific Data, 9(1), 60.

Vanpaemel, W., Vermorgen, M., Deriemaecker, L., & Storms, G. (2015). Are we wasting a good crisis? The availability of psychological research data after the storm. Collabra, 1(1), 3.

Vasishth, S., & Gelman, A. (2021). How to embrace variation and accept uncertainty in linguistic and psycholinguistic data analysis. Linguistics, 59(5), 1311–1342.

Vetter, F. (2021). Issues of corpus comparability and register variation in the International Corpus of English: Theories and computer applications [Doctoral dissertation, Otto-Friedrich-Universität].

Wallis, S. (2017, February 16). The replication crisis: What does it mean for corpus linguistics? corp.ling.stats: statistics for corpus linguistics. [URL]

(2019). Comparing χ² tables for separability of distribution and effect: Meta-tests for comparing homogeneity and goodness of fit contingency test outcomes. Journal of Quantitative Linguistics, 26(4), 330–355.

(2020). Statistics in corpus linguistics research: A new approach (1st ed.). Routledge.

(2022). Accurate confidence intervals on Binomial proportions, functions of proportions, algebraic formulae and effect sizes. [URL]

Wallis, S., & Mehl, S. (2022). Comparing baselines for corpus analysis: Research into the get-passive in speech and writing. In O. Schützler & J. Schlüter (Eds.), Data and methods in corpus linguistics: Comparative approaches (1st ed., pp. 101–126). Cambridge University Press.

Whitaker, K. (2017, September 26). Publishing a reproducible paper [Conference presentation]. Open science in practice summer school, Lausanne, Switzerland.

Wieling, M., Rawee, J., & van Noord, G. (2018). Reproducibility in computational linguistics: Are we willing to share? Computational Linguistics, 44(4), 641–649.

Wilcox, R. R. (2013). Introduction to robust estimation and hypothesis testing (3rd ed.). Academic Press.

Wilson, G., Bryan, J., Cranston, K., Kitzes, J., Nederbragt, L., & Teal, T. K. (2017). Good enough practices in scientific computing. PLOS Computational Biology, 13(6). Article e1005510.

Yarkoni, T. (2022). The generalizability crisis. Behavioral and Brain Sciences, 45(e1).

Young, C. (2018). Model uncertainty and the crisis in science. Socius: Sociological Research for a Dynamic World, 41.

Young, C., & Holsteen, K. (2017). Model uncertainty and robustness: A computational framework for multimodel analysis. Sociological Methods & Research, 46(1), 3–40.

Cited by (2)

Cited by two other publications

Becker, Laura & Matías Guzmán Naranjo

2025. Authors’ response to “Replication and methodological robustness in quantitative typology”. Linguistic Typology 29:3 ► pp. 591 ff.

Frenken, Florian, Stephanie Evert, Gerold Schneider & Stella Neumann

2025. How stable are multivariate findings about register variation across varieties of English? On the replicability of Geometric Multivariate Analysis. ICAME Journal 49:1 ► pp. 23 ff.

This list is based on CrossRef data as of 21 november 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.