Article published In: Reproducibility, Replicability, and Robustness in Corpus Linguistics
Edited by Martin Schweinberger and Michael Haugh
[International Journal of Corpus Linguistics 30:2] 2025
► pp. 130–149
Reproducibility, replicability, robustness, and generalizability in corpus linguistics
Published online: 14 February 2025
https://doi.org/10.1075/ijcl.24113.fla
https://doi.org/10.1075/ijcl.24113.fla
Abstract
Establishing the credibility of scientific research involves several related but significantly different concerns.
One potential problem in surveying different approaches to these concerns is that of terminology, as some of the basic terms used
in the discussion — reproducibility, replicability, robustness, and generalizability — are often used in inconsistent or
contradictory ways. This paper proposes to resolve such confusion by providing a terminological framework for discussing what kind
of confirmation is necessary for a scientific study to be deemed credible. A study is said to be ‘reproducible’ if we can obtain
identical results by performing an identical analysis on identical data, ‘replicable’ if we can obtain consistent results using
the same analysis on different data, ‘robust’ if we can obtain consistent results from identical data using a different analysis,
and ‘generalizable’ if we can obtain consistent results from different data using a different analysis.
Keywords: reproducibility, replicability, robustness, generalizability, credibility crisis
Article outline
- 1.Introduction
- 2.Reproducibility
- 2.1Computational reproducibility
- 2.2Analytical reproducibility
- 2.2.1Analytic reconstructability
- 2.2.2Analytic traceability
- 3.Replicability
- 3.1The corpus as sample
- 3.2What does it mean for a study to replicate?
- 4.Robustness and generalizability
- 4.1Robustness
- 4.2Generalizability
- 5.Crisis or opportunity?
- Notes
References
References (100)
Alvarez, R. M., & Heuberger, S. (2022). How
(not) to reproduce: Practical considerations to improve research transparency in political
science. PS: Political Science &
Politics, 55(1), 149–154.
Anderson, S. F., & Maxwell, S. E. (2016). There’s
more than one way to conduct a replication study: Beyond statistical
significance. Psychological
Methods, 21(1), 1–12.
Andringa, S., & Godfroid, A. (2020). Sampling
bias and the problem of generalizability in applied linguistics. Annual Review of Applied
Linguistics, 401, 134–142.
Artner, R., Verliefde, T., Steegen, S., Gomes, S., Traets, F., Tuerlinckx, F., & Vanpaemel, W. (2021). The
reproducibility of statistical results in psychological research: An investigation using unpublished raw
data. Psychological
Methods, 26(5), 527–546.
Bakker, M., & Wicherts, J. M. (2011). The
(mis)reporting of statistical results in psychology journals. Behavior Research
Methods, 43(3), 666–678.
Barth, D., & Kapatsinski, V. (2017). A
multimodel inference approach to categorical variant choice: Construction, priming and frequency effects on the choice between
full and contracted forms of am, are and is. Corpus
Linguistics and Linguistic
Theory, 13(2), 203–260.
Belz, A., Agarwal, S., Shimorina, A., & Reiter, E. (2021). A
systematic review of reproducibility research in natural language
processing. In P. Merlo, J. Tiedemann, & R. Tsarfaty (Eds.), Proceedings
of the 16th conference of the European chapter of the Association for Computational Linguistics: Main
volume (pp. 381–393). Association
for Computational Linguistics.
Bernaisch, T., Gries, S. Th., & Heller, B. (2022). Theoretical
models and statistical modelling of linguistic epicentres. World
Englishes, 41(3), 333–346.
Bisang, W. (2011). Variation
and reproducibility in linguistics. In P. Siemund (Ed.), Linguistic
universals and language
variation (pp. 237–263). De
Gruyter Mouton.
BNC Consortium. (2007). British National
Corpus (version 3, BNC XML ed.). [URL]
Bollen, K., Cacioppo, J. T., Kaplan, R. M., Krosnick, J. A., & Olds, J. L. (2015). Social,
behavioral, and economic sciences perspectives on robust and reliable science (Report of the
Subcommittee on Replicability in Science Advisory Committee to the National Science Foundation Directorate for Social,
Behavioral, and Economic Sciences). National Science Foundation. [URL]
Brezina, V., McEnery, T., & Wattam, S. (2015). Collocations
in context: A new perspective on collocation networks. International Journal of Corpus
Linguistics, 20(2), 139–173.
Brezina, V., & Meyerhoff, M. (2014). Significant
or random?: A critical review of sociolinguistic generalisations based on large
corpora. International Journal of Corpus
Linguistics, 19(1), 1–28.
Brezina, V., & Timperley, M. (2017). How
large is the BNC? A proposal for standardised tokenization and word counting. [Conference
presentation]. Corpus linguistics conference 2017, Birmingham,
UK.
Burch, B., & Egbert, J. (2022a). Confidence
intervals for ratios of means applied to corpus-based word frequency classes. Journal of
Applied
Statistics, 50(7), 1592–1610.
(2022b). Word
use equivalence and hierarchical word tiers. Journal of Quantitative
Linguistics, 30(1), 104–124.
Burch, B., Egbert, J., & Biber, D. (2017). Measuring
and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and
Statistics in Linguistics and Communication
Science, 3(2), 189–216.
Claerbout, J. F., & Karrenbach, M. (1992). Electronic
documents give reproducible research a new meaning. In SEG Technical
Program expanded abstracts
1992, (pp. 601–604).
Doyle, P. G. (2003). Replicating
corpus linguistics: A corpus-driven investigation of lexical networks in texts [Unpublished doctoral
dissertation]. Lancaster University.
Earp, B. D., & Trafimow, D. (2015). Replication,
falsification, and the crisis of confidence in social psychology. Frontiers in
Psychology, 61, Article 621.
Egbert, J., & Baker, P. (Eds.). (2021). Using
corpus methods to triangulate linguistic analysis. Routledge.
Egbert, J., & Biber, D. (2019). Incorporating
text dispersion into keyword
analyses. Corpora, 14(1), 77–104.
Egbert, J., Biber, D., & Gray, B. (2022). Designing
and evaluating language corpora: A practical framework for corpus
representativeness. Cambridge University Press.
Egbert, J., Burch, B., & Biber, D. (2020). Lexical
dispersion and corpus design. International Journal of Corpus
Linguistics, 25(1), 89–115.
Egbert, J., Larsson, T., & Biber, D. (2020). Doing
linguistics with a corpus: Methodological considerations for the everyday user (1st
ed.). Cambridge University Press.
Eubank, N. (2016). Lessons
from a decade of replications at the quarterly journal of political science. PS: Political
Science &
Politics, 49(2), 273–276.
Flanagan, J. (2017). Reproducible
research: Strategies, tools, and workflows. In T. Hiltunen, J. McVeigh, & T. Säily (Eds.), Big
and rich data in English corpus linguistics: Methods and
explorations. VARIENG. [URL]
Fletcher, S. C. (2021). How
(not) to measure replication. European Journal for Philosophy of
Science, 11(2), 57.
Fuscone, S., Favre, B., & Prévot, L. (2021). Reproducibility
in speech rate convergence experiments. Language Resources and
Evaluation, 55(3), 817–832.
Gawne, L., & Berez-Kroeker, A. L. (2018). Reflections
on reproducible research. In B. McDonnell, A. L. Berez-Kroeker, & G. Holton (Eds.), Reflections
on language documentation 20 years after Himmelmann
1998 (pp. 22–32). University of Hawaiʻi Press. [URL]
Gelman, A., & Loken, E. (2013, November 13). The
garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking”
and the research hypothesis was posited ahead of time. [URL]
Gelman, A., & Stern, H. (2006). The
difference between “significant” and “not significant” is not itself statistically
significant. The American
Statistician, 60(4), 328–331.
Gervais, W. M. (2021). Practical
methodological reform needs good theory. Perspectives on Psychological
Science, 16(4), 827–843.
Gries, S. Th. (2015). The most under-used
statistical method in corpus linguistics: Multi-level (and mixed-effects)
models. Corpora, 10(1), 95–125.
(2020). Analyzing
dispersion. In M. Paquot & S. T. Gries (Eds.), A
practical handbook of corpus
linguistics (pp. 99–118). Springer.
(2021). (Generalized linear)
mixed-effects modeling: A learner corpus example. Language
Learning, 71(3), 757–798.
(2022a). What do (most of) our
dispersion measures measure (most)? Dispersion? Journal of Second Language
Studies, 5(2), 171–205.
(2022b). Toward more careful corpus
statistics: Uncertainty estimates for frequencies, dispersions, association measures, and
more. Research Methods in Applied
Linguistics, 1(1), Article
100002.
Gries, S. Th., & Paquot, M. (2020). Writing
up a corpus-linguistic paper. In M. Paquot & S. Th. Gries (Eds.), A
practical handbook of corpus
linguistics (pp. 647–659). Springer.
Hackert, S. (2008). Counting
and coding the past: Circumscribing the variable context in quantitative analyses of past
inflection. Language Variation and
Change, 20(1), 127–153.
Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., & Stahel, W. A. (1986). Robust
statistics: The approach based on influence functions (1st
ed.). Wiley.
Hardwicke, T. E., Bohn, M., MacDonald, K., Hembacher, E., Nuijten, M. B., Peloquin, B. N., deMayo, B. L., Yoon, E. J., & Frank, M. C. (2021). Analytic
reproducibility in articles receiving open data badges at the journal Psychological Science: An observational
study. R. Soc. Open
Sci., 81, Article 201494.
Hardwicke, T. E., Wallach, J. D., Kidwell, M. C., Bendixen, T., Crüwell, S., & Ioannidis, J. P. A. (2020). An
empirical assessment of transparency and reproducibility-related research practices in the social sciences
(2014–2017). R. Soc. Open
Sci., 71, Article 190806.
In’nami, Y., Mizumoto, A., Plonsky, L., & Koizumi, R. (2022). Promoting
computationally reproducible research in applied linguistics: Recommended practices and
considerations. Research Methods in Applied
Linguistics, 1(3), Article
1000030.
Ioannidis, J. P. A. (2005). Why
most published research findings are false. PLoS
Medicine, 2(8), e124.
John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring
the prevalence of questionable research practices with incentives for truth
telling. Psychological
Science, 23(5), 524–532.
Kytö, M., & Smitterberg, E. (2015). Diachronic
registers. In D. Biber & R. Reppen (Eds.), The
Cambridge handbook of English corpus
linguistics (pp. 330–345). Cambridge University Press.
Laurinavichyute, A., Yadav, H., & Vasishth, S. (2022). Share
the code, not just the data: A case study of the reproducibility of articles published in the Journal of Memory and Language
under the open data policy. Journal of Memory and
Language, 1251, Article 104332.
Lee, D. Y. W. (2000). Modelling
variation in spoken and written language: The multi-dimensional approach revisited [Unpublished
doctoral dissertation]. Lancaster University.
Lundberg, I., Johnson, R., & Stewart, B. M. (2021). What
is your estimand? Defining the target quantity connects statistical evidence to
theory. American Sociological
Review, 86(3), 532–565.
McElreath, R. (2020). Statistical
rethinking: A Bayesian course with examples in R and STAN (2nd
ed.). Chapman and Hall/CRC.
McEnery, T., & Brezina, V. (2022). Fundamental
principles of corpus linguistics (1st ed.). Cambridge University Press.
McEnery, T., & Hardie, A. (2011). Corpus
linguistics: Method, theory and practice. Cambridge University Press.
Meehl, P. E. (1967). Theory-testing
in psychology and physics: A methodological paradox. Philosophy of
Science, 34(2), 103–115.
Mehl, S. (2021). What
we talk about when we talk about corpus frequency: The example of polysemous verbs with light and concrete
senses. Corpus Linguistics and Linguistic
Theory, 17(1), 223–247.
National Academies of Sciences, Engineering, and
Medicine. (2019). Reproducibility and replicability in
science. National Academies Press.
Nosek, B. A., Hardwicke, T. E., Moshontz, H., Allard, A., Corker, K. S., Dreber, A., Fidler, F., Hilgard, J., Kline Struhl, M., Nuijten, M. B., Rohrer, J. M., Romero, F., Scheel, A. M., Scherer, L. D., Schönbrodt, F. D., & Vazire, S. (2022). Replicability,
robustness, and reproducibility in psychological science. Annual Review of
Psychology, 731, 719–748.
Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2016). The
prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research
Methods, 481, 1205–1226.
Open Science
Collaboration. (2015). Estimating the reproducibility of psychological
science. Science, 349(6251), aac4716.
Pedersen, T. (2008). Empiricism
is not a matter of faith. Computational
Linguistics, 34(3), 465–470.
Peikert, A., & Brandmaier, A. M. (2021). A
reproducible data analysis workflow with R Markdown, Git, Make, and Docker. Quantitative and
Computational Methods in Behavioral Sciences, 11, Article
e3763.
Peng, R. D., & Hicks, S. C. (2021). Reproducible
research: A retrospective. Annual Review of Public
Health, 421, 79–93.
Phillips, M. (1985). Aspects
of text structure: An investigation of the lexical organisation of
text. North-Holland.
Pietschnig, J., Siegel, M., Eder, J. S. N., & Gittler, G. (2019). Effect
declines are systematic, strong, and ubiquitous: A meta-meta-analysis of the decline effect in intelligence
research. Frontiers in
Psychology, 101, Article 2874.
Porte, G., & McManus, K. (2018). Doing
replication research in applied linguistics (1st
ed.). Routledge.
Rastle, K. (2022). Improving
reproducibility in the Journal of Memory and Language. Journal of Memory and
Language, 1261, Article 104351.
Schützler, O., & Schlüter, J. (Eds.). (2022). Data
and methods in corpus linguistics: Comparative approaches [Supplemental
material]. Cambridge University Press. [URL].
Sönning, L. (2024). Evaluation
of keyness metrics: Performance and reliability. Corpus Linguistics and Linguistic
Theory, 20(1), 263–288.
Sönning, L., & Grafmiller, J. (2024). Seeing
the wood for the trees: Predictive margins for random forests. Corpus Linguistics and
Linguistic
Theory, 20(1), 153–181.
Sönning, L., & Krug, M. (2022). Comparing
study designs and down-sampling strategies in corpus analysis: The importance of speaker metadata in the BNCs of 1994 and
2014. In O. Schützler & J. Schlüter (Eds.), Data
and methods in corpus linguistics: Comparative
approaches (pp. 127–160). Cambridge University Press.
Sönning, L., & Werner, V. (Eds.). (2021a). The
replication crisis: Implications for linguistics [Special
issue]. Linguistics, 59(5). [URL]
(2021b). The
replication crisis, scientific revolutions, and
linguistics. Linguistics, 59(5), 1179–1206.
Spence, J. R., & Stanley, D. J. (2016). Prediction
interval: What to expect when you’re expecting … a replication. PLOS
ONE, 11(9), Article
e0162874.
Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing
transparency through a multiverse analysis. Perspectives on Psychological
Science, 11(5), 702–712.
Sterling, T. D. (1959). Publication
decisions and their possible effects on inferences drawn from test of significance — or vice
versa. Journal of the American Statistical
Association, 54(285), 30–34.
Stodden, V., Seiler, J., & Ma, Z. (2018). An
empirical analysis of journal policy effectiveness for computational
reproducibility. Proceedings of the National Academy of
Sciences, 115(11), 2584–2589.
Szmrecsanyi, B., Biber, D., Egbert, J., & Franco, K. (2016). Toward
more accountability: Modeling ternary genitive variation in Late Modern English. Language
Variation and
Change, 28(1), 1–29.
Trisovic, A., Lau, M. K., Pasquier, T., & Crosas, M. (2022). A
large-scale study on research code quality and execution. Scientific
Data, 9(1), 60.
Vanpaemel, W., Vermorgen, M., Deriemaecker, L., & Storms, G. (2015). Are
we wasting a good crisis? The availability of psychological research data after the
storm. Collabra, 1(1), 3.
Vasishth, S., & Gelman, A. (2021). How
to embrace variation and accept uncertainty in linguistic and psycholinguistic data
analysis. Linguistics, 59(5), 1311–1342.
Vetter, F. (2021). Issues
of corpus comparability and register variation in the International Corpus of English: Theories and computer
applications [Doctoral
dissertation, Otto-Friedrich-Universität].
Wallis, S. (2017, February 16). The
replication crisis: What does it mean for corpus linguistics? corp.ling.stats: statistics for
corpus linguistics. [URL]
(2019). Comparing
χ2 tables for separability of distribution and effect: Meta-tests for comparing homogeneity
and goodness of fit contingency test outcomes. Journal of Quantitative
Linguistics, 26(4), 330–355.
(2022). Accurate
confidence intervals on Binomial proportions, functions of proportions, algebraic formulae and effect
sizes. [URL]
Wallis, S., & Mehl, S. (2022). Comparing
baselines for corpus analysis: Research into the get-passive in speech and
writing. In O. Schützler & J. Schlüter (Eds.), Data
and methods in corpus linguistics: Comparative approaches (1st
ed., pp. 101–126). Cambridge University Press.
Whitaker, K. (2017, September 26). Publishing
a reproducible paper [Conference presentation]. Open science in practice
summer school, Lausanne, Switzerland.
Wieling, M., Rawee, J., & van Noord, G. (2018). Reproducibility
in computational linguistics: Are we willing to share? Computational
Linguistics, 44(4), 641–649.
Wilcox, R. R. (2013). Introduction
to robust estimation and hypothesis testing (3rd ed.). Academic Press.
Wilson, G., Bryan, J., Cranston, K., Kitzes, J., Nederbragt, L., & Teal, T. K. (2017). Good
enough practices in scientific computing. PLOS Computational
Biology, 13(6). Article
e1005510.
Cited by (2)
Cited by two other publications
Becker, Laura & Matías Guzmán Naranjo
This list is based on CrossRef data as of 21 november 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
