In:Challenges in Corpus Linguistics: Rethinking corpus compilation and analysis
Edited by Mark Kaunisto and Marco Schilk
[Studies in Corpus Linguistics 118] 2024
► pp. 89–105
Open Corpus Linguistics – or How to overcome common problems in dealing with corpus data by adopting open research practices
Published online: 19 September 2024
https://doi.org/10.1075/scl.118.06har
https://doi.org/10.1075/scl.118.06har
Abstract
In recent years, many researchers have called attention
to the fact that research results very often cannot be replicated – a
phenomenon that has been called replication crisis. The
replication crisis in linguistics is highly relevant to corpus-based
research: Many corpus studies are not directly replicable as the data on
which they are based are not readily available. Especially in English
linguistics, the full versions of many widely used corpora are still behind
paywalls, which means that they are not accessible to parts of the global
research community, and even when parts of the data are freely accessible,
this presents problems for state-of-the-art methods of data analysis. In
this paper, I discuss the challenges that have led to this situation and
address some possible solutions. In particular, I argue for using smaller
but openly available corpora whenever possible and for adopting open
research practices as far as possible even when using commercial
corpora.
Keywords: replicability, open research, accessibility, transparency, representativeness
Article outline
- 1.Introduction
- 2.Revisiting Rissanen’s problems
- 3.Open Corpus Linguistics: Perspectives and challenges
- 4.Conclusion: Open Corpus Linguistics in practice
Acknowledgements Notes References
References (40)
Baker, Paul, Hardie, Andrew & McEnery, Tony. 2006. A
Glossary of Corpus
Linguistics. Edinburgh: EUP.
Barbaresi, Adrien. 2021. Trafilatura:
A web scraping library and command-line tool for text discovery and
extraction. In Proceedings
of the Annual Meeting of the ACL, System
Demonstrations. <[URL]> (25 October
2022).
Baroni, Marco, Bernardini, Silvia, Ferraresi, Adriano & Zanchetta, Eros. 2009. The
WaCky Wide Web: A collection of very large linguistically processed
web-crawled corpora. Language
Resources and
Evaluation 43(3): 209–226.
Biber, Douglas. 1993. Representativeness
in corpus design. Literary and
Linguistic
Computing 8: 243–257.
Collister, Lauren B. 2022. Copyright
and sharing linguistic
data. In The
Open Handbook of Linguistic Data
Management, Andrea L. Berez-Kroeker, Bradley McDonnell, Eve Koller & Lauren B. Collister (eds), 117–128. Cambridge MA: The MIT Press.
Dijk, Teun A. van. 2005. Contextual
knowledge management in discourse prediction: A CDA
perspective. In A
New Agenda in (Critical) Discourse Analysis: Theory, Methodology and
Interdisciplinarity [Discourse Approaches to
Politics, Society and Culture 13], Ruth Wodak & Paul A. Chilton (eds), 71–100. Amsterdam: John Benjamins.
Egbert, Jesse, Larsson, Tove & Biber, Douglas. 2020. Doing
Linguistics with a Corpus: Methodological Considerations for the
Everyday User [Elements in Corpus
Linguistics]. Cambridge: CUP.
Eve, Martin Paul. 2014. Open
Access and the Humanities: Con;//doi.org/texts, Controversies and
the
Future. Cambridge: CUP.
Garellek, Marc, Simpson, Adrian, Roettger, Timo B., Recasens, Daniel, Niebuhr, Oliver, Mooshammer, Christine, Michaud, Alexis et al. 2020. Letter
to the editor: Toward open data policies in phonetics: What we can
gain and how we can avoid
pitfalls. Journal of Speech
Sciences 9: 3–16.
Gärtner, Markus, Kleinkopf, Felicitas, Andresen, Melanie & Hermann, Sibylle. 2021. Corpus
reusability and copyright – Challenges and
opportunities. In Proceedings
of the Workshop on Challenges in the Management of Large Corpora
(CMLC-9) 2021. Limerick, 12 July 2021
(Online-Event), Harald Lüngen, Marc Kupietz, Piotr Bański, Adrien Barbaresi, Simon Clematide & Ines Pisetta (eds), 10–19. Leibniz-Institut für Deutsche Sprache. <[URL]> (30 October,
2022).
Glenberg, Arthur M. & Kaschak, Michael P. 2002. Grounding
language in action. Psychonomic
Bulletin &
Review 9(3): 558–565.
Goldberg, Adele E. 1995. Constructions:
A Construction Grammar Approach to Argument
Structure. Chicago IL: The University of Chicago Press.
Goodman, Steven N., Fanelli, Daniele & Ioannidis, John P. A. 2016. What
does research reproducibility
mean? Science Translational
Medicine 8(341): 341ps12.
Hüffmeier, Joachim, Mazei, Jens & Schultze, Thomas. 2016. Reconceptualizing
replication as a sequence of different studies: A replication
typology. Journal of Experimental
Social
Psychology 66: 81–92.
Hunston, Susan. 2008. Collection
strategies and design
decisions. In Corpus
Linguistics: An International Handbook [HSK
29.1], Anke Lüdeling & Merja Kytö (eds), 154–168. Berlin: Walter de Gruyter.
Kupietz, Marc, Belica, Cyril, Keibel, Holger & Witt, Andreas. 2010. The
German Reference Corpus DeReKo: A primordial sample for linguistic
research. In Proceedings
of the Seventh International Conference on Language Resources and
Evaluation (LREC 2010), Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner & Daniel Tapias (eds), 1848–1854. Valletta: European Language Resources Association. [URL] (19 May
2024).
Kytö, Merja & Rissanen, Matti. 1988. The
Helsinki Corpus of English Texts: Classifying and coding the
diachronic
part. In Corpus
Linguistics, Hard and Soft: Proceedings of the Eighth International
Conference on English Language Research on Computerized
Corpora, Merja Kytö, Ossi Ihalainen & Matti Rissanen (eds), 169–179. Amsterdam: Rodopi.
Larsson, Tove. 2021. Has
‘the replication crisis’ reached corpus
linguistics? Blog Linguistics
with a Corpus. <[URL]> (25 October
2022).
Lehmberg, Timm, Rehm, Georg, Witt, Andreas & Zimmermann, Felix. 2008. Digital
text collections, linguistic research data, and mashups: Notes on
the legal situation. Library
Trends 57: 52–71.
Lewis, William D., Farrar, Scott & Langendoen, D. Terence. 2006. Linguistics
in the Internet age: Tools and fair
use. In Proceedings
of the EMELD’06 Workshop on Digital Language Documentation: Tools
and Standards: The State of the
Art. Lansing, MI. <[URL]> (6 January
2023).
McCreadie, Richard, Soboroff, Ian, Lin, Jimmy, Macdonald, Craig, Ounis, Iadh & McCullough, Dean. 2012. On
building a reusable Twitter
corpus. In Proceedings
of the 35th International ACM SIGIR Conference on Research and
Development in Information Retrieval – SIGIR
’12, 1113. Portland OR: ACM Press.
Morey, Richard D. et al. 2021. A
pre-registered, multi-lab non-replication of the action-sentence
compatibility effect
(ACE). Psychonomic Bulletin &
Review 29: 613–626.
Omidian, Taha, Balance, Oliver James & Siyanova-Chanturia, Anna. 2021. Replicating
corpus-based research in English for academic purposes: Proposed
replication of Cortes (2013) and Biber and Gray
(2010). Language
Teaching, 1–9.
Perek, Florent. 2021. Distributional
semantic models for English verbs and
nouns. Open Science
Framework.
Rehm, Georg, Witt, Andreas, Zinsmeister, Heike & Dellert, Johannes. 2007. Corpus
masking: Legally bypassing licensing restrictions for the free
distribution of text
collections. In Digital
Humanities 2007, 2nd
edn, Sara Schmidt, Ray Siemens, Amit Kumar & John Unsworth (eds), 166–170. Urbana-Champaign IL: University of Illinois. <[URL]> (19 May
2024).
Rissanen, Matti. 1989. Three
problems connected with the use of diachronic
corpora. ICAME
Journal 13: 16–19.
Rosati, Eleonora. 2021. The
DSM Directive two years on: Do things ever get
easier? IIC – International Review of
Intellectual Property and Competition
Law 52(9): 1139–1142.
Schäfer, Roland & Bildhauer, Felix. 2012. Building
large corpora from the web using a new efficient tool
chain. In Proceedings
of LREC 2012, Nicoletta Calzolari, Khalid Choukri, Terry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds), 486–493. <[URL]> (25 October
2022).
Schmidt, Stefan. 2009. Shall
we really do it again? The powerful concept of replication is
neglected in the social
sciences. Review of General
Psychology 13(2): 90–100.
Schneider, Roman. 2020. A
corpus linguistic perspective on contemporary German pop lyrics with
the multi-layer annotated
“Songkorpus”. In Proceedings
of the Thirteenth Language Resources and Evaluation
Conference, Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds), 842–848. Marseille: European Language Resources Association. <[URL]> (19 May
2024).
Sönning, Lukas & Werner, Valentin. 2021. The
replication crisis, scientific revolutions, and
linguistics. Linguistics 59(5): 1179–1206.
Stefanowitsch, Anatol & Gries, Stefan T. 2003. Collostructions:
Investigating the interaction of words and
constructions. International Journal
of Corpus
Linguistics 8(2): 209–243.
Vandeweerd, Nathan, Housen, Alex & Paquot, Magali. 2021. Applying
phraseological complexity measures to L2 French: A partial
replication study. International
Journal of Learner Corpus
Research 7(2): 197–229.
Wilkinson, Mark D. et al. 2016. The
FAIR Guiding Principles for scientific data management and
stewardship. Scientific
Data 3(1): 160018.
Winter, Bodo & Grice, Martine. 2021. Independence
and generalizability in
linguistics. Linguistics 59(5): 1251–1277.
Yamamoto, Mutsumi. 1999. Animacy
and Reference. A Cognitive Approach to Corpus
Linguistics [Studies in Language Companion
Series
46]. Amsterdam: John Benjamins.
Zaenen, Annie, Carletta, Jean, Garretson, Gregory, Bresnan, Joan, Koontz-Garboden, Andrew, Nikitina, Tatiana, O’Connor, M. Catherine & Wasow, Tom. 2004. Animacy
encoding in English: Why and
how. In DiscAnnotation
’04, Bonnie Webber & Donna Byron (eds), 118–125. Stroudsburg PA: Association for Computational Linguistics.
Cited by (1)
Cited by one other publication
This list is based on CrossRef data as of 1 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
