Open Corpus Linguistics – or How to overcome common problems in dealing with corpus data by adopting open research practices

Hartmann, Stefan

doi:10.1075/scl.118.06har

In:Challenges in Corpus Linguistics: Rethinking corpus compilation and analysis
Edited by Mark Kaunisto and Marco Schilk
[Studies in Corpus Linguistics 118] 2024
► pp. 89–105

Get fulltext from our e-platform

Download Book PDF

Download Book EPUB

Open Corpus Linguistics – or How to overcome common problems in dealing with corpus data by adopting open research practices

Stefan Hartmann | Heinrich Heine University Düsseldorf

Published online: 19 September 2024

https://doi.org/10.1075/scl.118.06har

Abstract

In recent years, many researchers have called attention to the fact that research results very often cannot be replicated – a phenomenon that has been called replication crisis. The replication crisis in linguistics is highly relevant to corpus-based research: Many corpus studies are not directly replicable as the data on which they are based are not readily available. Especially in English linguistics, the full versions of many widely used corpora are still behind paywalls, which means that they are not accessible to parts of the global research community, and even when parts of the data are freely accessible, this presents problems for state-of-the-art methods of data analysis. In this paper, I discuss the challenges that have led to this situation and address some possible solutions. In particular, I argue for using smaller but openly available corpora whenever possible and for adopting open research practices as far as possible even when using commercial corpora.

Keywords: replicability, open research, accessibility, transparency, representativeness

Article outline

1.Introduction
2.Revisiting Rissanen’s problems
3.Open Corpus Linguistics: Perspectives and challenges
4.Conclusion: Open Corpus Linguistics in practice
Acknowledgements
Notes
References

References (40)

References

Baker, Paul, Hardie, Andrew & McEnery, Tony. 2006. A Glossary of Corpus Linguistics. Edinburgh: EUP.

Barbaresi, Adrien. 2021. Trafilatura: A web scraping library and command-line tool for text discovery and extraction. In Proceedings of the Annual Meeting of the ACL, System Demonstrations. <[URL]> (25 October 2022).

Baroni, Marco, Bernardini, Silvia, Ferraresi, Adriano & Zanchetta, Eros. 2009. The WaCky Wide Web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3): 209–226.

Biber, Douglas. 1993. Representativeness in corpus design. Literary and Linguistic Computing 8: 243–257.

Collister, Lauren B. 2022. Copyright and sharing linguistic data. In The Open Handbook of Linguistic Data Management, Andrea L. Berez-Kroeker, Bradley McDonnell, Eve Koller & Lauren B. Collister (eds), 117–128. Cambridge MA: The MIT Press.

Dijk, Teun A. van. 2005. Contextual knowledge management in discourse prediction: A CDA perspective. In A New Agenda in (Critical) Discourse Analysis: Theory, Methodology and Interdisciplinarity [Discourse Approaches to Politics, Society and Culture 13], Ruth Wodak & Paul A. Chilton (eds), 71–100. Amsterdam: John Benjamins.

Egbert, Jesse, Larsson, Tove & Biber, Douglas. 2020. Doing Linguistics with a Corpus: Methodological Considerations for the Everyday User [Elements in Corpus Linguistics]. Cambridge: CUP.

Eve, Martin Paul. 2014. Open Access and the Humanities: Con;//doi.org/texts, Controversies and the Future. Cambridge: CUP.

Garellek, Marc, Simpson, Adrian, Roettger, Timo B., Recasens, Daniel, Niebuhr, Oliver, Mooshammer, Christine, Michaud, Alexis et al. 2020. Letter to the editor: Toward open data policies in phonetics: What we can gain and how we can avoid pitfalls. Journal of Speech Sciences 9: 3–16.

Gärtner, Markus, Kleinkopf, Felicitas, Andresen, Melanie & Hermann, Sibylle. 2021. Corpus reusability and copyright – Challenges and opportunities. In Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-9) 2021. Limerick, 12 July 2021 (Online-Event), Harald Lüngen, Marc Kupietz, Piotr Bański, Adrien Barbaresi, Simon Clematide & Ines Pisetta (eds), 10–19. Leibniz-Institut für Deutsche Sprache. <[URL]> (30 October, 2022).

Glenberg, Arthur M. & Kaschak, Michael P. 2002. Grounding language in action. Psychonomic Bulletin & Review 9(3): 558–565.

Goldberg, Adele E. 1995. Constructions: A Construction Grammar Approach to Argument Structure. Chicago IL: The University of Chicago Press.

Goodman, Steven N., Fanelli, Daniele & Ioannidis, John P. A. 2016. What does research reproducibility mean? Science Translational Medicine 8(341): 341ps12.

Hüffmeier, Joachim, Mazei, Jens & Schultze, Thomas. 2016. Reconceptualizing replication as a sequence of different studies: A replication typology. Journal of Experimental Social Psychology 66: 81–92.

Hunston, Susan. 2008. Collection strategies and design decisions. In Corpus Linguistics: An International Handbook [HSK 29.1], Anke Lüdeling & Merja Kytö (eds), 154–168. Berlin: Walter de Gruyter.

Kupietz, Marc, Belica, Cyril, Keibel, Holger & Witt, Andreas. 2010. The German Reference Corpus DeReKo: A primordial sample for linguistic research. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner & Daniel Tapias (eds), 1848–1854. Valletta: European Language Resources Association. [URL] (19 May 2024).

Kytö, Merja & Rissanen, Matti. 1988. The Helsinki Corpus of English Texts: Classifying and coding the diachronic part. In Corpus Linguistics, Hard and Soft: Proceedings of the Eighth International Conference on English Language Research on Computerized Corpora, Merja Kytö, Ossi Ihalainen & Matti Rissanen (eds), 169–179. Amsterdam: Rodopi.

Larsson, Tove. 2021. Has ‘the replication crisis’ reached corpus linguistics? Blog Linguistics with a Corpus. <[URL]> (25 October 2022).

Lehmberg, Timm, Rehm, Georg, Witt, Andreas & Zimmermann, Felix. 2008. Digital text collections, linguistic research data, and mashups: Notes on the legal situation. Library Trends 57: 52–71.

Lewis, William D., Farrar, Scott & Langendoen, D. Terence. 2006. Linguistics in the Internet age: Tools and fair use. In Proceedings of the EMELD’06 Workshop on Digital Language Documentation: Tools and Standards: The State of the Art. Lansing, MI. <[URL]> (6 January 2023).

Machery, Edouard. 2020. What is a replication? Philosophy of Science 87(4): 545–567.

McCreadie, Richard, Soboroff, Ian, Lin, Jimmy, Macdonald, Craig, Ounis, Iadh & McCullough, Dean. 2012. On building a reusable Twitter corpus. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval – SIGIR ’12, 1113. Portland OR: ACM Press.

Morey, Richard D. et al. 2021. A pre-registered, multi-lab non-replication of the action-sentence compatibility effect (ACE). Psychonomic Bulletin & Review 29: 613–626.

Omidian, Taha, Balance, Oliver James & Siyanova-Chanturia, Anna. 2021. Replicating corpus-based research in English for academic purposes: Proposed replication of Cortes (2013) and Biber and Gray (2010). Language Teaching, 1–9.

Perek, Florent. 2021. Distributional semantic models for English verbs and nouns. Open Science Framework.

Rehm, Georg, Witt, Andreas, Zinsmeister, Heike & Dellert, Johannes. 2007. Corpus masking: Legally bypassing licensing restrictions for the free distribution of text collections. In Digital Humanities 2007, 2nd edn, Sara Schmidt, Ray Siemens, Amit Kumar & John Unsworth (eds), 166–170. Urbana-Champaign IL: University of Illinois. <[URL]> (19 May 2024).

Rissanen, Matti. 1989. Three problems connected with the use of diachronic corpora. ICAME Journal 13: 16–19.

Rosati, Eleonora. 2021. The DSM Directive two years on: Do things ever get easier? IIC – International Review of Intellectual Property and Competition Law 52(9): 1139–1142.

Schäfer, Roland & Bildhauer, Felix. 2012. Building large corpora from the web using a new efficient tool chain. In Proceedings of LREC 2012, Nicoletta Calzolari, Khalid Choukri, Terry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds), 486–493. <[URL]> (25 October 2022).

. 2013. Web Corpus Construction. San Rafael CA: Morgan & Claypool.

Schmidt, Stefan. 2009. Shall we really do it again? The powerful concept of replication is neglected in the social sciences. Review of General Psychology 13(2): 90–100.

Schneider, Roman. 2020. A corpus linguistic perspective on contemporary German pop lyrics with the multi-layer annotated “Songkorpus”. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds), 842–848. Marseille: European Language Resources Association. <[URL]> (19 May 2024).

Sönning, Lukas & Werner, Valentin. 2021. The replication crisis, scientific revolutions, and linguistics. Linguistics 59(5): 1179–1206.

Stefanowitsch, Anatol & Gries, Stefan T. 2003. Collostructions: Investigating the interaction of words and constructions. International Journal of Corpus Linguistics 8(2): 209–243.

Vandeweerd, Nathan, Housen, Alex & Paquot, Magali. 2021. Applying phraseological complexity measures to L2 French: A partial replication study. International Journal of Learner Corpus Research 7(2): 197–229.

Wilkinson, Mark D. et al. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3(1): 160018.

Winter, Bodo & Grice, Martine. 2021. Independence and generalizability in linguistics. Linguistics 59(5): 1251–1277.

Yamamoto, Mutsumi. 1999. Animacy and Reference. A Cognitive Approach to Corpus Linguistics [Studies in Language Companion Series 46]. Amsterdam: John Benjamins.

Zaenen, Annie, Carletta, Jean, Garretson, Gregory, Bresnan, Joan, Koontz-Garboden, Andrew, Nikitina, Tatiana, O’Connor, M. Catherine & Wasow, Tom. 2004. Animacy encoding in English: Why and how. In DiscAnnotation ’04, Bonnie Webber & Donna Byron (eds), 118–125. Stroudsburg PA: Association for Computational Linguistics.

Zwaan, Rolf A., Etz, Alexander, Lucas, Richard E. & Donnellan, M. Brent. 2018. Making replication mainstream. Behavioral and Brain Sciences 41, E120.

Cited by (1)

Cited by one other publication

Hartmann, Stefan & Tobias Ungerer

2025. 295Chaos Theory, Shmaos Theory . In Dynamics at the Lexicon-Syntax Interface, ► pp. 295 ff.

This list is based on CrossRef data as of 1 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.