Engaging with bad (meta)data in historical corpus linguistics

Vartiainen, Turo; Säily, Tanja

doi:10.1075/scl.118.02var

In:Challenges in Corpus Linguistics: Rethinking corpus compilation and analysis
Edited by Mark Kaunisto and Marco Schilk
[Studies in Corpus Linguistics 118] 2024
► pp. 9–34

Get fulltext from our e-platform

Download Book PDF

Download Book EPUB

Engaging with bad (meta)data in historical corpus linguistics

Turo Vartiainen | University of Helsinki

Tanja Säily | University of Helsinki

Published online: 19 September 2024

https://doi.org/10.1075/scl.118.02var

Abstract

In this chapter, we discuss some common pitfalls related to historical data and its use in linguistic analysis. We argue that the “philologist’s dilemma”, as originally proposed by Rissanen (1989), should be reconceptualized to meet the needs of the fast-evolving field of corpus linguistics, where scholars make increasing use of big-data resources and sophisticated statistical modelling. By providing examples of errors and uncertainties related to, for example, corpus metadata, sampling, balance, and OCR accuracy, we argue that corpus linguists should pay increasingly close attention to the sampling and annotation principles employed in the compilation of historical corpora as well as to the quality of the linguistic data. We propose that the principle of “knowing one’s corpus” in terms of its compilation principles has become all the more important in the age of big-data corpora, where it is not feasible for individual researchers, or corpus compilers, to validate their data manually.

Keywords: historical corpus linguistics, metadata, part-of-speech annotation, big data, corpus compilation, sampling

Article outline

1.Introduction
2.POS annotation in diachronic datasets
- 2.1Accounting for category change
- 2.2Theoretical choices in the design of the annotation scheme
- 2.3Annotation tailored to specific research questions
3.Large corpora
- 3.1Inaccuracies in text sampling
- 3.2Changes in the balance of subgenres
4.Historical databases
- 4.1Issues with balance and metadata
- 4.2OCR errors
  - 4.2.1Hapax legomena
  - 4.2.2Historical lexis
5.Discussion and conclusion
Acknowledgements
Notes
References

References (41)

References

Aarts, Bas. 2007. Syntactic Gradience: The Nature of Grammatical Indeterminacy. Oxford: OUP.

Alatrash, Reem, Schlechtweg, Dominik, Kuhn, Jonas & Schulte im Walde, Sabine. 2020. CCOHA: Clean Corpus of Historical American English. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds), 6958–6966. Marseille: European Language Resources Association.

Baayen, R. Harald. 1993. On frequency, transparency and productivity. In Yearbook of Morphology 1992, Geert Booij & Jaap van Marle (eds), 181–208. Dordrecht: Kluwer.

Biber, Douglas & Finegan, Edward. 1997. Diachronic relations among speech-based and written registers in English. In To Explain the Present: Studies in the Changing English Language in Honour of Matti Rissanen [Mémoires de la Société Néophilologique de Helsinki 52], Terttu Nevalainen & Leena Kahlas-Tarkka (eds), 253–275. Helsinki: Société Néophilologique.

Biber, Douglas & Gray, Bethany. 2011. The historical shift of scientific academic prose in English towards less explicit styles of expression: Writing without verbs. In Researching Specialized Kanguages [Studies in Corpus Linguistics 47], Vijay Bhatia, Purificación Sánchez Hernández & Pascual Pérez-Paredes (eds), 11–24. Amsterdam: John Benjamins.

Brinton, Laurel & Fee, Margery. 2001. Canadian English. In The Cambridge History of the English Language, Vol. 6: English in North America, John Algeo (ed.), 422–440. Cambridge: CUP.

COHA = Davies, Mark. 2010–. The Corpus of Historical American English: 400 million words, 1810–2009. <[URL]> (10 May 2024).

Denison, David. 1998. Syntax. In The Cambridge History of the English Language, Vol. 4: 1776–1997, Suzanne Romaine (ed.), 92–329. Cambridge: CUP.

. 2001. Gradience and linguistic change. In Historical Linguistics 1999: Selected Papers from the 14th International Conference on Historical Linguistics, Vancouver, August 1999 [Current Issues in Linguistic Theory 215], Laurel Brinton (ed.), 119–144. Amsterdam: John Benjamins.

. 2013. Parts of speech: Solid citizens or slippery customers? Journal of the British Academy 1: 151–185.

De Smet, Hendrik. 2012. The course of actualization. Language 88(3): 601–633.

ECCO = Eighteenth Century Collections Online. Gale. <[URL]> (19 May 2024).

Hill, Mark & Hengchen, Simon. 2019. Quantifying the impact of dirty OCR on historical text analysis: Eighteenth Century Collections Online as a case study. Digital Scholarship in the Humanities 34(4): 825–843.

Huddleston, Rodney & Pullum, Geoffrey K. (eds). 2002. The Cambridge Grammar of the English Language. Cambridge: CUP.

Hundt, Marianne & Leech, Geoffrey. 2012. “Small is beautiful”: On the value of standard reference corpora for observing recent grammatical change . In The Oxford Handbook of the History of English [Oxford Handbooks in Linguistics], Terttu Nevalainen & Elizabeth Closs Traugott (eds), 175–188. Oxford: OUP.

Larsson, Tove, Egbert, Jesse & Biber, Douglas. 2022. On the status of statistical reporting versus linguistic description in corpus linguistics: A ten-year perspective. Corpora 17(1): 137–157.

Leech, Geoffrey. 1991. The state of the art in corpus linguistics. In Karin Aijmer & Bengt Altenberg (eds), English Corpus Linguistics. Studies in Honour of Jan Svartvik. London: Longman.

Lenker, Ursula. 2010. Argument and Rhetoric: Adverbial Connectors in the History of English. Berlin: Mouton de Gruyter.

Lenker, Ursula & Meurman-Solin, Anneli (eds). 2007. Connectives in the History of English [Current Issues in Linguistic Theory 283]. Amsterdam: John Benjamins.

Liimatta, Aatu, Marjanen, Jani, Tahko, Tuuli, Tolonen, Mikko & Säily, Tanja. 2023a. Dimensions of incoming economic vocabulary in eighteenth-century Britain. Linguistica 63(1–2): 353–374. Special issue Sociocultural Change and the Development of Vernacular Languages in Early Modern Europe, Oliver Currie (ed.).

Liimatta, Aatu, Ryan, Yann, Säily, Tanja & Tolonen, Mikko. 2023b. Results from rough data? The large-scale study of early modern historiography with multi-dimensional register analysis. In Proceedings of the Digital Humanities in the Nordic Countries 7th Conference, Oslo – Stavanger – Bergen, Norway, March 8–10, 2023 [DHNB Publications 5:1], Annika Rockenberger, Sofie Gilbert, Juliane Tiemann & Elisa Pierfederici (eds), 297–312. Oslo: University of Oslo.

López-Couso, María José & Méndez-Naya, Belén. 2020. Masked by annotation: Minor declarative complementizers in parsed corpora of historical English. Research in Corpus Linguistics 8(2): 133–158.

McEnery, Anthony & Baker, Helen. 2019. Language surrounding poverty in early modern England: A corpus-based investigation of how people living in the seventeenth century perceived the criminalized poor. In From Data to Evidence in English Language Research, Carla Suhr, Terttu Nevalainen & Irma Taavitsainen (eds), 225–257. Leiden: Brill.

Menzel, Katrin, Knappen, Jörg & Teich, Elke. 2021. Generating linguistically relevant metadata for the Royal Society Corpus. Research in Corpus Linguistics 9(1): 1–18. Special issue Challenges of Combining Structured and Unstructured Data in Corpus Development, Tanja Säily & Jukka Tyrkkö (eds).

Öhman, Emily, Säily, Tanja & Laitinen, Mikko. 2019. Towards the inevitable demise of everybody? A multifactorial analysis of one/-body/-man variation in indefinite pronouns in historical American English. Presentation at the 40th Annual Conference of the International Computer Archive of Modern and Medieval English (ICAME 40), Neuchâtel, Switzerland, June. <[URL]> (19 May 2024).

PCEEC = Parsed Corpus of Early English Correspondence, tagged version. 2006. Annotated by Arja Nurmi, Ann Taylor, Anthony Warner, Susan Pintzuk & Terttu Nevalainen. Compiled by the CEEC Project Team. York: University of York and Helsinki: University of Helsinki. Distributed through the Oxford Text Archive.

Peitsara, Kirsti. 1997. The development of reflexive strategies in English. In Grammaticalization at Work: Studies of Long-term Developments in English [Topics in English Linguistics 24], Matti Rissanen, Merja Kytö & Kirsi Heikkonen (eds), 277–370. Berlin: De Gruyter Mouton.

Petré, Peter & Anthonissen, Lynn. 2020. Individuality in complex systems: A constructionist approach. Cognitive Linguistics 31(2): 185–212.

Rayson, Paul, Archer, Dawn, Baron, Alistair, Culpeper, Jonathan & Smith, Nicholas. 2007. Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora. In Proceedings of Corpus Linguistics 2007, 27–30 July, University of Birmingham, UK, Matthew Davies, Paul Rayson, Susan Hunston & Pernilla Danielsson (eds), article #192. <[URL]> (19 May 2024).

Rayson, Paul, Leech, Geoffrey & Hodges, Mary. 1997. Social differentiation in the use of English vocabulary: Some analyses of the conversational component of the British National Corpus. International Journal of Corpus Linguistics 2(1): 133–152.

Rissanen, Matti. 1989. Three problems connected with the use of diachronic corpora. ICAME Journal 13: 16–19.

Säily, Tanja. 2014. Sociolinguistic Variation in English Derivational Productivity: Studies and Methods in Diachronic Corpus Linguistics [Mémoires de la Société Néophilologique de Helsinki XCIV]. Helsinki: Société Néophilologique.

Säily, Tanja, Nevalainen, Terttu & Siirtola, Harri. 2011. Variation in noun and pronoun frequencies in a sociohistorical corpus of English. Literary and Linguistic Computing 26(2): 167–188.

Säily, Tanja, Vartiainen, Turo & Siirtola, Harri. 2017. Exploring part-of-speech frequencies in a sociohistorical corpus of English. In Exploring Future Paths of Historical Sociolinguistics [Advances in Historical Sociolinguistics 7], Tanja Säily, Arja Nurmi, Minna Palander-Collin & Anita Auer (eds), 23–52. Amsterdam: John Benjamins.

Säily, Tanja & Vartiainen, Turo. Forthcoming. Historical linguistics. In The Bloomsbury Handbook of Corpus Linguistics, Gavin Brookes & Michaela Mahlberg (eds). London: Bloomsbury.

Sinclair, John. 2004. Trust the Text. Language, Corpus and Discourse. New York NY: Routledge.

Szmrecsanyi, Benedikt, Rosenbach, Anette, Bresnan, Joan & Wolk, Christoph. 2014. Culturally conditioned language change? A multi-variate analysis of genitive constructions in ARCHER. In Late Modern English Syntax, Marianne Hundt (ed), 133–152. Cambridge: CUP.

Taylor, Ann & Santorini, Beatrice. 2006. The Parsed Corpus of Early English Correspondence. York: University of York. <[URL]> (19 May 2024).

Tognini-Bonelli, Elena. 2001. Corpus Linguistics at Work [Studies in Corpus Linguistics 6]. Amsterdam: John Benjamins.

Tolonen, Mikko, Mäkelä, Eetu, Ijaz, Ali & Lahti, Leo. 2021. Corpus linguistics and Eighteenth Century Collections Online (ECCO). Research in Corpus Linguistics 9(1): 19–34.

Vartiainen, Turo. 2021. Trends and recent change in the syntactic distribution of degree modifiers: Implications for a usage-based theory of word classes. Journal of English Linguistics 49(2): 228–251.