In:Challenges in Corpus Linguistics: Rethinking corpus compilation and analysis
Edited by Mark Kaunisto and Marco Schilk
[Studies in Corpus Linguistics 118] 2024
► pp. 9–34
Engaging with bad (meta)data in historical corpus linguistics
Published online: 19 September 2024
https://doi.org/10.1075/scl.118.02var
https://doi.org/10.1075/scl.118.02var
Abstract
In this chapter, we discuss some common pitfalls related
to historical data and its use in linguistic analysis. We argue that the
“philologist’s dilemma”, as originally proposed by Rissanen (1989), should be reconceptualized to meet
the needs of the fast-evolving field of corpus linguistics, where scholars
make increasing use of big-data resources and sophisticated statistical
modelling. By providing examples of errors and uncertainties related to, for
example, corpus metadata, sampling, balance, and OCR accuracy, we argue that
corpus linguists should pay increasingly close attention to the sampling and
annotation principles employed in the compilation of historical corpora as
well as to the quality of the linguistic data. We propose that the principle
of “knowing one’s corpus” in terms of its compilation principles has become
all the more important in the age of big-data corpora, where it is not
feasible for individual researchers, or corpus compilers, to validate their
data manually.
Article outline
- 1.Introduction
- 2.POS annotation in diachronic datasets
- 2.1Accounting for category change
- 2.2Theoretical choices in the design of the annotation scheme
- 2.3Annotation tailored to specific research questions
- 3.Large corpora
- 3.1Inaccuracies in text sampling
- 3.2Changes in the balance of subgenres
- 4.Historical databases
- 4.1Issues with balance and metadata
- 4.2OCR errors
- 4.2.1Hapax legomena
- 4.2.2Historical lexis
- 5.Discussion and conclusion
Acknowledgements Notes References
References (41)
Alatrash, Reem, Schlechtweg, Dominik, Kuhn, Jonas & Schulte im Walde, Sabine. 2020. CCOHA:
Clean Corpus of Historical American
English. In Proceedings
of the Twelfth Language Resources and Evaluation
Conference, Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds), 6958–6966. Marseille: European Language Resources Association.
Baayen, R. Harald. 1993. On
frequency, transparency and
productivity. In Yearbook
of Morphology 1992, Geert Booij & Jaap van Marle (eds), 181–208. Dordrecht: Kluwer.
Biber, Douglas & Finegan, Edward. 1997. Diachronic
relations among speech-based and written registers in
English. In To
Explain the Present: Studies in the Changing English Language in
Honour of Matti Rissanen [Mémoires de la
Société Néophilologique de Helsinki
52], Terttu Nevalainen & Leena Kahlas-Tarkka (eds), 253–275. Helsinki: Société Néophilologique.
Biber, Douglas & Gray, Bethany. 2011. The
historical shift of scientific academic prose in English towards
less explicit styles of expression: Writing without
verbs. In Researching
Specialized Kanguages [Studies in Corpus
Linguistics 47], Vijay Bhatia, Purificación Sánchez Hernández & Pascual Pérez-Paredes (eds), 11–24. Amsterdam: John Benjamins.
Brinton, Laurel & Fee, Margery. 2001. Canadian
English. In The
Cambridge History of the English Language, Vol. 6: English in North
America, John Algeo (ed.), 422–440. Cambridge: CUP.
COHA = Davies, Mark. 2010–. The
Corpus of Historical American English: 400
million
words, 1810–2009. <[URL]> (10 May
2024).
Denison, David. 1998. Syntax. In The
Cambridge History of the English Language, Vol. 4:
1776–1997, Suzanne Romaine (ed.), 92–329. Cambridge: CUP.
. 2001. Gradience
and linguistic
change. In Historical
Linguistics 1999: Selected Papers from the 14th International
Conference on Historical Linguistics, Vancouver, August
1999 [Current Issues in Linguistic Theory
215], Laurel Brinton (ed.), 119–144. Amsterdam: John Benjamins.
. 2013. Parts
of speech: Solid citizens or slippery
customers? Journal of the British
Academy 1: 151–185.
ECCO = Eighteenth
Century Collections
Online. Gale. <[URL]> (19 May
2024).
Hill, Mark & Hengchen, Simon. 2019. Quantifying
the impact of dirty OCR on historical text analysis: Eighteenth
Century Collections Online as a case
study. Digital Scholarship in the
Humanities 34(4): 825–843.
Huddleston, Rodney & Pullum, Geoffrey K. (eds). 2002. The
Cambridge Grammar of the English
Language. Cambridge: CUP.
Hundt, Marianne & Leech, Geoffrey. 2012. “Small
is beautiful”: On the value of standard reference corpora for
observing recent grammatical
change . In The Oxford
Handbook of the History of English [Oxford
Handbooks in Linguistics], Terttu Nevalainen & Elizabeth Closs Traugott (eds), 175–188. Oxford: OUP.
Larsson, Tove, Egbert, Jesse & Biber, Douglas. 2022. On
the status of statistical reporting versus
linguistic description in corpus linguistics: A ten-year
perspective. Corpora 17(1): 137–157.
Leech, Geoffrey. 1991. The
state of the art in corpus
linguistics. In Karin Aijmer & Bengt Altenberg (eds), English
Corpus Linguistics. Studies in Honour of Jan
Svartvik. London: Longman.
Lenker, Ursula. 2010. Argument
and Rhetoric: Adverbial Connectors in the History of
English. Berlin: Mouton de Gruyter.
Lenker, Ursula & Meurman-Solin, Anneli (eds). 2007. Connectives
in the History of English [Current Issues in
Linguistic Theory
283]. Amsterdam: John Benjamins.
Liimatta, Aatu, Marjanen, Jani, Tahko, Tuuli, Tolonen, Mikko & Säily, Tanja. 2023a. Dimensions
of incoming economic vocabulary in eighteenth-century Britain.
Linguistica 63(1–2): 353–374. Special
issue Sociocultural Change and the
Development of Vernacular Languages in Early Modern
Europe, Oliver Currie (ed.).
Liimatta, Aatu, Ryan, Yann, Säily, Tanja & Tolonen, Mikko. 2023b. Results
from rough data? The large-scale study of early modern
historiography with multi-dimensional register
analysis. In Proceedings
of the Digital Humanities in the Nordic Countries 7th Conference,
Oslo – Stavanger – Bergen, Norway, March 8–10,
2023 [DHNB Publications
5:1], Annika Rockenberger, Sofie Gilbert, Juliane Tiemann & Elisa Pierfederici (eds), 297–312. Oslo: University of Oslo.
López-Couso, María José & Méndez-Naya, Belén. 2020. Masked
by annotation: Minor declarative complementizers in parsed corpora
of historical English. Research in
Corpus
Linguistics 8(2): 133–158.
McEnery, Anthony & Baker, Helen. 2019. Language
surrounding poverty in early modern England: A corpus-based
investigation of how people living in the seventeenth century
perceived the criminalized
poor. In From
Data to Evidence in English Language
Research, Carla Suhr, Terttu Nevalainen & Irma Taavitsainen (eds), 225–257. Leiden: Brill.
Menzel, Katrin, Knappen, Jörg & Teich, Elke. 2021. Generating
linguistically relevant metadata for the Royal Society
Corpus. Research in Corpus
Linguistics 9(1): 1–18. Special
issue Challenges of Combining Structured
and Unstructured Data in Corpus
Development, Tanja Säily & Jukka Tyrkkö (eds).
Öhman, Emily, Säily, Tanja & Laitinen, Mikko. 2019. Towards
the inevitable demise of everybody? A
multifactorial analysis of one/-body/-man variation
in indefinite pronouns in historical American
English. Presentation at
the 40th Annual Conference of the
International Computer Archive of Modern and Medieval English (ICAME
40), Neuchâtel,
Switzerland, June. <[URL]> (19 May
2024).
PCEEC = Parsed
Corpus of Early English Correspondence, tagged
version. 2006. Annotated
by Arja Nurmi, Ann Taylor, Anthony Warner, Susan Pintzuk & Terttu Nevalainen. Compiled
by the CEEC Project
Team. York: University of York and Helsinki: University of Helsinki. Distributed through the Oxford Text Archive.
Peitsara, Kirsti. 1997. The
development of reflexive strategies in
English. In Grammaticalization
at Work: Studies of Long-term Developments in
English [Topics in English Linguistics
24], Matti Rissanen, Merja Kytö & Kirsi Heikkonen (eds), 277–370. Berlin: De Gruyter Mouton.
Petré, Peter & Anthonissen, Lynn. 2020. Individuality
in complex systems: A constructionist
approach. Cognitive
Linguistics 31(2): 185–212.
Rayson, Paul, Archer, Dawn, Baron, Alistair, Culpeper, Jonathan & Smith, Nicholas. 2007. Tagging
the Bard: Evaluating the accuracy of a modern POS tagger on Early
Modern English
corpora. In Proceedings
of Corpus Linguistics 2007, 27–30 July, University of Birmingham,
UK, Matthew Davies, Paul Rayson, Susan Hunston & Pernilla Danielsson (eds), article
#192. <[URL]> (19 May
2024).
Rayson, Paul, Leech, Geoffrey & Hodges, Mary. 1997. Social
differentiation in the use of English vocabulary: Some analyses of
the conversational component of the British National
Corpus. International Journal of
Corpus
Linguistics 2(1): 133–152.
Rissanen, Matti. 1989. Three
problems connected with the use of diachronic
corpora. ICAME
Journal 13: 16–19.
Säily, Tanja. 2014. Sociolinguistic
Variation in English Derivational Productivity: Studies and Methods
in Diachronic Corpus Linguistics [Mémoires
de la Société Néophilologique de Helsinki
XCIV]. Helsinki: Société Néophilologique.
Säily, Tanja, Nevalainen, Terttu & Siirtola, Harri. 2011. Variation
in noun and pronoun frequencies in a sociohistorical corpus of
English. Literary and Linguistic
Computing 26(2): 167–188.
Säily, Tanja, Vartiainen, Turo & Siirtola, Harri. 2017. Exploring
part-of-speech frequencies in a sociohistorical corpus of
English. In Exploring
Future Paths of Historical Sociolinguistics [Advances in Historical
Sociolinguistics 7], Tanja Säily, Arja Nurmi, Minna Palander-Collin & Anita Auer (eds), 23–52. Amsterdam: John Benjamins.
Säily, Tanja & Vartiainen, Turo. Forthcoming. Historical
linguistics. In The
Bloomsbury Handbook of Corpus
Linguistics, Gavin Brookes & Michaela Mahlberg (eds). London: Bloomsbury.
Szmrecsanyi, Benedikt, Rosenbach, Anette, Bresnan, Joan & Wolk, Christoph. 2014. Culturally
conditioned language change? A multi-variate analysis of genitive
constructions in
ARCHER. In Late
Modern English Syntax, Marianne Hundt (ed), 133–152. Cambridge: CUP.
Taylor, Ann & Santorini, Beatrice. 2006. The
Parsed Corpus of Early English
Correspondence. York: University of York. <[URL]> (19 May
2024).
Tognini-Bonelli, Elena. 2001. Corpus
Linguistics at Work [Studies in Corpus
Linguistics
6]. Amsterdam: John Benjamins.
