Issues in using the British Library Newspapers database as a corpus: Early newspapers as data for corpus linguistics (and Digital Humanities)

Hiltunen, Turo

doi:10.1075/scl.118.05hil

In:Challenges in Corpus Linguistics: Rethinking corpus compilation and analysis
Edited by Mark Kaunisto and Marco Schilk
[Studies in Corpus Linguistics 118] 2024
► pp. 68–88

Get fulltext from our e-platform

Download Book PDF

Download Book EPUB

Early newspapers as data for corpus linguistics (and Digital Humanities)

Issues in using the British Library Newspapers database as a corpus

Turo Hiltunen | University of Helsinki

Published online: 19 September 2024

https://doi.org/10.1075/scl.118.05hil

Abstract

The availability of large digital archives has great potential for corpus linguistic research, but their use is not without problems. These problems can often be traced to fundamentally different ideas of what might constitute “good data” in Digital Humanities and in corpus linguistics, leading to different expectations regarding how the data is made available to researchers. This chapter discusses the specific challenges involved in using the British Library Newspapers database for corpus linguistics and considers potential solutions for them. It is argued that, to take full advantage of the database, it is necessary to adopt a flexible approach enabling a critical reflection on the digital materials, how they have been collected, processed, and made available.

Keywords: corpus compilation, Digital Humanities, sampling, representativeness, register

Article outline

1.Introduction
2.Digital text analysis in the humanities
- 2.1Digital Humanities
- 2.2Corpus linguistics
- 2.3Towards a useful synergy
3.Historical newspaper prose and the British Library Newspapers database
- 3.1Problems with available search tools
- 3.2Sampling, balance, and representativeness
- 3.3Registers and subregisters
- 3.4Optical Character Recognition (OCR)
4.Discussion
Notes
References

References (45)

References

Baroni, Marco & Evert, Stefan. 2009. Statistical methods for corpus exploitation. In Corpus Linguistics: An International Handbook, Vol. 2: Anke Lüdeling & Merja Kytö (eds), 777–802. Berlin: Mouton de Gruyter.

Becher, Tony & Trowler, Paul. 2001. Academic Tribes and Territories: Intellectual Enquiry and the Culture of Disciplines. Buckingham: Society for Research into Higher Education & Open University Press.

Biber, Douglas. 1988. Variation across Speech and Writing. Cambridge: CUP.

. 1993. Representativeness in corpus design. Literary and Linguistic Computing 8(4): 243–257.

Biber, Douglas & Egbert, Jesse. 2018. Register Variation Online. CUP.

Biber, Douglas & Finegan, Edward. 1994. Sociolinguistic Perspectives on Register. Oxford: OUP.

Biber, Douglas, Finegan, Edward & Atkinson, Dwight. 1993. ARCHER and its challenges: Compiling and exploring a representative corpus of historical English registers. In Creating and Using English Language Corpora, Udo Fries, Gunnel Tottie & Peter Schneider (eds), 1–13. Amsterdam: Rodopi.

Biber, Douglas & Gray, Bethany. 2013. Being specific about historical change: The influence of sub-register. Journal of English Linguistics 41 (2): 104–134.

Conboy, Martin. 2010. The Language of Newspapers: Socio-Historical Perspectives. London: Continuum.

Davies, Mark. 2019. Corpus-based studies of lexical and semantic variation: The importance of both corpus size and corpus design. In From Data to Evidence in English Language Research, Terttu Nevalainen, Carla Suhr & Irma Taavitsainen (eds), 66–87. Leiden: Brill.

Drucker, Johanna. 2021. The Digital Humanities Coursebook: An Introduction to Digital Methods for Research and Scholarship. Abingdon, Oxon: Routledge.

Duguid, Alison. 2010. Newspaper discourse informalisation: A diachronic comparison from keywords. Corpora 5(2): 109–138.

Flanagan, Joseph. Forthcoming. Reproducibility, replication, robustness, and generalizability in corpus linguistics. In Reproducibility, Replication, and Robustness in Corpus Linguistics, Michael Haugh & Martin Schweinberger (eds). Special issue in International Journal of Corpus Linguistics.

Fries, Udo. 2012. English and the Media: Newspapers. In English Historical Linguistics: An International Handbook, Alexander Bergs & Laurel Brinton (eds), 1063–75. Berlin: Mouton de Gruyter.

Fries, Udo & Schneider, Peter. 2000. ZEN: Preparing the Zurich English Newspaper Corpus. In English Media Texts – Past and Present: Language and Textual Structure, Friedrich Ungerer (ed.), 3–24. Amsterdam: John Benjamins.

Gale. 2023. Gale Digital Scholar Lab. <[URL]> (19 May 2024).

Geertz, Clifford. 1983. Local Knowledge: Further Essays in Interpretive Anthropology. New York NY: Basic Books.

Gregory, Ian Norman, Atkinson, Paul David, Hardie, Andrew, Joulain-Jay, Amelia, Kershaw, Daniel, Porter, Catherine, Rayson, Paul Edward & Rupp, Christopher John. 2016. From digital resources to historical scholarship with the British Library 19th Century Newspaper Collection. Journal of Siberian Federal University: Humanities and Social Sciences 9(4): 994–1006.

Hill, Mark J. & Hengchen, Simon. 2019. Quantifying the impact of dirty OCR on historical text analysis: Eighteenth century collections online as a case study. Digital Scholarship in the Humanities 34(4): 825–843.

Hiltunen, Turo. 2021. Exploring sub-register variation in Victorian newspapers: Evidence from the British Library Newspapers Database. In Corpus-Based Approaches to Register Variation [Studies in Corpus Linguistics 103], Elena Seoane & Douglas Biber (eds), 313–338. Amsterdam: John Benjamins.

Hiltunen, Turo, McVeigh, Joe & Säily, Tanja. 2017. How to turn linguistic data into evidence? In Big and Rich Data in English Corpus Linguistics: Methods and Explorations [Studies in Variation, Contacts and Change in English], Turo Hiltunen, Joe McVeigh, & Tanja Säily (eds). Helsinki: Research Unit for Variation, Contacts, and Change in English. <[URL]> (19 May 2024).

Hundt, Marianne & Leech, Geoffrey. 2012. ‘Small is beautiful’: On the value of standard reference corpora for observing recent grammatical change. In The Oxford Handbook of the History of English, Elizabeth Traugott & Terttu Nevalainen (eds), 175–188. Oxford: OUP.

Jensen, Kim Ebensgaard. 2014. Linguistics in the digital humanities: (Computational) corpus linguistics. MedieKultur: Journal of Media and Communication Research 30 (57).

Jucker, Andreas H. 1992. Social Stylistics. Syntactic Variation in British Newspapers. Berlin: Walter de Gruyter.

2009. Newspapers, pamphlets and scientific news discourse in Early Modern Britain. In Early Modern English News Discourse: Newspapers, Pamphlets and Scientific News Discourse [Pragmatics & Beyond New Series 187] Andreas H. Jucker (ed.), 1–9. Amsterdam: John Benjamins.

Kichuk, Diana. 2007. Metamorphosis: Remediation in early English books online (EEBO). Literary and Linguistic Computing 22(3): 291–303. [URL].

Landert, Daniela. 2014. Personalisation in Mass Media Communication: British Online News Between Public and Private [Pragmatics & Beyond New Series 240]. Amsterdam: John Benjamins.

Liimatta, Aatu. 2022. Do registers have different functions for text length? A case study of Reddit. Register Studies 4(2): 263–287.

Ljung, Magnus. 2000. Newspaper genres and newspaper English. In English Media Texts, Past and Present: Language and Textual Structure [Pragmatics & Beyond New Series 80], Friedrich Ungerer (ed.), 131–150. Amsterdam: John Benjamins.

Mair, Christian. 2006. Tracking ongoing grammatical change and recent diversification in present-day standard English: The complementary role of small and large corpora. In The Changing Face of Corpus Linguistics, Andrew Kehoe & Antoinette Renouf (eds), 355–376. Leiden: Brill.

Mäkelä, Eetu. 2021. Octavo. GitHub repository. <[URL]> (19 May 2024).

McCarty, Willard. 2012. A telescope for the mind? In Debates in the Digital Humanities, Matthew K. Gold (ed.), 113–136. Minneapolis MN: University of Minnesota Press.

McEnery, Tony & Hardie, Andrew. 2011. Corpus Linguistics: Method, Theory and Practice. Cambridge: CUP.

. 2013. The history of corpus linguistics. In The Oxford Handbook of the History of Linguistics, Keith Allan (ed.), 727–745. Oxford: OUP.

McEnery, Tony, Xiao, Richard & Tono, Yukio. 2006. Corpus-based Language Studies: An Advanced Resource Book. London: Routledge.

Mehl, Seth. 2021. Why linguists should care about digital humanities (and epidemiology). Journal of English Linguistics 49(3): 331–337.

Nicholson, Bob. 2012. Counting culture; Or, how to read Victorian newspapers from a distance. Journal of Victorian Culture 17(2): 238–246.

Nyhan, Julianne, Terras, Melissa & Vanhoutte, Edward. 2010. Introduction. In Defining Digital Humanities. A Reader, Melissa Terras, Edward Vanhoutte & Julianne Nyhan (eds), 1–10. Farnham: Ashgate.

Percy, Carol. 2012. Early advertising and newspapers as sources of sociolinguistic investigation. In The Handbook of Historical Sociolinguistics, Juan Manuel Hernández Campoy & Juan Camilo Silvestre Conde (eds), 191–210. Malden MA: Blackwell.

Prescott, Andrew. 2018. Searching for Dr. Johnson: The digitisation of the Burney newspaper collection. In Travelling Chronicles: News and Newspapers from the Early Modern Period to the Eighteenth Century [Library of the Written Word 66], Siv Gøril Brandtzæg, Paul Goring & Christine Watson (eds), 51–71. Leiden: Brill.

Roth, Camille. 2018. Digital, digitized, and numerical humanities. Digital Scholarship in the Humanities 34(3): 616–632.

Rühlemann, Christoph & Hilpert, Martin. 2017. Colloquialization in journalistic writing: The case of inserts with a focus on Well. Journal of Historical Pragmatics 18(1): 104–135.

Sinclair, John. 2005. Corpus and text – Basic principles. In Developing Linguistic Corpora: A Guide to Good Practice, Martin Wynne (ed.), 1–16. Oxford: Oxbow Books.

Tanner, Simon, Munoz, Trevor & Hemy Ros, Pich. 2009. Measuring mass text digitization quality and usefulness: Lessons learned from assessing the OCR accuracy of the British Library’s 19th century online newspaper archive. D-Lib Magazine 15(7–8). <[URL]>

Zottola, Angela. 2020. Corpus linguistics and digital humanities. Intersecting paths. A case study from Twitter. América Crítica 4(2): 131–141.