In:Challenges in Corpus Linguistics: Rethinking corpus compilation and analysis
Edited by Mark Kaunisto and Marco Schilk
[Studies in Corpus Linguistics 118] 2024
► pp. 68–88
Early newspapers as data for corpus linguistics (and Digital Humanities)
Issues in using the British Library Newspapers database as a corpus
Published online: 19 September 2024
https://doi.org/10.1075/scl.118.05hil
https://doi.org/10.1075/scl.118.05hil
Abstract
The availability of large digital archives has great
potential for corpus linguistic research, but their use is not without
problems. These problems can often be traced to fundamentally different
ideas of what might constitute “good data” in Digital Humanities and in
corpus linguistics, leading to different expectations regarding how the data
is made available to researchers. This chapter discusses the specific
challenges involved in using the British Library Newspapers
database for corpus linguistics and considers potential solutions for them.
It is argued that, to take full advantage of the database, it is necessary
to adopt a flexible approach enabling a critical reflection on the digital
materials, how they have been collected, processed, and made available.
Keywords: corpus compilation, Digital Humanities, sampling, representativeness, register
Article outline
- 1.Introduction
- 2.Digital text analysis in the humanities
- 2.1Digital Humanities
- 2.2Corpus linguistics
- 2.3Towards a useful synergy
- 3.Historical newspaper prose and the British Library
Newspapers database
- 3.1Problems with available search tools
- 3.2Sampling, balance, and representativeness
- 3.3Registers and subregisters
- 3.4Optical Character Recognition (OCR)
- 4.Discussion
Notes References
References (45)
Baroni, Marco & Evert, Stefan. 2009. Statistical
methods for corpus
exploitation. In Corpus
Linguistics: An International
Handbook, Vol. 2: Anke Lüdeling & Merja Kytö (eds), 777–802. Berlin: Mouton de Gruyter.
Becher, Tony & Trowler, Paul. 2001. Academic
Tribes and Territories: Intellectual Enquiry and the Culture of
Disciplines. Buckingham: Society for Research into Higher Education & Open University Press.
Biber, Douglas, Finegan, Edward & Atkinson, Dwight. 1993. ARCHER
and its challenges: Compiling and exploring a representative corpus
of historical English
registers. In Creating
and Using English Language Corpora, Udo Fries, Gunnel Tottie & Peter Schneider (eds), 1–13. Amsterdam: Rodopi.
Biber, Douglas & Gray, Bethany. 2013. Being
specific about historical change: The influence of
sub-register. Journal of English
Linguistics 41 (2): 104–134.
Davies, Mark. 2019. Corpus-based
studies of lexical and semantic variation: The importance of both
corpus size and corpus
design. In From
Data to Evidence in English Language
Research, Terttu Nevalainen, Carla Suhr & Irma Taavitsainen (eds), 66–87. Leiden: Brill.
Drucker, Johanna. 2021. The
Digital Humanities Coursebook: An Introduction to Digital Methods
for Research and
Scholarship. Abingdon, Oxon: Routledge.
Duguid, Alison. 2010. Newspaper
discourse informalisation: A diachronic comparison from
keywords. Corpora 5(2): 109–138.
Flanagan, Joseph. Forthcoming. Reproducibility,
replication, robustness, and generalizability in corpus
linguistics. In Reproducibility,
Replication, and Robustness in Corpus
Linguistics, Michael Haugh & Martin Schweinberger (eds). Special
issue in International Journal of Corpus
Linguistics.
Fries, Udo. 2012. English
and the Media:
Newspapers. In English
Historical Linguistics: An International
Handbook, Alexander Bergs & Laurel Brinton (eds), 1063–75. Berlin: Mouton de Gruyter.
Fries, Udo & Schneider, Peter. 2000. ZEN:
Preparing the Zurich English Newspaper
Corpus. In English
Media Texts – Past and Present: Language and Textual
Structure, Friedrich Ungerer (ed.), 3–24. Amsterdam: John Benjamins.
Gale. 2023. Gale
Digital Scholar Lab. <[URL]> (19 May
2024).
Geertz, Clifford. 1983. Local
Knowledge: Further Essays in Interpretive
Anthropology. New York NY: Basic Books.
Gregory, Ian Norman, Atkinson, Paul David, Hardie, Andrew, Joulain-Jay, Amelia, Kershaw, Daniel, Porter, Catherine, Rayson, Paul Edward & Rupp, Christopher John. 2016. From
digital resources to historical scholarship with the British Library
19th Century Newspaper
Collection. Journal of Siberian
Federal University: Humanities and Social
Sciences 9(4): 994–1006.
Hill, Mark J. & Hengchen, Simon. 2019. Quantifying
the impact of dirty OCR on historical text analysis: Eighteenth
century collections online as a case
study. Digital Scholarship in the
Humanities 34(4): 825–843.
Hiltunen, Turo. 2021. Exploring
sub-register variation in Victorian newspapers: Evidence from the
British Library Newspapers
Database. In Corpus-Based
Approaches to Register Variation [Studies in
Corpus Linguistics 103], Elena Seoane & Douglas Biber (eds), 313–338. Amsterdam: John Benjamins.
Hiltunen, Turo, McVeigh, Joe & Säily, Tanja. 2017. How
to turn linguistic data into
evidence? In Big
and Rich Data in English Corpus Linguistics: Methods and
Explorations [Studies in Variation, Contacts
and Change in English], Turo Hiltunen, Joe McVeigh, & Tanja Säily (eds). Helsinki: Research Unit for Variation, Contacts, and Change in English. <[URL]> (19 May
2024).
Hundt, Marianne & Leech, Geoffrey. 2012. ‘Small
is beautiful’: On the value of standard reference corpora for
observing recent grammatical
change. In The
Oxford Handbook of the History of
English, Elizabeth Traugott & Terttu Nevalainen (eds), 175–188. Oxford: OUP.
Jensen, Kim Ebensgaard. 2014. Linguistics
in the digital humanities: (Computational) corpus
linguistics. MedieKultur: Journal of
Media and Communication
Research 30 (57).
Jucker, Andreas H. 1992. Social
Stylistics. Syntactic Variation in British
Newspapers. Berlin: Walter de Gruyter.
2009. Newspapers,
pamphlets and scientific news discourse in Early Modern
Britain. In Early
Modern English News Discourse: Newspapers, Pamphlets and Scientific
News Discourse [Pragmatics & Beyond New
Series 187] Andreas H. Jucker (ed.), 1–9. Amsterdam: John Benjamins.
Kichuk, Diana. 2007. Metamorphosis:
Remediation in early English books online
(EEBO). Literary and Linguistic
Computing 22(3): 291–303. [URL].
Landert, Daniela. 2014. Personalisation
in Mass Media Communication: British Online News Between Public and
Private [Pragmatics & Beyond New Series
240]. Amsterdam: John Benjamins.
Liimatta, Aatu. 2022. Do
registers have different functions for text length? A case study of
Reddit. Register
Studies 4(2): 263–287.
Ljung, Magnus. 2000. Newspaper
genres and newspaper
English. In English
Media Texts, Past and Present: Language and Textual
Structure [Pragmatics & Beyond New
Series 80], Friedrich Ungerer (ed.), 131–150. Amsterdam: John Benjamins.
Mair, Christian. 2006. Tracking
ongoing grammatical change and recent diversification in present-day
standard English: The complementary role of small and large
corpora. In The
Changing Face of Corpus Linguistics, Andrew Kehoe & Antoinette Renouf (eds), 355–376. Leiden: Brill.
Mäkelä, Eetu. 2021. Octavo.
GitHub repository. <[URL]> (19 May
2024).
McCarty, Willard. 2012. A
telescope for the
mind? In Debates
in the Digital Humanities, Matthew K. Gold (ed.), 113–136. Minneapolis MN: University of Minnesota Press.
McEnery, Tony & Hardie, Andrew. 2011. Corpus
Linguistics: Method, Theory and
Practice. Cambridge: CUP.
. 2013. The
history of corpus
linguistics. In The
Oxford Handbook of the History of
Linguistics, Keith Allan (ed.), 727–745. Oxford: OUP.
McEnery, Tony, Xiao, Richard & Tono, Yukio. 2006. Corpus-based
Language Studies: An Advanced Resource
Book. London: Routledge.
Mehl, Seth. 2021. Why
linguists should care about digital humanities (and
epidemiology). Journal of English
Linguistics 49(3): 331–337.
Nicholson, Bob. 2012. Counting
culture; Or, how to read Victorian newspapers from a
distance. Journal of Victorian
Culture 17(2): 238–246.
Nyhan, Julianne, Terras, Melissa & Vanhoutte, Edward. 2010. Introduction. In Defining
Digital Humanities. A Reader, Melissa Terras, Edward Vanhoutte & Julianne Nyhan (eds), 1–10. Farnham: Ashgate.
Percy, Carol. 2012. Early
advertising and newspapers as sources of sociolinguistic
investigation. In The
Handbook of Historical
Sociolinguistics, Juan Manuel Hernández Campoy & Juan Camilo Silvestre Conde (eds), 191–210. Malden MA: Blackwell.
Prescott, Andrew. 2018. Searching
for Dr. Johnson: The digitisation of the Burney newspaper
collection. In Travelling
Chronicles: News and Newspapers from the Early Modern Period to the
Eighteenth Century [Library of the Written
Word 66], Siv Gøril Brandtzæg, Paul Goring & Christine Watson (eds), 51–71. Leiden: Brill.
Roth, Camille. 2018. Digital,
digitized, and numerical
humanities. Digital Scholarship in
the
Humanities 34(3): 616–632.
Rühlemann, Christoph & Hilpert, Martin. 2017. Colloquialization
in journalistic writing: The case of inserts with a focus on
Well. Journal of
Historical
Pragmatics 18(1): 104–135.
Sinclair, John. 2005. Corpus
and text – Basic
principles. In Developing
Linguistic Corpora: A Guide to Good
Practice, Martin Wynne (ed.), 1–16. Oxford: Oxbow Books.
Tanner, Simon, Munoz, Trevor & Hemy Ros, Pich. 2009. Measuring
mass text digitization quality and usefulness: Lessons learned from
assessing the OCR accuracy of the British Library’s 19th century
online newspaper archive. D-Lib
Magazine 15(7–8). <[URL]>
