In:Corpora and the Changing Society: Studies in the evolution of English
Edited by Paula Rautionaho, Arja Nurmi and Juhani Klemola
[Studies in Corpus Linguistics 96] 2020
► pp. 29–56
Changes in society and language
Charting poverty
Published online: 8 April 2020
https://doi.org/10.1075/scl.96.02sch
https://doi.org/10.1075/scl.96.02sch
Abstract
This study addresses how societal and linguistic changes can be
detected using historical corpora, with the topics of poverty and industrial revolution
as a case study, based on large historical corpora, in particular EEBO, and CLMET3.0.
The results, based on a rich array of state-of-the art statistical approaches (such as
kernel density estimation), show how poverty, industrial revolution, and urbanization
are associated through, for instance, the associations of war, religion, family,
poverty, and suffering. The study also discusses the importance of data size and
cleanness, the temptations of distant reading, and the necessity for validating the
discovered patterns in close reading and distant reading in interaction.
Article outline
- 1.Introduction
- 2.Data and pre-processing
- 2.1The EEBO Collection as sampler corpus
- 2.2The CLMET3.0 corpus
- 2.3The pre-processing step of spelling normalization
- 3.Methods
- 3.1Data-based and data-driven approaches
- 3.2Document classification
- 3.3Topic modelling
- 3.4Conceptual maps
- 4.Results and discussion
- 4.1Dictionary-based approach
- 4.2Topic modelling
- 4.2.1EEBO early vs. EEBO late
- 4.2.2Adding CLMET3.0 and increasing the number of topics
- 4.3Conceptual maps
- 5.Conclusions
Notes References
References (43)
Corpora and software
CLMET3.0 = Corpus of Late
Modern English Texts, version
3.0. De Smet, Hendrik, Diller, Hans-Jürgen & Tyrkkö, Jukka (comps). <[URL]>
EEBO = Early English Books
Online. Davies, Mark (comp.). <[URL]>
Gephi = <[URL]>
Mallet = Machine Learning for LanguagE
Toolkit. <[URL]>
Textplot = <[URL]>
Other references
Ananiadou, Sophia, Kell, Douglas B. & Tsujii, Jun-ichi. 2006. Text
mining and its potential applications in systems
biology. Trends in
Biotechnology 24(12): 571–579.
Baroni, Marco & Lenci, Alessandro. 2010. Distributional
memory: A general framework for corpus-based
semantics. Computational
Linguistics 36(4): 673–721.
Bartsch, Sabine & Evert, Stefan. 2014. Towards
a Firthian notion of
collocation. In Vernetzungsstrategien,
Zugriffsstrukturen und automatisch ermittelte Angaben in
Internetwörterbüchern [OPAL - Online publizierte Arbeiten zur Linguistik 2/2014], Andrea Abel & Lothar Leimnitz (eds), 48–61. Mannheim: Institut für Deutsche Sprache.
Church, Kenneth. 2000. Empirical
estimates of adaptation: The chance of Two Noriegas is closer to p/2 than
p2. Proceedings of the 17th Conference
on Computational
Linguistics, 180–186. Stroudsburg, PA: Association for Computational Linguistics.
C. W. 2013. Did
living standards improve during the Industrial
Revolution? The
Economist, September 13, 2013. <[URL]> (30 December 2018).
Daudin, Guillaume, O’Rourke, Kevin H., & Prados de la Escosura, Leandro. 2008. Trade
and empire, 1700–1870. Technical Report # 2008–24,
OFCE: Centre de recherche en économie et sciences po. <[URL]> (30 December 2018).
Evert, Stefan. 2008. Corpora
and
collocations. In Corpus
Linguistics. An International Handbook, Anke Lüdeling & Merja Kytö (eds), 1212–1248. Berlin: Mouton De Gruyter.
Food and Agriculture Organisation of
the United
Nations. November 2003. Anti-hunger
Programme. A twin-track approach to hunger
reduction: priorities for national and international
action. <[URL]>.
Firth, Rupert. 1957. A
synopsis of linguistic theory
1930–1955. In Studies in
Linguistic Analysis [Special volume of the Philological
Society], Rupert Firth (ed.), 1–32. Oxford: Blackwell.
Glynn, Dylan. 2010. Corpus-driven
cognitive semantics. Introduction to the
field. In Quantitative
Methods in Cognitive Semantics: Corpus-Driven
Approaches [Cognitive Linguistics Research
46], Dylan Glynn & Kerstin Fischer (eds), 1–42. Berlin: Mouton de Gruyter.
Gries, Stefan T. 2010. Corpus
linguistics and theoretical linguistics: a love-hate relationship? Not
necessarily… International Journal of Corpus
Linguistics 15(3): 327–343.
Grimmer, Justin & Stewart, Brandon. 2013. Text
as data: The promise and pitfalls of automatic content analysis methods for
political texts. Political
Analysis 21(3): 267–297.
Hatton, Timothy J. & Bray, Bernice E. 2010. Long
run trends in the heights of European men, 19th–20th
centuries. Economics & Human
Biology 8(3): 405–413.
Hilpert, Martin & Gries, Stefan T. 2016. Quantitative
approaches to diachronic corpus
linguistics. In The
Cambridge Handbook of English Historical
Linguistics, Merja Kytö & Päivi Pahta (eds), 36–53. Cambridge: CUP.
Jurafsky, Daniel & Martin, James H. 2009. Speech
and Language Processing: An Introduction to Natural Language Processing, Speech
Recognition, and Computational Linguistics, 2nd
edn. Upper Saddle River, NJ: Prentice-Hall.
Komlos, John. 1998. Shrinking
in a growing economy? The mystery of physical stature during the industrial
revolution. Journal of Economic
History 58: 779–802.
Michel, Jean-Baptiste, Shen, Yuan Kui, Presser Aiden, Aviva, Veres, Adrian, Gray, Matthew K., The Google Books Team, Pickett, Joseph P., Hoiberg, Dale, Clancy, Dan, Norvig, Peter, Orwant, Jon, Pinker, Steven, Nowak, Martin A. & Lieberman Aiden, Erez. 2011. Quantitative
analysis of culture using millions of digitized
books. Science 331(6014): 176–182.
Oakes, Michael P. 2014. Literary
Detective Work on the Computer [Natural Language
Processing
12]. Amsterdam: John Benjamins.
Rayson, Paul. 2008. From
key words to key semantic domains. International
Journal of Corpus
Linguistics 13(4): 519–549.
Sahlgren, Magnus. 2006. The
Word-Space Model: Using Distributional Analysis to Represent Syntagmatic and
Paradigmatic Relations between Words in High- Dimensional Vector
Spaces. PhD
dissertation, Stockholm University.
Schneider, Gerold. 2014. Applying
Computational Linguistics and Language Models: From Descriptive Linguistics to
Text Mining and Psycholinguistics. Cumulative Habilitation, University of Zurich.
. 2018. Differences
between Swiss High German and German High German via data-driven
methods. In Proceedings
of SwissText 2018, Mark Cieliebak, Don Tuggener & Fernando Benites (eds), 6–16. <[URL]> (30 December 2018).
Schneider, Gerold, Pettersson, Eva & Percillier, Michael. 2017. Comparing
rule-based and SMT-based spelling normalisation for English historical
texts. Proceedings of the NoDaLiDa 2017 Workshop on
Processing Historical Language. <[URL]> (30 December 2018).
Schwartz, H. Andrew & Ungar, Lyle H. 2015. Data-driven
content analysis of social media: A systematic overview of automated
methods. The ANNALS of the American Academy of
Political and Social
Science 659(1): 78–94.
Szreter, Simon & Mooney, Graham. 1998. Urbanization,
mortality, and the standard of living debate: new estimates of the expectation of
life at birth in nineteenth-century British
cities. Economic History
Review 51(1): 84–112.
Taavitsainen, Irma & Schneider, Gerold. 2019. Scholastic
argumentation in Early English medical writing and its afterlife: New corpus
evidence. In From Data to
Evidence in English Language Research [Language and
Computers 83], Carla Suhr, Terttu Nevalainen & Irma Taavitsainen (eds), 191–221. Leiden: Brill.
Tognini-Bonelli, Elena. 2001. Corpus
Linguistics at Work [Studies in Corpus Linguistics
6]. Amsterdam: John Benjamins.
Wüest, Bruno, Schneider, Gerold & Amsler, Michael. 2014. Measuring
the public accountability of new modes of
governance. Proceedings of the ACL 2014 Workshop on
Language Technologies and Computational Social Science, Baltimore,
Maryland, 38–43. Stroudsburg, PA: Association for Computational Linguistics.
Cited by (1)
Cited by one other publication
Schneider, Gerold & Maud Reveilhac
2023. Colloquialisation, compression and democratisation in British parliamentary debates. In Exploring Language and Society with Big Data [Studies in Corpus Linguistics, 111], ► pp. 336 ff.
This list is based on CrossRef data as of 1 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
