In:Crossing Boundaries through Corpora: Innovative corpus approaches within and beyond linguistics
Edited by Sarah Buschfeld, Patricia Ronan, Theresa Neumaier, Andreas Weilinghoff and Lisa Westermayer
[Studies in Corpus Linguistics 119] 2024
► pp. 62–98
Chapter 4Digital Dickens
An automated content analysis of Charles Dickens’ novels
Published online: 17 October 2024
https://doi.org/10.1075/scl.119.04sch
https://doi.org/10.1075/scl.119.04sch
Abstract
This investigation employs computational linguistic methods such as document classification, topic
modelling, and distributional semantics to scrutinize eight novels by Charles Dickens, uncovering dimensions of social
criticism, literary realism, and narrative structures. While affirming positive results for automated analysis of
social criticism, the study emphasizes that it could discover differing associations only due to semantic abstraction,
which distributional semantics, word embeddings, and topic modelling can offer. Literary realism is successfully
traced through detailed descriptions and everyday activities. Plotting plots with computational linguistic methods,
specifically conceptual maps with textplot, shows promise but requires refinement. The study shows that current
methods in content analysis offer new possibilities for literary analysis and digital humanities.
Article outline
- 1.Introduction
- 2.Motivation and background
- 2.1Distributional semantics
- 2.2Content analysis
- 2.3Dickens’ visions and style
- 3.Materials
- 3.1A corpus of Dickens novels
- 3.2CLMET 3.0 Corpus
- 3.3Combined corpus
- 4.Methods
- 4.1Document classification
- 4.2Topic modelling
- 4.3Conceptual maps
- 4.4Distributional semantics
- 4.5Comparison and triangulation of methods
- 5.Results
- 5.1Poverty in Dickens
- 5.1.1Frequency
- 5.1.2Document classification
- 5.1.3Distributional semantics
- 5.1.4Topic modelling
- 5.2Literary realism
- 5.2.1Distributional semantics
- 5.2.2Topic modelling
- 5.2.2Conceptual maps
- 5.2.3Plotting plots
- 5.1Poverty in Dickens
- 6.Conclusion
Acknowledgements Notes References
References (37)
Baroni, Marco, Dinu, Georgiana & Kruszewski, Germán. 2014. Don’t
count, predict! A systematic comparison of context-counting vs. context-predicting semantic
vectors. In Proceedings of the 52nd Annual Meeting of
the Association for Computational Linguistics, Kristina Toutanova & Hua Wu (eds), 238–247. Stroudsburg PA: ACL.
Betts, Jennifer. 2020. What
is realism in literature? Elements and
examples. Yourdictionary. 25 October
2020. <[URL]> (14
April 2023).
Buzan, Tony & Buzan, Barry. 1993. The
Mind Map Book: How to Use the Radiant Thinking to Maximize Your Brain’s Untapped
Potential. London: Penguin.
Deerwester, Scott, Dumais, Susan T., Furnas, George W., Landauer, Thomas K. & Harshman, Richard. 1990. Indexing
by latent semantic analysis. Journal of the American Society of Information
Science 41(6): 391–407.
Evert, Stefan 2006. How
random is a corpus? The library metaphor. Zeitschrift für Anglistik und
Amerikanistik 54(2): 177–190.
Firth, John Rupert. 1957. A synopsis
of linguistic theory 1930–1955. Studies in Linguistic
Analysis: 1–32.
Fitzmaurice, Suan, Robinson, Justyna A., Alexander, Marc, Hine, Iona C., Mehl, Seth & Dallachy, Fraser. 2017. Linguistic
{DNA}: Investigating Conceptual Change in Early Modern English
Discourse. Studia
Neophilologica: 1–18.
Gries, Stefan T. 2015. The most
under-used statistical method in corpus linguistics: Multi-level (and mixed-effects)
models. Corpora 10: 95–125.
Grimmer, Justin & Stewart, Brandon. 2013. Text
as data: The promise and pitfalls of automatic content analysis methods for political
texts. Political
Analysis 21(3): 267–297.
. 1970. Distributional
structure. In Papers in Structural and
Transformational Linguistics [Formal Linguistics
Series], 775–794. Dordrecht: Springer.
Hundt, Marianne, Schneider, Gerold & Seoane, Elena. 2016. The
use of the be-passive in academic Englishes: local versus global language in an international
language. Corpora 11(1): 31–63.
Jacomy, Mathieu, Venturini, Tommaso, Heymann, Sebastien & Bastian, Mathieu. 2014. ForceAtlas2.
A Continuous Graph Layout Algorithm for Handy Network Visualization Designed for the Gephi
Software. PLOS
ONE 9(6): e98679. < >
Jurafsky, Daniel & Martin, James H. 2009. Speech and
Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational
Linguistics, 2nd edn. Upper Saddle River NJ: Prentice-Hall.
Kailash, Sudha. 2012. Charles
Dickens as a social critic. International Journal of Research in Economics
& Social
Sciences 2(8): 1–51.
Karlgren, Jussi & Sahlgren, Magnus. 2001. From
words to understanding. In Foundations of Real-World
Intelligence, 294–308. Stanford CA: CSLI.
Kaufman, Micki. 2020. “Everything
on paper will be used against me.” Quantifying Kissinger. Quantifying
Kissinger. <[URL]> (7 January
2023).
Mahlberg, Michaela. 2013. Corpus
Stylistics and Dickens’s Fiction [Routledge Advances in corpus linguistics Series
14]. London: Routledge.
McClure, David. 2015a. Textplot
Refresh. <[URL]>
. 2015b. Textplot. <[URL]> (29 May
2024).
Mikolov, Tomas, Chen, Kai, Corrado, Greg & Dean, Jeffrey. 2013. Efficient
estimation of word representations in vector
space. arXiv 1301–3781. <[URL]> (29 May
2024).
Munafò, Marcus R. & Smith, George Davey. 2018. Robust
research needs many lines of
evidence. Nature 553: 399–401.
Oakes, Michael P. 2014. Literary Detective Work
on the Computer [Natural Language Processing
12]. Amsterdam: John Benjamins.
Rayson, Paul. 2008. From
key words to key semantic domains. International Journal of Corpus
Linguistics 13(4): 519–549.
Sahlgren, Magnus. 2006. The
Word-Space Model: Using distributional Analysis to represent syntagmatic and paradigmatic relations between
words in high-dimensional vector spaces. PhD disseration, Stockholm University.
Schwartz, H. Andrew & Ungar, Lyle H. 2015. Data-driven
content analysis of social media: A systematic overview of automated
methods. The ANNALS of the American Academy of Political and Social
Science 659(1): 78–94.
Shadrova, Anna. 2021. Topic
models do not model topics: Epistemological remarks and steps towards best
practices. Journal of Data Mining and Digital Humanities
2021.
Smith, Nathaniel J. & Levy, Roger. 2013. The
effect of word predictability on reading time is
logarithmic. Cognition 128(3): 302–319.
Wittgenstein, Ludwig & Anscombe, Gertrude Elizabeth Margret. 1953. Philosophical
Investigations. New York NY: Macmillan.
Zottola, Angela. 2020. Corpus
linguistics and digital humanities. Intersecting paths. A case study from
Twitter. América
Crítica 4(2): 131–141.
Zucchini, Walter. 2003. Applied
smoothing techniques — Part 1: Kernel density estimation. <[URL]> (29 May
2024).
Cited by (1)
Cited by one other publication
This list is based on CrossRef data as of 1 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
