An automated content analysis of Charles Dickens’ novels: Chapter 4. Digital Dickens

Schneider, Gerold

doi:10.1075/scl.119.04sch

In:Crossing Boundaries through Corpora: Innovative corpus approaches within and beyond linguistics
Edited by Sarah Buschfeld, Patricia Ronan, Theresa Neumaier, Andreas Weilinghoff and Lisa Westermayer
[Studies in Corpus Linguistics 119] 2024
► pp. 62–98

Get fulltext from our e-platform

Download Book PDF

Download Book EPUB

Chapter 4
Digital Dickens

An automated content analysis of Charles Dickens’ novels

Gerold Schneider | University of Zurich

Published online: 17 October 2024

https://doi.org/10.1075/scl.119.04sch

Abstract

This investigation employs computational linguistic methods such as document classification, topic modelling, and distributional semantics to scrutinize eight novels by Charles Dickens, uncovering dimensions of social criticism, literary realism, and narrative structures. While affirming positive results for automated analysis of social criticism, the study emphasizes that it could discover differing associations only due to semantic abstraction, which distributional semantics, word embeddings, and topic modelling can offer. Literary realism is successfully traced through detailed descriptions and everyday activities. Plotting plots with computational linguistic methods, specifically conceptual maps with textplot, shows promise but requires refinement. The study shows that current methods in content analysis offer new possibilities for literary analysis and digital humanities.

Keywords: computational linguistics, digital humanities, Charles Dickens, document classification, topic modelling, distributional semantics, conceptual maps

Article outline

1.Introduction
2.Motivation and background
- 2.1Distributional semantics
- 2.2Content analysis
- 2.3Dickens’ visions and style
3.Materials
- 3.1A corpus of Dickens novels
- 3.2CLMET 3.0 Corpus
- 3.3Combined corpus
4.Methods
- 4.1Document classification
- 4.2Topic modelling
- 4.3Conceptual maps
- 4.4Distributional semantics
- 4.5Comparison and triangulation of methods
5.Results
- 5.1Poverty in Dickens
  - 5.1.1Frequency
  - 5.1.2Document classification
  - 5.1.3Distributional semantics
  - 5.1.4Topic modelling
- 5.2Literary realism
  - 5.2.1Distributional semantics
  - 5.2.2Topic modelling
  - 5.2.2Conceptual maps
  - 5.2.3Plotting plots
6.Conclusion
Acknowledgements
Notes
References

References (37)

References

Baroni, Marco, Dinu, Georgiana & Kruszewski, Germán. 2014. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Kristina Toutanova & Hua Wu (eds), 238–247. Stroudsburg PA: ACL.

Betts, Jennifer. 2020. What is realism in literature? Elements and examples. Yourdictionary. 25 October 2020. <[URL]> (14 April 2023).

Blei, David. 2012. Probabilistic topic models. Communications of the ACM 55(4): 77–84.

Buzan, Tony & Buzan, Barry. 1993. The Mind Map Book: How to Use the Radiant Thinking to Maximize Your Brain’s Untapped Potential. London: Penguin.

Deerwester, Scott, Dumais, Susan T., Furnas, George W., Landauer, Thomas K. & Harshman, Richard. 1990. Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6): 391–407.

De Smet, Hendrik. 2005. A corpus of Late Modern English. ICAME Journal 29: 69–82.

Eve, Martin Paul. 2022. The Digital Humanities and Literary Studies. Oxford: OUP.

Evert, Stefan 2006. How random is a corpus? The library metaphor. Zeitschrift für Anglistik und Amerikanistik 54(2): 177–190.

Firth, John Rupert. 1957. A synopsis of linguistic theory 1930–1955. Studies in Linguistic Analysis: 1–32.

Fitzmaurice, Suan, Robinson, Justyna A., Alexander, Marc, Hine, Iona C., Mehl, Seth & Dallachy, Fraser. 2017. Linguistic {DNA}: Investigating Conceptual Change in Early Modern English Discourse. Studia Neophilologica: 1–18.

Gries, Stefan T. 2015. The most under-used statistical method in corpus linguistics: Multi-level (and mixed-effects) models. Corpora 10: 95–125.

Grimmer, Justin & Stewart, Brandon. 2013. Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis 21(3): 267–297.

Harris, Zellig. 1968. Mathematical Structures of Language. New York NY: Wiley.

. 1970. Distributional structure. In Papers in Structural and Transformational Linguistics [Formal Linguistics Series], 775–794. Dordrecht: Springer.

Hundt, Marianne, Schneider, Gerold & Seoane, Elena. 2016. The use of the be-passive in academic Englishes: local versus global language in an international language. Corpora 11(1): 31–63.

Jacomy, Mathieu, Venturini, Tommaso, Heymann, Sebastien & Bastian, Mathieu. 2014. ForceAtlas2. A Continuous Graph Layout Algorithm for Handy Network Visualization Designed for the Gephi Software. PLOS ONE 9(6): e98679. < >

Janda, Laura A. 2013. Cognitive Linguistics: The Quantitative Turn. Berlin: Mouton de Gruyter.

Jurafsky, Daniel & Martin, James H. 2009. Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics, 2nd edn. Upper Saddle River NJ: Prentice-Hall.

Kailash, Sudha. 2012. Charles Dickens as a social critic. International Journal of Research in Economics & Social Sciences 2(8): 1–51.

Karlgren, Jussi & Sahlgren, Magnus. 2001. From words to understanding. In Foundations of Real-World Intelligence, 294–308. Stanford CA: CSLI.

Kaufman, Micki. 2020. “Everything on paper will be used against me.” Quantifying Kissinger. Quantifying Kissinger. <[URL]> (7 January 2023).

Mahlberg, Michaela. 2013. Corpus Stylistics and Dickens’s Fiction [Routledge Advances in corpus linguistics Series 14]. London: Routledge.

McClure, David. 2015a. Textplot Refresh. <[URL]>

. 2015b. Textplot. <[URL]> (29 May 2024).

Mikolov, Tomas, Chen, Kai, Corrado, Greg & Dean, Jeffrey. 2013. Efficient estimation of word representations in vector space. arXiv 1301–3781. <[URL]> (29 May 2024).

Moretti, Franco. 2013. Distant Reading. London: Verso.

Munafò, Marcus R. & Smith, George Davey. 2018. Robust research needs many lines of evidence. Nature 553: 399–401.

Oakes, Michael P. 2014. Literary Detective Work on the Computer [Natural Language Processing 12]. Amsterdam: John Benjamins.

Rayson, Paul. 2008. From key words to key semantic domains. International Journal of Corpus Linguistics 13(4): 519–549.

Reed, John R. 2010. Dickens’ Hyperrealism. Columbus OH: Ohio State University Press.

Sahlgren, Magnus. 2006. The Word-Space Model: Using distributional Analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. PhD disseration, Stockholm University.

Schwartz, H. Andrew & Ungar, Lyle H. 2015. Data-driven content analysis of social media: A systematic overview of automated methods. The ANNALS of the American Academy of Political and Social Science 659(1): 78–94.

Shadrova, Anna. 2021. Topic models do not model topics: Epistemological remarks and steps towards best practices. Journal of Data Mining and Digital Humanities 2021.

Smith, Nathaniel J. & Levy, Roger. 2013. The effect of word predictability on reading time is logarithmic. Cognition 128(3): 302–319.

Wittgenstein, Ludwig & Anscombe, Gertrude Elizabeth Margret. 1953. Philosophical Investigations. New York NY: Macmillan.

Zottola, Angela. 2020. Corpus linguistics and digital humanities. Intersecting paths. A case study from Twitter. América Crítica 4(2): 131–141.

Zucchini, Walter. 2003. Applied smoothing techniques — Part 1: Kernel density estimation. <[URL]> (29 May 2024).

Cited by (1)

Cited by one other publication

Zhan, Hongwei

2025. Key cluster identification in literary texts using and comparing multiple measures: an exploratory comparative study and its implications. Digital Scholarship in the Humanities 40:2 ► pp. 668 ff.

This list is based on CrossRef data as of 1 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.

Chapter 4Digital Dickens

An automated content analysis of Charles Dickens’ novels

Cited by one other publication

Chapter 4
Digital Dickens