A corpus-driven exploration: Chapter 3. Medical topics and style from 1500 to 2018

Schneider, Gerold

doi:10.1075/pbns.330.03sch

In:Corpus Pragmatic Studies on the History of Medical Discourse
Edited by Turo Hiltunen and Irma Taavitsainen
[Pragmatics & Beyond New Series 330] 2022
► pp. 49–78

Get fulltext from our e-platform

Download Book PDF

Download Book EPUB

Chapter 3
Medical topics and style from 1500 to 2018

A corpus-driven exploration

Gerold Schneider | University of Zurich

Published online: 1 July 2022

https://doi.org/10.1075/pbns.330.03sch

Abstract

This chapter investigates changes in medical topics, style and language across 500 years, from 1500 to 2018. To do so, we employ data-driven methods of Computational Linguistics and Digital Humanities: document classification, topic modelling, and automatically constructed conceptual maps. We trace changes from traditional thinking in the scholastic period to empirical methods, professionalised medicine, and finally the increasing importance of data, statistics and clinical studies, away from symptom-centred medicine. We conclude that medical discourse has undergone radical changes and that data-driven methods reflect these changes and offer an unprecedented overview. We also critically discuss shortcomings of our data and methods.

Keywords: data-driven approaches, machine learning, collocations, Topic Modelling, history of medicine, Digital Humanities, conceptual maps, Kernel Density Estimation, automated content analysis, English medical discourse, language and health, culturomics

Article outline

1.Introduction
2.Motivation
- 2.1Systematic comparison of all lexical features
- 2.2Advanced computational methods
- 2.3Sampling and representativeness
3.Materials
- 3.1CEEM
- 3.2ARCHER Medical
- 3.3HIMERA
- 3.4PubMed Excerpt
- 3.5Overview of the complete data of our investigation
- 3.6Limitations of the data
4.Methods
- 4.1Data preparation
- 4.2Supervised document classification
- 4.3Unsupervised topic modelling
- 4.4Unsupervised Conceptual Maps with Kernel Density Estimation
5.Results
- 5.1Results of supervised document classification
- 5.2Results of unsupervised topic modelling
- 5.3Results of Unsupervised Conceptual Maps with Kernel Density Estimation
6.Conclusion and future prospects
Acknowledgements
Notes
References

References (44)

References

Ananiadou, Sophia, Douglas B. Kell, and Tsujii, Jun-ichi. 2006. “Text Mining and Its Potential Applications in Systems Biology.” Trends in Biotechnology 24 (12): 571–579.

Baron, Alistair, and Paul Rayson. 2008. “VARD 2: A Tool for Dealing with Spelling Variation in Historical Corpora.” In Proceedings of the Postgraduate Conference in Corpus Linguistics, Aston University, Birmingham, UK, 22 May 2008. [URL]

Baroni, Marco, and Alessandro Lenci. 2010. “Distributional Memory: A General Framework for Corpus-based Semantics.” Computational Linguistics 36 (4): 673–721.

Bazerman, Charles. 1988. Shaping Written Knowledge. Madison: University of Wisconsin.

Biber, Douglas, Edward Finegan, and Dwight Atkinson. 1994. “ARCHER and Its Challenges: Compiling and Exploring a Representative Corpus of Historical English Registers.” Creating and Using English Language Corpora: Papers from the 14th International Conference on English Language Research on Computerized Corpora, Zürich 1994, ed. by Udo Fries, Peter Schneider, and Gunnel Tottie, 1–13. Amsterdam: Rodopi.

Blei, David. 2012. “Probabilistic Topic Models.” Communications of the ACM 55 (4): 77–84.

Broersma, Marcel, and Frank Harbers. 2018. “Exploring Machine Learning to Study the Long-Term Transformation of News.” Digital Journalism 6 (9): 1150–1164.

Bybee, Joan. 2007. Frequency of Use and the Organization of Language. Oxford: Oxford University Press.

Church, Kenneth. 2000. “Empirical Estimates of Adaptation: The chance of Two Noriegas is closer to p/2 than p².” Proceedings of the 17th Conference on Computational linguistics (COLING 2000), 180–186. Stroudsburg: Association for Computational Linguistics.

Conklin, Kathy, and Norbert Schmitt. 2012. “The Processing of Formulaic Language.” Annual Review of Applied Linguistics 32: 45–61.

Erman, Britt and Beatrice Warren. 2000. “The Idiom Principle and the Open Choice Principle.” TEXT 20 (1): 29–62.

Firth, John Rupert. 1957. “A Synopsis of Linguistic Theory 1930–1955.” Studies in Linguistic Analysis [Special Volume of the Philological Society]: 1–32. Oxford: Blackwell.

Fitzmaurice, Susan, Justyna A. Robinson, Marc Alexander, Iona C. Hine, Seth Mehl, and Fraser Dallachy. 2017. “Linguistic DNA: Investigating Conceptual Change in Early Modern English Discourse.” Studia Neophilologica 89 (sup1): 21–38.

Funk, Christopher. 2015. “Concept Recognition and Its Application for Protein Function Prediction.” Computational Biology Thesis Defense. University of Colorado. [URL]

Ghanem, Salma. 1997. “Filling the Tapestry: The Second Level of Agenda Setting.” In Communication and Democracy: Exploring the Intellectual Frontiers in Agenda-Setting Theory, ed. by Maxwell McCombs, Donald L. Shaw and David Weaver. 3–14. Mahwah, NJ: Lawrence Erlbaum.

Grimmer, Justin, and Brandon Stewart. 2013. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21 (3): 267–297.

Hilpert, Martin, and Stefan Gries. 2016. “Quantitative Approaches to Diachronic Corpus Linguistics.” In The Cambridge Handbook of English Historical Linguistics, ed. by Merja Kytö, and Päivi Pahta, 36–53. Cambridge: Cambridge University Press.

Hundt, Marianne, David Denison, and Gerold Schneider. 2012. “Relative Complexity in Scientific Discourse.” English Language and Linguistics 16 (2): 209–240.

Janda, Laura A. (ed.) 2013. Cognitive Linguistics: The Quantitative Turn. The Essential Reader. Berlin: Mouton de Gruyter.

Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics. 2nd edition. Upper Saddle River, NJ: Prentice-Hall.

Keller, Frank and Mirella Lapata. 2003. “Using the Web to Obtain Frequencies for Unseen Bigrams”. Computational Linguistics, 29:3, 459–484.

Lapata, Mirella and Frank Keller. 2005. “Web-based Models for Natural Language Processing”. ACM Transactions on Speech and Language Processing, 2:1, 1–31.

Late Modern English Medical Texts 1700–1800 (LMEMT). 2019. Compiled by Taavitsainen, Irma, Turo Hiltunen, Ville Marttila, Päivi Pahta, Maura Ratia, Carla Suhr and Jukka Tyrkkö. Amsterdam: John Benjamins. CD-ROM published with a book.

Leech, Geoffrey. 2007. “New Resources, or Just Better Old Ones? The Holy Grail of Representativeness.” In Corpus Linguistics and the Web, ed. by Marianne Hundt, Nadja Nesselhauf, and Carolin Biewer, 133–149. Amsterdam: Rodopi.

Michel, Jean-Baptiste, Shen, Yuan Kui, Aiden, Aviva P., Veres, Adrian, Gray, Matthew K., Pickett, Joseph P., Hoiberg, Dale, Clancy, Dan, Norvig, Peter, Orwant, Jon, Pinker, Steven, Nowak, Martin A. & Aiden, Erez Lieberman. 2011. Quantitative analysis of culture using millions of digitized books. Science 331(6014): 176–182.

Oakes, Michael P. 2014. Literary Detective Work on the Computer. Amsterdam & Philadelphia, PA: Benjamins.

Roberts, Marilyn, Tzong-Horng (Dustin) Dzwo, and Wayne Wanta. 2002. “Agenda Setting and Issue Salience Online.” Communication Research 29: 452–465.

Röder, Michael, Andreas Both, and Alexander Hinneburg. 2015. “Exploring the Space of Topic Coherence Measures.” Proceedings of WSDM’15, February 2–6, 2015, 399–408, Shanghai, China.

Sahlgren, Magnus. 2006. The Word-Space Model: Using Distributional Analysis to Represent Syntagmatic and Paradigmatic Relations between Words in High- Dimensional Vector Spaces. PhD dissertation, Stockholm University.

Scally, Gabriel. 2014. “Public Health Profession.” In Encyclopedia of Health Economics, Vol. 3, ed. by Anthony J. Culyer, 204–209. San Diego: Elsevier.

Schneider, Gerold. 2018. “Differences between Swiss High German and German High German via Data-Driven Methods.” Proceedings of the 3rd Swiss Text Analytics Conference (SwissText 2018), Winterthur, Switzerland, ed. by Mark Ciliebak, Don Tuggener and Fernando Benites, 17–25. [URL]

Schneider, Gerold, Eva Pettersson, and Michael Percillier. 2017. “Comparing Rule-Based and SMT-Based Spelling Normalisation for English Historical Texts.” Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language, Gothenburg, Sweden, ed, by Gerlof Bouma and Yvonne Adesam, 40–46.

Schreiber-Gregory, Deanna. 2018. “Regulation Techniques for Multicollinearity: Lasso, Ridge, and Elastic Nets.” Proceedings of Western Users of SAS Software Conferences 2018, September 5–7, 2018, Sacramento, California. [URL]

Schwartz, H. Andrew, and Lyle H. Ungar. 2015. “Data-Driven Content Analysis of Social Media: A Systematic Overview of Automated Methods.” The ANNALS of the American Academy of Political and Social Science 659 (1): 78–94.

Sinclair, John and Ronald Carter. 2004. Trust the Text: Language, Corpus and Discourse. London: Routledge.

Steinberger, Ralf, Aldo Podavini, Alexandra Balahur, Guillaume Jacquet, Hristo Tanev, Jens Linge, Martin Atkinson, Michele Chinosi, Vanni Zavarella, Yaniv Steiner, and Erik van der Goot. 2015. “Observing Trends in Automated Multilingual Media Analysis.” Proceedings of the Symposium on New Frontiers of Automated Content Analysis in the Social Sciences (ACA’2015), Zürich, Switzerland, 1–3 July, 1–8. [URL]

Taavitsainen, Irma, Turo Hiltunen, Anu Lehto, Ville Marttila, Päivi Pahta, Maura Ratia, Carla Suhr and Jukka Tyrkkö. 2019. Late Modern English Medical Texts: The Corpus. In Late Modern English Medical Texts: Writing Medicine in the Eighteenth Century, ed. by Irma Taavitsainen, and Turo Hiltunen. Amsterdam: John Benjamins Publishing Company.

Taavitsainen, Irma, Päivi Pahta, Turo Hiltunen, Martti Mäkinen, Ville Marttila, Maura Ratia, Carla Suhr, and Jukka Tyrkkö. 2010. Early Modern English Medical Texts: Corpus. In Early Modern English Medical Texts: Corpus Description and Studies, ed. by Irma Taavitsainen, and Päivi Pahta. Amsterdam: John Benjamins Publishing Company.

Taavitsainen, Irma, and Gerold Schneider. 2018. “Scholastic Argumentation in Early English Medical Writing and Its Afterlife: New Corpus Evidence.” In From Data to Evidence in English Language Research, ed. by Carla Suhr, Terttu Nevalainen, and Irma Taavitsainen, 191–221. Leiden: Brill.

Taavitsainen, Irma, Gerold Schneider, and Peter Jones. 2019. “Topics of Eighteenth-Century Medical Writing with Triangulation of Methods: LMEMT and the Underlying Reality.” In Late Modern English Medical Texts: Writing Medicine in the Eighteenth Century, ed. by Irma Taavitsainen, and Turo Hiltunen, 31–74. Benjamins: Amsterdam.

Tang, Jian, Zhaoshi Meng, Xuanlong Nguyen, Qiaozhu Mei, and Ming Zhang. 2014. “Understanding the Limiting Factors of Topic Modeling via Posterior Contraction Analysis.” Proceedings of the 31st International Conference on Machine Learning, 32(1), ed. by Eric P. Xing, and Tony Jebara, 190–198. [URL]

Thompson, Paul, Riza Theresa Batista-Navarro, Georgios Kontonatsios, Jacob Carter, Elizabeth Toon, John McNaught, Carsten Timmermann, Michael Worboys, and Sophia Ananiadou. 2016. “Text Mining the History of Medicine.” PLOS ONE 11 (1): e0144717.

Tognini-Bonelli, Elena. 2001. Corpus Linguistics at Work. Amsterdam: Benjamins.

Villegas, Marta, Ander Intxaurrondo, Aitor Gonzalez-Agirre, Montserrat Marimon, and Martin Krallinger. 2018. “The MeSpEN Resource for English-Spanish Medical Machine Translation and Terminologies: Census of Parallel Corpora, Glossaries and Term Translations.” In LREC MultilingualBIO: Multilingual Biomedical Text Processing, Miyazaki, Japan, ed. by Maite Melero, Martin Krallinger and Aitor Gonzalez-Agirre, 32–39, ELRA. [URL]

Chapter 3Medical topics and style from 1500 to 2018

A corpus-driven exploration

Chapter 3
Medical topics and style from 1500 to 2018