The Coronavirus Corpus: Design, construction, and use

Davies, Mark

doi:10.1075/ijcl.21044.dav

Article published In: Language and Covid-19
Edited by Michaela Mahlberg and Gavin Brookes
[International Journal of Corpus Linguistics 26:4] 2021
► pp. 583–598

Get fulltext from our e-platform

Download PDF

Short paper

The Coronavirus Corpus

Design, construction, and use

Mark Davies | Brigham Young University

Published online: 3 May 2021

https://doi.org/10.1075/ijcl.21044.dav

Abstract

This paper discusses the creation and use of the Coronavirus Corpus, which is currently (March 2021) 900 million words in size, and which will probably be about one billion words in size by May–June 2021. The Coronavirus Corpus is a subset of the NOW Corpus (News on the Web), which is currently about 12.1 billion words in size and which grows by about two billion words each year. These two corpora are updated every night, with about 6–10 million words for NOW and 2–3 million words for the Coronavirus Corpus. The Coronavirus Corpus allows users to see the frequency of words and phrases over time (even by individual day), and users can find all words that are more frequent in one time period than another. Users can also see the collocates for words and phrases, and compare the collocates to see what is being said about particular topics over time.

Keywords: corpus design, NOW corpus, text archive, Coronavirus, COVID-19

Article outline

1.Introduction
2.Creating and using the NOW Corpus
3.Creating the Coronavirus Corpus
- 3.1Virtual Corpora in NOW
- 3.2A stand-alone Coronavirus Corpus
4.Using the Coronavirus Corpus
5.Conclusion
Note
References

References (4)

References

Davies, M. (2015). Corpora: An introduction. In D. Biber & R. Reppen (Eds.), Cambridge Handbook of English Corpus Linguistics (pp. 11–31). Cambridge University Press.

(2017). Using large online corpora to examine lexical, semantic, and cultural variation in different dialects and time periods. In E. Friginal (Ed.), Studies in Corpus-Based Sociolinguistics (pp. 19–82). Routledge.

(2018). Corpus-based studies of lexical and semantic variation: The importance of both corpus size and corpus design. In C. Suhr, T. Nevalainen, & I. Taavitsainen (Eds.), From Data to Evidence in English Language Research (pp. 34–55). Brill.

Davies, M., & Fuchs, R. (2015). Expanding horizons in the study of World Englishes with the 1.9 billion word Global Web-Based English Corpus (GloWbE). English World-Wide, 36(1), 1–28.

Cited by (26)

Cited by 26 other publications

Order by:

Sardinha, Tony Berber & Shannon Fitzsimmons-Doolan

2026. Lexical Multidimensional Analysis,

Wang, Xiaodong, Wanting Li, Lihe Chen & Lixin He

2026. Multilingual Parallel Corpus Construction and Translation Quality Optimization Based on Transformer Algorithm. In Proceedings of International Conference on Recent Innovations in Computing [Lecture Notes in Electrical Engineering, 1487], ► pp. 397 ff.

Dong, Shuai, Jihua Dong & Buckingham Louisa

2025. Directives in Academia and Media During the Uncertainty. International Journal of Applied Linguistics 35:4 ► pp. 2317 ff.

Lei, Hong

2025. Navigating health communication in China: a corpus-based critical discourse analysis of COVID-19 news from 2020 to 2023. Humanities and Social Sciences Communications 12:1

Nesi, Hilary

2025. ESP and Corpus Studies. In The Handbook of English for Specific Purposes, ► pp. 469 ff.

Roberts, Seán G., Kateryna Krykoniuk, Michael Handford, Yue Zhou, Jianzhong Wu & Chien-fei Chen

2025. The energy trilemma COP-out: accessibility is under-reported in international English-language media coverage of United Nations Climate Change Conferences. Energy Research & Social Science 127 ► pp. 104275 ff.

Song, Yuping & Ge Shan

2025. 2025 3rd International Conference on Data Science and Information System (ICDSIS), ► pp. 1 ff.

van Rooy, Bertus

2025. Corpora of Englishes in the Outer Circle. In The Wiley Blackwell Encyclopedia of World Englishes, ► pp. 1 ff.

van Rooy, Bertus

2025. Corpora of Englishes in the Inner Circle. In The Wiley Blackwell Encyclopedia of World Englishes, ► pp. 1 ff.

Li, Chunyao

2024. 2024 IEEE 2nd International Conference on Control, Electronics and Computer Technology (ICCECT), ► pp. 981 ff.

Moreno-Ortiz, Antonio

2024. Introduction. In Making Sense of Large Social Media Corpora, ► pp. 1 ff.

Moreno-Ortiz, Antonio

2024. COVID-19 Corpora. In Making Sense of Large Social Media Corpora, ► pp. 19 ff.

Oakey, David & Benet Vincent

2024. Introductory editorial synthesis paper: Corpus linguistics and the language of COVID-19: Applications and outcomes. Applied Corpus Linguistics 4:3 ► pp. 100110 ff.

Peters, Joachim, Maria Heckel, Eva Breindl & Christoph Ostgathe

2024. What Does “Palliative” Mean? Sentiment, Knowledge, and Public Perception Concerning Palliative Care on the Internet since the COVID-19 Pandemic. Palliative Medicine Reports 5:1

Rossiter, Timothy & Averil Coxhead

2024. Technical vocabulary in government spoken communications: The team of five million in bubbles, PPE and CBACs. International Journal of Applied Linguistics 34:4 ► pp. 1556 ff.

Yusufali, Hussein, Roger K. Moore & Stefan Goetze

2024. ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ► pp. 12016 ff.

Afzaal, Muhammad & Xiangtao Du

2023. Syntactic complexity in translated eHealth discourse of COVID-19: a comparable parallel corpus approach. Asia Pacific Translation and Intercultural Studies 10:1 ► pp. 3 ff.

Alkhammash, Reem

2023. Bibliometric, network, and thematic mapping analyses of metaphor and discourse in COVID-19 publications from 2020 to 2022. Frontiers in Psychology 13

Bondi, Marina & Jessica Jane Nocella

2023. Boosting Booster Trust: Negotiating a Jungle of Misinformation. Lingue Culture Mediazioni - Languages Cultures Mediation (LCM Journal) 10:2

Dong, Jihua, Shuai Dong & Louisa Buckingham

2023. A discourse dynamics exploration of terminology for Covid-19 in professional and public discourse. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 29:2 ► pp. 224 ff.

Jones, Christian, David Oakey & Kay L. O'Halloran

2023. “I will say the picture of the background is not related to the words”: using corpus linguistics and focus groups to reveal how speakers of English as an additional language perceive the effectiveness of the phraseology and imagery in UK public health tweets during COVID-19. Applied Corpus Linguistics 3:2 ► pp. 100053 ff.

Moreno-Ortiz, Antonio & María García-Gámez

2023. Strategies for the Analysis of Large Social Media Corpora: Sampling and Keyword Extraction Methods. Corpus Pragmatics 7:3 ► pp. 241 ff.

Spicksley, Dr Kathryn & Dr Emma Franklin

2023. Who works on the ‘frontline’? comparing constructions of ‘frontline’ work before and during the COVID-19 pandemic.. Applied Corpus Linguistics 3:3 ► pp. 100059 ff.

Chen, Mei-Hua

2022. Process-Oriented Corpus Pedagogy to Promote EFL Learner Awareness of Lexical Knowledge. In Emerging Concepts in Technology-Enhanced Language Teaching and Learning [Advances in Educational Technologies and Instructional Design, ], ► pp. 275 ff.

Jiang, Feng Kevin & Ken Hyland

2022. COVID‐19 in the news: The first 12 months. International Journal of Applied Linguistics 32:2 ► pp. 241 ff.

Dong, Jihua, Louisa Buckingham & Hao Wu

2021. A discourse dynamics exploration of attitudinal responses towards COVID-19 in academia and media. International Journal of Corpus Linguistics 26:4 ► pp. 532 ff.

This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.