Article published In: Language and Covid-19
Edited by Michaela Mahlberg and Gavin Brookes
[International Journal of Corpus Linguistics 26:4] 2021
► pp. 583–598
Short paper
The Coronavirus Corpus
Design, construction, and use
Published online: 3 May 2021
https://doi.org/10.1075/ijcl.21044.dav
https://doi.org/10.1075/ijcl.21044.dav
Abstract
This paper discusses the creation and use of the Coronavirus Corpus, which is currently (March 2021) 900 million words in size, and which will probably be about one billion words in size by May–June 2021. The Coronavirus Corpus is a subset of the NOW Corpus (News on the Web), which is currently about 12.1 billion words in size and which grows by about two billion words each year. These two corpora are updated every night, with about 6–10 million words for NOW and 2–3 million words for the Coronavirus Corpus. The Coronavirus Corpus allows users to see the frequency of words and phrases over time (even by individual day), and users can find all words that are more frequent in one time period than another. Users can also see the collocates for words and phrases, and compare the collocates to see what is being said about particular topics over time.
Keywords: corpus design, NOW corpus, text archive, Coronavirus, COVID-19
Article outline
- 1.Introduction
- 2.Creating and using the NOW Corpus
- 3.Creating the Coronavirus Corpus
- 3.1Virtual Corpora in NOW
- 3.2A stand-alone Coronavirus Corpus
- 4.Using the Coronavirus Corpus
- 5.Conclusion
- Note
References
References (4)
Davies, M. (2015). Corpora: An introduction. In D. Biber & R. Reppen (Eds.), Cambridge Handbook of English Corpus Linguistics (pp. 11–31). Cambridge University Press.
(2017). Using large online corpora to examine lexical, semantic, and cultural variation in different dialects and time periods. In E. Friginal (Ed.), Studies in Corpus-Based Sociolinguistics (pp. 19–82). Routledge.
(2018). Corpus-based studies of lexical and semantic variation: The importance of both corpus size and corpus design. In C. Suhr, T. Nevalainen, & I. Taavitsainen (Eds.), From Data to Evidence in English Language Research (pp. 34–55). Brill.
Davies, M., & Fuchs, R. (2015). Expanding horizons in the study of World Englishes with the 1.9 billion word Global Web-Based English Corpus (GloWbE). English World-Wide, 36(1), 1–28.
Cited by (26)
Cited by 26 other publications
Wang, Xiaodong, Wanting Li, Lihe Chen & Lixin He
Dong, Shuai, Jihua Dong & Buckingham Louisa
Lei, Hong
Nesi, Hilary
Roberts, Seán G., Kateryna Krykoniuk, Michael Handford, Yue Zhou, Jianzhong Wu & Chien-fei Chen
Song, Yuping & Ge Shan
van Rooy, Bertus
van Rooy, Bertus
Li, Chunyao
Moreno-Ortiz, Antonio
Oakey, David & Benet Vincent
Peters, Joachim, Maria Heckel, Eva Breindl & Christoph Ostgathe
Rossiter, Timothy & Averil Coxhead
Yusufali, Hussein, Roger K. Moore & Stefan Goetze
Afzaal, Muhammad & Xiangtao Du
Alkhammash, Reem
Bondi, Marina & Jessica Jane Nocella
Dong, Jihua, Shuai Dong & Louisa Buckingham
2023. A discourse dynamics exploration of terminology for Covid-19 in professional and public discourse. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 29:2 ► pp. 224 ff.
Jones, Christian, David Oakey & Kay L. O'Halloran
2023. “I will say the picture of the background is not related to the words”: using corpus linguistics and focus groups to reveal how speakers of English as an additional language perceive the effectiveness of the phraseology and imagery in UK public health tweets during COVID-19. Applied Corpus Linguistics 3:2 ► pp. 100053 ff.
Moreno-Ortiz, Antonio & María García-Gámez
Spicksley, Dr Kathryn & Dr Emma Franklin
Chen, Mei-Hua
Jiang, Feng Kevin & Ken Hyland
Dong, Jihua, Louisa Buckingham & Hao Wu
2021. A discourse dynamics exploration of attitudinal responses towards COVID-19 in academia and media. International Journal of Corpus Linguistics 26:4 ► pp. 532 ff.
This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
