The influence of the benchmark corpus on keyword analysis

Pojanapunya, Punjaporn; Watson Todd, Richard

doi:10.1075/rs.19017.poj

Article published In: Register Studies
Vol. 3:1 (2021) ► pp.88–114

Get fulltext from our e-platform

Download PDF

The influence of the benchmark corpus on keyword analysis

Punjaporn Pojanapunya | King Mongkut’s University of Technology Thonburi

Richard Watson Todd | King Mongkut’s University of Technology Thonburi

Published online: 3 June 2021

https://doi.org/10.1075/rs.19017.poj

Abstract

The growing popularity of keyword analysis as an applied linguistics methodology has not been matched by an increase in the rigour with which the method is applied. While several studies have investigated the impact of choices made at certain stages of the keyword analysis process, the impact of the choice of benchmark corpus has largely been overlooked. In this paper, we compare a target corpus with several benchmark corpora and show that the keywords generated are different. We also show that certain characteristics of the keyword list and of the keywords themselves vary in relatively predictable ways depending on the benchmark corpus. These variations have implications for the choice of benchmark corpus and how the results of a keyword analysis should be interpreted. Analyzing the keywords from a comparison with a large general corpus or the keyword lists from multiple comparisons may be most appropriate for register studies.

Keywords: keyword analysis, reference corpus, aboutness, register

Article outline

1.Introduction
- 1.1Uses of keyword analysis
- 1.2Conducting a keyword analysis
- 1.3Research into the effects of different benchmark corpora
2.Methodology
- 2.1The target corpus
- 2.2The benchmark corpora
- 2.3Generating the keyword lists
- 2.4Analysing the keyword lists and the keywords
3.Results
- 3.1Does the benchmark corpus influence the keywords?
- 3.2How does the benchmark corpus influence the keyword lists as a whole?
- 3.3How does the benchmark corpus influence the keywords?
4.Discussion
References

References (36)

References

Archer, D., Wilson, A., & Rayson, P. (2002). Introduction to the USAS category system. Retrieved from [URL].

Baker, P. (2009). The question is, how cruel is it? Keywords, fox hunting and the house of commons. In D. Archer (Ed.), What’s in a word-List?: Investigating word frequency and key word extraction (pp. 125–136). Aldershot: Ashgate.

Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press.

Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics: Investigating language structure and use. Cambridge: Cambridge University Press.

Bigi, B., Brun, A., Haton, J. P., Smaïli, K., & Zitouni, I. (2001). A comparative study of topic identification on newspaper and e-mail. Proceedings of the 8th International Symposium on String Processing and Information Retrieval (pp. 238–241). Retrieved from [URL].

Blaxter, T. T. (2014). Applying keyword analysis to gendered language in the Íslendingasögur. Nordic Journal of Linguistics, 37(2), 169–198.

Camiciottoli, B. C. (2016). “All those Elvis-meets-golf-player looks”: A corpus-assisted analysis of creative compounds in fashion blogging. Discourse, Context & Media, 121, 77–86.

Cochran, W. G. (1977). Sampling techniques (3rd Ed.). New York: John Wiley & Sons.

Cselle, G., Albrecht, K., & Wattenhofer, R. (2007). Buzztrack: Topic detection and tracking in email. Proceedings of the 12th International Conference on Intelligent User Interfaces (pp. 190–197).

Culpeper, J. (2009). Keyness: Words, parts-of-speech and semantic categories in the character-talk of Shakespeare’s Romeo and Juliet. International Journal of Corpus Linguistics, 14(1), 29–59.

Ferret, O., & Grau, B. (2000). A topic segmentation of texts based on semantic domains. Proceedings of the 14th European Conference on Artificial Intelligence (pp. 426–430). IOS Press.

Gabrielatos, C., & Marchi, A. (2012). Keyness: appropriate metrics and practical issues. Critical Approaches to Discourse Studies. Bologna. Retrieved from [URL].

Gardner, D. (2007). Validating the construct of word in applied corpus-based vocabulary research: A critical survey. Applied Linguistics, 28(2), 241–265.

Geluso, J., & Hirch, R. (2019). The reference corpus matters: Comparing the effect of different reference corpora on keyword analysis. Register Studies, 1(2), 209–242.

Gerbig, A. (2010). Key words and key phrases in a corpus of travel writing. In M. Bondi & M. Scott (Eds.), Keyness in texts (pp. 147–168). Amsterdam: John Benjamins.

Gilmore, A., & Millar, N. (2018). The language of civil engineering research articles: A corpus-based approach. English for Specific Purposes, 511, 1–17.

Goh, G. Y. (2011). Choosing a reference corpus for keyword calculation. Linguistic Research, 28(1), 239–256.

Harvey, K., Churchill, D., Crawford, P., Brown, B., Mullany, L., Macfarlane, A., & McPherson, A. (2008). Health communication and adolescents: What do their emails tell us?. Family Practice, 25(4), 304–311.

Hyland, K. (2004). Disciplinary discourses: Social interactions in academic writing. Ann Arbor, Michigan: University of Michigan Press.

Jones, C., Byrne, S., & Halenko, N. (2018). Successful spoken English: Findings from learner corpora. London: Routledge.

Kilgarriff, A., & Berber Sardinha, T. (2000). Proceedings of the Workshop on Comparing Corpora. Hong Kong.

Kotzé, E. F. (2010). Author identification from opposing perspectives in forensic linguistics. Southern African Linguistics and Applied Language Studies, 28(2), 185–197.

Loudermilk, B. C. (2007). Occluded academic genres: An analysis of the MBA thought essay. Journal of English for Academic Purposes, 6(3), 190–205.

Meier, H. E., Rose, A., & Hölzen, M. (2017). Spirals of signification? A corpus linguistic analysis of the German doping discourse. Communication & Sport, 5(3), 352–373.

Meltzer, E. O., Wallace, D., Dykewicz, M., & Shneyer, L. (2016). Minimal clinically important difference (MCID) in allergic rhinitis: Agency for healthcare research and quality or anchor-based thresholds?. The Journal of Allergy and Clinical Immunology: In Practice, 4(4), 682–688.

Nkechinyere, E. M., Andrew, I., & Idochi, O. (2015). Comparison of different methods of outlier detection in univariate time series data. International Journal for Research in Mathematics and Statistics, 1(1), 55–83.

Paquot, M., & Bestgen, Y. (2009). Distinctive words in academic writing: A comparison of three statistical tests for keyword extraction. Papers from the 29th International Conference on English Language Research on Computerized Corpora (ICAME 29) (pp. 247–269). Retrieved from [URL].

Pojanapunya, P. (2017). A theory of keywords. (Doctoral dissertation). Retrieved from [URL].

Pojanapunya, P., & Watson Todd, R. (2018). Log-likelihood and odds ratio: Keyness statistics for different purposes of keyword analysis. Corpus Linguistics and Linguistic Theory, 14(1), 133–167.

Scharl, A., & Weichselbraun, A. (2008). An automated approach to investigating the online media coverage of US presidential elections. Journal of Information Technology and Politics, 5(1), 121–132.

Scott, M. (1997). PC analysis of key words – and key key words. System, 25(2), 233–245.

(2006). In search of a bad reference corpus. Paper presented at Word Frequency and Keyword Extraction: AHRC ICT Methods Network Expert Seminar on Linguistics., Lancaster University, UK. Retrieved from [URL].

Scott, M., & Tribble, C. (2006). Textual patterns: Key words and corpus analysis in language education. Amsterdam: John Benjamins.

Swales, J. (1990). Genre analysis: English in academic and research settings. Cambridge: Cambridge University Press.

Willis, R. (2017). Taming the climate? Corpus analysis of politicians’ speech on climate change. Environmental Politics, 26(2), 212–231.

Xiao, Z., & McEnery, A. (2005). Two approaches to genre analysis: Three genres in modern American English. Journal of English Linguistics, 33(1), 62–82.

Cited by (9)

Cited by nine other publications

Order by:

Cvrček, Václav & Martina Berrocal

2025. Sibling-texts keyword analysis: exploring topic and register keywords. Digital Scholarship in the Humanities 40:3 ► pp. 762 ff.

Palayon, Raymund & Irish Mae Dalona

2025. The Language of Praise and Worship: A Corpus Analysis of Register Variation in Christian Songs. Southeastern Philippines Journal of Research and Development 30:2 ► pp. 119 ff.

Elsoufy, Ayman Mohamed

2024. Media bias through collocations: a corpus-based study of Egyptian and Ethiopian news coverage of the Grand Ethiopian Renaissance Dam. Humanities and Social Sciences Communications 11:1

Nickels, Lindsay C, Trisha L Marshall, Ezra Edgerton, Patrick W Brady, Philip A Hagedorn & James J Lee

2024. Defining diagnostic uncertainty as a discourse type: A transdisciplinary approach to analysing clinical narratives of Electronic Health Records. Applied Linguistics 45:1 ► pp. 134 ff.

Bozward, David, Matthew Rogers-Draycott, Kelly Smith, Mokuba Mave, Vic Curtis, Chinthaka Aluthgama-Baduge, Rob Moon & Nigel Adams

2023. Exploring the outcomes of enterprise and entrepreneurship education in UK HEIs: An Excellence Framework perspective. Industry and Higher Education 37:3 ► pp. 345 ff.

Kyröläinen, Aki-Juhani & Veronika Laippala

2023. Predictive keywords: Using machine learning to explain document characteristics. Frontiers in Artificial Intelligence 5

Palayon, Raymund & David Perrodin

2023. "I am Christ": Keyness Analyses of Christ Claimants' Sermons. Southeastern Philippines Journal of Research and Development 28:1 ► pp. 69 ff.

Palayon, Raymund T., Richard Watson Todd & Sompatu Vungthong

2022. From the temple of life to the temple of death: keyness analyses of the transitions of a cult. Corpora 17:3 ► pp. 331 ff.

T. Palayon, Raymund, Richard Watson Todd & Sompatu Vungthong

2022. Distinguishing the Language of Destructive Cults from the Language of Mainstream Religion: Corpus Analyses of Sermons. rEFLections 29:1 ► pp. 20 ff.

This list is based on CrossRef data as of 30 november 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.