Article published In: Register Studies
Vol. 1:2 (2019) ► pp.209–242
The reference corpus matters
Comparing the effect of different reference corpora on keyword analysis
Published online: 25 September 2019
https://doi.org/10.1075/rs.18001.gel
https://doi.org/10.1075/rs.18001.gel
Abstract
This study investigates the effect that reference corpora of different registers have on the content of keyword
lists. The study focusses on two target corpora and the keyword lists generated for each when using three distinct reference
corpora. The two target corpora consist of published research by faculty at two PhD-granting programs in applied linguistics in
North America. The reference corpora comprise published research in applied linguistics, newspaper and magazine articles, and
fiction texts, respectively. The findings suggest that while common keywords representing each target corpus emerge regardless of
the reference corpus used in the analysis, there are also substantial differences. Primarily, using a reference corpus of the same
sub-register as the target corpus better highlights content unique to each target corpus while using a reference corpus of a
different register better uncovers words that reflect the register that the target corpora represent. Implications for conducting
keyword analysis are discussed.
Keywords: keyword analysis, register, reference corpus, target corpus
Article outline
- 1.Introduction
- 1.1Reference corpus and register
- 2.Methods
- 2.1Corpora
- 2.2Conducting keyword analysis
- 2.3Dispersion
- 3.Case studies
- 3.1Case study 1: Keywords generated against a reference corpus of the same sub-register
- 3.2Case study 2: Keywords generated against reference corpora of different sub-registers
- 4.Conclusion
- Acknowledgments
References
References (35)
Anthony, L. (2018). AntConc (3.5.6) [Computer Software]. Tokyo, Japan: Waseda University. Available from <[URL]>
Baker, P. (2004). Querying keywords: Questions of difference, frequency, and sense in keywords analysis. Journal of English Linguistics, 32(4), 346–359.
Biber, D., & Gray, B. (2016). Grammatical complexity in academic English: Linguistic change in writing. Cambridge: Cambridge University Press.
Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). The Longman grammar of spoken and written English. London: Longman.
Bondi, M. (2010). Perspectives on keywords and keyness: An introduction. In M. Bondi & M. Scott (Eds.), Keyness in texts (pp. 1–18). Amsterdam: John Benjamins.
Culpeper, J. (2009). Keyness: Words, parts-of-speech and semantic categories in the character-talk of Shakespeare’s Romeo and Juliet*
. International Journal of Corpus Linguistics, 14(1), 29–59.
Davies, M. (2008–). The Corpus of Contemporary American English (COCA): 560 million words, 1990-present. Available online at <[URL]>
Egbert, J. (2007). Quality Analysis of Journals in TESOL and Applied Linguistics. TESOL Quarterly, 41(1), 157–171.
Gabrielatos, C. (2018). Keyness analysis: Nature, metrics and techniques. In C. Taylor & A. Marchi (Eds.), Corpus approaches to discourse: A critical review (pp. 225–258). New York, NY: Routledge.
Gilmore, A., & Millar, N. (2018). The language of civil engineering research articles: A corpus-based approach. English for Specific Purposes, 511, 1017.
Gray, B. (2013). More than discipline: uncovering multi-dimensional patterns of variation in academic research articles. Corpora, 8(2), 153–181.
(2015). Linguistic variation in research articles: When discipline tells only part of the story. Amsterdam: John Benjamins.
Gries, S. Th. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437.
Hirch, R., & Geluso, J. (2017, October). Capturing ‘aboutness’: Comparing and contrasting three methods of keyword analysis. Paper presented at
Second Language Research Forum (SLRF), Ohio State University, Columbus, OH.
Hyland, K., & Jiang, F. (2018). “In this paper we suggest”: Changing patterns of disciplinary metadiscourse. English for Specific Purposes, 511, 18–30.
Jones, E., Oliphant, E., & Peterson, P. (2001–). SciPy: Open Source Scientific Tools for Python. <[URL]> (22 August 2017).
Keynes, J. M. (1936). The general theory of employment, interest, and money. New York, NY: Harcourt and Brace. E-text available from The University of Adelaide Library Electronic Texts Collection. <[URL]>
Lijffijt, J., Nevalainen, T., Säily, Papapetrou, P., Puolamäki, K., & Mannila, H. (2016). Significance testing of word frequencies in corpora. Digital Scholarship in the Humanities, 31(2), 374–397.
Mastropierro, L., & Mahlberg, M. (2017). Key words and translated cohesion in Lovecraft’s At the Mountains of Madness and one of its Italian translations. English Text Construction, 10(1), 78–105.
Paquot, M., & Bestgen, Y. (2009). Distinctive words in academic writing: A comparison of three statistical tests for keyword extraction. In A. Jucker, D. Schreier, & M. Hundt (Eds.), Corpora: Pragmatics and discourse (pp. 247–269). Amsterdam: Rodopi.
Pojanapunya, P., & Watson Todd, R. (2016). Log-likelihood and odds ratio: Keyness statistics for different purposes of keyword analysis. Corpus Linguistics and Linguistic Theory, 14(1), 133–167.
Rayson, P. (2008). Log-likelihood and effect size calculator. <[URL]> (22 December 2017).
Scott, M. (2010). Problems in investigating keyness, or clearing the undergrowth and marking out trails… In M. Bondi & M. Scott (Eds.), Keyness in texts (pp. 43–57). Amsterdam: John Benjamins.
(2018b). WordSmith Tools Manual (Version 7.0). Stroud, Gloucestershire: Mike Scott and Lexical Analysis Software. <[URL]> (15 April 2018).
Scott, M. & Tribble, S. (2006). Textual patterns: Key words and corpus analysis in language education. Amsterdam: John Benjamins.
Stubbs, M. (2010). Three concepts of keywords. In M. Bondi & M. Scott (Eds.), Keyness in texts (pp. 21–42). Amsterdam: John Benjamins.
Swales, J. (1990). Genre analysis: English in academic and research settings. Cambridge: Cambridge University Press.
Upton, G., & Cook, I. (2014). A dictionary of statistics (3rd ed.). Oxford: Oxford University Press.
Cited by (11)
Cited by 11 other publications
Cvrček, Václav & Martina Berrocal
Yu, Wenyuan & NaRi Shin
Hashimoto, Brett & Kyra Nelson
Trnavac, Radoslava & Encarnacion Hidalgo Tenorio
Wang, Zheng
Kyröläinen, Aki-Juhani & Veronika Laippala
Rowson, Tatiana S., Sylvia Jaworska & Iwona Gibas
강, 소미, 하연 장 & 주연 장
Karpenko-Seccombe, Tatyana
Pojanapunya, Punjaporn & Richard Watson Todd
This list is based on CrossRef data as of 30 november 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
