In:Challenges in Corpus Linguistics: Rethinking corpus compilation and analysis
Edited by Mark Kaunisto and Marco Schilk
[Studies in Corpus Linguistics 118] 2024
► pp. 35–54
Named entities as potentially problematic items in corpora
Published online: 19 September 2024
https://doi.org/10.1075/scl.118.03kau
https://doi.org/10.1075/scl.118.03kau
Abstract
This chapter discusses problems in the interpretation of
corpus data arising from the insufficiencies in the annotation of named
entities. Many corpora nowadays still do not adequately enable corpus users
to set up queries that would exclude items appearing in names when needed to
improve precision of the searches. Through an examination of case studies in
major English language corpora, the chapter highlights the need to carefully
post-process the search results, as irrelevant occurrences of named entities
may pose challenges in the analyses of word frequencies and their
collocational behaviour. The chapter calls for more detailed annotation of
named entities in already available large linguistic corpora and reminds of
the importance of close inspection of the search hits.
Keywords: named entities, proper names, annotation, corpus linguistics
Article outline
- 1.Introduction
- 2.Background
- 2.1The concepts of proper nouns and proper names
- 2.2Annotation of named entities
- 3.Case studies
- 3.1Common nouns used as (parts of) proper nouns: Lifespan and samurai
- 3.2Near-synonymous adjectives in named entities: Limited/restricted, royal/regal and fantastic/fabulous
- 4.Discussion and conclusion
Notes References
References (31)
Aarts, Bas, Chalker, Sylvia & Weiner, Edmund. 2014. The
Oxford Dictionary of English Grammar, 2nd
edn. Oxford: OUP.
Alatrash, Reem, Schlechtweg, Dominik, Kuhn, Jonas & Schulte im Walde, Sabine. 2020. CCOHA:
Clean Corpus of Historical American
English. In Proceedings
of the Thirteenth Language Resources and Evaluation
Conference, Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds), 6958–6966. Marseille: European Language Resources Association.
Archer, Dawn, Wilson, Andrew & Rayson, Paul. 2002. Introduction
to the USAS category system. <[URL]> (19 May
2024).
Biber, Douglas, Johansson, Stig, Leech, Geoffrey N., Conrad, Susan & Finegan, Edward. 2021. Grammar
of Spoken and Written
English. Amsterdam: John Benjamins.
COCA = Davies, Mark. 2008–. The
Corpus of Contemporary American English. <[URL]> (27 August
2022).
Colman, Fran. 2008. Names,
derivational morphology, and Old English
gender. Studia Anglica
Posnaniensia 44: 29–52.
Cucchiarelli, Alessandro, Luzi, Danilo & Velardi, Paola. 1998. Automatic
semantic tagging of unknown proper
names. In COLING
1998, Vol. 1: The 17th International Conference on Computational
Linguistics, 286–292. Montreal: Université de Montréal.
Denison, David. 2013. Grammatical
mark-up: Some more demarcation
disputes. In New
Methods in Historical Corpora, Paul Bennett, Martin Durrell, Silke Scheible & Richard J. Whitt (eds), 17–35. Tübingen: Gunter Narr.
Fung, Pascale. 1995. A
pattern matching method forfinding noun and proper noun translations
from Noisy Parallel
Corpora. In 33
Annual Meeting of the Association for Computational Linguistics (ACL
’95), 236–243. Cambridge MA: Association for Computational Linguistics.
GloWbE = Davies, Mark. 2013. Corpus
of Global Web-Based English. <[URL]> (19 May
2024).
Gries, Stefan T. 2008. Dispersions
and adjusted frequencies in
corpora. International Journal of
Corpus
Linguistics, 13(4): 403–437.
Huddleston, Rodney & Pullum, Geoffrey. 2002. The
Cambridge Grammar of the English
Language. Cambridge: CUP.
Kaunisto, Mark. 2017. Multilingualism
and quotations from a corpus-linguistic perspective: A case study of
Samuel Taylor Coleridge’s Biographia
Literaria, in Challenging
the Myth of Monolingual Corpora [Language
and Computers 80], Arja Nurmi, Tanja Rütten & Päivi Pahta (eds), 220–238. Leiden: Brill.
Kübler, Sandra & Zinsmeister, Heike. 2015. Corpus
Linguistics and Linguistically Annotated
Corpora. London: Bloomsbury.
Leech, Geoffrey & Smith, Nicholas. 2000. Manual
to Accompany the British National Corpus (Version 2) with Improved
Word-class Tagging. UCREL. Lancaster University. <[URL]> (19 May
2024).
Leech, Geoffrey, Rayson, Paul & Wilson, Andrew. 2001. Word
Frequencies in Written and Spoken English: Based on the British
National
Corpus. London: Longman.
Lehtonen, Sharin. 2021. Tsunami,
anime, and martial arts: A
corpus-based lexicological study of Japanese borrowings in a
historical context and in six varieties of Present-day
English. MA
dissertation, Tampere University. <[URL]> 〉 (19 May
2024).
NOW = Davies, Mark. 2016–. Corpus
of News on the Web. <[URL]> (3 August
2022).
Pierini, Patrizia. 2008. Opening
a Pandora’s pox: Proper names in English
phraseology. Linguistik
Online, 36(4).
Preiss, Judita & Stevenson, Mark. 2013. Distinguishing
common and proper
nouns. In Second
Joint Conference on Lexical and Computational Semantics (*SEM), Vol.
1: Proceedings of the Main Conference and the Shared
Task, 80–84. Stroudsburg PA: Association for Computational Linguistics.
Rissanen, Matti. 1989. Three
problems connected with the use of diachronic
corpora. ICAME
Journal 13: 16–19.
. 1992. The
diachronic corpus as a window to the history of
English. In Directions
in Corpus Linguistics. Proceedings of Nobel Symposium 82, Stockholm,
4–8 August 1991, Jan Svartvik (ed.), 185–205. Berlin: Mouton de Gruyter.
Santorini, Beatrice. 1990. Part-of-Speech
Tagging Guidelines for The Penn Treebank
Project, 3rd
revision. Philadelphia PA: University of Philadelphia.
Sekine, Satoshi, Sudo, Kiyoshi & Nobata, Chikashi. 2002. Extended
named entity
hierarchy. In Proceedings
of the 3rd International Conference on Language Resources and
Evaluation (LREC 2002), Las Palmas, Canary Islands, Spain, 1818–1824. Paris: European Language Resources Association (ELRA).
Ševčiková, Magda. 2007. Proper
nouns in Czech
corpora. In Proceedings
of the Corpus Linguistics Conference
(CL2007), Matthew Davis, Paul Rayson, Susan Hunston & Pernilla Danielsson (eds). University of Birmingham. <[URL]> 〉 (19 May
2024).
