Article published In: The Mental Lexicon
Vol. 10:1 (2015) ► pp.152–163
Issues with the capture-recapture measure of vocabulary size
Published online: 1 May 2015
https://doi.org/10.1075/ml.10.1.06nel
https://doi.org/10.1075/ml.10.1.06nel
This short paper discusses shortcomings of the capture-recapture (CR) method of estimating vocabulary size (Meara & Olmos Alcoy, 2010; Williams, Segalowitz & Leclair, 2014). When sampling from a population generated by a power-law process (e.g., a Zipf distribution), the probability that any given member is selected is dependent on its rank, such that higher frequency rank (i.e., 1st, 2nd, 3rd) members are much more likely to be selected than lower rank (i.e., 100th, 1000th) members. Because of this, sampling is much more likely to select from the same limited group of words. The CR measure, however, assumes a uniform distribution, and so drastically underestimates the size of the vocabulary when applied to power-law data. Work with simulated data shows ways that the degree of underestimation may be lessened. Applying these methods to real data shows effects parallel to those in the simulations.
Keywords: measurement, corpora, vocabulary, Zipf’s law
References (13)
Chapman, D.G. (1951). Some properties of the hypergeometric distribution with applications to zoological sample censuses. University of California Publications in Statistics, 1(7), 131–160. Full text available through the Hathi Trust Digital Library at [URL].
Cofer, C.N., & Shevitz, R. (1952). Word-association as a function of word-frequency. The American journal of psychology, 651, 75–79.
Cook, S.W., & Skinner, B.F. (1939). Some factors influencing the distribution of associated words. The Psychological Record. 31, 178–184.
Howes, D. (1957). On the relation between the probability of a word as an association and in general linguistic usage. The Journal of Abnormal and Social Psychology, 54(1), 75.
Jensen, A.L. (1981). Sample sizes for single mark and single recapture experiments. Transactions of the American Fisheries Society, 110(3), 455–458.
LOCNESS: The Louvain Corpus of Native English Essays. Centre for English Corpus Linguistics CECL, Université catholique de Louvain, Belgium.
Meara, P.M., & Olmos Alcoy, J.C. (2010). Words as species: An alternative approach to estimating productive vocabulary size. Reading in a Foreign Language, 22(1). 222.
Nelson, D.L., McEvoy, C.L., & Schreiber, T.A. (1998). The University of South Florida word association, rhyme, and word fragment norms. [URL]
Robson, D.S., & Regier, H.A. (1964). Sample size in Petersen mark–recapture experiments. Transactions of the American Fisheries Society, 93(3), 215–226.
Steyvers, M., & Tenenbaum, J.B. (2005). The large‐scale structure of semantic networks: Statistical analyses and a model of semantic growth. Cognitive science, 29(1), 41–78.
Williams, J., Segalowitz, N., & Leclair, T. (2014). Estimating second language productive vocabulary size: A capture-recapture approach. The Mental Lexicon, 9(1), 23–47.
Wolfram Research Inc. (2014). Mathematica 10.0. Champaign, IL.
