In:Computational Phraseology
Edited by Gloria Corpas Pastor and Jean-Pierre Colson
[IVITRA Research in Linguistics and Literature 24] 2020
► pp. 189–206
Statistical significance for measures of collocation strength
Published online: 8 May 2020
https://doi.org/10.1075/ivitra.24.10oak
https://doi.org/10.1075/ivitra.24.10oak
Abstract
Of the commonly-used measures of lexical association or
collocation strength, only some directly relate to statistical significance:
the t-score, chi-squared, log-likelihood, the z-score and Fisher’s exact
test. We describe each of these tests, and also describe a computer
simulation by which we can derive confidence limits, and hence the
statistical significance, of any measure of lexical association which is
derived from the contingency table. We illustrate this approach using
pointwise mutual information (PMI). We also describe how the Poisson
distribution enables us to find the statistical significance of the raw
frequency with which a collocation is found. We compare all these methods
using collocates of “take”, namely “take up”, “take place”, “take advantage”
and “take stock”.
Article outline
- 1.Introduction
- 2.The chi-squared test (X2)
- 3.The log-likelihood test (G2)
- 4.Fisher’s exact test
- 5.The z-score
- 6.The t-test
- 7.Pointwise mutual information
- 8.Computer simulations to estimate statistical significance
- 9.The Poisson distribution
- 10.Confidence limits of the mean and standard deviation
- 11.Experimental comparison of measures
- 12.Conclusion
Notes References
References (14)
Altman, D. G. (1991). Practical Statistics for Medical Research. Boca Raton, FL: Chapman and Hall/CRC.
Berry-Rogghe, G. L. M. (1973). The Computation of Collocations and their Relevance in Lexical Studies. In A. J. Aitken, R. Bailey, & N. Hamilton-Smith (Eds.), The Computer and Literary Studies (pp. 103–112). Edinburgh: Edinburgh University Press.
Church, K., Gale, W., Hanks, P., & Hindle, D. (1991). In U. Zernik (Ed.), Exploiting Online Resources to Build a Lexicon (pp. 115–164). Hillsdale, NJ: Lawrence-Erlbaum.
Church, K., & Hanks, P. (1989). Word association norms, mutual information and lexicography. In Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, 26–29 June 1989, Vancouver (pp. 76–83).
Church, K.W. & Hanks, P. (1989). Word association norms, mutual Information, and Lexicography. Proceedings of the Annual Meeting of Association for Computational Linguistics, Vancouver, 76-83.
Hanks, P., El-Maarouf, I., & Oakes, M. (2017). In M. Sailer, & S. Markantonatou (Eds.), MWE: Insights from a Multi-lingual Perspective. Berlin: Language Science Press.
Koehler, K. (1986). Goodness-of-fit tests for log-linear models in sparse contingency tables. Journal of the American Statistical Association, 81, 483–493.
Koehler, K., & Larntz, K. (1980). An empirical investigation of goodness-of-fit statistics for sparse multinomials. Journal of the American Statistical Association, 75, 336–344.
Larntz, K. (1978). Small-sample comparison of exact levels for chi-squared goodness-of-fit statistics. Journal of the American Statistical Association, 73, 253–263.
Manning, C. D., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge, MA: Massachusetts Institute of Technology.
Moore, R. C. (2004). On Log-Likelihood ratios and the significance of rare events. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004) (pp. 333–340). Barcelona, Spain.
Cited by (1)
Cited by one other publication
This list is based on CrossRef data as of 12 november 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
