Statistical significance for measures of collocation strength

Oakes, Michael P.

doi:10.1075/ivitra.24.10oak

In:Computational Phraseology
Edited by Gloria Corpas Pastor and Jean-Pierre Colson
[IVITRA Research in Linguistics and Literature 24] 2020
► pp. 189–206

Get fulltext from our e-platform

Download Book PDF

Download Book EPUB

Statistical significance for measures of collocation strength

Michael P. Oakes | University of Wolverhampton | Michael.Oakes@wlv.ac.uk

Published online: 8 May 2020

https://doi.org/10.1075/ivitra.24.10oak

Abstract

Of the commonly-used measures of lexical association or collocation strength, only some directly relate to statistical significance: the t-score, chi-squared, log-likelihood, the z-score and Fisher’s exact test. We describe each of these tests, and also describe a computer simulation by which we can derive confidence limits, and hence the statistical significance, of any measure of lexical association which is derived from the contingency table. We illustrate this approach using pointwise mutual information (PMI). We also describe how the Poisson distribution enables us to find the statistical significance of the raw frequency with which a collocation is found. We compare all these methods using collocates of “take”, namely “take up”, “take place”, “take advantage” and “take stock”.

Keywords: collocation strength, statistical significance, Monte Carlo Methods, Poisson Distribution

Article outline

1.Introduction
2.The chi-squared test (X²)
3.The log-likelihood test (G²)
4.Fisher’s exact test
5.The z-score
6.The t-test
7.Pointwise mutual information
8.Computer simulations to estimate statistical significance
9.The Poisson distribution
10.Confidence limits of the mean and standard deviation
11.Experimental comparison of measures
12.Conclusion
Notes
References

References (14)

References

Agresti, A. (2002). Categorical Data Analysis. Second Edition. Heboken, NJ: John Wiley.

Altman, D. G. (1991). Practical Statistics for Medical Research. Boca Raton, FL: Chapman and Hall/CRC.

Berry-Rogghe, G. L. M. (1973). The Computation of Collocations and their Relevance in Lexical Studies. In A. J. Aitken, R. Bailey, & N. Hamilton-Smith (Eds.), The Computer and Literary Studies (pp. 103–112). Edinburgh: Edinburgh University Press.

Church, K., Gale, W., Hanks, P., & Hindle, D. (1991). In U. Zernik (Ed.), Exploiting Online Resources to Build a Lexicon (pp. 115–164). Hillsdale, NJ: Lawrence-Erlbaum.

Church, K., & Hanks, P. (1989). Word association norms, mutual information and lexicography. In Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, 26–29 June 1989, Vancouver (pp. 76–83).

Church, K.W. & Hanks, P. (1989). Word association norms, mutual Information, and Lexicography. Proceedings of the Annual Meeting of Association for Computational Linguistics, Vancouver, 76-83.

Hanks, P., El-Maarouf, I., & Oakes, M. (2017). In M. Sailer, & S. Markantonatou (Eds.), MWE: Insights from a Multi-lingual Perspective. Berlin: Language Science Press.

Koehler, K. (1986). Goodness-of-fit tests for log-linear models in sparse contingency tables. Journal of the American Statistical Association, 81, 483–493.

Koehler, K., & Larntz, K. (1980). An empirical investigation of goodness-of-fit statistics for sparse multinomials. Journal of the American Statistical Association, 75, 336–344.

Larntz, K. (1978). Small-sample comparison of exact levels for chi-squared goodness-of-fit statistics. Journal of the American Statistical Association, 73, 253–263.

Manning, C. D., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge, MA: Massachusetts Institute of Technology.

Moore, R. C. (2004). On Log-Likelihood ratios and the significance of rare events. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004) (pp. 333–340). Barcelona, Spain.

Pecina, P. (2008). Lexical Association Measures: Collocation Extraction. (PhD Thesis, Charles University in Prague).

Seretan, V. (2011). Syntax-based Collocation Extraction. Berlin: Springer.

Cited by (1)

Cited by one other publication

Watson Todd, Richard, Chanen Munkong, Passanan Assavarak & Punjaporn Pojanapunya

2025. Authentic learning for soft skills development and environmental sustainability. London Review of Education 23:1

This list is based on CrossRef data as of 12 november 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.