Multi-unit association measures: Moving beyond pairs of words

Dunn, Jonathan

doi:10.1075/ijcl.16098.dun

Article published In: International Journal of Corpus Linguistics
Vol. 23:2 (2018) ► pp.183–215

Get fulltext from our e-platform

Download PDF

Multi-unit association measures

Moving beyond pairs of words

Jonathan Dunn | University of Canterbury

Published online: 5 October 2018

https://doi.org/10.1075/ijcl.16098.dun

Abstract

This paper formulates and evaluates a series of multi-unit measures of directional association, building on the pairwise ΔP measure, that are able to quantify association in sequences of varying length and type of representation. Multi-unit measures face an additional segmentation problem: once the implicit length constraint of pairwise measures is abandoned, association measures must also identify the borders of meaningful sequences. This paper takes a vector-based approach to the segmentation problem by using 18 unique measures to describe different aspects of multi-unit association. An examination of these measures across eight languages shows that they are stable across languages and that each provides a unique rank of associated sequences. Taken together, these measures expand corpus-based approaches to association by generalizing across varying lengths and types of representation.

Keywords: association strength, multi-unit association, sequences, ΔP, collocations

Article outline

1.Introduction
2.Direction of association and sequence length
3.Data and methodology
4.Analysis: Formulating multi-unit association measures
- 4.1Mean ΔP and Sum ΔP
- 4.2Minimum ΔP
- 4.3Reduced ΔP
- 4.4Divided ΔP
- 4.5End-point ΔP
- 4.6Changed ΔP
- 4.7Summarizing the association measures
5.Discussion: Empirical analysis of association measures
- 5.1Relations between directions and measures
- 5.2Stability across languages and representation types
6.Using the measures together
7.Conclusions
Acknowledgements
References

References (26)

References

Biber, B., Reppen, R., Schnur, E., & Ghanem, R. (2016). On the (non)utility of Juilland’s D to measure lexical dispersion in large corpora. International Journal of Corpus Linguistics, 21(4): 439–464.

Church, K., & Hanks, P. (1990). Word association norms, Mutual Information, and lexicography. Computational Linguistics, 16(1): 22–29.

Daudaravičius, V., & Marcinkevičienė, R. (2004). Gravity counts for the boundaries of collocations. International Journal of Corpus Linguistics, 9(2): 321–348.

Davies, M. (2008-). The Corpus of Contemporary American English (COCA): 520 million words, 1990-present. Available online at [URL] (last accessed June 2018).

Dunn, J. (2017). Computational learning of construction grammars. Language and Cognition, 9(2): 254–292.

(2018). Finding variants for construction-based dialectometry: A corpus-based approach to regional CxGs. Cognitive Linguistics, 29(2): 275–311.

Ellis, N. (2007). Language acquisition as rational contingency learning. Applied Linguistics, 27(1): 1–24.

Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vincent, P., & Bengio, S. (2010). Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 111: 625–660.

Evert, S. (2005). The Statistics of Word Co-Occurrences: Word Pairs and Collocations (Unpublished doctoral dissertation). Stuttgart, University of Stuttgart.

Gries, St. Th. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4): 403–437.

(2010). Bigrams in registers, domains, and varieties: A bigram gravity approach to the homogeneity of corpora. In Mahlberg, M., Diaz, V. & Smith, C. (Eds.) Proceedings of the 2009 Corpus Linguistics Conference. Liverpool: University of Liverpool.

(2012). Frequencies, probabilities, and association measures in usage- / exemplar-based linguistics. Studies in Language, 11(3): 477–510.

(2013). 50-something years of work on collocations: What is or should be next. International Journal of Corpus Linguistics, 18(1): 137–165.

Gries, St. Th., & Mukherjee, J. (2010). Lexical gravity across varieties of English: An ICE-based study of n-grams in Asian Englishes. International Journal of Corpus Linguistics, 15(4): 520–548.

Gries, St. Th., & Stefanowitsch, A. (2004). Extending collostructional analysis: A corpus-based perspective on ‘alternations’. International Journal of Corpus Linguistics, 9(1): 97–129.

Jelinek, F. (1990). Self-organizing language modeling for speech recognition. In A. Waibel & K. Lee (eds.), Readings in Speech Recognition (pp. 450–506). San Mateo, CA: Morgan Kaufmann.

Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of the 10th Machine Translation Summit 2005 (pp. 79–86). Tokyo: Asia-Pacific Association for Machine Translation.

Michelbacher, L., Evert, S., & Schutze, H. (2007). Asymmetric association measures. In N. Nicolov, G. Angelova & R. Mitkov (Eds.), Proceedings of the 6th International Conference on Recent Advances in Natural Language Processing (RANLP) (pp. 367–372). Amsterdam/Philadelphia: John Benjamins. 367–372.

Nguyen, D. Q., Nguyen, D. Q., Pham, D. D., & Pham, S. B. (2016). A robust transformation-based learning approach using ripple down rules for part-of-speech tagging. AI Communications, 29(3): 409–422.

Pecina, P. (2009). Lexical association measures and collocation extraction. Language Resources and Evaluation, 44(1/2): 137–158.

Pedersen, T. (1998). Dependent bigram identification. In J. Mostow & C. Rich (Eds.), Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98) (p. 1197). Menlo Park, CA: The AAAI Press.

Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. In B. Pang & W. Daelemans (Eds.), Proceedings of Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543) Stroudsburg, PA: Association for Computational Linguistics.

Shimohata, S., Sugio, T., & Nagata, J. (1997). Retrieving collocations by co-occurrences and word order constraints. In P. Cohen & W. Wahlster (Eds.), Proceedings of the Association for Computational Linguistics Annual Meeting (pp. 476–481). Stroudsburg, PA: Association for Computational Linguistics.

Wible, D., & Tsao, N. (2010). StringNet as a computational resource for discovering and investigating linguistic constructions. In M. Sahlgren & O. Knutsson (Eds.), Proceedings of the Workshop on Extracting and Using Constructions in Computational Linguistics (NAACL-HTL) (pp. 25–31). Stroudsburg, PA: Association for Computational Linguistics.

Wiechmann, D. (2008). On the computation of collostructional strength: Testing measures of association as expressions of lexical bias. Corpus Linguistics and Linguistic Theory, 4(2): 253–290.

Zhai, C. (1997). Exploiting context to identify lexical atoms: A statistical view of linguistic context. In P. Brezillon (Ed.), Proceedings of the First International and Interdisciplinary Conference on Modeling and Using Contex (pp.119–129). Rio de Janeiro, Brazil.

Cited by (6)

Cited by six other publications

Order by:

Li, Jingjie & Wenjie Hu

2025. Identification of sentence stems characteristic of Chinese learner English writing. Heliyon 11:3 ► pp. e37166 ff.

Dunn, Jonathan

2022. Natural Language Processing for Corpus Linguistics,

Dunn, Jonathan

2022. Exposure and emergence in usage-based grammar: computational experiments in 35 languages. Cognitive Linguistics 33:4 ► pp. 659 ff.

Dunn, Jonathan

2024. Computational Construction Grammar,

Gries, Stefan Th.

2022. Multi-word units (and tokenization more generally): a multi-dimensional and largely information-theoretic approach. Lexis :19

Murakami, Akira & Nick C. Ellis

2022. Effects of Availability, Contingency, and Formulaicity on the Accuracy of English Grammatical Morphemes in Second Language Writing. Language Learning 72:4 ► pp. 899 ff.

This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.