Article published In: International Journal of Corpus Linguistics
Vol. 23:2 (2018) ► pp.183–215
Multi-unit association measures
Moving beyond pairs of words
Published online: 5 October 2018
https://doi.org/10.1075/ijcl.16098.dun
https://doi.org/10.1075/ijcl.16098.dun
Abstract
This paper formulates and evaluates a series of multi-unit measures of directional association, building on the pairwise ΔP measure, that are able to quantify association in sequences of varying length and type of representation. Multi-unit measures face an additional segmentation problem: once the implicit length constraint of pairwise measures is abandoned, association measures must also identify the borders of meaningful sequences. This paper takes a vector-based approach to the segmentation problem by using 18 unique measures to describe different aspects of multi-unit association. An examination of these measures across eight languages shows that they are stable across languages and that each provides a unique rank of associated sequences. Taken together, these measures expand corpus-based approaches to association by generalizing across varying lengths and types of representation.
Keywords: association strength, multi-unit association, sequences, ΔP, collocations
Article outline
- 1.Introduction
- 2.Direction of association and sequence length
- 3.Data and methodology
- 4.Analysis: Formulating multi-unit association measures
- 4.1Mean ΔP and Sum ΔP
- 4.2Minimum ΔP
- 4.3Reduced ΔP
- 4.4Divided ΔP
- 4.5End-point ΔP
- 4.6Changed ΔP
- 4.7Summarizing the association measures
- 5.Discussion: Empirical analysis of association measures
- 5.1Relations between directions and measures
- 5.2Stability across languages and representation types
- 6.Using the measures together
- 7.Conclusions
- Acknowledgements
References
References (26)
Biber, B., Reppen, R., Schnur, E., & Ghanem, R. (2016). On the (non)utility of Juilland’s D to measure lexical dispersion in large corpora. International Journal of Corpus Linguistics, 21(4): 439–464.
Church, K., & Hanks, P. (1990). Word association norms, Mutual Information, and lexicography. Computational Linguistics, 16(1): 22–29.
Daudaravičius, V., & Marcinkevičienė, R. (2004). Gravity counts for the boundaries of collocations. International Journal of Corpus Linguistics, 9(2): 321–348.
Davies, M. (2008-). The Corpus of Contemporary American English (COCA): 520 million words, 1990-present. Available online at [URL] (last accessed June 2018).
Dunn, J. (2017). Computational learning of construction grammars. Language and Cognition, 9(2): 254–292.
(2018). Finding variants for construction-based dialectometry: A corpus-based approach to regional CxGs. Cognitive Linguistics, 29(2): 275–311.
Ellis, N. (2007). Language acquisition as rational contingency learning. Applied Linguistics, 27(1): 1–24.
Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vincent, P., & Bengio, S. (2010). Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 111: 625–660.
Evert, S. (2005). The Statistics of Word Co-Occurrences: Word Pairs and Collocations (Unpublished doctoral dissertation). Stuttgart, University of Stuttgart.
Gries, St. Th. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4): 403–437.
(2010). Bigrams in registers, domains, and varieties: A bigram gravity approach to the homogeneity of corpora. In Mahlberg, M., Diaz, V. & Smith, C. (Eds.) Proceedings of the 2009 Corpus Linguistics Conference. Liverpool: University of Liverpool.
(2012). Frequencies, probabilities, and association measures in usage- / exemplar-based linguistics. Studies in Language, 11(3): 477–510.
(2013). 50-something years of work on collocations: What is or should be next. International Journal of Corpus Linguistics, 18(1): 137–165.
Gries, St. Th., & Mukherjee, J. (2010). Lexical gravity across varieties of English: An ICE-based study of n-grams in Asian Englishes. International Journal of Corpus Linguistics, 15(4): 520–548.
Gries, St. Th., & Stefanowitsch, A. (2004). Extending collostructional analysis: A corpus-based perspective on ‘alternations’. International Journal of Corpus Linguistics, 9(1): 97–129.
Jelinek, F. (1990). Self-organizing language modeling for speech recognition. In A. Waibel & K. Lee (eds.), Readings in Speech Recognition (pp. 450–506). San Mateo, CA: Morgan Kaufmann.
Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of the 10th Machine Translation Summit 2005 (pp. 79–86). Tokyo: Asia-Pacific Association for Machine Translation.
Michelbacher, L., Evert, S., & Schutze, H. (2007). Asymmetric association measures. In N. Nicolov, G. Angelova & R. Mitkov (Eds.), Proceedings of the 6th International Conference on Recent Advances in Natural Language Processing (RANLP) (pp. 367–372). Amsterdam/Philadelphia: John Benjamins. 367–372.
Nguyen, D. Q., Nguyen, D. Q., Pham, D. D., & Pham, S. B. (2016). A robust transformation-based learning approach using ripple down rules for part-of-speech tagging. AI Communications, 29(3): 409–422.
Pecina, P. (2009). Lexical association measures and collocation extraction. Language Resources and Evaluation, 44(1/2): 137–158.
Pedersen, T. (1998). Dependent bigram identification. In J. Mostow & C. Rich (Eds.), Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98) (p. 1197). Menlo Park, CA: The AAAI Press.
Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. In B. Pang & W. Daelemans (Eds.), Proceedings of Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543) Stroudsburg, PA: Association for Computational Linguistics.
Shimohata, S., Sugio, T., & Nagata, J. (1997). Retrieving collocations by co-occurrences and word order constraints. In P. Cohen & W. Wahlster (Eds.), Proceedings of the Association for Computational Linguistics Annual Meeting (pp. 476–481). Stroudsburg, PA: Association for Computational Linguistics.
Wible, D., & Tsao, N. (2010). StringNet as a computational resource for discovering and investigating linguistic constructions. In M. Sahlgren & O. Knutsson (Eds.), Proceedings of the Workshop on Extracting and Using Constructions in Computational Linguistics (NAACL-HTL) (pp. 25–31). Stroudsburg, PA: Association for Computational Linguistics.
Cited by (6)
Cited by six other publications
Li, Jingjie & Wenjie Hu
Dunn, Jonathan
Gries, Stefan Th.
This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
