Article published In: International Journal of Corpus Linguistics
Vol. 27:2 (2022) ► pp.191–219
A multi-dimensional comparison of the effectiveness and efficiency of association measures in collocation extraction
Published online: 10 May 2022
https://doi.org/10.1075/ijcl.19111.den
https://doi.org/10.1075/ijcl.19111.den
Abstract
Because of the ubiquity and importance of collocations in language use/learning, how to effectively and
efficiently identify collocations has been a topic of interest. Although some studies have evaluated many of the existing
association measures (AMs) used in the automatic identification of collocations, the results so far have been inconsistent and
unclear due to various limitations of the existing studies. Hence, this study makes a multi-dimensional evaluation of the
effectiveness and efficiency of seven major AMs in the identification of three types of collocations across five genres and seven
corpora of different sizes. The results indicate that while a few AMs, such as Log Likelihood Ratio and Cubic Mutual Information
(MI3), are consistently more effective and efficient than the other five AMs being examined, no one AM alone may be
adequate in the identification of different types of collocations across different genres and corpus sizes. Research implications
are also discussed.
Article outline
- 1.Introduction
- 2.Background and rationale: Key issues regarding collocation definition/identification
- 2.1Definition and types of collocations
- 2.2Existing AMs and studies on the effectiveness and efficiency of AMs
- 3.Methodology
- 3.1AMs and factors included for evaluation and comparison
- 3.2Corpora used
- 3.3Tools and procedures used for data analysis and AM evaluation/comparison
- 4.Results and discussion
- 4.1Results for Research Question 1: Variations among AMs in the general corpus
- 4.2Results for Research Question 2: Effects of genres
- 4.3Results for Research Question 3: Effects of collocation types
- 4.4Results for Research Question 4: Effects of text length
- 4.5Summary discussion
- 5.Conclusions
- Acknowledgements
- Note
References
References (54)
Auksoriūtė, A. (2008). Eurotermbank–Term
Bank of the New Eu Members. Coactivity: Philology,
Educology, 16(2), 12–19.
Barfield, A., & Gyllstad, H. (2009). Introduction:
Researching L2 collocation knowledge and development. In A. Barfield & H. Gyllstad (Eds.), Researching
Collocations in Another
Language (pp. 1–20). Palgrave Macmillan.
Bartsch, S., & Evert, S. (2014). Towards
a Firthian notion of collocation. In A. Abel & L. Lemnitzer (Eds.), Vernetzungsstrategien, Zugriffsstrukturen und automatisch ermittelte Angaben in
Internetwörterbüchern [Networking Strategies, Access Structures and Automatically
Retrieved Information in Internet
Dictionaries] (pp. 48–61). Institut für Deutsche Sprache.
Benson, M., Benson, E., & Ilson, R. (2010). The
BBI Combinatory Dictionary of English: Your Guide to Collocations and Grammar (3rd
ed.). John Benjamins.
Bestgen, Y., & Granger, S. (2014). Quantifying
the development of phraseological competence in L2 English writing: An automated
approach. Journal of Second Language
Writing, 261, 28–41.
Bisht, R. K., Dhami, H. S., & Tiwari, N. (2006). An
evaluation of different statistical techniques of collocation extraction using a probability measure to word
combinations. Journal of Quantitative
Linguistics, 13(2–3), 161–175.
BNC Consortium. (2007). British National
Corpus (version 3, BNC XML ed.). [URL]
Choueka, Y., Klein, T., & Nuwitz, E. (1983). Automatic
retrieval of frequent idiomatic and collocational expressions in a large corpus. Journal for
Literary and Linguistic
Computing, 4(1), 34–38.
Church, K. W., & Hanks, P. (1990). Word
association, norms, mutual information, and lexicography. Computational
Linguistics, 16(1), 22–29.
Church, K. W., Gale, W., Hanks, P., Hindle, R., & Moon, R. (1994). Lexical
substitutability. In B. T. S. Atkins & A. Zampolli (Eds.), Computational
Approaches to the
Lexicon (pp. 153–177). Oxford University Press.
Crossley, S., Salsbury, T., & McNamara, D. (2015). Assessing
lexical proficiency using analytic ratings: A case for collocation accuracy. Applied
Linguistics, 36(5), 570–590.
Daille, B. (1994). Approche mixte pour l’extraction automatique de terminologie: statistiques lexicales et filtres
linguistiques [Mixed Approach for the Automatic Extraction of Terminology:
Lexical Statistics and Linguistic Filters] [Unpublished doctoral
dissertation]. Universite’ Paris 7. [URL]
Daille, B., Gaussier, E., & Langé, J. M. (1998). An
evaluation of statistical scores for word association. In J. Ginzburg, Z. Khasidashvili, C. Vogel, J.-J. Levy, & E. Vallduvi (Eds.), The
Tbilisi Symposium on Logic, Language and Computation: Selected
Papers (pp. 177–188). CSLI.
Daudaravičius, V., & Marcinkevičienė, R. (2004). Gravity
counts for the boundaries of collocations. International Journal of Corpus
Linguistics, 9(2), 321–348.
Davies, M. (2008–). The
Corpus of Contemporary American English (COCA): 560 million words, 1990-present. Available online
at [URL]
Dunning, T. (1993). Accurate
methods for the statistics of surprise and coincidence. Computational
Linguistics, 19(1), 61–74.
Durrant, P., & Schmitt, N. (2009). To
what extent do native and non-native writers make use of collocations? IRAL-International
Review of Applied Linguistics in Language
Teaching, 47(2), 157–177.
Erman, B., Forsberg Lundell, F., & Lewis, M. (2016). Formulaic
language in advanced second language acquisition and use. In K. Hyltenstam (Ed.), Advanced
Proficiency and Exceptional Ability in Second
Languages (pp. 111–147). Walter de Gruyter.
Evert, P. (2005). The
Statistics of Word Co-occurrences: Word Pairs and Collocations [Doctoral
dissertation, Universität Stuttgart]. OPUS. [URL]
Evert, S. (2009). Corpora
and collocations. In M. Kytö & A. Lüdeling (Eds.), Corpus
Linguistics: An International
Handbook (Vol. 21, pp. 1212–1248). Mouton de Gruyter.
Evert, S., & Krenn, B. (2001). Methods
for qualitative evaluation of lexical association
measures. In Proceedings of the 39th Annual Meeting
of the Association of Computational
Linguistics (pp. 188–195). Association of Computational Linguistics. [URL].
Fernández, B. G., & Schmitt, N. (2015). How
much collocation knowledge do L2 learners have? ITL-International Journal of Applied
Linguistics, 166(1), 94–126.
Gablasova, D., Brezina, V., & McEnery, T. (2017). Collocations
in corpus-based language learning research: Identifying, comparing, and interpreting the
evidence. Language
Learning, 67(S1),155–179.
Hanks, P. (1996). Contextual
dependency and lexical sets. International Journal of Corpus
Linguistics, 1(1), 75–98.
Hoffman, S., Evert, S., Smith, N., Lee, D., & Berglund Prytz, Y. (2008). Corpus
Linguistics with BNCweb: A Practical Guide. Peter Lang.
Hughes, J., & Hardie, A. (2019). Corpus
linguistics and event-related potentials. In J. Egbert & J. Baker (Eds.), Using
Corpus Methods to Triangulate Linguistic
Analysis (pp. 185–218). Routledge.
Kilgarriff, A., Rychlý, P., Smrz, P., & Tugwell, D. (2004). The
Sketch Engine. In G. Williams & S. Vessier (Eds.), Proceedings
of the 11th EURALEX International
Congress (pp. 105–116). Université de Bretagne Sud.
Krenn, B., & Evert, S. (2001). Can
we do better than frequency? A case study on extracting PP-verb
collocations. In Proceedings of the ACL Workshop on
Collocations (pp. 39–46). Association for Computational Linguistics.
Kumova Metin, S., & Karaoğlan, B. (2010). Collocation
extraction in Turkish texts using statistical methods. In E. Rognvaldsson & H. Loftsson (Eds.), Advances
in Natural Language Processing: 7th International Conference on NLP, IceTAL 2010, Reykjavik, Iceland, August 16–18, 2010:
Proceedings (pp. 238–249). Springer.
(2011). Measuring
collocation tendency of words. Journal of Quantitative
Linguistics, 18(2), 174–187.
Lei, L., & Liu, D. (2018). The
academic English collocation list: A corpus-driven study. International Journal of Corpus
Linguistics, 23(2), 216–243.
Liu, D. (2010a). Is
it a chief, main, major, primary, or principal concern? A corpus-based behavioral profile study of the
near-synonyms and its implications. International Journal of Corpus
Linguistics, 15(1), 56–87.
(2010b). Going
beyond patterns: Involving cognitive analysis in the learning of collocations. TESOL
Quarterly, 44(1), 4–30.
(2013). Salience
and construal in the use of synonymy: A study of two sets of near-synonymous nouns. Cognitive
Linguistics, 24(1), 67–113.
Manning, C. D., & Schütze, H. (2000). Foundations
of Statistical Natural Language Processing. MIT Press.
Nesselhauf, N. (2005). Collocations
in a Learner Corpus. John Benjamins.
Oxford University
Press. (n.d). Oxford English Corpus.
Pecina, P. (2005). An
extensive empirical study of collocation extraction methods. In C. Callison-Burch & S. Wan (Eds.), Proceedings
of the ACL Student Research
Workshop (pp. 13–18). Association for Computational Linguistics. [URL].
(2010). Lexical
association measures and collocation extraction. Language Resources and
Evaluation, 44(1–2),137–158.
Pecina, P., & Schlesinger, P. (2006). Combining
association measures for collocation extraction. In Proceedings of
the 21th International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational
Linguistics (COLING/ACL 2006, pp. 651–658). Association for Computational Linguistics. [URL].
R Core Team. (2019). R: A language and
environment for statistical computing (Version 3.6.0) [Computer software]. R Foundation for Statistical Computing. [URL]
Rychlý, P. (2008). A
lexicographer-friendly association score. In P. Sojka & A. Horák (Eds.), Proceedings
of Recent Advances in Slavonic Natural Language
Processing (pp. 6–9). Masaryk University. [URL]
Scott, S., & Matwin, S. (1999). Feature
engineering for text classification. In I. Bratko & S. Dzeroski (Eds.), Proceedings
of the Sixteenth International Conference on Machine
Learning (pp. 379–388). Morgan Kaufmann.
Simpson-Vlach, R., & Ellis, N. C. (2010). An academic formulas list: New methods in phraseology research. Applied Linguistics, 31(4), 487–512.
Cited by (5)
Cited by five other publications
Fioravanti, Irene
Wang, Zihe
Zhan, Hongwei
This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
