Article published In: International Journal of Corpus Linguistics
Vol. 27:3 (2022) ► pp.349–379
Use words, not constructions!
A new perspective on the unit of analysis in collostructional analysis
Published online: 25 May 2022
https://doi.org/10.1075/ijcl.20072.pro
https://doi.org/10.1075/ijcl.20072.pro
Abstract
The aim of collostructional analysis or, more precisely, simple collexeme analysis, is to quantify the statistical association between a construction c and a lexeme l that occurs in a particular slot of the construction. The analysis is based on 2×2 contingency tables that ought to represent a cross-classification of the units of analysis. So far, the units of analysis have been identified either as all constructions in the corpus or all instances of a class C of constructions to which construction c belongs. In practice, it is often not possible or feasible to identify these constructions. Therefore, the sample size is typically approximated by heuristic estimates. The bottom-right cell of the contingency table is most affected by these approximations. I suggest that the units of analysis be defined on the word level, instead, as the class W of word forms that satisfy the restrictions on the collexeme slot of c.
Article outline
- 1.Introduction
- 1.1Analysis of cooccurrence data with contingency tables
- 1.2Collostructional analysis
- 2.The unit of analysis in simple collexeme analysis
- 2.1Constructions as the unit of analysis
- 2.2Approximating the sample size
- 2.3Problems arising from approximating the sample size
- 2.4Practical impact of the approximations
- 3.Suggested solution
- 4.Discussion
- 4.1Methodological advantages
- 4.2Accidental application of word-based simple collexeme analysis
- 4.3Change of interpretative perspective
- 5.Word-based vs. heuristic simple collexeme analysis – Case studies
- 5.1The [N waiting to happen] construction
- 5.2The [X think nothing of V-ing] construction
- 6.Conclusion
- Notes
References
References (35)
Chen, D., & Manning, C. D. (2014). A fast and accurate dependency parser using neural networks. In A. Moschitti, B. Pang, & W. Daelemans (Eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014) (pp. 740–750).
Church, K. W. (2000). Empirical estimates of adaptation: The chance of two Noriegas is closer to p/2 than p2. In Proceedings of the 18th Conference on Computational Linguistics (COLING’00), Volume 1 (pp. 180–186). [URL].
Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22–29.
Church, K., Gale, W., Hanks, P., & Hindle, D. (1989). Parsing, word associations and typical predicate-argument relations. In Speech and Natural Language: Proceedings of a Workshop held at Cape Cod, Massachusetts, October 15–18, 1989 (pp. 75–81). [URL].
Evert, S. (2004). The Statistics of Word Cooccurrences: Word Pairs and Collocations [Doctoral dissertation, Universität Stuttgart]. [URL]
Goldberg, A. E. (2006). Constructions at Work: The Nature of Generalization in Language. Oxford University Press.
Gries, S. T. (2012). Frequencies, probabilities, and association measures in usage-/exemplar-based linguistics: Some necessary clarifications. Studies in Language, 36(3), 477–510.
(2015). More (old and new) misunderstandings of collostructional analysis: On Schmid and Küchenhoff (2013). Cognitive Linguistics, 26(3), 505–536.
Gries, S. T., & Stefanowitsch, A. (2004a). Covarying collexemes in the into-causative. In M. Achard & S. Kemmer (Eds.), Language, Culture, and Mind (pp. 225–236). CSLI.
(2004b). Extending collostructional analysis: A corpus-based perspective on “alternations”. International Journal of Corpus Linguistics, 9(1), 97–129.
Katz, S. M. (1996). Distribution of content words and phrases in text and language modelling. Natural Language Engineering, 2(1), 15–59.
Kilgarriff, A. (2005). Language is never, ever, ever, random. Corpus Linguistics and Linguistic Theory, 1(2), 263–276.
Korhonen, A. (2002). Subcategorization Acquisition [Doctoral dissertation, University of Cambridge]. [URL]
Korhonen, A., Krymolowski, Y., & Briscoe, T. (2006). A large subcategorization lexicon for natural language processing applications. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06) (pp. 1015–1020). [URL]
Küchenhoff, H., & Schmid, H.-J. (2015). Reply to “More (old and new) misunderstandings of collostructional analysis: On Schmid & Küchenhoff” by Stefan Th. Gries. Cognitive Linguistics, 26(3), 537–547.
Loftus, G. R. (1996). Psychology will be a much better science when we change the way we analyze data. Current Directions in Psychological Science, 5(6), 161–171.
Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, System Demonstrations (pp. 55–60).
Nivre, J., Marneffe, M.-C. de, Ginter, F., Goldberg, Y., Hajič, J., Manning, C. D., McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., Tsarfaty, R., & Zeman, D. (2016). Universal Dependencies v1: A multilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16) (pp. 1659–1666). [URL]
Pecina, P. (2005). An extensive empirical study of collocation extraction methods. In C. Callison-Burch & S. Wan (Eds.), Proceedings of the ACL Student Research Workshop (pp. 13–18). [URL].
(2010). Lexical association measures and collocation extraction. Language Resources and Evaluation, 44(1), 137–158.
Sarkar, A., & Zeman, D. (2000). Automatic extraction of subcategorization frames for Czech. In Proceedings of the 18th International Conference on Computational Linguistics (COLING’00), Volume 2 (pp. 691–697). [URL].
Schäfer, R. (2015). Processing and querying large web corpora with the COW14 architecture. In P. Bański, H. Biber, E. Breiteneder, M. Kupietz, H. Lüngen, & A. Witt (Eds.), Proceedings of the 3rd Workshop on Challenges in the Management of Large Corpora (CMLC-3) (pp. 28–34). [URL]
Schäfer, R., & Bildhauer, F. (2012). Building large corpora from the web using a new efficient tool chain. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12) (pp. 486–493). [URL]
Schmid, H.-J. (2000). English Abstract Nouns as Conceptual Shells: From Corpus to Cognition. Mouton de Gruyter.
Schmid, H.-J., & Küchenhoff, H. (2013). Collostructional analysis and other ways of measuring lexicogrammatical attraction: Theoretical premises, practical problems and cognitive underpinnings. Cognitive Linguistics, 24(3), 531–577.
Schuster, S., & Manning, C. D. (2016). Enhanced English universal dependencies: An improved representation for natural language understanding tasks. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16) (pp. 2371–2378). [URL]
Stefanowitsch, A. (2014). Collostructional analysis: A case study of the English into-causative. In T. Herbst, H.-J. Schmid, & S. Faulhaber (Eds.), Constructions Collocations Patterns (pp. 217–238). De Gruyter Mouton.
Stefanowitsch, A., & Gries, S. T. (2003). Collostructions: Investigating the interaction of words and constructions. International Journal of Corpus Linguistics, 8(2), 209–243.
(2009). Corpora and grammar. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics: An International Handbook (pp. 933–952). Walter de Gruyter.
Stevens, M. E., Giuliano, V. E., & Heilprin, L. B. (Eds.). (1965). Statistical Association Methods for Mechanized Documentation. Symposium Proceedings. Washington 1964. National Bureau of Standards.
Uhrig, P., Evert, S., & Proisl, T. (2018). Collocation candidate extraction from dependency-annotated corpora: Exploring differences across parsers and dependency annotation schemes. In P. Cantos-Gómez & M. Almela-Sánchez (Eds.), Lexical Collocation Analysis: Advances and Applications (pp. 111–140). Springer.
Cited by (2)
Cited by two other publications
Fioravanti, Irene
This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
