In:Applications of Pattern-driven Methods in Corpus Linguistics:
Edited by Joanna Kopaczyk and Jukka Tyrkkö
[Studies in Corpus Linguistics 82] 2018
► pp. 15–56
Chapter 2From lexical bundles to surprisal and language models
Measuring the idiom principle in native and learner language
Published online: 13 March 2018
https://doi.org/10.1075/scl.82.02sch
https://doi.org/10.1075/scl.82.02sch
Abstract
We exploit the information theoretic measure of surprisal to analyze the formulaicity of lexical sequences. We first show the prevalence of individual lexical bundles, then we argue that abstracting to surprisal as an information-theoretic measure of lexical bundleness, formulaicity and non-creativity is an appropriate measure for the idiom principle, as it expresses reader expectations and text entropy. As strong and gradient formulaic, idiomatic and selectional preferences prevail on all levels, we argue for the abstraction step from individual bundles to measures of bundleness. We use surprisal to analyse differences between genres of native language use, and learner language at different levels: (a) spoken and written genres of native language (L1); (b) spoken and written learner language (L2), across selected written genres; (c) learner language as compared with native language (L1). We thus test Pawley and Syder (1983)’s hypothesis that native speakers know best how to play the tug-of-war between formulaicity (Sinclair’s idiom principle) and expressiveness (Sinclair’s open-choice principle), which can be measured with Levy and Jaeger (2007)’s uniform information density (UID) which is a principle of minimizing comprehension difficulty. Our goal to abstract away from word sequences also leads us to language models as models of processing, first in the form of a part-of-speech tagger, then in the form of a syntactic parser. While our hypotheses are largely confirmed, we also observe that advanced learners bundle most, and that scientific language may show lower surprisal than spoken language.
Article outline
- 1.Introduction
- 2.Related research
- 3.Materials
- 4.From frequencies to collocations
- 4.1Frequency as measure of lexical bundleness
- 4.2Collocation measures: O/E and T-score
- 4.2.1Method
- 4.2.2Results
- 5.Surprisal as a measure of bundleness
- 5.1Method
- 5.2Results
- 5.3Bundleness of spoken L2 compared to corrected L2
- 5.4Bundleness of written L2 compared to L1
- 6.Collocations as non-adjacent relations in a syntactic frame
- 7.Part-of-Speech tagging model
- 7.1Method
- 7.2Results
- 8.Parser as a language processing model
- 8.1Method
- 8.2Parser performance
- 8.3Parser model fit
- 9.Conclusions and outlook
Notes References
References (62)
Altenberg, Bengt & Tapper, Marie. 1998. The use of adverbial connectors in advanced Swedish learner’s written English. In Learner English on Computer, Sylviane Granger (ed.), 80–93. London: Addison Wesley Longman.
Aston, Guy & Burnard, Lou. 1998. The BNC Handbook. Exploring the British National Corpus with SARA. Edinburgh: EUP.
Bartsch, Sabine & Evert, Stefan. 2014. Towards a Firthian notion of collocation. In Vernetzungsstrategien, Zugriffsstrukturen und automatisch ermittelte Angaben in Internetwörterbüchern [OPAL – Online publizierte Arbeiten zur Linguistik 2/2014], Andrea Abel & Lothar Lemnitzer (eds), 48–61. Mannheim: Institut für Deutsche Sprache.
Biber, Douglas. 2003. Compressed noun-phrase structures in newspaper discourse: The competing demands of popularization vs. economy. In New Media Language, Jean Aitchison & Diana Lewis (eds), 169–181. London: Routledge.
. 2009. A corpus-driven approach to formulaic language in English: Multi-word patterns in speech and writing. International Journal of Corpus Linguistics 14(3): 275–311.
Biber, Douglas & Barbieri, Federica. 2007. Lexical bundles in university spoken and written registers. English for Specific Purposes 26: 263–286.
Biber, Douglas, Conrad, Susan & Cortes, Viviana. 2004. If you look at…: Lexical bundles in university teaching and textbooks. Applied Linguistics 25: 371–405.
Biber, Douglas, Johansson, Stig, Leech, Geoffrey, Conrad, Susan & Finegan, Edward. 1999. Longman Grammar of Spoken and Written English. London: Longman.
Bonk, William J. 2000. Testing ESL learners’ knowledge of collocations. Urbana IL: Clearinghouse. <[URL]>
Conrad, Susan & Biber, Douglas. 2004. The frequency and use of lexical bundles in conversation and academic prose. Lexicographica 20: 56–71.
Cheng, Winnie, Greaves, Chris, Sinclair, John McH. & Warren, Martin. 2009. Uncovering the extent of the phraseological tendency: Towards a systematic analysis of concgrams. Applied Linguistics 30(2): 236–252.
Ellis, Nick C. 2002. Frequency effects in language processing. Studies in Second Language Acquisition 24(2): 143–188.
Ellis, Nick C., Frey, Eric & Jalkanen, Isaac. 2009. The psycholinguistic reality of collocation and semantic prosody (1): Lexical access. In Exploring the Lexis-Grammar Interface [Studies in Corpus Linguistics 35], Ute Römer & Rainer Schulze (eds), 89–114. Amsterdam: John Benjamins.
Ellis, Nick C. & Frey, Eric. 2009. The psycholinguistic reality of collocation and semantic prosody (2): Affective priming. Formulaic Language 2: 473–497.
Ellis, Nick C., Simpson Vlach, Rita & Maynard, Carson. 2008. Formulaic language in native and second language speakers: Psycholinguistics, corpus linguistics, and TESOL. Tesol Quarterly 42(3): 375–396.
Erman, Britt & Warren, Beatrice. 2000. The idiom principle and the open choice principle. TEXT 20(1): 29–62.
Erman, Britt. 2009. Formulaic language from a learner perspective: What the learner needs to know. In Formulaic Language, Vol. II: Acquisition, Loss, Psychological Reality, and Functional Explanations [Typological Studies in Language 83], Roberta Corrigan, Edith A. Moravcsik, Hamid Ouali & Kathleen M. Wheatley (eds), 323–346. Amsterdam: John Benjamins.
Evert, Stefan. 2009. Corpora and collocations. In Corpus Linguistics. An International Handbook, Anke Lüdeling & Merja Kytö (eds), 1212–1248. Berlin: Mouton de Gruyter.
Frank, Stefan L. & Bod, Rens. 2011. Insensitivity of the human sentence-processing system to the hierarchical structure. Psychological Science 22(6): 829–834.
Frank, Stefan L., Fernandez Monsalve, Irene, Thompson, Robin L. & Vigliocco, Gabriella. 2013. Reading-time data for evaluating broad-coverage models of English sentence processing Behavior Research Methods 45: 1182–1190
Fossum, Victoria & Levy, Roger. 2012. Sequential vs. hierachical models of human incremental sentence processing. In Proceedings of the 3rd Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2012), Montreal, Canada, Roger Levy & David Reitter (eds), 61–69. Montreal: Association for Computational Linguistics.
Gildea, Daniel. 2001. Corpus variation and parser performance. In Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing (EMNLP), 167–202, Pittsburgh, PA.
Gries, Stefan T. 2013. 50-something years of work on collocations: What is or should be next…. International Journal of Corpus Linguistics 18(1): 137–166. Special issue Current Issues in Phraseology, Sebastian Hoffmann, Bettina Fischer-Starcke & Andrea Sand (eds).
2010. Useful statistics for corpus linguistics. In A Mosaic of Corpus Linguistics: Selected Approaches, Aquilino Sánchez & Moisés Almela (eds), 269–291. Frankfurt: Peter Lang.
Granger, Sylviane. 2009. Prefabricated patterns in advanced EFL writing: Collocations and formulae. In Phraseology: Theory, Analysis, and Applications, Anthony P. Cowie (ed.), 185–204. Tokyo: Kurosio.
Granger, Sylviane, & Tyson, Stephanie. 1996. Connector usage in the English essay writing of native and non-native EFL speakers of English. World Englishes 15(1): 17–27.
Izumi, Emi, Uchimoto, Kiyotaka & Isahara, Hitoshi. 2005. Error annotation for corpus of Japanese learner English. Proceedings of the Sixth International Workshop on Linguistically Interpreted Corpora (LINC 2005). <[URL]>
Ishikawa, Shin. 2009. Vocabulary in interlanguage: A study on corpus of English essays written by Asian university students (CEEAUS). In Phraseology, Corpus Linguistics and Lexicography: Papers from Phraseology 2009 in Japan, Katsumasa Yagi & Takaaki Kanzaki (eds), 87–100. Nishinomiya: Kwansei Gakuin University Press.
Kennedy, Chris & Thorp, Dilys. 2007. A corpus investigation of linguistic responses to an IELTS Academic Writing task. In IELTS Collected Papers: Research in Speaking and Writing Assessment, Linda Taylor & Peter Falvey (eds), 316–378. Cambridge: CUP.
Kopaczyk, Joanna. 2012. Applications of the lexical bundles method in historical corpus research. In Corpus Data across Languages and Disciplines, Piotr Pezik (ed.), 83–95. Frankfurt: Peter Lang.
Keller, Frank. 2003. A probabilistic parser as a model of global processing difficulty. In Proceedings of the 25th Annual Conference of the Cognitive Science Society, Richard Alterman & David Kirsh (eds), 646–651. Boston MA: Cognitive Science Society.
. 2010. Cognitively plausible models of human language processing. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics: Short Papers, 11–16 July, 60–67. Uppsala: Uppsala University.
Lee, David Y. W. 2001. Genres, registers, text types, domains and styles: Clarifying the concepts and navigating a path through the bnc jungle. Language Learning and Technology 5(3): 37–72.
Leech, Geoffrey. 2000. Grammars of spoken English: New outcomes of corpus-oriented research. Language Learning 50(4): 675–724.
Lehmann, Hans Martin & Schneider, Gerold. 2011. A large-scale investigation of verb-attached prepositional phrases. In Studies in Variation, Contacts and Change in English, Vol. 6: Methodological and Historical Dimensions of Corpus Linguistics, Sebastian Hoffmann, Paul Rayson & Geoffrey Leech (eds). Helsinki: Varieng.
Levy, Roger & Jaeger, T. Florian. 2007. Speakers optimize information density through syntactic reduction. In Advances in Neural Information Processing Systems (NIPS) 19, Bernhard Schlökopf, John Platt & Thomas Hoffman (eds), 849–856. Cambridge MA: The MIT Press.
Jaeger, T. Florian. 2010. Redundancy and reduction: Speakers manage syntactic information density. Cognitive Psychology 61(1): 23–62.
Lorenz, Gunter R. 1999. Adjective Intensification – Learners Versus Native Speakers. A Corpus Study of Argumentative Writing. Amsterdam: Rodopi.
Malvern, David D., Richards, Brian J., Chipere, Ngoni & Durán, Pilar. 2004. Lexical Diversity and Language Development. Houndmills: Palgrave MacMillan.
Marcus, Mitch, Santorini, Beatrice & Marcinkiewicz, Mary Ann. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19: 313–330.
McEnery, Tony, Xiao, Richard & Tono, Yukio. 2006. Corpus-based Language Studies: An Advanced Resource Book [Routledge Applied Linguistics Series]. London: Routledge.
Millar, Neil. 2011. The processing of malformed learner collocations. Applied Linguistics 32(2):129–148.
Nesselhauf, Nadja. 2003. The use of collocations by advanced learners of English and some implications for teaching. Applied Linguistics 24(2): 223–242.
Ng, Hwee Tou, Wu, Siew Mei, Briscoe, Ted, Hadiwinoto, Christian, Hendy Susanto, Raymond & Bryant, Christoper (eds). 2014. Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task. Baltimore MD: Association for Computational Linguistics.
NICT, 2012. Japanese Learner English Corpus (JLE, Version 4.1, 2012). <[URL]>
Ohlrogge, Aaron. 2009. Formulaic expressions in intermediate EFL writing assessment. In Formulaic Language, Vol. II: Acquisition, Loss, Psychological Reality, and Functional Explanations [Typological Studies in Language 83], Roberta Corrigan, Edith A. Moravcsik, Hamid Ouali & Kathleen M. Wheatley (eds), 375–386. Amsterdam: John Benjamins.
Pawley, Andrew & Hodgetts Syder, Frances. 1983. Two puzzles for linguistic theory: Native-like selection and native-like fluency. In Language and Communication, Jack C. Richards & Richard W. Schmidt (eds), 191–226. London: Longman.
Pecina, Pavel. 2009. Lexical Association Measures: Collocation Extraction [Studies in Computational and Theoretical Linguistics 4]. Prague: Institute of Formal and Applied Linguistics, Charles University in Prague.
Read, John & Nation, Paul. 2006. An investigation of the lexical dimension of the IELTS speaking test. In IELTS Research Reports, Vol. 6, Petronella McGovern & Steve Walsh (eds). IELTS Australia and British Council. <[URL]>
Ronan, Patricia & Schneider, Gerold. 2015. Determining light verb constructions in contemporary British and Irish English. International Journal of Corpus Linguistics 20(3): 326–354.
Schmid, Helmut. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing. Manchester.
Schneider, Gerold. 2008. Hybrid Long-Distance Functional Dependency Parsing. PhD dissertation, University of Zurich.
Shannon, Claude E. 1951. Prediction and entropy of printed English. The Bell System Technical Journal 30: 50–64.
Sinclair, John McH. & Mauranen, Anna. 2006. Linear Unit Grammar: Integrating Speech and Writing [Studies in Corpus Linguistics 25]. Amsterdam: John Benjamins.
Siyanova-Chanturia, Anna & Martinez, Ron. 2014. The Idiom Principle revisited. Applied Linguistics 36(5): 549–569.
Cited by (2)
Cited by two other publications
Drury, Brett & Samuel Morais Drury
Schneider, Gerold
2022. Syntactic changes in verbal clauses and noun phrases from 1500 onwards. In English Historical Linguistics [Current Issues in Linguistic Theory, 358], ► pp. 163 ff.
This list is based on CrossRef data as of 1 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
