Article published In: Lingvisticæ Investigationes
Vol. 41:2 (2018) ► pp.240–268
A corpus-based study of the automatic extraction and validation of V-N Italian oral academic collocations
Published online: 4 February 2019
https://doi.org/10.1075/li.00022.pep
https://doi.org/10.1075/li.00022.pep
Abstract
This study describes the outcomes of a POS-based method for the automatic extraction of V-N Italian oral academic collocations from an annotated corpus.
A frequency statistical measure is applied to automatically extract the collocations from the POS-tagged corpus.
The results reveal that frequency alone is not sufficient to measure the degree of association that connects the two elements of a word pair.
In order to detect the real-attested Italian collocations, the data has been further evaluated by 50 Italian native speakers.
The results indicate that these combinations are tightly linked to their context of usage.
Thus, native speakers should be exposed to these phrasal contexts to activate their mechanisms of explicit reflection and assess the degree of collocativity of these combinations.
Article outline
- Introduction
- 1.Towards a definition of “collocation”
- 1.1Collocations in applied linguistics
- 2.Data and methodology
- 2.1Collecting data for structuring the ASIC corpus
- 2.2Extracting and filtering collocations from the ASIC corpus
- 2.3Validation of the extracted academic Italian collocation list
- 2.3.1Results of the crowd sourcing experiment
- 2.3.2Double validation of the data
- Discussion and conclusions
- Acknowledgements
- Notes
References
References (64)
Ackerman, K. & Chen, Y. 2014. The Academic Collocation List. [online] Available at: <[URL]>.
Basili, R., Pazienza, M. T., & Velardi, P. 1992. A shallow syntactic analyzer to extract word associations from corpora. Literary and Linguistic Computing, 71, 113–123.
Benson, M. 1990. Collocations and general-purpose dictionaries. International Journal of Lexicography, 311, 23–35.
Benson, M., Benson, E., & Ilson, R. 1986. The BBI Dictionary of English Word Combinations. Amsterdam: John Benjamins.
Biber, D. 2006. University Language: a corpus-based study of spoken and written registers. Amsterdam: John Benjamins.
Callison-Burch, C. 2009. Fast, cheap, and creative: Evaluating translation quality using Amazon’s Mechanical Turk. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 286–295.
Calzolari, N. et al. 2002. Towards Best Practice for Multiword Expressions in Computational Lexicons. Proceedings of the Third International Conference on Language Resources and Evaluation, 1934–1940.
Chan, T. P., & Liou, H. C. 2005. Effects of web-based concordancing instruction on EFL students’ learning of verb-noun collocations. Computer Assisted Language Learning, 18(3), 231–251.
Cowie, A. 1978. The place of illustrative material and collocations in the design of a learner’s dictionary. In P. Strevens (Ed.), In Honour of A.S. Hornby, 127–139. Oxford: Oxford University Press.
1981. The treatment of collocations and idioms in learners’ dictionaries. Applied Linguistics, 21, 223–235.
Church, K. W. & Hanks, P. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics, 161, 22–29.
Church, K. W., Gale, W., Hanks, P., & Hindle, D. 1991. Parsing, word associations, and typical predicate-argument relations. In M. Tomita (Ed.), Current Issues in Parsing Technology, 75–81. Boston: Kluwer Academic.
Durrant, P. 2008. High frequency collocations and second language learning. Final Thesis Ph.D., University of Nottingham.
Durrant, P. & Schmitt, N. 2009. To what extent do native and nonnative writers make use of collocations?. International Review of Applied Linguistics in Language Teaching, 471, 157–177.
Ellis, N. C., Simpson-Vlach, R. & Maynard, C. 2008. Formulaic language in native and second-language speakers: Psycholinguistics, corpus Linguistics, and TESOL. TESOL Quarterly, 421, 375–396.
Evert, S. 2008. Corpora and collocations. In A. Lüdeling, & M. Kytö (Eds.), Corpus Linguistics. An International Handbook, 223–233. Berlin: de Gruyter.
Evert, S. & Hardie, A. 2011. Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium. Proceedings of the Corpus Linguistics 2011 conference.
Firth, J. 1956. Synopsis of linguistic theory 1930–1955. Reprinted in F. R. Palmer (Ed.) 1968, Selected Papers of J. R. Firth, 168–205. Harlow: Longman.
Gao, Z.-M. 2011. Exploring the effects and use of a Chinese-English bilingual concordancer. Computer-Assisted Language Learning, 241, 255–275.
2014. Automatic Extraction of English Collocations and their Chinese-English Bilingual Examples: A Computational Tool for Bilingual Lexicography. Studies in Linguistics, 401, 11, 95–121.
Granger, S., & Meunier, F. 2008. Phraseology. An interdisciplinary perspective. Amsterdam: John Benjamins.
Granger, S., & Paquot, M. 2009. In search of a General Academic vocabulary: A corpus-driven study. In K. Katsampoxaki-Hodgetts (Ed.), Options and Practices of LSP Practitioners, 94–108. Crete: University of Crete Publications.
Hardie, A. 2012. CQPweb – combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics, 171, 31, 380–409.
Henriksen, B. 2012. Research on L2 learners’ collocational competence and development – a progress report. In C. Bardel, B. Laufer, & C. Lindqvist (Eds.), L2 vocabulary acquisition, knowledge and use. New perspectives on assessment and corpus analysis, 29–56. Eurosla Monographs Series 2, EUROSLA.
Hoffmann, S., Evert, S., Smith, N., Lee, D. Y. W. & Berglund Prytz, Y. 2008. Corpus Linguistics with BNCWeb – a Practical Guide. Frankfurt am Main: Peter Lang.
Howarth, P. 1996. Phraseology in English academic writing: some implications for language learning and dictionary making. Niemeyer: Tübingen.
Hsueh, P., Melville, P. & Sindhwani, V. 2009. Data quality from crowdsourcing: a study of annotation selection criteria. Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing, 27–35.
Hyland, K. 2008. As can be seen: Lexical bundles and disciplinary variation. English for specific purposes, 271, 4–21.
Justeson, J. S. & Katz, S. 1995. Technical terminology: Some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 11, 9–27.
Kilgarriff, A., Rychly, P., Smrz, P. & Tugwell, D. 2004. The Sketch Engine. Proceedings of EURALEX, 105–116 Lorient, France.
Kjellmer, G. 1987. Aspects of English collocations. In W. Meijs (Ed.), Corpus Linguistics and Beyond: Proceedings of the Seventh International Conference of English of English Language Research on Computerized Corpora, 133–140. Amsterdam: Rodopi.
Krishnamurthy, R. 2006. Collocations. In K. Brown (Ed.), Encyclopedia of language and linguistics, 2nd Edition, 596–600. Oxford: Elsevier.
Kupiec, J., Pedersen, J. & Chen, F. 1995. A Trainable Document Summarizer. Proceedings of the 18th ACM-SIGIR, 68–73 Seattle.
Laufer, B. & Waldman, T. 2011. Verb-noun collocations in second-language writing: A corpus analysis of learners’ English. Language Learning, 6121, 647–672.
Lewis, M. 1993. The lexical approach. The State of ELT and the Way Forward. Hove: Language Teaching Publications.
Lorenz, G. 1999. Adjective intensification-learners versus native speakers: A corpus study of argumentative writing. Amsterdam: Rodopi.
Manning, C. D. & Schütze, H. 1999. Foundations of Statistical Natural Language Processing. Cambridge: MIT Press.
Nagy, W. & Townsend, D. 2012. Words as Tools: Learning Academic Vocabulary as Language Acquisition. Reading Research Quarterly, 4711, 91–108.
Nation, I. S. P. 2001. Learning vocabulary in another language. Cambridge: Cambridge University Press.
Nesi, H., & Gardner, S. 2012. Genres across the Disciplines: Student writing in higher education. Cambridge: Cambridge University Press.
Nesselhauf, N. 2005. Collocations in a learner corpus. Amsterdam & Philadelphia: Benjamins.
Peppoloni, D. 2012. Linguistic and computational tools in support of non-native Italian speaking students: the development of the Academic Spoken Italian Corpus. In A. Llanes, L. Astrid, L. Gallego, & R. Mateu (Eds.), La lingüística aplicada en la era de la globalización. Lleida: Edicions i Publicacions de la Universitat de Lleida.
Post, M., Callison-Burch, C., & Osborne, M. 2012. Constructing parallel corpora for six indian languages via crowdsourcing. Proceedings of the Seventh Workshop on Statistical Machine Translation. Montréal, 401–409.
Ramisch, C., Villavicencio, A., Moura, L., & Idiart, M. 2008. Picking them up and figuring them out: Verb-particle constructions, noise and idiomaticity. In A. Clark, & K. Toutanova (Eds.), Proceedings of the Twelfth Conference on Natural Language Learning (CoNLL 2008), 49–56. Manchester, UK: Association for Computational Linguistics.
Ross, I. C. & Tukey, J. W. 1975. Introduction to these Volumes. In J. W. Tukey (Ed.), Index to Statistics and Probability, IV–X. Los Altos: R&D Press.
Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. 2002. Multiword expressions: A pain in the neck for NLP. Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics CICLING 2002, 1–15.
Shei, C.-C. & Pain, H. 2000. An ESL writer’s collocation aid. Computer-Assisted Language Learning, 131, 167–182.
Shin, D., & Nation, P. 2008. Beyond single words: The most frequent collocations in spoken English. ELT Journal, 62(4), 339–348.
Simpson-Vlach, R. & Ellis, N. C. 2010. An academic formulas list: New methods in phraseology research. Applied Linguistics, 31, 4, 463–512.
2004. How to use corpora in language teaching. Amsterdam and Philadelphia: John Benjamins.
Smadja, F. 1993. Retrieving collocations form text: Xtract. Computational Linguistics, 1911, 143–177.
Snow, R., O’connor, B., Jurafsky, D., & Ng, A. 2008. Cheap and fast – but is it good?: evaluating non-expert annotations for natural language tasks. Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 254–263.
Villavicencio, A., Kordoni, V., Zhang, Y., Idiart, M., & Ramisch, C. 2007. Validation and evaluation of automatically acquired multiword expressions for grammar engineering. In J. Eisner (Ed.), Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2007), 1034–1043. Prague, Czech Republic: Association for Computational Linguistics.
