Article published In: Revista Española de Lingüística Aplicada/Spanish Journal of Applied Linguistics
Vol. 34:2 (2021) ► pp.435–463
Automatic lexical collocate extraction for corpus-based ontology building and refinement
A FunGramKB case study of the THEFT conceptual scenario
Published online: 15 December 2021
https://doi.org/10.1075/resla.19030.fer
https://doi.org/10.1075/resla.19030.fer
Abstract
Traditional corpus-based methods rely on manual inspection and extraction of lexical collocates in the study of selection preferences, which is a very costly, labor-intensive, and time-consuming task. Devising automatic methods for lexical collocate extraction becomes necessary to handle this task and the immensity of corpora available. With a view to leveraging the Sketch Engine platform and in-built corpora, we propose a working prototype of a Lexical Collocate Extractor (LeCoExt) command-line tool that mines lexical collocates from all types of verbs according to their syntactic constituents and Collocate Frequency Score (CFS). This might be the first tool that performs comprehensive corpus-based studies of the selection preferences of individual or groups of verbs exploiting the capabilities offered by Sketch Engine. This tool might facilitate the task of extracting rich lexico-semantic knowledge from diverse corpora in a few seconds and at a click away. We test its performance for ontology building and refinement departing from a previous detailed analysis of stealing verbs carried out by Fernández-Martínez, N. J., & Faber, P. (2020). Who stole what from whom? A corpus-based, cross-linguistic study of English and Spanish verbs of stealing. Languages in Contrast, 20(1): 107–140. . We show how the proposed tool is used to extract conceptual-cognitive knowledge from the THEFT scenario and implement it into FunGramKB Core Ontology through the creation and modification of theft-related conceptual units.
Resumen
Extracción automática de colocaciones léxicas para la construcción y mejora basadas en corpus de una ontología: estudio de caso del escenario conceptual del robo en FunGramKB
Los métodos tradicionales basados en corpus se valen de la inspección y extracción manual de colocaciones léxicas para el estudio de las preferencias de selección, lo cual es una tarea muy costosa, que requiere mucho trabajo y tiempo. El diseño de métodos automáticos para la extracción de colocaciones léxicas se hace necesario para abordar esta tarea y la inmensidad de los corpus disponibles. Con el fin de aprovechar la plataforma Sketch Engine y sus corpora, proponemos un prototipo funcional de una herramienta de línea de comandos de extracción de colocaciones léxicas, Lexical Collocate Extractor, (LeCoExt) que extrae colocaciones léxicas de todo tipo de verbos según sus constituyentes sintácticos y su puntuación de frecuencia de colocaciones o Collocate Frequency Score (CFS). Esta podría ser la primera herramienta que sirve para realizar estudios exhaustivos basados en corpus sobre las preferencias de selección de verbos individuales o grupos de verbos explotando las capacidades que ofrece Sketch Engine. Esta herramienta podría facilitar la tarea de extraer un rico conocimiento léxico-semántico de diversos corpus en pocos segundos y a un clic de distancia. Probamos su rendimiento para la construcción y mejora de ontologías partiendo de un análisis detallado previo de los verbos de robo en Fernández-Martínez, N. J., & Faber, P. (2020). Who stole what from whom? A corpus-based, cross-linguistic study of English and Spanish verbs of stealing. Languages in Contrast, 20(1): 107–140. . Mostramos cómo la herramienta propuesta se utiliza para extraer el conocimiento conceptual-cognitivo del escenario THEFT e implementarlo en la ontología central de FunGramKB a través de la creación y modificación de unidades conceptuales relacionadas con el robo.
Article outline
- 1.Introduction
- 2.Theoretical background
- 2.1Stealing verbs: A lexico-semantic perspective
- 2.2FunGramKB: Definition, scope and architecture
- 2.3The FunGramKB ontology
- 3.Methodology
- 3.1Presenting the lexical collocate extractor (LeCoExt) tool
- 3.2FunGramKB conceptual categorization and specifications
- 4.Results and discussion
- 4.1Extraction of semantic knowledge
- 4.2Implementation and refinement of findings into the FunGramKB Core Ontology
- 4.3Limitations in this contribution
- 5.Conclusion
- Notes
References
References (44)
Asaro, C., Biasiotti, M. A., Guidotti, P., Papini, M., Sagri, M. T., & Tiscornia, D. (2003). A domain ontology: Italian crime ontology. In Proceedings of the ICAIL 2003 Workshop on Legal Ontologies & Web based legal information management, 1–7.
Berman, R. (1982). On the Nature of ‘Oblique’ Objects in Bitransitive Constructions. Lingua, 56(2), 101–125.
Boas, H. (2013). Frame Semantics and Translation. In A. Rojo & I. Ibarretxte-Antunano (Eds.), Cognitive Linguistics and Translation (pp. 125–158). Berlin/New York: Mouton de Gruyter.
British National Corpus, version 3 (BNC XML Edition). (2007). Distributed by Oxford University Computing Services on behalf of the BNC Consortium. Available at [URL] [last accessed 15 May 2019]
Bušta, J., & Herman, O. (2017). JSI Newsfeed Corpus. In The 9th International Corpus Linguistics Conference, University of Birmingham, 25–28 July 2017.
Dux, R. (2018). Frames, Verbs, and Constructions: German Constructions with Verbs of Stealing. In A. Ziem & H. Boas (Eds.), Approaching German Syntax from a Constructionist Perspective (pp. 367–405). Berlin/New York: Mouton de Gruyter.
Faber, P., & Mairal-Usón, R. (1999). Constructing a Lexicon of English Verbs. Berlin: Mouton de Gruyter.
(2018). A Conceptually-Oriented Approach to Semantic Composition in RRG. In R. D. Van Valin (Ed.), The Cambridge Handbook of Role and Reference Grammar. Cambridge: Cambridge University Press.
Felices-Lago, Á. (2014). The emergence of axiology as a key parameter in modern linguistics. In G. Thompson & L. Alba-Juex (eds), Evaluation in Context (pp. 27–46). Jon Benjamins.
(2015). Foundational considerations for the development of the Globalcrimeterm subontology: A research project based on FunGramKB. Onomazéin, 31(1): 127–144.
(2016). The Process of Constructing Ontological Meaning Based on Criminal Law Verbs. Círculo de Lingüística Aplicada a la Comunicación, 651, 109–148.
Fernández-Martínez, N. J., & Faber, P. (2020). Who stole what from whom? A corpus-based, cross-linguistic study of English and Spanish verbs of stealing. Languages in Contrast, 20(1): 107–140.
Fillmore, C., & Baker, C. (2010). A Frames Approach to Semantic Analysis. In B. Heine & H. Narrog (Eds.), The Oxford Handbook of Linguistic Analysis (pp. 313–340). New York: Oxford University Press.
Gangemi, A., Sagri, M., & Tiscornia, D. (2005). A Constructive Framework for Legal Ontologies. In V. R. Benjamins et al. (Eds.), Law and the Semantic Web (pp. 97–124). Berlin: Springer.
Goldberg, A. (2010). Verbs, Constructions and Semantic Frames. In M. Rappaport-Hovav, E. Doron and I. Sichel (Eds.), Syntax, Lexical Semantics and Event Structure (pp. 39–58). Oxford: Oxford University Press.
Jakubíček, M., Kilgarriff, A., McCarthy, D., & Rychlý, P. (2010). Fast Syntactic Searching in Very Large Corpora for Many Languages. PACLIC, 741–747.
Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013). The TenTen Corpus Family. Seventh International Corpus Linguistics Conference CL, 125–127.
Jiménez-Briones, R., & Luzondo-Oyón, A. (2011). Building Ontological Meaning in a Lexico-conceptual Knowledge Base. Onomázein, 231, 11–40.
Kilgarriff, A., Vojtěch, K., Krek, S., Srdanovič, I., & Tiberius, C. (2010). A Quantitative Evaluation of Word Sketches. Proceedings of the 14th EURALEX International Congress, 372–379.
Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Rychlý, P., & Suchomel, V. (2014). The Sketch Engine: Ten Years on. Lexicography, 11, 7–36. Available at [URL] [last accessed 28 December 2018]
Leary, R., Vandenberghe, W., & Zeleznikow, J. (2004). Towards a financial fraud ontology: a legal modelling approach, ICAIL 2003 Workshop on Legal Ontologies & Web based legal information management, 1–33.
Lenci, A. et al. (2000). SIMPLE: A general framework for the development of multilingual lexicon. International Journal of Lexicography, 13(4), 249–263.
Masolo, C. et al. (2003). WonderWeb Deliverable D18: Ontology Library. Laboratory for Applied Ontology, ISTC-CNR.
McCarthy, D., Kilgarrif, A., Jakubíček, M., & Reddy, S. (2015). Semantic Word Sketches. Corpus Linguistics (CL2015), 1–5.
Miller, G., & Fellbaum, C. (2007). WordNet Then and Now. Language Resources and Evaluation, 41(2), 209–214. Available at [URL] [last accessed 17 May 2019]
Niles, I., & Pease, A. (2001). Towards a standard Upper Ontology. In Proceedings of the Second International Conference on Formal Ontology in Information Systems. Ogunquit. Available at [URL] [last accessed 10 January 2019]
Pedersen, B. S., & Keson, B. (1999). SIMPLE–Semantic information for multifunctional plurilingual lexica: some examples of Danish concrete nouns. Proceedings of the SIGLEX-99 Workshop. Maryland. Available at [URL] [last accessed 15 January 2019]
Periñán-Pascual, C. (2012). En defensa del procesamiento del lenguaje natural fundamentado en la lingüística teórica. Onomázein, 261, 13–48.
(2013). A knowledge-engineering approach to the cognitive categorization of lexical meaning. VIAL – Vigo International Journal of Applied Linguistics, 101, 85–104.
Periñán-Pascual, C., & Arcas-Túnez, F. (2004). Meaning postulates in a lexico-conceptual knowledge base. 15th International Workshop on Databases and Expert Systems Applications, IEEE, Los Alamitos (California), 38–42.
(2005). Microconceptual-Knowledge Spreading in FunGramKB. Proceedings of the 9th IASTED International Conference on Artificial Intelligence and Soft Computing. Anaheim-Calgary-Zurich: ACTA Press, 239–244.
(2010a). The architecture of FunGramKB. Proceedings of the 7th International Conference on Language Resources and Evaluation, European Language Resources Association (ELRA), 2667–2674.
Periñán-Pascual, C., & Mairal-Usón, R. (2009). Bringing Role and Reference Grammar to Natural Language Understanding. Procesamiento del Lenguaje Natural, 431, 265–273.
Ruiz-de-Mendoza Ibáñez, F., & Mairal-Usón, R. (2009). Constructing meaning: a brief overview of the Lexical Constructional Model. In Mario Brdar (Ed.), Converging and diverging tendencies in Cognitive Linguistics. Amsterdam/Philadelphia: John Benjamins.
Ruppenhofer, J., Boas, H., & Baker, C. (2017). FrameNet. In P. Fuertes-Olivera (Ed.), The Routledge Handbook of Lexicography (pp. 383–398). New York: Routledge.
Rychlý, P. (2008). A Lexicographer-Friendly Association Score. Proceedings of the 2nd Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN, 21, 6–9.
Sartor, G., Casanovas, P., Biasotti, M. A., & Fernández-Barrera, M. (Eds.) (2011). Approaches to legal ontologies, theories, domains, methodologies, Berlin: Springer.
Thorgren, S. (2005). Transaction Verbs: A Lexical and Semantic Analysis of Rob and Steal. Reports from the Department of Language and Culture, 31, 1–44.
Valente, A. (2005). Types and roles of legal ontologies. In R. Benjamins, P. Casonovas, J. Breuker & A. Gangemi (Eds.), Law and the semantic web (pp. 65–76). Berlin: Springer.
Cited by (1)
Cited by one other publication
This list is based on CrossRef data as of 30 november 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
