Article published In: The Wealth and Breadth of Construction-Based Research:
Edited by Timothy Colleman, Frank Brisard, Astrid De Wit, Renata Enghels, Nikos Koutsoukos, Tanja Mortelmans and María Sol Sansiñena
[Belgian Journal of Linguistics 34] 2020
► pp. 66–78
Let’s get into it
Using contextualized embeddings as retrieval tools
Published online: 28 May 2021
https://doi.org/10.1075/bjl.00035.fon
https://doi.org/10.1075/bjl.00035.fon
Abstract
This squib briefly explores how contextualized embeddings – which are a type of compressed token-based semantic vectors –
can be used as semantic retrieval and annotation tools for corpus-based research into constructions. Focusing on embeddings created by the
Bidirectional Encoder Representations from Transformer model, also known as ‘BERT’, this squib demonstrates how contextualized embeddings
can help counter two types of retrieval inefficiency scenarios that may arise with purely form-based corpus queries. In the first scenario,
the formal query yields a large number of hits, which contain a reasonable number of relevant examples that can be labeled and used as input
for a sense disambiguation classifier. In the second scenario, the contextualized embeddings of exemplary tokens are used to retrieve more
relevant examples in a large, unlabeled dataset. As a case study, this squib focuses on the into-interest construction (e.g. I’m so
into you).
Keywords: distributional semantics, BERT, corpus linguistics, data retrieval, prepositions
Article outline
- 1.Introduction
- 2.Vector-based distributional semantic models
- 3.The challenge: Finding into-interest
- 4.A solution: BERT as a disambiguation tool
- 5.BERT as an exemplar-based retrieval tool
- 6.Conclusion
- Notes
References
References (25)
Baroni, Marco, Georgiana Dinu, and Germán Kruszewski. 2014. “Don’t Count, Predict! A Systematic Comparison of Context-Counting vs. Context-Predicting Semantic Vectors.” In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ed. by Kristina Toutanova, and Hua Wu, 238–247. Baltimore, Maryland: Association for Computational Linguistics.
Bergen, Benjamin K., and Nancy Chang. 2005. “Embodied Construction Grammar in Simulation-Based Language Understanding.” In Construction Grammars: Cognitive Grounding and Theoretical Extensions, ed. by Jan-Ola Östman, and Mirjam Fried, 147–190. Amsterdam/Philadelphia: John Benjamins.
Boleda, Gemma. 2020. “Distributional Semantics and Linguistic Theory.” Annual Review of Linguistics 6 (1): 213–234.
Bolukbasi, Tolga, Kai-Wei Chang, James Zou, Venkatesh Saligrama, and Adam Kalai. 2016. “Man Is to Computer Programmer as Woman Is to Homemaker? Debiasing Word Embeddings.” ArXiv:1607.06520 [Cs, Stat], July. [URL]
Clark, Kevin, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. “What Does BERT Look At? An Analysis of BERT’s Attention.” ArXiv:1906.04341 [Cs], June. [URL].
Croft, William. 2001. Radical Construction Grammar: Syntactic Theory in Typological Perspective. Oxford: Oxford University Press.
De Pascale, Stefano. 2019. “Token-Based Vector Space Models as Semantic Control in Lexical Sociolectometry”. PhD dissertation, KU Leuven.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding”. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), ed. by Jill Burstein, Christy Doran, and Thamar Solorio, 4171–4186. Minneapolis: Association for Computational Linguistics.
Giulianelli, Mario, Marco Del Tredici, and Raquel Fernández. 2020. “Analysing Lexical Semantic Change with Contextualised Word Representations.” ArXiv:2004.14118 [Cs], April. [URL].
Goldberg, Adele E. 1995. Constructions: A Construction Grammar Approach to Argument Structure. Chicago: University of Chicago Press.
Gupta, Abhijeet, Gemma Boleda, Marco Baroni, and Sebastian Padó. 2015. “Distributional Vectors Encode Referential Attributes.” In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, ed. by Lluís Màrquez, Chris Callison-Burch, and Jian Su, 12–21. Lisbon: Association for Computational Linguistics.
Heylen, Kris, Thomas Wielfaert, Dirk Speelman, and Dirk Geeraerts. 2015. “Monitoring Polysemy: Word Space Models as a Tool for Large-Scale Lexical Semantic Analysis.” Lingua 1571 (April): 153–172.
Hilpert, Martin. 2013. “Corpus-Based Approaches to Constructional Change.” In: The Oxford Handbook of Construction Grammar, ed. by Thomas Hoffmann, and Graeme Trousdale, 458–475. Oxford: Oxford University Press.
Hilpert, Martin, and David Correia Saavedra. 2017. “Using Token-Based Semantic Vector Spaces for Corpus-Linguistic Analyses: From Practical Applications to Tests of Theoretical Claims.” Corpus Linguistics and Linguistic Theory 16(2): 393–424.
Johnson, Jeff, Matthijs Douze, and Hervé Jégou. 2017. “Billion-Scale Similarity Search with GPUs.” ArXiv:1702.08734 [Cs], February. [URL]
Linzen, Tal, Grzegorz Chrupała, Yonatan Belinkov, and Dieuwke Hupkes (eds). 2019. Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Florence: Association for Computational Linguistics. [URL]
Louwerse, Max M., and Rolf A. Zwaan. 2009. “Language Encodes Geographical Information.” Cognitive Science 33 (1): 51–73.
Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. 2009. An Introduction to Information Retrieval. Cambridge: Cambridge University Press.
Perek, Florent. 2016. “Using Distributional Semantics to Study Syntactic Productivity in Diachrony: A Case Study.” Linguistics 54 (1): 149–188.
Peters, Matthew, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. “Deep Contextualized Word Representations.” In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), ed. by Marilyn Walker, Heng Ji, and Amanda Stent, 2227–2237. New Orleans: Association for Computational Linguistics.
Schlechtweg, Dominik, Stefanie Eckmann, Enrico Santus, Sabine Schulte im Walde, and Daniel Hole. 2017. “German in Flux: Detecting Metaphoric Change via Word Entropy.” In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), ed. by Roger Levy, and Lucia Specia, 354–367. Vancouver: Association for Computational Linguistics.
Sommerauer, Pia, and Antske Fokkens. 2018. “Firearms and Tigers Are Dangerous, Kitchen Knives and Zebras Are Not: Testing Whether Word Embeddings Can Tell”. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, ed. by Tal Linzen, Grzegorz Chrupała, and Afra Alishahi, 276–286. Brussels: Association for Computational Linguistics.
Stefanowitsch, Anatol, and Stefan Gries. 2003. “Collostructions: Investigating the Interaction between Words and Constructions.” International Journal of Corpus Linguistics 8 (2): 209–243.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need”. ArXiv:1706.03762 [Cs], December. [URL]
Yoon, Jiyoung, and Stefan Th Gries (eds). 2016. Corpus-Based Approaches to Construction Grammar. Amsterdam/Philadelphia: John Benjamins.
