How direct is the link between words and images?

Shahmohammadi, Hassan; Heitmeier, Maria; Shafaei-Bajestan, Elnaz; Lensch, Hendrik P. A.; Baayen, R. Harald

doi:10.1075/ml.22010.sha

Article published In: The Mental Lexicon
Vol. 18:3 (2023) ► pp.472–511

Get fulltext from our e-platform

Download PDF

Download EPUB

How direct is the link between words and images?

Hassan Shahmohammadi | University of Tübingen

Maria Heitmeier | University of Tübingen

Elnaz Shafaei-Bajestan | University of Tübingen

Hendrik P. A. Lensch | University of Tübingen

R. Harald Baayen | University of Tübingen

Available under the Creative Commons Attribution (CC BY) 4.0 license.

For any use beyond this license, please contact the publisher at rights@benjamins.nl.

Published online: 11 January 2024

https://doi.org/10.1075/ml.22010.sha

Abstract

investigated the relationship between words and images in which they concluded the possibility of a direct link between words and embodied experience. In their study, participants were presented with a target noun and a pair of images, one chosen by their model and another chosen randomly. Participants were asked to select the image that best matched the target noun. Building upon their work, we addressed the following questions. 1. Apart from utilizing visually embodied simulation, what other strategies subjects might have used? How much does this setup rely on visual information? Can it be solved using textual representations? 2. Do current visually-grounded embeddings explain subjects’ selection behavior better than textual embeddings? 3. Does visual grounding improve the representations of both concrete and abstract words? For this aim, we designed novel experiments based on pre-trained word embeddings. Our experiments reveal that subjects’ selection behavior is explained to a large extend on text-based embeddings and word-based similarities. Visually grounded embeddings offered modest advantages over textual embeddings in certain cases. These findings indicate that the experiment by may not be well suited for tapping into the perceptual experience of participants, and the extent to which it measures visually grounded knowledge is unclear.

Keywords: visual grounding, word embeddings, grounded cognition

Article outline

1.Introduction
2.Methodology
- 2.1Materials from Günther et al. (2022)
- 2.2Model from Shahmohammadi et al. (2023)
- 2.3Procedure
3.Results
- 3.1Q1: Can we model participant behaviour without assuming participants generate mental images?
  - 3.1.1Max models
  - 3.1.2GAM models
- 3.2Q2: Is participants’ behaviour best accounted for by purely textual or multimodal word embeddings?
- 3.3Q3: Does the indirect grounding of abstract words afford a better understanding of the experimental results reported by GPVM?
4.Discussion and conclusion
Acknowledgements
Notes
References

References (87)

References

Abdou, M., Kulmizev, A., Hershcovich, D., Frank, S., Pavlick, E., and Søgaard, A. (2021). Can Language Models Encode Perceptual Structure Without Grounding? A Case Study in Color. In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 109–132, Stroudsburg, PA, USA. Association for Computational Linguistics.

Anderson, A. J., Bruni, E., Lopopolo, A., Poesio, M., and Baroni, M. (2015). Reading visually embodied meaning from the brain: Visually grounded computational models decode visual-object mental imagery induced by written text. NeuroImage, 1201:309–322.

Anschütz, M., Lozano, D. M., and Groh, G. (2023). This is not correct! negation-aware evaluation of language generation systems.

Baroni, M. (2016). Grounding distributional semantics in the visual world. Language and Linguistics Compass, 10(1):3–13.

Barsalou, L. W. (1999). Perceptual symbol systems. Behavioral and Brain Sciences, 22(4).

(2003). Abstraction in perceptual symbol systems. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences, 358(1435).

(2008). Grounded Cognition. Annual Review of Psychology, 59(1).

(2010). Grounded cognition: Past, present, and future. Topics in cognitive science, 2(4):716–724.

Barsalou, L. W., Santos, A., Simmons, W. K., and Wilson, C. D. (2008). Language and simulation in conceptual processing. In Symbols and Embodiment: Debates on meaning and cognition. Oxford University Press.

Bordes, P., Zablocki, E., Soulier, L., Piwowarski, B., and Gallinari, P. (2019). Incorporating visual semantics into sentence representations within a grounded space. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 696–707, Hong Kong, China. Association for Computational Linguistics.

Bruni, E., Tran, N.-K., and Baroni, M. (2014). Multimodal distributional semantics. Journal of Artificial Intelligence Research, 491:1–47.

Brysbaert, M., Warriner, A. B., and Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46(3):904–911.

Buchanan, E. M., Valentine, K. D., and Maxwell, N. P. (2019). English semantic feature production norms: An extended database of 4436 concepts. Behavior Research Methods, 51(4).

Bulat, L., Clark, S., and Shutova, E. (2017). Speaking, Seeing, Understanding: Correlating semantic models with conceptual representation in the brain. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, USA. Association for Computational Linguistics.

Castelhano, M. S. and Rayner, K. (2008). Eye movements during reading, visual search, and scene perception: An overview. Cognitive and cultural influences on eye movements, 21751:3–33.

Chatfield, K., Simonyan, K., Vedaldi, A., and Zisserman, A. (2014). Return of the Devil in the Details: Delving Deep into Convolutional Nets. arXiv preprint arXiv:1405.3531.

Chrupaɫa, G., Kádár, Á., and Alishahi, A. (2015). Learning language through pictures. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 112–118, Beijing, China. Association for Computational Linguistics.

Collell Talleda, G., Zhang, T., and Moens, M.-F. (2017). Imagined visual representations as multimodal embeddings. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), pages 4378–4384. AAAI.

Cree, G. S. and McRae, K. (2003). Analyzing the factors underlying the structure and computation of the meaning of chipmunk, cherry, chisel, cheese, and cello (and many other such concrete nouns). Journal of Experimental Psychology: General, 132(2).

Cronin, D. A., Hall, E. H., Goold, J. E., Hayes, T. R., and Henderson, J. M. (2020). Eye movements in real-world scene photographs: General characteristics and effects of viewing task. Frontiers in Psychology, 101:2915.

De Deyne, S., Navarro, D. J., Collell, G., and Perfors, A. (2021). Visual and Affective Multimodal Models of Word Meaning in Language and Mind. Cognitive Science, 45(1).

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). Imagenet: A largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee.

Dolan, R. J. (2002). Emotion, cognition, and behavior. Science, 298(5596):1191–1194.

Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., and Ruppin, E. (2001). Placing search in context: The concept revisited. In Proceedings of the 10th international conference on World Wide Web, pages 406–414.

Gerz, D., Vulić, I., Hill, F., Reichart, R., and Korhonen, A. (2016). SimVerb-3500: A large-scale evaluation set of verb similarity. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2173–2182, Austin, Texas. Association for Computational Linguistics.

Goldstone, R. L. (1995). Effects of Categorization on Color Perception. Psychological Science, 6(5).

Grondin, R., Lupker, S. J., and McRae, K. (2009). Shared features dominate semantic richness effects for concrete concepts. Journal of Memory and Language, 60(1):1–19.

Günther, F., Petilli, M. A., Vergallito, A., and Marelli, M. (2022). Images of the unseen: extrapolating visual representations for abstract and concrete words in a data-driven computational model. Psychological Research.

Günther, F., Rinaldi, L., and Marelli, M. (2019). Vector-Space Models of Semantic Representation From a Cognitive Perspective: A Discussion of Common Misconceptions. Perspectives on Psychological Science, 14(6):1006–1033.

Halawi, G., Dror, G., Gabrilovich, E., and Koren, Y. (2012). Large-scale learning of word relatedness with constraints. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1406–1414.

Harnad, S. (1990). The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1–3):335–346.

Harris, Z. S. (1954). Distributional Structure. WORD, 10(2–3).

Hasegawa, M., Kobayashi, T., and Hayashi, Y. (2017). Incorporating visual features into word embeddings: A bimodal autoencoder-based approach. In IWCS 2017 – 12th International Conference on Computational Semantics – Short papers.

Hill, F., Reichart, R., and Korhonen, A. (2015). Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4):665–695.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–1780.

Hoffman, D. (2019). The case against reality: Why evolution hid the truth from our eyes. WW Norton & Company.

Hollenstein, N., de la Torre, A., Langer, N., and Zhang, C. (2019). CogniVal: A Framework for Cognitive Word Embedding Evaluation. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), Stroudsburg, PA, USA. Association for Computational Linguistics.

Howell, S. R., Jankowicz, D., and Becker, S. (2005). A model of grounded language acquisition: Sensorimotor features improve lexical and grammatical learning. Journal of Memory and Language, 53(2):258–276.

Kant, I., Guyer, P., and Wood, A. W. (1781/1999). Critique of pure reason. Cambridge University Press.

Kenton, J. D. M.-W. C. and Toutanova, L. K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.

Kiela, D. and Bottou, L. (2014). Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 36–45, Doha, Qatar. Association for Computational Linguistics.

Kiela, D., Bulat, L., and Clark, S. (2015). Grounding semantics in olfactory perception. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 231–236.

Kiela, D. and Clark, S. (2015). Multi- and Cross-Modal Semantics Beyond Vision: Grounding in Auditory Perception. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, USA. Association for Computational Linguistics.

Kiela, D., Conneau, A., Jabri, A., and Nickel, M. (2018). Learning visually grounded sentence representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 408–418, New Orleans, Louisiana. Association for Computational Linguistics.

Kiros, J., Chan, W., and Hinton, G. (2018). Illustrative language understanding: Largescale visual grounding with image search. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 922–933, Melbourne, Australia. Association for Computational Linguistics.

Lakoff, G. (1987). Women, Fire, and Dangerous Things. University of Chicago Press.

Lakoff, G. and Johnson, M. (1980). The metaphorical structure of the human conceptual system. Cognitive science, 4(2):195–208.

Landauer, T. K. and Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2).

Langacker, R. W. (1999). A view from cognitive linguistics. Behavioral and Brain Sciences, 22(4).

Lazaridou, A., Chrupaɫa, G., Fernández, R., and Baroni, M. (2016). Multimodal Semantic Learning from Child-Directed Input. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Stroudsburg, PA, USA. Association for Computational Linguistics.

Lazaridou, A., Marelli, M., and Baroni, M. (2017). Multimodal Word Meaning Induction From Minimal Exposure to Natural Text. Cognitive Science, 411.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer.

Louwerse, M. and Connell, L. (2011). A Taste of Words: Linguistic Context and Perceptual Simulation Predict the Modality of Words. Cognitive Science, 35(2):381–398.

Louwerse, M. M. (2011). Symbol interdependency in symbolic and embodied cognition. Topics in Cognitive Science, 3(2):273–302.

Louwerse, M. M. and Zwaan, R. A. (2009). Language Encodes Geographical Information. Cognitive Science, 33(1):51–73.

Luong, T., Socher, R., and Manning, C. (2013). Better word representations with recursive neural networks for morphology. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 104–113, Sofia, Bulgaria. Association for Computational Linguistics.

Lynott, D., Connell, L., Brysbaert, M., Brand, J., and Carney, J. (2020). The Lancaster Sensorimotor Norms: multidimensional measures of perceptual and action strength for 40,000 English words. Behavior Research Methods, 52(3).

Mandera, P., Keuleers, E., and Brysbaert, M. (2017). Explaining human performance in psycholinguistic tasks with models of semantic similarity based on prediction and counting: A review and empirical validation. Journal of Memory and Language, 921.

Marelli, M. and Amenta, S. (2018). A database of orthography-semantics consistency (osc) estimates for 15,017 english words. Behavior research methods, 501:1482–1495.

Martin, A. (2007). The Representation of Object Concepts in the Brain. Annual Review of Psychology, 58(1):25–45.

McRae, K., Cree, G. S., Seidenberg, M. S., and Mcnorgan, C. (2005). Semantic feature production norms for a large set of living and nonliving things. Behavior Research Methods, 37(4).

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Mkrtychian, N., Blagovechtchenski, E., Kurmakaeva, D., Gnedykh, D., Kostromina, S., and Shtyrov, Y. (2019). Concrete vs. Abstract Semantics: From Mental Representations to Functional Brain Mapping. Frontiers in Human Neuroscience, 131(August):267.

Montefinese, M. (2019). Semantic representation of abstract and concrete words: A minireview of neural evidence. Journal of Neurophysiology, 121(5):1585–1587.

Park, J. and Myaeng, S.-h. (2017a). A computational study on word meanings and their distributed representations via polymodal embedding. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 214–223, Taipei, Taiwan. Asian Federation of Natural Language Processing.

(2017b). A computational study on word meanings and their distributed representations via polymodal embedding. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 214–223.

Pennington, J., Socher, R., and Manning, C. (2014). Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Stroudsburg, PA, USA. Association for Computational Linguistics.

Pezzelle, S., Takmaz, E., and Fernández, R. (2021). Word representation learning in multimodal pre-trained transformers: An intrinsic evaluation. Transactions of the Association for Computational Linguistics, 91:1563–1579.

Rotaru, A. S. and Vigliocco, G. (2020a). Constructing semantic models from words, images, and emojis. Cognitive science, 44(4):e12830.

(2020b). Constructing Semantic Models From Words, Images, and Emojis. Cognitive Science, 44(4):e12830.

Rozenkrants, B., Olofsson, J. K., and Polich, J. (2008). Affective visual event-related potentials: arousal, valence, and repetition effects for normal and distorted pictures. International Journal of Psychophysiology, 67(2):114–123.

Shahmohammadi, H., Heitmeier, M., Shafaei-Bajestan, E., Lensch, H., and Baayen, H. (2023). Language with vision: a study on grounded word and sentence embeddings. Behavior Research Methods, accepted for publication.

Shahmohammadi, H., Lensch, H. P. A., and Baayen, R. H. (2021). Learning zero-shot multifaceted visually grounded word embeddings via multi-task training. In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 158–170, Online. Association for Computational Linguistics.

Silberer, C. and Lapata, M. (2014). Learning grounded meaning representations with autoencoders. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 721–732, Baltimore, Maryland. Association for Computational Linguistics.

Simmons, W. K., Martin, A., and Barsalou, L. W. (2005). Pictures of Appetizing Foods Activate Gustatory Cortices for Taste and Reward. Cerebral Cortex, 15(10):1602–1608.

Solomon, K. O. and Barsalou, L. W. (2001). Representing Properties Locally. Cognitive Psychology, 43(2):129–169.

(2004). Perceptual simulation in property verification. Memory & Cognition, 32(2):244–259.

Tan, H. and Bansal, M. (2020). Vokenization: Improving language understanding with contextualized, visual-grounded supervision. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2066–2080, Online. Association for Computational Linguistics.

Tan, M. and Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR.

Utsumi, A. (2022). A test of indirect grounding of abstract concepts using multimodal distributional semantics. Frontiers in psychology, 131.

Vigliocco, G., Ponari, M., and Norbury, C. (2018). Learning and processing abstract words and concepts: Insights from typical and atypical development. Topics in cognitive science, 10(3):533–549.

Wang, B., Wang, A., Chen, F., Wang, Y., and Kuo, C.-C. J. (2019). Evaluating word embedding models: Methods and experimental results. APSIPA transactions on signal and information processing, 81.

Westbury, C. (2014). You Can’t Drink a Word: Lexical and Individual Emotionality Affect Subjective Familiarity Judgments. Journal of Psycholinguistic Research, 43(5).

Westbury, C. and Hollis, G. (2019). Wriggly, squiffy, lummox, and boobs: What makes some words funny? Journal of Experimental Psychology: General, 148(1).

Wood, S. N. (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society (B), 73(1):3–36.

Yun, T., Sun, C., and Pavlick, E. (2021). Does vision-and-language pretraining improve lexical grounding? In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4357–4366, Punta Cana, Dominican Republic. Association for Computational Linguistics.

Zwaan, R. A. and Madden, C. J. (2005). Embodied Sentence Comprehension. In Grounding Cognition. Cambridge University Press.