Learning visually grounded words and syntax of natural spoken language

Roy, Deb

doi:10.1075/eoc.4.1.04roy

Article published In: The Evolution of Grounded Communication
Edited by Luc Steels
[Evolution of Communication 4:1] 2001
► pp. 33–56

Get fulltext from our e-platform

Download PDF

Learning visually grounded words and syntax of natural spoken language

Deb Roy

Published online: 29 April 2002

https://doi.org/10.1075/eoc.4.1.04roy

Properties of the physical world have shaped human evolutionary design and given rise to physically grounded mental representations. These grounded representations provide the foundation for higher level cognitive processes including language. Most natural language processing machines to date lack grounding. This paper advocates the creation of physically grounded language learning machines as a path toward scalable systems which can conceptualize and communicate about the world in human-like ways. As steps in this direction, two experimental language acquisition systems are presented.

The first system, CELL, is able to learn acoustic word forms and associated shape and color categories from fluent untranscribed speech paired with video camera images. In evaluations, CELL has successfully learned from spontaneous infant-directed speech. A version of CELL has been implemented in a robotic embodiment which can verbally interact with human partners.

The second system, DESCRIBER, acquires a visually-grounded model of natural language which it uses to generate spoken descriptions of objects in visual scenes. Input to DESCRIBER’s learning algorithm consists of computer generated scenes paired with natural language descriptions produced by a human teacher. DESCRIBER learns a three-level language model which encodes syntactic and semantic properties of phrases, word classes, and words. The system learns from a simple ‘show-and-tell’ procedure, and once trained, is able to generate semantically appropriate, contextualized, and syntactically well-formed descriptions of objects in novel scenes.

Cited by (17)

Cited by 17 other publications

Order by:

Liu, Rui, Yibei Guo, Runxiang Jin & Xiaoli Zhang

2024. A Review of Natural-Language-Instructed Robot Execution Systems. AI 5:3 ► pp. 948 ff.

Heath, Scott, David Ball & Janet Wiles

2016. Lingodroids: Cross-Situational Learning for Episodic Elements. IEEE Transactions on Cognitive and Developmental Systems 8:1 ► pp. 3 ff.

Mingo, Jack Mario & Ricardo Aler

2016. A competence-performance based model to develop a syntactic language for artificial agents. Information Sciences 373 ► pp. 79 ff.

Rasheed, Nadia & Shamsudin H. M. Amin

2016. Developmental and Evolutionary Lexicon Acquisition in Cognitive Agents/Robots with Grounding Principle: A Short Review. Computational Intelligence and Neuroscience 2016 ► pp. 1 ff.

Mukerjee, Amitabha & Madan Mohan Dabbeeru

2012. Grounded discovery of symbols as concept–language pairs. Computer-Aided Design 44:10 ► pp. 901 ff.

Tikhanoff, Vadim, Angelo Cangelosi & Giorgio Metta

2011. Integration of Speech and Action in Humanoid Robots: iCub Simulation Experiments. IEEE Transactions on Autonomous Mental Development 3:1 ► pp. 17 ff.

Bauckhage, C., S. Wachsmuth, M. Hanheide, S. Wrede, G. Sagerer, G. Heidemann & H. Ritter

2008. The visual active memory perspective on integrated recognition systems. Image and Vision Computing 26:1 ► pp. 5 ff.

Knowles, Michael John & Stefan Wermter

2008. 2008 Eighth International Conference on Hybrid Intelligent Systems, ► pp. 404 ff.

MCCLAIN, MATTHEW & STEPHEN LEVINSON

2007. SEMANTIC BASED LEARNING OF SYNTAX IN AN AUTONOMOUS ROBOT. International Journal of Humanoid Robotics 04:02 ► pp. 321 ff.

Wachsmuth, Sven, Sebastian Wrede & Marc Hanheide

2007. Coordinating interactive vision behaviors for cognitive assistance. Computer Vision and Image Understanding 108:1-2 ► pp. 135 ff.

Jamieson, M., S. Dickinson, S. Stevenson & S. Wachsmuth

2006. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2 (CVPR'06), ► pp. 2102 ff.

Jung-Hoon Hwang, KangWoo Lee & Dong-Soo Kwon

2005. ROMAN 2005. IEEE International Workshop on Robot and Human Interactive Communication, 2005., ► pp. 623 ff.

Bauckhage, C., M. Hanheide, S. Wrede & G. Sagerer

2004. Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., ► pp. 827 ff.

Heidemann, Gunther, Ingo Bax & Holger Bekel

2004. Proceedings of the 6th international conference on Multimodal interfaces, ► pp. 53 ff.

Steels, Luc

2003. The Evolution of Communication Systems by Adaptive Agents. In Adaptive Agents and Multi-Agent Systems [Lecture Notes in Computer Science, 2636], ► pp. 125 ff.

Steels, Luc

2003. Evolving grounded communication for robots. Trends in Cognitive Sciences 7:7 ► pp. 308 ff.

Roy, Deb K.

2002. Learning visually grounded words and syntax for a scene description task. Computer Speech & Language 16:3-4 ► pp. 353 ff.

This list is based on CrossRef data as of 9 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.