In:Eye-tracking in Interaction: Studies on the role of eye gaze in dialogue
Edited by Geert Brône and Bert Oben
[Advances in Interaction Studies 10] 2018
► pp. 139–168
Chapter 7Gaze and face-to-face interaction
From multimodal data to behavioral models
Published online: 13 November 2018
https://doi.org/10.1075/ais.10.07bai
https://doi.org/10.1075/ais.10.07bai
Abstract
This chapter describes experimental and modeling work aiming at describing gaze patterns that are mutually exchanged by interlocutors during situated and task-directed face-to-face two-ways interactions. We will show that these gaze patterns (incl. blinking rate) are significantly influenced by the cognitive states of the interlocutors (speaking, listening, thinking, etc.), their respective roles in the conversation (e.g. instruction giver, respondent) as well as their social relationship (e.g. colleague, supervisor).
This chapter provides insights into the (micro-)coordination of gaze with other components of attention management as well as methodologies for capturing and modeling behavioral regularities observed in experimental data. A particular emphasis is put on statistical models, which are able to learn behaviors in a data-driven way.
We will introduce several statistical models of multimodal behaviors that can be trained on such multimodal signals and generate behaviors given perceptual cues. We will notably compare performances and properties of models which explicitly model the temporal structure of studied signals, and which relate them to internal cognitive states. In particular we study Semi-Hidden Markov Models and Dynamic Bayesian Networks and compare them to classifiers without sequential models (Support Vector Machines and Decision Trees).
We will further show that the gaze of conversational agents (virtual talking heads, speaking robots) may have a strong impact on communication efficiency. One of the conclusions we draw from these experiments is that multimodal behavioral models able to generate co-verbal gaze patterns of interactive avatars should be designed with great care in order not to increase the cognitive load of human partners. Experiments involving an impoverished or irrelevant control of the gaze of artificial agents (virtual talking heads and humanoid robots) have demonstrated its negative impact on communication (Garau, Slater, Bee, & Sasse, 2001).
Article outline
- 1.Introduction
- 2.Interactive gaze
- 2.1Eyes in the visual scene
- 2.2Conversational gaze
- 2.3Mutual gaze patterns
- 3.Learning & generating gaze patterns
- 3.1Grounding gaze patterns
- 3.2Learning joint behaviors
- 3.3A sample interactive game
- 3.4Learning joint behaviors with dynamic Bayesian networks
- 3.5Adapting joint behaviors
- 3.6Effective gaze tracking and generation
- 4.Active gaze estimation from images and videos: Gaze patterns and interaction models
- 5.Easing gaze reading
- 5.1Eye appearance
- 5.2Estimating gaze direction of avatars
- 6.Future trends
Acknowledgments Note References
References (111)
Al Moubayed, S., Edlund, J., & Beskow, J. (2012). Taming Mona Lisa: communicating gaze faithfully in 2D and 3D facial projections. ACM Transactions on Interactive Intelligent Systems, 1(2), article 11 (25pages).
Al Moubayed, S., Skantze, G., & Beskow, J. (2012). Lip-reading: Furhat audiovisual intelligibility of a back-projected animated face. Intelligent Virtual Agents – Lecture Notes in Computer Science, 7502, 196–203.
Albrecht, I., Haber, J., & Seidel, H. -P. (2002). Automatic Generation of Non-Verbal Facial Expressions from Speech. In J. Vince & R. Earnshaw (Eds.), Advances in Modelling, Animation and Rendering (pp. 283–293). Springer London. Retrieved from
Allopenna, P. D., Magnuson, J. S., & Tanenhaus, M. K. (1998). Tracking the time course of spoken word recognition using eye movements: Evidence for continuous mapping models. Journal of Memory and Language, 38(4), 419–439.
Alnajar, F., Gevers, T., Valenti, R., & Ghebreab, S., (2013). Calibration-free gaze estimation using human gaze patterns (pp.137–144). Presented at the Computer Vision (ICCV), 2013 IEEE International Conference on, Sydney, Australia: IEEE.
Bailly, G., Elisei, F., Raidt, S., Casari, A., & Picot, A., (2006). Embodied conversational agents : computing and rendering realistic gaze patterns. In Pacific Rim Conference on Multimedia Processing (Vol. LNCS 4261, pp.9–18). Hangzhou – China.
Bailly, G., Elisei, F., & Sauze, M. (2015). Beaming the gaze of a humanoid robot. In Human-Robot Interaction (HRI) (pp.47–48). Portland, OR.
Bailly, G., Raidt, S., & Elisei, F. (2010). Gaze, conversational agents and face-to-face communication. Speech Communication – Special Issue on Speech and Face-to-Face Communication, 52(3), 598–612.
Barisic, I., Timmermans, B., Pfeiffer, U., Bente, G., Vogeley, K., & Schilbach, L. (2013). Using dual eyetracking to investigate real-time social interactions. Proceedings from SIGCHI Conference on Human Factors in Computing Systems.
Baron-Cohen, S., Jollife, T., Mortimore, C., & Robertson, M. (1997). Another advanced test of theory of mind: evidence from very high functioning adults with autism or Asperger syndrome. Journal of Child Psychology and Psychiatry, 38(7), 813–822.
Bengio, Y., & Frasconi, P. (1996). Input-output HMMs for sequence processing. IEEE Transactions on Neural Networks, 7(5), 1231–1249.
Benoît, C., Grice, M., & Hazan, V. (1996). The SUS test: A method for the assessment of text-to-speech synthesis intelligibility using Semantically Unpredictable Sentences. Speech Communication, 18, 381–392.
Bindemann, M., Burton, A. M., Hooge, I. C., Jenkins, R., &de Haan, E. F. (2005). Faces retain attention. Psychonomic Bulletin & Review, 12(6), 1048–1053.
Boker, S. M., Cohn, J. F., Theobald, B. -J., Matthews, I., Brick, T. R., & Spies, J. R. (2009). Effects of damping head movement and facial expression in dyadic conversation using real-time facial expression tracking and synthesized avatars. Philosophical Transactions of the Royal Society – Biological Sciences, 364(1535), 3485–3495.
Bolt, R.A., 1980. Put-that-there: Voice and gesture at the graphics interface. ACM SIGGRAPH Computer Graphics 14, 262-270.
Borji, A., Sihite, D. N., & Itti, L. (2013). Quantitative Analysis of Human-Model Agreement in Visual Saliency Modeling: A Comparative Study. Image Processing, IEEE Transactions on, 22(1), 55–69.
Brône, G., & Oben, B. (2015). InSight Interaction: a multimodal and multifocal dialogue corpus. Language Resources and Evaluation, 49(1), 195–214.
Buchan, J. N., Paré, M., & Munhall, K. G. (2007). Spatial statistics of gaze fixations during dynamic face processing. Social Neuroscience, 2(1), 1–13.
Carletta, J., Hill, R. L., Nicol, C., Taylor, T., de Ruiter, J. P., & Bard, E. G. (2010). Eyetracking for two-person tasks with manipulation of a virtual world. Behavior Research Methods, 42(1), 254–265.
Clark, H. H. (2003). Pointing and placing. In S. Kita (Ed.), Pointing: Where Language, Culture, and Cognition Meet (pp.243–268). New York: Lawrence Erlbaum Associates Publishers.
Cooper, G. F., & Herskovits, E. (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9(4), 309–347.
Coutrot, A., & Guyader, N. (2014). How saliency, faces, and sound influence gaze in dynamic social scenes. Journal of Vision, 14(8), 5.
Coutrot, A., Guyader, N., Ionescu, G., & Caplier, A. (2012). Influence of soundtrack on eye movements during video exploration. Journal of Eye Movement Research, 5(4), 2.
Cuijpers, R. H., & van der Pol, D. (2013). Region of eye contact of humanoid Nao robot is similar to that of a human. In G. Herrmann, M. J. Pearson, A. Lenz, P. Bremner, A. Spiers, & U. Leonards (Eds.), Social Robotics (Vol. 8239, pp.280–289). Springer International Publishing. Retrieved from
Cummins, F. (2012). Gaze and blinking in dyadic conversation: A study in coordinated behaviour among individuals. Language and Cognitive Processes, 27(10), 1525–1549.
Dale, R., Fusaroli, R., Duran, N., & Richardson, D. C. (2013). The self-organization of human interaction. Psychology of Learning and Motivation, 59, 43–95.
Delaunay, F., Greeff, J., & Belpaeme, T. (2010). A study of a retro-projected robotic face and its effectiveness for gaze reading by humans. In ACM/IEEE International Conference on Human-Robot Interaction (HRI) (pp.39–44). Osaka, Japan.
Donat, R., Bouillaut, L., Aknin, P., & Leray, P. (2008). Reliability analysis using graphical duration models (pp.795–800). Presented at the Availability, Reliability and Security, 2008. ARES 08. Third International Conference on, IEEE.
Duffner, S., & Garcia, C. (2015). Visual Focus of Attention estimation with unsupervised incremental learning. IEEE Transactions on Circuits and Systems for Video Technology, to appear.
Elisei, F., Bailly, G., & Casari, A. (2007). Towards eyegaze-aware analysis and synthesis of audiovisual speech. In Auditory-visual Speech Processing (pp.120–125). Hilvarenbeek, The Netherlands.
Ferreira, J. F., Lobo, J., Bessiere, P., Castelo-Branco, M., & Dias, J. (2013). A Bayesian framework for active artificial perception. IEEE Transactions on Cybernetics, 43(2), 699–711.
Foerster, F., Bailly, G., & Elisei, F. (2015). Impact of iris size and eyelids coupling on the estimation of the gaze direction of a robotic talking head by human viewers. In Humanoids. Seoul, Korea. 148–153.
Foulsham, T., Walker, E., & Kingstone, A. (2011). The where, what and when of gaze allocation in the lab and the natural environment. Vision Research, 51(17), 1920–1931.
Funes Mora, K. A., & Odobez, J. -M. (2014). Geometric generative gaze estimation (G3E) for remote RGB-D cameras (pp.1773–1780). Presented at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH: IEEE.
Fusaroli, R., & Tylén, K. (2016). Investigating conversational dynamics: Interactive alignment, Interpersonal synergy, and collective task performance. Cognitive Science, 40(1), 145–171.
Garau, M., Slater, M., Bee, S., & Sasse, M. A. (2001). The impact of eye gaze on communication using humanoid avatars. In SIGCHI conference on Human factors in computing systems (pp.309–316). Seattle, WA.
Goferman, S., Zelnik-Manor, L., & Tal, A. (2012). Context-aware saliency detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(10), 1915–1926.
Gregory, R. (1997). Eye and Brain: The Psychology of Seeing. Princeton, NJ: Princeton University Press.
Gu, E., & Badler, N. I. (2006). Visual attention and eye gaze during multiparty conversations with distractions (pp.193–204). Presented at the Intelligent Virtual Agents, Springer.
Hanes, D. A., & McCollum, G. (2006). Variables contributing to the coordination of rapid eye/head gaze shifts. Biological Cybernetics, 94, 300–324.
Henderson, J. M., Malcolm, G. L., & Schandl, C. (2009). Searching in the dark: Cognitive relevance drives attention in real-world scenes. Psychonomic Bulletin & Review, 16(5), 850–856.
Hietanen, J. K. (1999). Does your gaze direction and head orientation shift my visual attention? Neuroreport, 10(16), 3443–3447.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Huang, C. -M., & Mutlu, B. (2014). Learning-based Modeling of Multimodal Behaviors for Humanlike Robots. In Proceedings of the 2014 ACM/IEEE International Conference on Human-robot Interaction (pp. 57–64). New York, NY, USA: ACM.
Ishii, R., Otsuka, K., Kumano, S., & Yamato, J. (2014). Analysis and modeling of next speaking start timing based on gaze behavior in multi-party meetings. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp.694–698). Florence, Italy.
Itti, L., Dhavale, N., & Pighin, F. (2003). Realistic avatar eye and head animation using a neurobiological model of visual attention. In SPIE 48th Annual International Symposium on Optical Science and Technology (Vol. 5200, pp.64–78). Bellingham, WA.
(2006). Photorealistic attention-based gaze animation. In IEEE International Conference on Multimedia and Expo (pp. 521–524). Toronto, Canada.
Jensen, F., Lauritzen, S., & Olesen, K. (1990). Bayesian updating in recursive graphical models by local computations. Computational Statistics Quaterly, 4(1), 269–282.
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks (pp.1725–1732). Presented at the Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, IEEE.
Kobayashi, H., & Kohshima, S. (2001). Unique morphology of the human eye and its adaptive meaning: comparative studies on external morphology of the primate eye. Journal of Human Evolution, 40(5), 419–435.
Koller, D., & Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques – Adaptive Computation and Machine Learning. Boston, MA: MIT Press.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing (NIPS). Lake Tahoe, NV.
Laidlaw, K. E. W., Foulsham, T., Kuhn, G., & Kingstone, A. (2011). Social attention to a live person is critically different than looking at a videotaped person. PNAS, 108, 5548–5553.
Lakin, J., Jefferis, V., Cheng, C., & Chartrand, T. (2003). The chameleon effect as social glue: evidence for the evolutionary significance of nonconscious mimicry. Nonverbal Behavior, 27(3), 145–162.
Langton, S. R. H. (2000). The mutual influence of gaze and head orientation in the analysis of social attention direction. Quarterly Journal of Experimental Psychology, 53A(3), 825–845.
Langton, S. R., Honeyman, H., & Tessler, E. (2004). The influence of head contour and nose angle on the perception of eye-gaze direction. Perception & Psychophysics, 66(5), 752–771.
Lansing, C. R., & McConkie, G. W. (1999). Attention to facial regions in segmental and prosodic visual speech perception tasks. Journal of Speech, Language, and Hearing Research, 42(3), 526–539.
Lee, S. P., Badler, J. B., & Badler, N. (2002). Eyes alive. ACM Transaction on Graphics, 21(3), 637–644.
Levenshtein, V. (1966). Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady
, 10(8), 707–710.
Li, J., Tian, Y., & Huang, T. (2014). Visual saliency with statistical priors. International Journal of Computer Vision, 107(3), 239–253.
Liang, S., Fuhrman, S., Somogyi, R., & others. (1998). Reveal, a general reverse engineering algorithm for inference of genetic network architectures. In Pacific symposium on biocomputing (Vol. 3, pp.18–29).
Marschner, L., Pannasch, S., Schulz, J., & Graupner, S. -T. (2015). Social communication with virtual agents: The effects of body and gaze direction on attention and emotional responding in human observers. International Journal of Psychophysiology, 97(2), 85–92.
McNeill, D. (1992). Hand and Mind. What Gestures Reveal about Thought. Chicago: Chicago University Press.
Mihoub, A., Bailly, G., & Wolf, C. (2014). Modelling perception-action loops: comparing sequential models with frame-based classifiers. In Human-Agent Interaction (HAI) (pp.309–314). Tsukuba, Japan.
(2015). Learning multimodal behavioral models for face-to-face social interaction. Journal on Multimodal User Interfaces, 9(3), 195–210.
Mihoub, A., Bailly, G., Wolf, C., & Elisei, F. (2016). Graphical models for social behavior modeling in face-to face interaction. Pattern Recognition Letters, 74, 82–89.
Murphy, K. (2002). Dynamic bayesian networks: representation, inference and learning (PhD Thesis). UC Berkeley, Computer Science Division, Berkeley, CA.
Mutlu, B., Kanda, T., Forlizzi, J., Hodgins, J., & Ishiguro, H. (2012). Conversational gaze mechanisms for humanlike robots. ACM Transactions on Interactive Intelligent Systems (TiiS), 1(2), 12.
Neverova, N., Wolf, C., Taylor, G. W., & Nebout, F. (2016). ModDrop: adaptive multi-modal gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 38(8), 1692–1706.
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning (pp.689–696). Presented at the International conference on machine learning (ICML), Bellevue, WA.
Nguyen, D.-A., Bailly, G., & Elisei, F.. (2016). Conducting neuropsychological tests with a humanoid robot: design and evaluation. In IEEE International Conference on Cognitive Infocommunications – CogInfoCom. Wroclaw, Poland. 337–342.
Onuki, T., Ishinoda, T., Kobayashi, Y., & Kuno, Y. (2013). Designing robot eyes for gaze communication. In IEEE Korea-Japan Joint Workshop on Frontiers of Computer Vision (FCV) (pp.97–102). Fukuoka, Japan.
Otsuka, K. (2011). Multimodal Conversation Scene Analysis for Understanding People’s Communicative Behaviors in Face-to-Face Meetings. In International Conference on Human-Computer Interaction (HCI) (Vol. 12, pp.171–179). Orlando FL.
Otsuka, K., Takemae, Y., & Yamato, J. (2005). A probabilistic inference of multiparty-conversation structure based on Markov-switching models of gaze patterns, head directions, and utterances. In International Conference on Multimodal Interfaces (ICMI) (pp.191–198). Seattle, WA.
Oyekoya, O., Steed, A., & Steptoe, W. (2010). Eyelid kinematics for virtual characters. Computer Animation and Virtual Worlds, 21(3–4), 161–171.
Pelachaud, C.&Bilvi, M., (2003). Modelling gaze behavior for conversational agents. In International Working Conference on Intelligent Virtual Agents (Vol. LNAI 2792). Kloster Irsee, Germany.
Pentland, A. S. (2004). Social dynamics: Signals and behavior. Presented at the International Conference on Developmental Learning, La Jolla, CA.
Picot, A., Bailly, G., Elisei, F., & Raidt, S. (2007). Scrutinizing natural scenes: controlling the gaze of an embodied conversational agent. In International Conference on Intelligent Virtual Agents (IVA) (pp.272–282). Paris, France.
Raidt, S., Bailly, G., & Elisei, F. (2007). Mutual gaze during face-to-face interaction. In Auditory-visual Speech Processing. Hilvarenbeek, The Netherlands. paper P23, 6 pages
Richardson, D. C., Dale, R., & Kirkham, N. Z. (2007). The art of conversation is coordination common ground and the coupling of eye movements during dialogue. Psychological Science, 18(5), 407–413.
Richardson, D. C., Dale, R., & Shockley, K. (2008). Synchrony and swing in conversation: coordination, temporal dynamics, and communication. In I. Wachsmuth, M. Lenzen, & G. Knoblich (Eds.), Embodied Communication (pp. 75–93). Oxford, UK: Oxford University Press.
Risko, E. F., Laidlaw, K. E. W., Freeth, M., Foulsham, T., & Kingstone, A. (2012). Social attention with real versus reel stimuli: toward an empirical approach to concerns about ecological validity. Frontiers in Human Neuroscience, 6, 143.
Risko, E. F., Richardson, D. C., & Kingstone, A. (2016). Breaking the Fourth Wall of Cognitive Science Real-World Social Attention and the Dual Function of Gaze. Current Directions in Psychological Science, 25(1), 70–74.
Ruhland, K., Andrist, S., Badler, J., Peters, C., Badler, N., Gleicher, M.&R. Mcdonnell (2014). Look me in the eyes: A survey of eye and gaze animation for virtual agents and artificial systems (pp.69–91). Presented at the Eurographics State-of-the-Art Report.
Sak, H., Vinyals, O., Heigold, G., Senior, A., McDermott, E., Monga, R., & Mao, M. (2014). Sequence discriminative distributed training of long short-term memory recurrent neural networks. Entropy, 15(16), 17–18.
Schauerte, B., & Stiefelhagen, R. (2014). “Look at this!” learning to guide visual saliency in human-robot interaction (pp.995–1002). Presented at the Intelligent Robots and Systems (IROS 2014), 2014 IEEE/RSJ International Conference on, IEEE.
Schmidt, R., Morr, S., Fitzpatrick, P., & Richardson, M. J. (2012). Measuring the dynamics of interactional synchrony. Journal of Nonverbal Behavior, 36(4), 263–279.
Senju, A., & Hasegawa, T. (2005). Direct gaze captures visuospatial attention. Vision Cognition, 12, 127– 144.
Sheikhi, S., Odobez, J.-M., 2014. Combining dynamic head pose-gaze mapping with the robot conversational state for attention recognition in human-robot interactions. Pattern Recognition Letters.
Sugano, Y., Matsushita, Y., & Sato, Y. (2013). Appearance-based gaze estimation using visual saliency. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(2), 329–341.
Sugano, Y., Matsushita, Y., Sato, Y., 2014. Learning-by-synthesis for appearance-based 3d gaze estimation. Presented at the Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, IEEE, pp. 1821-1828.
Sun, Y. (2003). Hierarchical object-based visual attention for machine vision (Thesis). Institute of Perception, Action and Behaviour, University of Edinburgh, Edinburgh, UK.
Teufel, C., Alexis, D. M., Clayton, N. S., & Davis, G. (2010). Mental-state attribution drives rapid, reflexive gaze following. Attention, Perception, & Psychophysics, 72(3), 695–705.
Tomasello, M., Hare, B., Lehmann, H., & Call, J. (2007). Reliance on head versus eyes in the gaze following of great apes and human infants: the cooperative eye hypothesis. Journal of Human Evolution, 52, 314–320.
Trabelsi, G., Leray, P., Ben Ayed, M., & Alimi, A. M. (2013). Benchmarking dynamic Bayesian network structure learning algorithms (pp.1–6). Presented at the Modeling, Simulation and Applied Optimization (ICMSAO), 2013 5th International Conference on, IEEE.
Trutoiu, L. C., Carter, E. J., Matthews, I., & Hodgins, J. K. (2011). Modeling and animating eye blinks. ACM Transactions on Applied Perception (TAP), 8(3), 1–17.
Valenti, R., & Gevers, T. (2012). Accurate eye center location through invariant isocentric patterns. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(9), 1785–1798.
Van der Burg, E., Olivers, C. N., Bronkhorst, A. W., & Theeuwes, J. (2009). Poke and pop: Tactile–visual synchrony increases visual saliency. Neuroscience Letters, 450(1), 60–64.
Vatikiotis-Bateson, E., Eigsti, I. -M., Yano, S., & Munhall, K. G. (1998). Eye movement of perceivers during audiovisual speech perception. Perception & Psychophysics, 60, 926–940.
Vertegaal, R., Slagter, R., van der Veer, G., & Nijholt, A. (2001). Eye gaze patterns in conversations: There is more to conversational agents than meets the eyes. In Conference on Human Factors in Computing Systems (pp.301–308). Seattle, WA: ACM Press New York, NY, USA.
Vinayagamoorthy, V., Garau, M., Steed, A., & Slater, M. (2004). An eye gaze model for dyadic interaction in an immersive virtual environment: Practice and experience. The Computer Graphics Forum, 23(1), 1–11.
