Evaluating multi-party interactions with social robots using large language models and multi-modal systems

García, Daniel Hernández; Gunson, Nancie; Addlesee, Angus; Cherakara, Neeraj; Dondrup, Christian; Sieińska, Weronika; Romeo, Marta; Lemon, Oliver

doi:10.1075/is.25014.gar

Article In: Group Dynamics in Human–Robot Interaction
Edited by Alessandra Sciutti, Dario Pasquali, Giulia Belgiovine and Linda Lastrico
[Interaction Studies 26:3] 2025
► pp. 422–476

Evaluating multi-party interactions with social robots using large language models and multi-modal systems

Daniel Hernández García | Heriot-Watt University

Nancie Gunson | Heriot-Watt University

Angus Addlesee | Heriot-Watt University

Neeraj Cherakara | Heriot-Watt University

Christian Dondrup | Heriot-Watt University

Weronika Sieińska | Heriot-Watt University

Marta Romeo | Heriot-Watt University

Oliver Lemon | Heriot-Watt University

This content is being prepared for publication; it may be subject to changes.

Abstract

Managing conversational interactions with groups of people is still an open challenge in human-robot interaction, requiring a multi-modal combination of sensory inputs/outputs and dialogue systems. In this paper, we present the development of an integrated multi-modal system connecting a Large Language Model (LLM) with a social robot’s perception and action modules for managing situated multi-party interactions. We describe and discuss the exploratory results of a system-wide performance evaluation via a within-subjects user study in which 27 unique pairs of participants interacted with a social robot under two conditions: a multi-party capable system and a baseline system with only single-party capabilities. Participants interacted with the two systems in a combination of task-based and open-ended scenarios, for a total of 108 interactions with each of the two systems. Our evaluation demonstrated a slight preference for the Multi-Party system and a more balanced interaction overall, and highlights potentials and open challenges in the integration of LLMs capabilities into robotic conversational systems.

Keywords: multiparty interaction, non-verbal behavior, situated interactions, social robotics, large language models

Article outline

1.Introduction
2.Related work
- 2.1Multi-party human-robot interactions
- 2.2Large Language Models for multi-party conversations
  - 2.2.1Large Language Models for evaluation of dialogue tasks
- 2.3Large Language Models for HRI
3.System overview
- 3.1Robot platform
- 3.2Visual perception stream
  - 3.2.1Person detection and tracking
  - 3.2.2Gaze estimation
- 3.3Audio perception stream
  - 3.3.1Speaker identification
  - 3.3.2Speech recognition
- 3.4Social scene state
  - 3.4.1ROS4HRI interface
- 3.5Non-verbal behavior
  - 3.5.1Interaction controller
  - 3.5.2Gaze behaviour
- 3.6Verbal behavior
  - 3.6.1Dialogue controller
  - 3.6.2LLM interface
  - 3.6.3Tablet interface
4.User evaluation
- 4.1Procedure
- 4.2Participants
5.Results
- 5.1System performance
  - 5.1.1Turn-taking
  - 5.1.2Addressee detection
  - 5.1.3Addressee selection
- 5.2Component-level performance
  - 5.2.1Person tracking
  - 5.2.2Gaze estimation and face tracking
  - 5.2.3Speaker identification and speech recognition
  - 5.2.4LLM interface
- 5.3Automated dialogue evaluation
- 5.4User evaluation
  - 5.4.1Explicit preference
  - 5.4.2Multi-party features
    - Group task completion
    - Robot turn-taking behaviour
    - Robot listening and understanding
    - Robot movements
  - 5.4.3Speech user interface service quality — SUISQ
  - 5.4.4Robot anthropomorphism — HRIES
6.Discussion
- 6.1Multi-party interaction context and features
- 6.2Users perception of the robot
- 6.3Limitations
- 6.4Future steps
7.Conclusion
Notes
Appendix
- A1.Multi-party system prompt
- A2.Single-party system prompt
- A3.Dialogue quality prompt
- A4.Multi-party-goal dialogue quality prompt
- A5.Participants instructions
  - A5.1Open scenario
  - A5.2Task-oriented scenarios
  - A5.3Pictograms
- A6.Example dialogues
References

References (62)

References

Abdelrahman, A. A., Hempel, T., Khalifa, A., Al-Hamadi, A., & Dinges, L. (2023). L2cs-net : Finegrained gaze estimation in unconstrained environments. 2023 8th International Conference on Frontiers of Signal Processing (ICFSP), 98–102.

Addlesee, A. (2024). Grounding LLMs to In-prompt Instructions: Reducing Hallucinations Caused by Static Pre-training Knowledge. 3rd Workshop on Safety for Conversational AI, Safety4ConvAI 2024 at LREC-COLING 2024 — Workshop Proceedings, 1–7.

Addlesee, A., Sieińska, W., Gunson, N., Hernandez Garcia, D., Dondrup, C., & Lemon, O. (2023). Multi-party goal tracking with LLMs: Comparing pre-training, fine-tuning, and prompt engineering. In S. Stoyanchev, S. Joty, D. Schlangen, O. Dusek, C. Kennington, & M. Alikhani (Eds.), Proceedings of the 24th annual meeting of the special interest group on discourse and dialogue (pp. 229–241). Association for Computational Linguistics.

Agrawal, G., Kumarage, T., Alghami, Z., & Liu, H. (2023). Can Knowledge Graphs Reduce Hallu-cinations in LLMs? : A Survey. [URL]

Allgeuer, P., Ali, H., & Wermter, S. (2024). When robots get chatty: Grounding multimodal humanrobot conversation and collaboration. Proceedings of the International Conference on Artificial Neural Networks.

Aylett, M. P., & Romeo, M. (2023). You don’t need to speak, you need to listen: Robot interaction and human-like turn-taking. Proceedings of the 5th International Conference on Conversational User Interfaces.

Bohus, D., & Horvitz, E. (2010). Facilitating multiparty dialog with gaze, gesture, and speech. International Conference on Multimodal Interfaces and the Workshop on Machine Learning for Multimodal Interaction.

(2011). Multiparty turn taking in situated dialog: Study, lessons, and directions. Proceedings of the SIGDIAL 2011 Conference, 98–109.

Bommasani, R., Klyman, K., Longpre, S., Kapoor, S., Maslej, N., Xiong, B., Zhang, D., & Liang, P. (2023). The foundation model transparency index. [URL]

Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., Stoica, I., & Xing, E. P. (2023). Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. [URL]

Cooper, S., Fava, A. D., Vivas, C., Marchionni, L., & Ferro, F. (2020). Ari: The social assistive robot and companion. 2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), 745–751. [URL].

Deng, J., Guo, J., Yuxiang, Z., Yu, J., Kotsia, I., & Zafeiriou, S. (2019). Retinaface: Single-stage dense face localisation in the wild. arxiv.

Eisenberg, A., Gannot, S., & Chazan, S. E. (2023). A two-stage speaker extraction algorithm under adverse acoustic conditions using a single-microphone. 2023 31st European Signal Processing Conference (EUSIPCO), 266–270.

Eshghi, A., & Healey, P. G. (2016). Collective contexts in conversation: Grounding by proxy. Cog-nitive science, 40 (2), 299–324.

Fu, J., Ng, S.-K., Jiang, Z., & Liu, P. (2023). Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.

Ge, Z., Liu, S., Wang, F., Li, Z., & Sun, J. (2021). Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430.

Gu, J.-C., Tan, C.-H., Tao, C., Ling, Z.-H., Hu, H., Geng, X., & Jiang, D. (2022). HeterMPC: A Heterogeneous Graph Neural Network for Response Generation in Multi-Party Conversations. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 5086–5097.

Gu, J.-C., Tao, C., & Ling, Z.-H. (2022). WHO Says WHAT to WHOM: A Survey of Multi-Party Conversations. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22).

Gu, J.-C., Tao, C., Ling, Z., Xu, C., Geng, X., & Jiang, D. (2021). MPC-BERT: A pre-trained language model for multi-party conversation understanding. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 3682–3692.

Gunson, N., Addlesee, A., Hernandez Garcia, D., Romeo, M., Dondrup, C., & Lemon, O. (2024). A holistic evaluation methodology for multi-party spoken conversational agents. ACM International Conference on Intelligent Virtual Agents (IVA ’24).

Hu, W., Chan, Z., Liu, B., Zhao, D., Ma, J., & Yan, R. (2019). GSN: A graph-structured network for multi-party dialogues. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19).

Incao, S., Mazzola, C., Belgiovine, G., & Sciutti, A. (2024). A roadmap for embodied and social grounding in llms. arXiv preprint arXiv:2409.16900.

Iniguez-Carrillo, A. L., Gaytan-Lugo, L. S., Garcia-Ruiz, M. A., & Maciel-Arellano, R. (2021). Usability questionnaires to evaluate voice user interfaces. IEEE Latin America Transactions, 19 (9), 1468–1477.

Jayagopi, D. B., & Odobez, J.-M. (2013). Given that, should i respond? contextual addressee estimation in multi-party human-robot interactions. Proceedings of the 8th ACM/IEEE International Conference on Human-Robot Interaction, 147–148.

Ji, Z., Yu, T., Xu, Y., Lee, N., Ishii, E., & Fung, P. (2023). Towards Mitigating Hallucination in Large Language Models via Self-Reflection. EMNLP 2023, 1827–1843. [URL]

Jia, J., Komma, A., Leffel, T., Peng, X., Nagesh, A., Soliman, T., Galstyan, A., & Kumar, A. (2024). Leveraging LLMs for dialogue quality measurement. In Y. Yang, A. Davani, A. Sil, & A. Kumar (Eds.), Proceedings of the 2024 conference of the north american chapter of the association for computational linguistics: Human language technologies (volume 6: Industry track) (pp. 359–367). Association for Computational Linguistics.

Johansson, M., & Skantze, G. (2015). Opportunities and obligations to take turns in collaborative multi-party human-robot interaction. Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, 305–314.

Johansson, M., Skantze, G., & Gustafson, J. (2014). Comparison of human-human and human-robot turn-taking behaviour in multiparty situated interaction. Proceedings of the 2014 Workshop on Understanding and Modeling Multiparty, Multimodal Interactions, 21–26.

Kim, C. Y., Lee, C. P., & Mutlu, B. (2024). Understanding large-language model (LLM)-powered human-robot interaction. Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, 371–380.

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2023). Large language models are zero-shot reasoners.

Lewis, J. R., & Hardzinski, M. L. (2015). Investigating the psychometric properties of the Speech User Interface Service Quality questionnaire. International Journal of Speech Technology, 18 (3), 479–487.

Lin, T., Maire, M., Belongie, S. J., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: common objects in context. In D. J. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer vision — ECCV 2014 — 13th european conference, zurich, switzerland, september 6–12, 2014, proceedings, part V (pp. 740–755). Springer.

Mahadevan, K., Chien, J., Brown, N., Xu, Z., Parada, C., Xia, F., Zeng, A., Takayama, L., & Sadigh, D. (2024). Generative expressive robot behaviors using large language models. Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, 482–491.

Mahajan, K., & Shaikh, S. (2021). On the need for thoughtful data collection for multi-party dialogue: A survey of available corpora and collection methods. In H. Li, G.-A. Levow, Z. Yu, C. Gupta, B. Sisman, S. Cai, D. Vandyke, N. Dethlefs, Y. Wu, & J. J. Li (Eds.), Proceedings of the 22nd annual meeting of the special interest group on discourse and dialogue (pp. 338352). Association for Computational Linguistics.

Mazzola, C., Romeo, M., Rea, F., Sciutti, A., & Cangelosi, A. (2023). To whom are you talking? a deep learning model to endow social robots with addressee estimation skills. 2023 International Joint Conference on Neural Networks (IJCNN), 1–10.

Mishra, C., & Skantze, G. (2022). Knowing where to look: A planning-based architecture to automate the gaze behavior of social robots. 2022 31st IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), 1201–1208.

Mittelstädt, J. M., Maier, J., Goerke, P., Zinn, F., & Hermes, M. (2024). Large language models can outperform humans in social situational judgments. Scientific Reports, 14 (1), 27449.

Mohamed, Y., & Lemaignan, S. (2021). Ros for human-robot interaction. 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 3020–3027.

Murali, P., Steenstra, I., Yun, H. S., Shamekhi, A., & Bickmore, T. (2023). Improving multiparty interactions with a robot using large language models. Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems.

Novikova, J., Lemon, O., & Rieser, V. (2016). Crowd-sourcing NLG data: Pictures elicit better data. CoRR, abs/1608.00339. [URL].

Parada, C. (2024). What do foundation models have to do with and for hri? Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, 21.

Parreira, M. T., Gillet, S., Vazquez, M., & Leite, I. (2022). Design implications for effective robot gaze behaviors in multiparty interactions. Proceedings of the 2022 ACM/IEEE International Conference on Human-Robot Interaction, 976–980.

Rawte, V., Sheth, A., & Das, A. (2023). A Survey of Hal lucination in Large Foundation Models. [URL]

Schlangen, D., & Skantze, G. (2009). A general, abstract model of incremental dialogue processing. In A. Lascarides, C. Gardent, & J. Nivre (Eds.), Proceedings of the 12th conference of the European chapter of the ACL (EACL 2009) (pp. 710–718). Association for Computational Linguistics. [URL].

Shriberg, E., Stolcke, A., Hakkani-Tiir, D., & Heck, L. P. (2012). Learning when to listen: Detecting system-addressed speech in human-human-computer dialog. Interspeech, 334–337.

Skantze, G. (2021). Turn-taking in conversational systems and human-robot interaction: A review. Computer Speech & Language, 671, 101178.

Skantze, G., Johansson, M., & Beskow, J. (2015). Exploring turn-taking cues in multi-party human-robot discussions about ob jects. Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, 67–74.

Spatola, N., Kuihnlenz, B., & Cheng, G. (2021). Perception and Evaluation in Human-Robot In-teraction: The Human-Robot Interaction Evaluation Scale (HRIES) — A Multicomponent Approach of Anthropomorphism. International Journal of Social Robotics, 13(7), 1517–1539.

Tan, C.-H., Gu, J.-C., & Ling, Z.-H. (2023). Is ChatGPT a good multi-party conversation solver? In H. Bouamor, J. Pino, & K. Bali (Eds.), Findings of the association for computational linguistics: Emnlp 2023 (pp. 4905–4915). Association for Computational Linguistics.

Traum, D. (2004). Issues in multiparty dialogues. Advances in Agent Communication: International Workshop on Agent Communication Languages, ACL 2003, Melbourne, Australia, July 14, 2003. Revised and Invited Papers, 201–211.

Tzinis, E., Wang, Z., Jiang, X., & Smaragdis, P. (2022). Compute and memory efficient universal sound source separation. J. Signal Process. Syst., 94 (2), 245–259.

Wachowiak, L., Coles, A., Celiktutan, O., & Canal, G. (2024). Are large language models aligned with people’s social intuitions for human-robot interactions? 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2520–2527.

Wagner, D., Churchill, A., Sigtia, S., Georgiou, P., Mirsamadi, M., Mishra, A., & Marchi, E. (2024). A multimodal approach to device-directed speech detection with large language models. ICASSP 2024–2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 10451–10455.

Wang, B., Chen, W., Pei, H., Xie, C., Kang, M., Zhang, C., Xu, C., Xiong, Z., Dutta, R., Schaeffer, R., Truong, S. T., Arora, S., Mazeika, M., Hendrycks, D., Lin, Z., Cheng, Y., Koyejo, S., Song, D., & Li, B. (2023). DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. (NeurIPS 2023). [URL]

Wang, C., Hasler, S., Tanneberg, D., Ocker, F., Joublin, F., Ceravola, A., Deigmoeller, J., & Gienger, M. (2024). Lami: Large language models for multi-modal human-robot interaction. Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, 1–10.

Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q. V., Zhou, D., & Chen, X. (2023). Large language models as optimizers.

Zhang, C., Chen, J., Li, J., Peng, Y., & Mao, Z. (2023). Large language models for human-robot interaction: A review. Biomimetic Intelligence and Robotics, 3(4), 100131.

Zhang, S., Dong, L., Li, X., Zhang, S., Sun, X., Wang, S., Li, J., Hu, R., Zhang, T., Wu, F., & Wang, G. (2023). Instruction tuning for large language models: A survey.

Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., & Wang, X. (2022). Byte-track: Multi-ob ject tracking by associating every detection box. Proceedings of the European Conference on Computer Vision (ECCV).

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., & Stoica, I. (2024). Judging llm-as-a-judge with mt-bench and chatbot arena. Proceedings of the 37th International Conference on Neural Information Processing Systems.

Zhong, M., Liu, Y., Xu, Y., Zhu, C., & Zeng, M. (2022). DialogLM: Pre-trained model for long dialogue understanding and summarization. Proceedings of the AAAI Conference on Artificial Intelligence, 361, 11765–11773.

Zhong, Y., Xie, J., Wang, J., Fan, B., Fang, Z., & Peng, B. (2024). Improving large language models in multi-party conversations through role-playing. In D.-S. Huang, X. Zhang, & Q. Zhang (Eds.), Advanced intel ligent computing technology and applications (pp. 209–220). Springer Nature Singapore.