Article In: Group Dynamics in Human–Robot Interaction
Edited by Alessandra Sciutti, Dario Pasquali, Giulia Belgiovine and Linda Lastrico
[Interaction Studies 26:3] 2025
► pp. 422–476
Evaluating multi-party interactions with social robots using large language models and multi-modal systems
This content is being prepared for publication; it may be subject to changes.
Abstract
Managing conversational interactions with groups of people is still an open challenge in human-robot interaction,
requiring a multi-modal combination of sensory inputs/outputs and dialogue systems. In this paper, we present the development of
an integrated multi-modal system connecting a Large Language Model (LLM) with a social robot’s perception and action modules for
managing situated multi-party interactions. We describe and discuss the exploratory results of a system-wide performance
evaluation via a within-subjects user study in which 27 unique pairs of participants interacted with a social robot under two
conditions: a multi-party capable system and a baseline system with only single-party capabilities. Participants interacted with
the two systems in a combination of task-based and open-ended scenarios, for a total of 108 interactions with each of the two
systems. Our evaluation demonstrated a slight preference for the Multi-Party system and a more balanced interaction overall, and
highlights potentials and open challenges in the integration of LLMs capabilities into robotic conversational systems.
Article outline
- 1.Introduction
- 2.Related work
- 2.1Multi-party human-robot interactions
- 2.2Large Language Models for multi-party conversations
- 2.2.1Large Language Models for evaluation of dialogue tasks
- 2.3Large Language Models for HRI
- 3.System overview
- 3.1Robot platform
- 3.2Visual perception stream
- 3.2.1Person detection and tracking
- 3.2.2Gaze estimation
- 3.3Audio perception stream
- 3.3.1Speaker identification
- 3.3.2Speech recognition
- 3.4Social scene state
- 3.4.1ROS4HRI interface
- 3.5Non-verbal behavior
- 3.5.1Interaction controller
- 3.5.2Gaze behaviour
- 3.6Verbal behavior
- 3.6.1Dialogue controller
- 3.6.2LLM interface
- 3.6.3Tablet interface
- 4.User evaluation
- 4.1Procedure
- 4.2Participants
- 5.Results
- 5.1System performance
- 5.1.1Turn-taking
- 5.1.2Addressee detection
- 5.1.3Addressee selection
- 5.2Component-level performance
- 5.2.1Person tracking
- 5.2.2Gaze estimation and face tracking
- 5.2.3Speaker identification and speech recognition
- 5.2.4LLM interface
- 5.3Automated dialogue evaluation
- 5.4User evaluation
- 5.4.1Explicit preference
- 5.4.2Multi-party features
- Group task completion
- Robot turn-taking behaviour
- Robot listening and understanding
- Robot movements
- 5.4.3Speech user interface service quality — SUISQ
- 5.4.4Robot anthropomorphism — HRIES
- 5.1System performance
- 6.Discussion
- 6.1Multi-party interaction context and features
- 6.2Users perception of the robot
- 6.3Limitations
- 6.4Future steps
- 7.Conclusion
- Notes
- Appendix
- A1.Multi-party system prompt
- A2.Single-party system prompt
- A3.Dialogue quality prompt
- A4.Multi-party-goal dialogue quality prompt
- A5.Participants instructions
- A5.1Open scenario
- A5.2Task-oriented scenarios
- A5.3Pictograms
- A6.Example dialogues
References
References (62)
Abdelrahman, A. A., Hempel, T., Khalifa, A., Al-Hamadi, A., & Dinges, L. (2023). L2cs-net :
Finegrained gaze estimation in unconstrained environments. 2023 8th International
Conference on Frontiers of Signal Processing
(ICFSP), 98–102.
Addlesee, A. (2024). Grounding
LLMs to In-prompt Instructions: Reducing Hallucinations Caused by Static Pre-training
Knowledge. 3rd Workshop on Safety for Conversational AI, Safety4ConvAI 2024 at LREC-COLING 2024
— Workshop Proceedings, 1–7.
Addlesee, A., Sieińska, W., Gunson, N., Hernandez Garcia, D., Dondrup, C., & Lemon, O. (2023). Multi-party
goal tracking with LLMs: Comparing pre-training, fine-tuning, and prompt
engineering. In S. Stoyanchev, S. Joty, D. Schlangen, O. Dusek, C. Kennington, & M. Alikhani (Eds.), Proceedings
of the 24th annual meeting of the special interest group on discourse and
dialogue (pp. 229–241). Association for Computational Linguistics.
Agrawal, G., Kumarage, T., Alghami, Z., & Liu, H. (2023). Can
Knowledge Graphs Reduce Hallu-cinations in LLMs? : A Survey. [URL]
Allgeuer, P., Ali, H., & Wermter, S. (2024). When
robots get chatty: Grounding multimodal humanrobot conversation and collaboration. Proceedings
of the International Conference on Artificial Neural Networks.
Aylett, M. P., & Romeo, M. (2023). You
don’t need to speak, you need to listen: Robot interaction and human-like
turn-taking. Proceedings of the 5th International Conference on Conversational User
Interfaces.
Bohus, D., & Horvitz, E. (2010). Facilitating
multiparty dialog with gaze, gesture, and speech. International Conference on Multimodal
Interfaces and the Workshop on Machine Learning for Multimodal Interaction.
(2011). Multiparty
turn taking in situated dialog: Study, lessons, and directions. Proceedings of the SIGDIAL 2011
Conference, 98–109.
Bommasani, R., Klyman, K., Longpre, S., Kapoor, S., Maslej, N., Xiong, B., Zhang, D., & Liang, P. (2023). The
foundation model transparency index. [URL]
Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E., Stoica, I., & Xing, E. P. (2023). Vicuna:
An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. [URL]
Cooper, S., Fava, A. D., Vivas, C., Marchionni, L., & Ferro, F. (2020). Ari:
The social assistive robot and companion. 2020 29th IEEE International Conference on Robot and
Human Interactive Communication (RO-MAN), 745–751. [URL].
Deng, J., Guo, J., Yuxiang, Z., Yu, J., Kotsia, I., & Zafeiriou, S. (2019). Retinaface:
Single-stage dense face localisation in the wild. arxiv.
Eisenberg, A., Gannot, S., & Chazan, S. E. (2023). A
two-stage speaker extraction algorithm under adverse acoustic conditions using a
single-microphone. 2023 31st European Signal Processing Conference
(EUSIPCO), 266–270.
Eshghi, A., & Healey, P. G. (2016). Collective
contexts in conversation: Grounding by proxy. Cog-nitive
science, 40 (2), 299–324.
Fu, J., Ng, S.-K., Jiang, Z., & Liu, P. (2023). Gptscore:
Evaluate as you desire. arXiv preprint arXiv:2302.04166.
Ge, Z., Liu, S., Wang, F., Li, Z., & Sun, J. (2021). Yolox:
Exceeding yolo series in 2021. arXiv preprint
arXiv:2107.08430.
Gu, J.-C., Tan, C.-H., Tao, C., Ling, Z.-H., Hu, H., Geng, X., & Jiang, D. (2022). HeterMPC:
A Heterogeneous Graph Neural Network for Response Generation in Multi-Party
Conversations. Proceedings of the 60th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), 5086–5097.
Gu, J.-C., Tao, C., & Ling, Z.-H. (2022). WHO
Says WHAT to WHOM: A Survey of Multi-Party Conversations. Proceedings of the Thirty-First
International Joint Conference on Artificial Intelligence (IJCAI-22).
Gu, J.-C., Tao, C., Ling, Z., Xu, C., Geng, X., & Jiang, D. (2021). MPC-BERT:
A pre-trained language model for multi-party conversation understanding. Proceedings of the
59th Annual Meeting of the Association for Computational
Linguistics, 3682–3692.
Gunson, N., Addlesee, A., Hernandez Garcia, D., Romeo, M., Dondrup, C., & Lemon, O. (2024). A
holistic evaluation methodology for multi-party spoken conversational agents. ACM
International Conference on Intelligent Virtual Agents (IVA ’24).
Hu, W., Chan, Z., Liu, B., Zhao, D., Ma, J., & Yan, R. (2019). GSN:
A graph-structured network for multi-party dialogues. Proceedings of the Twenty-Eighth
International Joint Conference on Artificial Intelligence (IJCAI-19).
Incao, S., Mazzola, C., Belgiovine, G., & Sciutti, A. (2024). A
roadmap for embodied and social grounding in llms. arXiv preprint
arXiv:2409.16900.
Iniguez-Carrillo, A. L., Gaytan-Lugo, L. S., Garcia-Ruiz, M. A., & Maciel-Arellano, R. (2021). Usability
questionnaires to evaluate voice user interfaces. IEEE Latin America
Transactions, 19 (9), 1468–1477.
Jayagopi, D. B., & Odobez, J.-M. (2013). Given
that, should i respond? contextual addressee estimation in multi-party human-robot
interactions. Proceedings of the 8th ACM/IEEE International Conference on Human-Robot
Interaction, 147–148.
Ji, Z., Yu, T., Xu, Y., Lee, N., Ishii, E., & Fung, P. (2023). Towards
Mitigating Hallucination in Large Language Models via Self-Reflection. EMNLP
2023, 1827–1843. [URL]
Jia, J., Komma, A., Leffel, T., Peng, X., Nagesh, A., Soliman, T., Galstyan, A., & Kumar, A. (2024). Leveraging
LLMs for dialogue quality measurement. In Y. Yang, A. Davani, A. Sil, & A. Kumar (Eds.), Proceedings
of the 2024 conference of the north american chapter of the association for computational linguistics: Human language
technologies (volume 6: Industry
track) (pp. 359–367). Association for Computational Linguistics.
Johansson, M., & Skantze, G. (2015). Opportunities
and obligations to take turns in collaborative multi-party human-robot interaction. Proceedings
of the 16th Annual Meeting of the Special Interest Group on Discourse and
Dialogue, 305–314.
Johansson, M., Skantze, G., & Gustafson, J. (2014). Comparison
of human-human and human-robot turn-taking behaviour in multiparty situated
interaction. Proceedings of the 2014 Workshop on Understanding and Modeling Multiparty,
Multimodal Interactions, 21–26.
Kim, C. Y., Lee, C. P., & Mutlu, B. (2024). Understanding
large-language model (LLM)-powered human-robot interaction. Proceedings of the 2024 ACM/IEEE
International Conference on Human-Robot
Interaction, 371–380.
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2023). Large
language models are zero-shot reasoners.
Lewis, J. R., & Hardzinski, M. L. (2015). Investigating
the psychometric properties of the Speech User Interface Service Quality
questionnaire. International Journal of Speech
Technology, 18 (3), 479–487.
Lin, T., Maire, M., Belongie, S. J., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft
COCO: common objects in context. In D. J. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer
vision — ECCV 2014 — 13th european conference, zurich, switzerland, september 6–12, 2014, proceedings, part
V (pp. 740–755). Springer.
Mahadevan, K., Chien, J., Brown, N., Xu, Z., Parada, C., Xia, F., Zeng, A., Takayama, L., & Sadigh, D. (2024). Generative
expressive robot behaviors using large language models. Proceedings of the 2024 ACM/IEEE
International Conference on Human-Robot
Interaction, 482–491.
Mahajan, K., & Shaikh, S. (2021). On
the need for thoughtful data collection for multi-party dialogue: A survey of available corpora and collection
methods. In H. Li, G.-A. Levow, Z. Yu, C. Gupta, B. Sisman, S. Cai, D. Vandyke, N. Dethlefs, Y. Wu, & J. J. Li (Eds.), Proceedings
of the 22nd annual meeting of the special interest group on discourse and
dialogue (pp. 338352). Association for Computational Linguistics.
Mazzola, C., Romeo, M., Rea, F., Sciutti, A., & Cangelosi, A. (2023). To
whom are you talking? a deep learning model to endow social robots with addressee estimation
skills. 2023 International Joint Conference on Neural Networks
(IJCNN), 1–10.
Mishra, C., & Skantze, G. (2022). Knowing
where to look: A planning-based architecture to automate the gaze behavior of social
robots. 2022 31st IEEE International Conference on Robot and Human Interactive Communication
(RO-MAN), 1201–1208.
Mittelstädt, J. M., Maier, J., Goerke, P., Zinn, F., & Hermes, M. (2024). Large
language models can outperform humans in social situational judgments. Scientific
Reports, 14 (1), 27449.
Mohamed, Y., & Lemaignan, S. (2021). Ros
for human-robot interaction. 2021 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), 3020–3027.
Murali, P., Steenstra, I., Yun, H. S., Shamekhi, A., & Bickmore, T. (2023). Improving
multiparty interactions with a robot using large language models. Extended Abstracts of the
2023 CHI Conference on Human Factors in Computing Systems.
Novikova, J., Lemon, O., & Rieser, V. (2016). Crowd-sourcing
NLG data: Pictures elicit better data. CoRR, abs/1608.00339. [URL].
Parada, C. (2024). What
do foundation models have to do with and for hri? Proceedings of the 2024 ACM/IEEE
International Conference on Human-Robot Interaction, 21.
Parreira, M. T., Gillet, S., Vazquez, M., & Leite, I. (2022). Design
implications for effective robot gaze behaviors in multiparty interactions. Proceedings of the
2022 ACM/IEEE International Conference on Human-Robot
Interaction, 976–980.
Rawte, V., Sheth, A., & Das, A. (2023). A
Survey of Hal lucination in Large Foundation Models. [URL]
Schlangen, D., & Skantze, G. (2009). A
general, abstract model of incremental dialogue processing. In A. Lascarides, C. Gardent, & J. Nivre (Eds.), Proceedings
of the 12th conference of the European chapter of the ACL (EACL
2009) (pp. 710–718). Association for Computational Linguistics. [URL].
Shriberg, E., Stolcke, A., Hakkani-Tiir, D., & Heck, L. P. (2012). Learning
when to listen: Detecting system-addressed speech in human-human-computer
dialog. Interspeech, 334–337.
Skantze, G. (2021). Turn-taking
in conversational systems and human-robot interaction: A review. Computer Speech &
Language, 671, 101178.
Skantze, G., Johansson, M., & Beskow, J. (2015). Exploring
turn-taking cues in multi-party human-robot discussions about ob jects. Proceedings of the 2015
ACM on International Conference on Multimodal
Interaction, 67–74.
Spatola, N., Kuihnlenz, B., & Cheng, G. (2021). Perception
and Evaluation in Human-Robot In-teraction: The Human-Robot Interaction Evaluation Scale (HRIES) — A Multicomponent Approach
of Anthropomorphism. International Journal of Social
Robotics, 13(7), 1517–1539.
Tan, C.-H., Gu, J.-C., & Ling, Z.-H. (2023). Is
ChatGPT a good multi-party conversation solver? In H. Bouamor, J. Pino, & K. Bali (Eds.), Findings
of the association for computational linguistics: Emnlp
2023 (pp. 4905–4915). Association for Computational Linguistics.
Traum, D. (2004). Issues
in multiparty dialogues. Advances in Agent Communication: International Workshop on Agent
Communication Languages, ACL 2003, Melbourne, Australia, July 14, 2003. Revised and Invited
Papers, 201–211.
Tzinis, E., Wang, Z., Jiang, X., & Smaragdis, P. (2022). Compute
and memory efficient universal sound source separation. J. Signal Process.
Syst., 94 (2), 245–259.
Wachowiak, L., Coles, A., Celiktutan, O., & Canal, G. (2024). Are
large language models aligned with people’s social intuitions for human-robot
interactions? 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS), 2520–2527.
Wagner, D., Churchill, A., Sigtia, S., Georgiou, P., Mirsamadi, M., Mishra, A., & Marchi, E. (2024). A
multimodal approach to device-directed speech detection with large language models. ICASSP
2024–2024 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), 10451–10455.
Wang, B., Chen, W., Pei, H., Xie, C., Kang, M., Zhang, C., Xu, C., Xiong, Z., Dutta, R., Schaeffer, R., Truong, S. T., Arora, S., Mazeika, M., Hendrycks, D., Lin, Z., Cheng, Y., Koyejo, S., Song, D., & Li, B. (2023). DecodingTrust:
A Comprehensive Assessment of Trustworthiness in GPT Models. (NeurIPS
2023). [URL]
Wang, C., Hasler, S., Tanneberg, D., Ocker, F., Joublin, F., Ceravola, A., Deigmoeller, J., & Gienger, M. (2024). Lami:
Large language models for multi-modal human-robot interaction. Extended Abstracts of the CHI
Conference on Human Factors in Computing Systems, 1–10.
Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q. V., Zhou, D., & Chen, X. (2023). Large
language models as optimizers.
Zhang, C., Chen, J., Li, J., Peng, Y., & Mao, Z. (2023). Large
language models for human-robot interaction: A review. Biomimetic Intelligence and
Robotics, 3(4), 100131.
Zhang, S., Dong, L., Li, X., Zhang, S., Sun, X., Wang, S., Li, J., Hu, R., Zhang, T., Wu, F., & Wang, G. (2023). Instruction
tuning for large language models: A survey.
Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., & Wang, X. (2022). Byte-track:
Multi-ob ject tracking by associating every detection box. Proceedings of the European
Conference on Computer Vision (ECCV).
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., & Stoica, I. (2024). Judging
llm-as-a-judge with mt-bench and chatbot arena. Proceedings of the 37th International
Conference on Neural Information Processing Systems.
