Some insights and a tentative proposal: Chapter 7. Videoconference interpreting goes multimodal

Zhang, Xiaojun; Corpas Pastor, Gloria; Zhang, Jing

doi:10.1075/ivitra.37.07zha

In:Interpreting Technologies – Current and Future Trends
Edited by Gloria Corpas Pastor and Bart Defrancq
[IVITRA Research in Linguistics and Literature 37] 2023
► pp. 169–194

Get fulltext from our e-platform

Download Book PDF

Download Book EPUB

Chapter 7
Videoconference interpreting goes multimodal

Some insights and a tentative proposal

Xiaojun Zhang | Xi’an Jiaotong-Liverpool University, China | Xiaojun.Zhang01@xjtlu.edu.cn

Gloria Corpas Pastor | IUITLM, University of Malaga, Spain | gcorpas@uma.es

Jing Zhang | Sichuan Normal University, China | jingzhang@sicnu.edu.cn

Published online: 9 October 2023

https://doi.org/10.1075/ivitra.37.07zha

Abstract

Recent times have witnessed an unprecedent surge of distant modalities of interpreting (remote, videoconference, etc.). The tendency has been particularly noticeable since the onset of the COVID-19 pandemic. Most scholarly research has explored the implications and applications of video technology for interpreting, its potential advantages and shortcomings. By contrast, this paper analyses the multimodal nature of videoconference interpreting (VCI) and its opportunities for research. Inspired by human bimodal perception and multi-sensory integration, our proposal adheres to the subfield of meeting content analysis as a convenient way to help interpreters prepare for a given meeting and provide a better user experience. Our main aim is to come up with a core list of key features and resources that may be used to inform the development of VCI technology and multilingual conference support applications in the future.

Keywords: automatic speech recognition (ASR), artificial intelligence (AI), multimodality, conversation dynamics, natural language processing (NLP), neural machine translation (NMT), non-verbal communication, speaker diarisation, summarisation, keyword spotting (KWS), videoconference interpreting (VCI)

Article outline

1.Introduction
2.Multimedia and multimodal human interaction
- 2.1Meeting content analysis
- 2.2Multimodal human interaction analysis
3.Technologies for meeting data capture
- 3.1Microphone arrays
- 3.2Speech recognition and speaker diarisation
  - 3.2.1Speech recognition
  - 3.2.2Speaker diarisation
- 3.3Keyword spotting
- 3.4Summarisation
- 3.5Computerised translation
- 3.6Face and gesture recognition
- 3.7Conversation dynamics
4.Some suggestions and concluding remarks
Notes
References

References (84)

References

Aarabl, Parham. 2003. “The fusion of distributed microphone arrays for sound localization”. EURASIP Journal on Advances in Signal Processing 2003 (4): 338–347.

Abowd, Gregory D. 1999. “Classroom 2000: An experiment with the instrumentation of a living educational environment”. IBM Systems Journal 38 (4): 508–530.

Anidjar, Or Haim, Hajaj, Chen, Dvit, Amit, and Issachar Gilad. 2020. “A thousand words are worth more than one recording: NLP based speaker change point detection”. [Online] Available at arXiv:2006.01206v1.

Apostolidis, Evlampios, Adamantidou, Elemi, Metsai, Alexandros I., Mezaris, Vasileios, and Ioannis Patras. 2021. “Video Summarization Using Deep Neural Networks: A Survey”. Proceedings of the IEEE. [Online] Available at arXiv:2101.06072.

Aronowitz, Hagai, Zhu, Weizhong, Suzuki, Masayuki, Kurata, Gakuto, and Ron Hoory. 2020. “New advances in speaker diarisation”. INTERSPEECH 2020. 279–283.

Bazzi, Issam, and James R. Glass. 2000. “Modeling out-of-vocabulary words for robust speech recognition”. In Proceedings of the 6th International Conference on Spoken Language Processing (ICSLP 2000). 401–404.

Besle, Julien, Fort, Alexandra, Delpuech, Claude, and Marie-Hélène Giard. 2004. “Bimodal speech: Early suppressive visual effects in human auditory cortex”. European Journal of Neuroscience 20: 2225–2234.

Braun, Sabine. 2015. “Remote Interpreting”. In The Routledge Handbook of Interpreting, edited by Holly Mikkelson, and Renée Jourdenais, 352–367. New York: Routledge.

. 2020. “’You are just a disembodied voice really’. Perceptions of video remote interpreting by legal interpreters and police officers”. In Linking up with video: Perspectives on interpreting practice and research, edited by Heidi Salaets and Geert Brône, 203–233. Amsterdam: John Benjamins.

Burger, Susanne, MacLaren, Victoria, and Hua Yu. 2002. “The ISL meeting corpus: the impact of meeting type on speech style”. In: Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP 2002). 301–304.

Chiu, Patrick, Boreczky, John, Girgensohn, Andreas, and Don Kimber. 2001. “LiteMinutes: an Internet-based system for multimedia meeting minutes”. In Proceedings of the 10th international conference on World Wide Web (WWW2001). 140–149. Hong Kong, CN.

Choe, Sang Keun, Lu, Quanyang, Raunak, Vikas, Xu, Yi, and Florian Metze. 2019. “On Leveraging Visual Modality for Speech Recognition Error Correction”. Proceedings ICML 2019.

Clark, Herbert H., and Thomas B. Carlson. 1982. “Hearers and speech acts”. Language 58 (2): 332–373.

Coen, Michael H. 1999. “The future of human-computer interaction, or how I learned to stop worrying and love my intelligent room”. IEEE Intelligent Systems 14 (5): 8–10.

Constable, Andrew. 2015. Distance Interpreting: A Nuremberg Moment for our Time. AIIC 2015 Assembly Day 3: Debate on Remote.

Corpas Pastor, Gloria. 2021. “Interpreting and Technology: Is the Sky Really the Limit?”. In Proceedings of the Translation and Interpreting Technology Online Conference, edited by Ruslan Mitkov, Vilelmini Sosoni, Julie Christine Giguere, Elena Murgolo, and Elisabeth Deysel, 15–24. Shumen: Incoma.

. 2022a. “Interpreting tomorrow? How to build a computer-assisted glossary of phraseological units in (almost) no time”. In Computational and Corpus-Based Phraseology Fourth International Conference, Europhras 2022, Malaga, Spain, September 28–30, 2022, Proceedings, edited by Gloria Corpas Pastor and Ruslan Mitkov, 62–77. Berlin: Springer.

. 2022b. “Technology Solutions for Interpreters: The VIP System”. Hermēneus. Revista de Traducción e Interpretación 23: 91–123.

Corpas Pastor, Gloria, and Lily May Fern. 2016. A Survey of Interpreters’ Needs and Practices Related to Language Technology. Technical report. Malaga: University of Malaga.

Cutler, Ross, Rui, Yong, Gupta, Anoop, Cadiz, J. J., Tashev, Ivan, He, Li-wei, Colburn, Alex, Zhang, Zhenyou, Liu, Zicheng, and Steve Silverberg. 2002. “Distributed meetings: A meeting capture and broadcasting system”. In Proceedings of the tenth ACM international conference on Multimedia. 503–512.

Davitti, Elena. 2019. “Methodological explorations of interpreter-mediated interaction: novel insights from multimodal analysis. Qualitative Research”. Qualitative Research 19 (1): 7–29.

Defrancq, Bart, and Claudio Fantinuoli. 2021. “Automatic Speech Recognition in the Booth: Assessment of System Performance, Interpreters’ Performances and Interactions in the Context of Numbers”. Target. International Journal of Translation Studies 33 (1): 73–102.

Fantinuoli, Claudio. 2017. “Computer-Assisted Preparation in Conference Interpreting”. Translation & Interpreting 9 (2): 24–37.

Foote, Jonathan T., Young, Steve J., Jones, Gareth J.F., and Karen Spärk Jones. 1997. “Unconstrained keyword spotting using phone lattices with application to spoken document retrieval”. Computer Speech & Language 11 (3): 207–224.

Goodwin, Charles. 1981. Conversational organization: Interaction between speakers and hearers. San Diego: Academic Press.

Gupta, Anhinav, Miao, Yajie, Neves, Leonardo, and Florian Metze. 2017. “Visual features for context-aware speech recognition”. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, 5020–5024 [URL].

James, David A. 1995. The application of classical information retrieval techniques to spoken documents. Unpublished doctoral thesis. University of Cambridge, United Kingdom.

James, David A., and Steven J. Young. 1994. “A fast lattice-based approach to vocabulary independent wordspotting”. In Processing of IEEE International Conference on Acoustics, Speech, and Signal (ICASSP 1994). 377–381.

Jewitt, Carey. 2014. “An Introduction to Multimodality”. In The Routledge Handbook of Multimodal Analysis, edited by Carey Jewitt (2nd ed), 15–30. London: Routledge.

Jia, Jiyou. 2015. “Intelligent Tutoring Systems”. In: Encyclopedia of Educational Technology, edited by Mike Spector, 411–413. Thousand Oaks, CA, USA: Sage.

Kazman, Rick, Al-Halimi, Reem, Hunt, William, and Marilyn Mantei. 1996. “Four paradigms for indexing video conferences”. IEEE multimedia 3 (1): 63–73.

Knapp, Mark L., Hall, Judith A., and Terrence Horgan. 2013. Nonverbal Communication in Human Interaction. Boston: Wadsworth Publishing.

Koehn, Philip. 2010. Statistical Machine Translation. Cambridge: Cambridge University Press.

Kubala, Francis, Colbath, Sean, Liu, Daben, and John Makhoul. 1999. “Rough‘n’Ready: a meeting recorder and browser”. ACM Computing Surveys (CSUR) 31(2es): 7.

Lee, Dar-Shyang, Erol, Berna, Graham, Jamey, Hull, Jonathan J., and Norihiko Murata. 2002. “Portable meeting recorder”. In Proceedings of the 10th ACM International Conference on Multimedia. 493–502.

Li, Haopeng, Ke, Qiuhong, Gong, Mingming, and Rui Zhang. 2022. “Video Summarization Based on Video-text Modelling”. [Online] Available at arXiv:2201.02494

Li, Jinyui. 2021. “Recent Advances in End-to-End Automatic Speech Recognition”. APSIPA Transactions on Signal and Information Processing. [Online] Available at arXiv:2111.01690.

Li, Ya, Campbell, Nick, and Jianhua Tao, J. 2015. “Voice quality: not only about ‘you’ but also about ‘your interlocutor’”. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal (ICASSP 2015). 4739–4743.

Lin, Zhejie, Zhao, Zhou, Li, Haoyuan, Liu, Jinglin, Zhang, Meng, Zeng, Xingshan, and Xiafei He. 2021. “SimulLR: Simultaneous Lip Reading Transducer with Attention-Guided Adaptive Memory”. Proceedings of ACMMM 2021. [Online] Available at [URL].

Luhn, Hans Peter. 1958. “The automatic creation of literature abstracts”. IBM Journal of research and development 2 (2): 159–165.

Macháček, Dominik, Žilinec, Matúš, and Ondřej Bojar. 2021. “Lost in Interpreting: Speech Translation from Source or Interpreter?” In Proceedings of INTERSPEECH 2021. 30 August–3 September, 2021, Brno, Check Republic. Brno: ISCA. 2376–238.

Martínez, Aleix M. 2002. “Recognizing Imprecisely Localized: partially occluded and expression variant faces from a single sample per class”. IEEE Transaction on Pattern Analysis and Machine Intelligence 24 (6): 748–763.

Matusov, Evgeny, Wilken, Patrick, Bahar, Parnia, Schamper, Julian, Golik, Pavel, Zeyer, Albert, Silvestre-Cerdà, Joan Albert, Martínez-Villaronga, Adrià, Pesch, Hendrick, and Jan-Thorsten Peter. 2018. “Neural Speech Translation at AppTek”. In Proceedings of the 15th International Conference on Spoken Language Translation. Brussels. International Conference on Spoken Language Translation, 104–111. [Online] Available at [URL]

Mazzawi, Hanna, Gonzalvo, Xavi, Kracun, Alexandar, Sridhar, Prashant, Subrahmanya, Niranjan A., Lopez-Moreno, Ignacio, Park, Hyun-jin, and Patrick Violette. 2019. “Improving Keyword Spotting and Language Identification via Neural Architecture Search at Scale”. Proceedings of INTERSPEECH 15.

McCowan, Iain, Gatica-Perez, Daniel, Bengio, Samy, Lathoud, Gillaume, Barnard, Mark, and Dong Zhang. 2005. “Automatic Analysis of Multimodal Group Actions in Meetings”. IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (3): 305–317.

Metze, Florian, Gieselman, Petra, Holzapfel, Hartwig, Kluge, Tobias, Rogina, Ivica, Waibel, Alex, and Mattias Wölfel. 2006. “The ‘FAME’ Interactive Space”. In Proceedings of Machine Learning for Multimodal Interaction (MLMI2006). 285–296.

Mondada, Lorenza. 2016. “Challenges of multimodality: Language and the body in social interaction”. Journal of Sociolinguistics 20 (3): 336–366.

Moores, Zoe. 2020. “Fostering access for all through respeaking at live events”. JOsTrans. The journal of specialised translation 33: 176–211.

Morgan, Nathaniel, Baron, Don, Bhagat, Sonali, Carvey-Essenburg, Hannah, Dhillon, Rajdip, Edwards, Jane, Gelbart, David, Janin, Adam, Krupski, Ashley, Peskin, Barbara, Pfau, Thilo, Shriberg, Elizabeth, Stolcke, Andreas, and Chuck Wooters. 2003. “Meetings about meetings: research at ICSI on speech in multiparty conversations”. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP2003). 740–743.

Moser-Mercer, Barbara. 2005. “Remote Interpreting: Issues of Multi-Sensory Integration in a Multilingual Task”. Meta 50 (2): 727–738.

Müller, Cornelia, Cienki, Alan, Fricke, Ellen, Ladewig, Silva, McNeill, David, and Sedihna Tessendorf (eds.). 2013. Body – Language – Communication: An International Handbook on Multimodality in Human Interaction. Vol. 1. Berlin and Boston: De Gruyter Mouton.

(eds.). 2014. Body – Language – Communication: An International Handbook on Multimodality in Human Interaction. Vol. 2. Berlin and Boston: De Gruyter Mouton.

Ng, Kenney, and Victor W. Zue, 2000. “Subword-based approaches for spoken document retrieval”. Speech Communication 32 (3): 157–186.

Oviatt, Sharon, Schuller, Björn, Cohen, Philip R., Sonntag, Daniel, Potamianos, Gerasimos, and Antonio Krüger (eds.). 2017. The Handbook of Multimodal-Multisensor Interfaces, Volume 1: Foundations, User Modeling, and Common Modality Combinations. ACM Books.

Padois, Thomas, Sgard, Frack C., Doutres, Olivier, and Alain Berry. (2017). “Acoustic source localization using a polyhedral microphone array and an improved generalized cross-correlation technique”. Journal of Sound and Vibration 386: 82–99.

Park, Tae Jin, Kanda, Naoyuki, Dimitriadis, Dimitrios, Han, Kyu J., Watanabe, Shinji, and Shrikanth Narayanan, 2022. “A Review of Speaker Diarization: Recent Advances with Deep Learning”. Computer Speech & Language 72: 101317.

Pentland, Alex, and Tracy Heibeck. 2008. Honest signals: how they shape our world. Cambridge: MIT press.

Pöchhacker, Franz. 2016. Introducing Interpreting Studies. Routledge (2nd edition). London and New York: Routledge.

. 2020. “‘Going Video’: Mediality and Multimodality in Interpreting Studies”. In Linking up with video: Perspectives on interpreting practice and research, edited by Salaets, H. and Brône, 13–45. Amsterdam: John Benjamins.

Qu, Leyuan, Weber, Cornelius, and Stefan Wermter. 2020. “Multimodal Target Speech Separation with Voice and Face References”. Interspeech, 2020.

Ramanathan, Vignesh, Joulin, Armand, Liang, Percy, and Li Fei-Fei. 2014. “Linking people in videos with “their” names using coreference resolution”. In Proceedings of the 13th European conference on computer vision (ECCV), Springer, 95–110.

Rogina, Ivica, and Thomas Schaaf. 2002. “Lecture and presentation tracking in an intelligent meeting room”. In Proceedings of the 4th IEEE International Conference on Multimodal Interfaces. 47–52.

Romero-Fresco, Pablo. 2011. Subtitling Through Speech Recognition: Respeaking. (Translation Practices Explained). St Jerome Publishing.

. 2018. “Subtitling through speech recognition”. In The Routledge Handbook of Audiovisual Translation, edited by Luis Pérez-González, 96–113. London and New York: Routledge.

Rui, Yong, Gupta, Anoop, and Jonathan Grudin. 2003. “Videography for telepresentations”. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 457–464.

Sandrelli, Annalisa. 2020. “Interlingual respeaking and simultaneous interpreting in a conference setting: a comparison”. inTRAlinea Special Issue: Technology in Interpreter Education and Practice. [URL]

Schapire, Robert E. 2013. “Explaining adaboost”. In Empirical inference, 37–52. Berlin: Springer.

Sinclair, Mark. 2016. Speech segmentation and speaker diarisation for transcription and translation. PhD thesis. University of Edinburgh, United Kingdom.

Spärck-Jones, Karen. 1999. “Automatic summarizing: factors and directions”. In Advances in Automatic Text Summarization, 1–12. Cambridge: MIT Press.

Specia, Lucia, Wang, Josiah, Lee, Sun Jae, Ostapenko, Alissa, and Pranava Madhyastha. 2021. “Read, spot and translate”. Machine Translation 35: 145–165.

Stolbov Mikhail. 2015. “Application of microphone arrays for distant speech capture”. Scientific and Technical Journal of Information Technologies, Mechanics and Optics 15 (4): 661–675.

Sulubacak, Umut, Čağlayan, Ozan, Grönroos, Stig-Arne, Elliott, Desmond, Rouhe, Aku, Specia, Lucia, and Jörg Tiedemann. 2020. “Multimodal machine translation through visuals and speech”. Machine Translation 34: 97–147.

Tür, Gokhan, Stolcke, Andreas, Voss, Lynn, Peters, Stanley, Hakkani-Tür, Dilek, Dowding, Joh, Favre, Benoit, Fernández, Raquel, Frampton, Matthew, Frandsen, Michael W., Frederickson, Clint, Graciarena, Martin, Kintzing, Donald, Leveque, Kyle, Mason, Shane, Niekrasz, John, Purver, Matthew, Riedhammer, Korbinian, Shriberg, Elizabeth, Tien, Jing, Vergyri, Dimitra, and Fang Yang. 2010. “The CALO meeting assistant system”. IEEE Transactions on Audio, Speech, and Language Processing 18 (6): 1601–1611.

Vranjes, Jelena, and Geert Brône. 2020. “Eye-tracking in interpreter-mediated talk: From research to practice”. In Linking up with video: Perspectives on interpreting practice and research, edited by Salaets, H. and Brône, G. 203–233. Amsterdam: John Benjamins.

Wactlar, Howard D., Kanade, Takeo, Smith, Michael A., and Scott M. Stevens. 1996. “Intelligent access to digital video: Informedia project”. Computer 29 (5): 46–52.

Wadensjö, Cecilia. 1999. “Telephone interpreting and the synchronization of talk in social interaction”. The Translator 5 (2): 247–264.

Waibel, Alex, Bett, Michael, Metze, Florian, Ries, Klaus, Schaaf, Thomas, Schultz, Tanja, Soltau, Hagen, Yu, Hua, and Klaus Zechner. 2001. “Advances in automatic meeting record creation and access”. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2001). 597–600.

Waibel, Alex, and Rainer Stiefelhagen (eds.). 2009. Computers in the Human Interaction Loop. London: Springer.

Whittaker, Steve, Hyland, Patrick, and Myrtle Wiley. 1994. “FILOCHAT: Handwritten notes provide access to recorded conversations”. In Proceedings of the SIGCHI conference on Human factors in computing systems. 271–277.

Witten, Ian H., Moffat, Alistair, and Timothy C. Bell. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. San Francisco: Morgan Kaufmann.

Zhang, Xiaojun. 2015. “The Changing Face of Conference Interpreting”. In New Horizons in Translation and Interpreting Studies. The 7th International Conference of the Iberian Association of Translation and Interpreting Studies (AIETI) edited by Gloria Corpas Pastor, Míriam Seghiri Domínguez, Rut Gutiérrez Florido and Míriam Urbano Mendaña, 255–263. Geneva: Editions Tradulex. [Online] Available at [URL]

Zhao, Wen-Yi, Chellappa, Rama, Phillips, Jonathon, and Azriel Rosenfeld. 2003. “Face Recognition: A Literature Survey”. ACM computing surveys (CSUR) 35 (4): 399–458.

Zhou, Bowen, Besacier, Laurent, and Yuqing Gao. 2007. “On Efficient Coupling of ASR and SMT for Speech Translation”. 2007 IEEE International Conference on Acoustics, Speech and Signal Processing – ICASSP ’07, 2007. IV-101–IV-104

Zhu, Wenwu, Wang, Xin, and Honzhi Li. 2020. “Multi-modal Deep Analysis for Multimedia”. IEEE Transactions on Circuits and Systems for Video Technology. [Online] Available at [URL].

Cited by (3)

Cited by three other publications

Chmiel, Agnieszka

2025. Sztuczna inteligencja w kabinie. Tłumaczenie symultaniczne w kontekście rozwoju technologii. Przekładaniec :50 ► pp. 42 ff.

Zou, Deyan, Huahui Zhang, Ying Zhao & Piao Xu

2025. Unleashing the potential: how ChatGPT improves gisting skills in student interpreters. The Interpreter and Translator Trainer ► pp. 1 ff.

Fan, Damien Chiaming

2024. Conference interpreters’ technology readiness and perception of digital technologies. Interpreting. International Journal of Research and Practice in Interpreting 26:2 ► pp. 178 ff.

This list is based on CrossRef data as of 12 november 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.

Chapter 7Videoconference interpreting goes multimodal

Some insights and a tentative proposal

Cited by three other publications

Chapter 7
Videoconference interpreting goes multimodal