From manual to machine: Evaluating automated ear–voice span measurement in simultaneous interpreting

Guo, Meng; Han, Lili

doi:10.1075/intp.00100.guo

Article published In: Interpreting
Vol. 26:1 (2024) ► pp.24–54

Get fulltext from our e-platform

Download PDF

Download EPUB

From manual to machine

Evaluating automated ear–voice span measurement in simultaneous interpreting

Meng Guo | Macao Polytechnic University

Lili Han | Macao Polytechnic University

Published online: 15 January 2024

https://doi.org/10.1075/intp.00100.guo

Abstract

This study introduces a groundbreaking automated methodology for measuring ear–voice span (EVS) in simultaneous interpreting (SI). Traditionally, assessing EVS – a critical temporal metric in SI – has been hampered by labour-intensive and time-consuming manual methods that are prone to inconsistency. To overcome these challenges, our research harnesses state-of-the-art natural language processing (NLP) technologies, including automatic speech recognition (ASR), sentence boundary detection (SBD) and cross-lingual alignment, to automate EVS measurement. We deployed a comprehensive array of NLP models and evaluated the automated pipelines on a 20-hour English-to-Portuguese SI corpus which featured 57 varied audio pairings. The findings are encouraging: the most effective model combination achieved a median EVS error of less than 0.1 seconds across the corpus. Moreover, the automated pipelines exhibited a high level of accuracy, strong correlation and substantial agreement with manual measurements when assessing median EVS for individual audio pairs. Despite these satisfactory results, certain challenges persist with some NLP models, indicating clear avenues for future research. This study not only introduces a groundbreaking approach to large-scale EVS measurement but also propels the automation of process analysis in Interpreting Studies.

Keywords: ear–voice span, simultaneous interpreting, automatic speech recognition, sentence boundary detection, cross-lingual alignment

Article outline

Introduction
1.Ear–voice span measurement in simultaneous interpreting
- 1.1Methods and tools for ear–voice span measurement
- 1.2Statistical techniques in ear–voice span measurement
- 1.3Innovations in ear–voice span measurement
2.Natural language processing technologies for ear–voice span measurement
- 2.1Automatic speech recognition models
- 2.2Sentence boundary detection models
- 2.3Cross-lingual alignment models
3.Data collection and preparation
- 3.1Compilation of the simultaneous interpreting corpus focused on ear–voice span
- 3.2Stratified corpus sampling for manual validation
4.Methodology
- 4.1Automated pipeline for ear–voice span measurement
- 4.2Manual annotation of ear–voice span
- 4.3Manual validation of pipeline components of natural language processing
- 4.4Data-preprocessing and -analysis techniques
5.Results
- 5.1Comparative analysis of manual and automated ear–voice span measurement approaches
- 5.2Evaluation of automatic speech recognition, sentence boundary detection and cross-lingual alignment
6.Discussion
- 6.1Performance of automated pipelines for ear–voice span measurement
- 6.2Performance of pipeline components
- 6.3Implications and limitations
7.Conclusion
Acknowledgements
Notes
References

References (50)

References

Artetxe, M. & Schwenk, H. (2019). Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics 71, 597–610.

Baevski, A., Zhou, H., Mohamed, A. & Auli, M. (2020). Wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv.

Bain, M., Huh, J., Han, T. & Zisserman, A. (2023, March 1). WhisperX: Time-accurate speech transcription of long-form audio. arXiv.

Barik, H. C. (1973). Simultaneous interpretation: Temporal and quantitative data. Language and Speech 16 (3), 237–270.

Bendazzoli, C. & Sandrelli, A. (2005). An approach to corpus-based interpreting studies: Developing EPIC (European Parliament Interpreting Corpus). Proceedings of the EU-HighLevel Scientific Conference Series MuTra 2005 – Challenges of Multidimensional Translation. [URL]

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. arXiv.

Cer, D., Yang, Y., Kong, S., Hua, N., Limtiaco, N., John, R. S., … Kurzweil, R. (2018). Universal Sentence Encoder. arXiv.

Chmiel, A., Janikowski, P. & Cieślewicz, A. (2020). The eye or the ear? Source language interference in sight translation and simultaneous interpreting: Interpreting 22 (2), 187–210.

Chmiel, A., Janikowski, P., Koržinek, D., Lijewska, A., Kajzer-Wietrzny, M., Jakubowski, D. & Plevoets, K. (2023). Lexical frequency modulates current cognitive load, but triggers no spillover effect in interpreting. Perspectives.

Chmiel, A., Koržinek, D., Kajzer-Wietrzny, M., Janikowski, P., Jakubowski, D. & Polakowska, D. (2022). Fluency parameters in the Polish Interpreting Corpus (PINC). In M. Kajzer-Wietrzny, A. Ferraresi, I. Ivaska & Bernardini (Eds.), Mediated discourse at the European Parliament empirical investigations. Berlin: Language Science Press, 63–91.

Chmiel, A., Szarkowska, A., Koržinek, D., Lijewska, A., Dutka, Ł., Brocki, Ł. & Marasek, K. (2017). Ear–voice span and pauses in intra- and interlingual respeaking: An exploratory study into temporal aspects of the respeaking process. Applied Psycholinguistics 38 (5), 1201–1227.

Christoffels, I. K., & de Groot, A. M. B. (2004). Components of simultaneous interpreting: Comparing interpreting with shadowing and paraphrasing. Bilingualism: Language and Cognition 7 (3), 227–240.

Cokely, D. (1986). The effects of lag time on interpreter errors. Sign Language Studies 531, 341–375.

Collard, C. & Defrancq, B. (2019). Predictors of ear-voice span, a corpus-based study with special reference to sex. Perspectives 27 (3), 431–454.

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., … Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, 8440–8451.

Conneau, A., Lample, G., Ranzato, M., Denoyer, L. & Jégou, H. (2018). Word translation without parallel data. arXiv.

Davis, K. H., Biddulph, R. & Balashek, S. (1952). Automatic recognition of spoken digits. The Journal of the Acoustical Society of America 24 (6), 637–642.

Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, Minnesota: Association for Computational Linguistics, 4171–4186.

Díaz-Galaz, S., Padilla, P. & Bajo, M. T. (2015). The role of advance preparation in simultaneous interpreting: A comparison of professional interpreters and interpreting students. Interpreting 17 (1), 1–25.

Gerver, D. (1976). Empirical studies of simultaneous interpretation: A review and a model. In R. Brislin (Ed.), Translation: Applications and research. New York: Gardner Press, 165–207.

Gile, D. (2009). Basic concepts and models for interpreter and translator training (Rev. ed.). Amsterdam: John Benjamins.

Gonga, A. A. N. G., Crasborn, O. A., Börstell, C. A. & Ormel, E. A. (2020). Comparing IS and NGT interpreting processing time. A case study. In C. McDermid, S. Ehrlich, & A. Gentry (Eds.), Proceedings of WASLI 2019. Geneva: WASLI, 74–95.

Gumul, E. (2006). Conjunctive cohesion and the length of Ear-Voice Span in simultaneous interpreting. Linguistica Silesiana 271, 93–103.

Han, H.-H. & Yu, H.-N. (2020). An empirical study of temporal variables and their correlations in spoken and sign language relay interpreting. Babel 66 (4–5), 619–635.

Hsu, W.-N., Sriram, A., Baevski, A., Likhomanenko, T., Xu, Q., Pratap, V., … Auli, M. (2021). Robust wav2vec 2.0: Analyzing domain shift in self-supervised pre-training. arXiv.

Jurafsky, D. & Martin, J. H. (2000). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. USA: Prentice Hall PTR.

Kiss, T. & Strunk, J. (2006). Unsupervised multilingual sentence boundary detection. Computational Linguistics 32 (4), 485–525.

Lamberger-Felber, H. (2017). Text-oriented research into interpreting – Examples from a case-study. HERMES 14 (26), 39–64.

Manning, C. D. & Schütze, H. (1999). Foundations of statistical Natural Language Processing. Cambridge, Mass: The MIT Press.

Mellinger, C. D. & Hanson, T. (2017). Quantitative research methods in translation and interpreting studies. London and New York: Routledge.

Montani, I., Honnibal, M., Honnibal, M., Landeghem, S. V., Boyd, A., Peters, H., … Tamura, Y. (2023). explosion/spaCy: V3.5.2: Pretraining improvements, bug fixes for spans and spancat and more. Zenodo.

Paneth, E. (1957). An investigation into conference interpreting. In F. Pöchhacker & M. Shlesinger (Eds.), The interpreting studies reader. New York: University of London/Routledge, 30–40.

Plevoets, K. & Defrancq, B. (2018). The cognitive load of interpreters in the European Parliament. A corpus-based study of predictors for the disfluency uh(m). Interpreting 20 (1), 1–28.

(2020). Imported load in simultaneous interpreting: An assessment. In Multilingual mediated communication and cognition. London: Routledge, 18–43.

Pöchhacker, F. (2016). Introducing interpreting studies (2nd ed.). London: Routledge.

Prandi, B. (2023). Computer-assisted simultaneous interpreting: A cognitive-experimental study on terminology. Berlin: Language Science Press.

Qi, P., Zhang, Y., Zhang, Y., Bolton, J. & Manning, C. D. (2020). Stanza: A Python Natural Language Processing toolkit for many human languages. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Online: Association for Computational Linguistics, 101–108.

Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 771, 257–286.

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C. & Sutskever, I. (2022). Robust speech recognition via large-scale weak supervision. arXiv.

Read, J., Dridan, R., Oepen, S. & Solberg, L. J. (2012). Sentence boundary detection: A long solved problem? Proceedings of COLING 2012: Posters. Mumbai, India: The COLING 2012 Organizing Committee, 985–994.

Reimers, N. & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982–3992. Hong Kong, China: Association for Computational Linguistics.

Rosendo, L. R. & Galván, M. C. (2019). Coping with speed. Babel 65 (1), 1–25.

Ruder, S., Vulić, I. & Søgaard, A. (2019). A survey of cross-lingual word embedding models. Journal of Artificial Intelligence Research 651, 569–631.

Temnikova, I., Abdelali, A., Djabri, S. & Hedaya, S. (2019). Human-informed speakers and interpreters analysis in the WAW corpus and an automatic method for calculating interpreters’ décalage. Proceedings of the Human-informed Translation and Interpreting Technology Workshop (HiT-IT 2019), 105–115.

Tiedemann, J. & Thottingal, S. (2020). OPUS-MT – Building open translation services for the World. Proceedings of the 22nd Annual Conference of the European Association for Machine Translation. Lisboa: European Association for Machine Translation, 479–480.

Timarová, Š. (2015). Time lag. In F. Pӧchhacker (Ed.), Routledge encyclopedia of interpreting studies. London: Routledge, 418–420.

Timarová, Š., Čeňková, I., Meylaerts, R., Hertog, E., Szmalec, A. & Duyck, W. (2014). Simultaneous interpreting and working memory executive control. Interpreting 16 (2), 139–168.

Timarová, Š., Dragsted, B. & Gorm Hansen, I. (2011). Time lag in translation and interpreting: A methodological exploration. In C. Alvstad, A. Hild & E. Tiselius (Eds.), Methods and strategies of process research: Integrative approaches in Translation Studies. John Benjamins, 121–146.

Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., … Raffel, C. (2021). mT5: A massively multilingual pre-trained Text-to-Text Transformer. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 483–498. Association for Computational Linguistics.

Zhang, W., Feng, Y., Meng, F., You, D. & Liu, Q. (2019). Bridging the gap between training and inference for Neural Machine Translation. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 4334–4343. Florence: Association for Computational Linguistics.

Cited by (1)

Cited by one other publication

Liu, Zhengyuan, Xinqi Yu, Wing Chung Hu, Yunxiao Ma, Ruiming Wang & Haoyun Zhang

2025. Praditor: A DBSCAN-based automation for speech onset detection. Behavior Research Methods 57:9

This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.