An empirical study on GenAI use in speech difficulty evaluation: Toward a human-centered application of AI in interpreting education

Wang, Lihan; Wang, Weiwei

doi:10.54754/incontext.v5i1.104

Article published In: Human-centeredness in Translation: Advancing Translation Studies in a human-centered AI era
Guest-edited by Miguel A. Jiménez-Crespo
[InContext 5:1] 2025
► pp. 116–145

Get fulltext from our e-platform

Download PDF

An empirical study on GenAI use in speech difficulty evaluation

Toward a human-centered application of AI in interpreting education

Lihan Wang | Beijing Foreign Studies University, China

Weiwei Wang | Guangdong University of Foreign Studies, China

Available under the Creative Commons Attribution-NonCommercial-NoDerivatives (CC BY-NC-ND) 4.0 license.

For any use beyond this license, please contact the publisher at rights@benjamins.nl.

Published online: 31 May 2025

https://doi.org/10.54754/incontext.v5i1.104

Abstract

This study examines the use of Artificial Intelligence Generated Content (AIGC) tools for assessing speech difficulty in interpreter training. 25 students were invited to interpret three materials from English into Chinese consecutively and then evaluate the difficulty levels of those speeches, while ChatGPT was provided with the transcripts and the duration of the speeches. Speech evaluations by students were compared to those made by ChatGPT within a standardized framework, the Speech Difficulty Index (SDI). Statistical analysis, specifically one-sample t-tests and one-sample Wilcoxon signed rank tests, were conducted to determine any significant differences between the assessments of students and ChatGPT. As for the total scores, the results indicate a consensus between students and ChatGPT on the difficulty of a moderately challenging speech. However, divergences were observed for the other two speeches classified as more or less difficult. Further comparison of the scores on three breakdown dimensions indicates that students’ evaluation can differ from that of ChatGPT in “Subject Matter”, while there is no significant difference in the scores of “Speed of Delivery”. As for “Density and Style,” the trend is consistent with the one shown in the total scores’ comparison. A following interview presents students’ perspectives on evaluating speech difficulty, with their subjective perceptions as standards to form judgements. Given ChatGPT’s capabilities to analyze delivery speed and minimize subjective biases, the integration of AIGC tools in educational settings is recommended. Moreover, interpreter trainers should notice the divergence and balance between the subjective perception among students and the objective evaluation of speech difficulty, to complement the ignorance of AIGC tools on subjective factors. By providing AIGC tools with reliable frameworks for speech difficulty evaluation, it could refine material selection, ensuring a better alignment with learners’ proficiency levels, thereby optimizing the educational outcomes of interpreter training. Based on the findings and limitations in this study, several promising aspects for future research are proposed.

Keywords: Artificial Intelligence Generated Content (AIGC), interpreting education, speech difficulty evaluation, Speech Difficulty Index (SDI), interpreting

摘要

本研究探讨了如何在口译培训中使用生成式人工智能（AIGC）评估口译材料难度。25名学生参与了实验，学生在完成对三则材料的英汉交替传译后,给出对材料难度的评分；研究者为ChatGPT提供了三则材料的转写稿和材料时长。在实验中，学生和ChatGPT在SDI框架下评估口译材料的难度。随后，研究者采用单样本t检验和单样本Wilcoxon符号秩检验来确定学生和ChatGPT的评分是否在统计学上具有显著差异。结果表明，在总分上，学生和ChatGPT对一则难度中等的材料在评分达成一致；对于一则较易和另一则较难材料的难度，学生和ChatGPT的评分出现了显著差异。在具体的三个维度中，学生和ChatGPT对三则材料”主题内容”的评分均存在显著差异；对”语速”的评分不存在显著差异；对”信息密度和发言风格”的评分差异状况与总分的差异状况一致。随后的访谈发现，学生评价口译材料难度时均从自身主观因素出发。鉴于ChatGPT能够计算材料语速，并降低学生主观因素的影响，研究建议学生和教师使用生成式人工智能来选择合适的口译材料。同时，由于生成式人工智能可能忽视学生的主观因素，研究建议教师平衡学生主观感受和材料客观难度。研究建议为生成式人工智能提供可靠的材料难度评价体系，以完善材料选择，确保训练材料更符合口译学习者的个人水平，从而优化口译练习效果。最后，基于本研究的发现和不足，笔者提出了具有一定价值的未来研究方向。

关键词：生成式人工智能，口译教育，口译材料难度评价，材料难度指数（SDI），口译

References (43)

References

AlKhuzaey, Samah, Floriana Grasso, Terry R. Payne and Valentina Tamma. (2024). Text-based Question Difficulty Prediction: A Systematic Review of Automatic Approaches. International Journal of Artificial Intelligence in Education, 34(3), 862–914.

Andres, Dörte. (2015). Easy? Medium? Hard? The importance of text selection in interpreter training. In Dörte Andres & Martina Behr (Eds.), To Know How to Suggest: Approaches to Teaching Conference Interpreting (pp. 103–124). Frank & Timme.

Bendazzoli, Claudio and Annalisa Sandrelli. (2005, May 2–6). An approach to corpus-based interpreting studies: Developing EPIC (European Parliament Interpreting Corpus). Proceedings of the Marie Curie Euroconferences MuTra: Challenges of Multidimensional Translation, Saarbrücken, Germany.

Benedetto, Luca, Andrea Cappelli, Roberto Turrin and Paolo Cremonesi. (2020, March 23–27). R2DE: a NLP approach to estimating IRT parameters of newly generated questions. Proceedings of the Tenth International Conference on Learning Analytics & Knowledge, Frankfurt, Germany.

Calvo-Ferrer, José Ramón. (2023). Can you tell the difference? A study of human vs machine-translated subtitles. Perspectives, 32(6), 1115–1132.

Chan, Clara Ho-yan. (2013). From self-interpreting to real interpreting: A new web-based exercise to launch effective interpreting training. Perspectives, 21(3), 358–377.

Chen, Hua, Ying Wang and T. Pascal Brown. (2021). The effects of topic familiarity on information completeness, fluency, and target language quality of student interpreters in Chinese–English consecutive interpreting. Across Languages and Cultures, 22(2), 176–191.

Chen, Sijia and Jan-Louis Kruger. (2024). A computer-assisted consecutive interpreting workflow: Training and evaluation. The Interpreter and Translator Trainer, 18(3), 380–399.

. (2023). The effectiveness of computer-assisted interpreting: A preliminary study based on English-Chinese consecutive interpreting. Translation and Interpreting Studies, 18(3), 399–420.

Choi, Inn-Chull and Youngsun Moon. (2020). Predicting the difficulty of EFL tests based on corpus linguistic features and expert judgment. Language Assessment Quarterly, 17(1), 18–42.

Defrancq, Bart and Claudio Fantinuoli. (2021). Automatic speech recognition in the booth: Assessment of system performance, interpreters’ performances and interactions in the context of numbers. Target. International Journal of Translation Studies, 33(1), 73–102.

Defrancq, Bart, Helena Snoeck and Claudio Fantinuoli. (2024). Interpreters’ performances and cognitive load in the context of a CAI Tool. In Marion Winters, Sharon Deane-Cox & Ursula Böser (Eds.), Translation, Interpreting and Technological Change: Innovations in Research, Practice and Training (pp. 37–58). Bloomsbury.

Desmet, Bart, Mieke Vandierendonck and Bart Defrancq. (2018). Simultaneous interpretation of numbers and the impact of technological support. In Claudio Fantinuoli (Ed.), Interpreting and Technology (pp. 13–27). Language Science Press.

Fantinuoli, Claudio. (2023). Towards AI-enhanced computer-assisted interpreting. In Gloria Corpas Pastor & Bart Defrancq (Eds.), Interpreting Technologies — Current and Future Trends (pp. 46–71). John Benjamins.

Fantinuoli, Claudio, Giulia Marchesini, David Landan and Lukas Horak. (2022). KUDO Interpreter Assist: Automated real-time support for remote interpretation. arXiv preprint, arXiv:2201.01800.

Fantinuoli, Claudio and Xiaoman Wang. (2024, June 24–27). Exploring the correlation between human and machine evaluation of simultaneous speech translation. Proceedings of the 25th Annual Conference of the European Association for Machine Translation, Sheffield, UK.

Frittella, Francesca. (2021). Computer-assisted conference interpreter training: Limitations and future directions. Journal of Translation Studies, 1(2), 103–142.

Hou, Jue, Koppatz Maximilian, José María Hoya Quecedo, Nataliya Stoyanova and Roman Yangarber. (2019, August 2). Modeling language learning using specialized Elo rating. Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, Florence, Italy.

Hsu, Fu-Yuan, Hahn-Ming Lee, Tao-Hsing Chang and Yao-Ting Sung. (2018). Automated estimation of item difficulty for multiple-choice tests: An application of word embedding techniques. Information Processing & Management, 54(6), 969–984.

Hu, Kaibao and Qing Tao. (2010). Hàn yīng huì yì kǒu yì yǔ liào kù de chuàng jiàn yǔ yìng yòng yán jiū [The compliation and application of Chinese-English conference interpreting corpus]. Chinese Translators Journal, 2010(5), 49–56, 95.

Huang, Xiaojia and Chuanyun Bao. (2016). Jiao ti chuan yi jiao xue cai liao nan du fen ji tan xi yi quan guo gao duan ying yong xing fan yi ren cai pei yang ji di jian she xiang mu wei li [Exploring the difficulty grading of teaching materials for consecutive interpreting — A case study on the construction of China’s high-end translation talent cultivation base]. Chinese Translators Journal, 2016(1), 58–62.

Iglesias Fernández, Emilia. (2016). Interactions between speaker’s speech rate, orality and emotional involvement, and perceptions of interpreting difficulty: A preliminary study. MonTI. Monografías de Traducción e Interpretación, Special Issue 3, 1–32.

Jayes, Thomas. (2023). Conference interpreting and technology: An institutional perspective. In Gloria Corpas Pastor & Bart Defrancq (Eds.), Interpreting Technologies — Current and Future Trends (pp. 217–240). John Benjamins.

Korpal, Paweł and Katarzyna Stachowiak-Szymczak. (2020). Combined problem triggers in simultaneous interpreting: Exploring the effect of delivery rate on processing and rendering numbers. Perspectives, 28(1), 126–143.

Krüger, Ralph and Janiça Hackenbuchner. (2024). A competence matrix for machine translation-oriented data literacy teaching. Target, 36(2), 245–275.

Li, Xiaolong and Mengjie Wang. (2018). Jī yú yǔ yīn shí bié APP de tóng shēng chuán yì néng lì péi yǎng jiào xué mó shì jiàn gòu yǔ yán jiū yǐ kē dà xùn fēi yǔ jì APP wèi lì [Construction and research of the teaching model of using automatic speech recognition app in simultaneous interpreting training course — A case study of voice note as an auxiliary tool]. Computer-Assisted Foreign Language Education in China, 11, 12–18.

Liu, Yiguang and Junying Liang. (2024). Multidimensional comparison of Chinese-English interpreting outputs from human and machine: Implications for interpreting education in the machine-translation age. Linguistics and Education, 801, 101273.

Loukina, Anastassia, Su-Youn Yoon, Jennifer Sakano, Youhua Wei and Kathy Sheehan. (2016, December 13–16). Textual complexity as a predictor of difficulty of listening items in language proficiency tests. Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan.

Lu, Xinchao. (2022). Yi yuan yu ji qi han ying tong sheng chuan yi zhi liang he guo cheng dui bi yan jiu [Comparing the quality and processes of Chinese-English simultaneous interpreting by interpreters and a machine]. Foreign Language Teaching and Research, 2022(4), 600–610, 641.

Mazzei, Cristiano and Laurence Jay-Rayon Ibrahim Aibo. (2022). The Routledge Guide to Teaching Translation and Interpreting Online. Routledge.

Pisani, Elisabetta and Claudio Fantinuoli. (2021). Measuring the impact of automatic speech recognition on number rendition in simultaneous interpreting. In Caiwan Wang & Binghan Zheng (Eds.), Empirical Studies of Translation and Interpreting (pp. 181–197). Routledge.

Prandi, Bianca. (2023). Computer-assisted Simultaneous Interpreting: A Cognitive-Experimental Study on Terminology. Language Science Press.

Rust, John and Susan Golombok. (2014). Modern Psychometrics: The Science of Psychological Assessment. Routledge.

Seiffe, Laura, Fares Kallel, Sebastian Möller, Babak Naderi and Roland Roller. (2022, June 20–25). Subjective text complexity assessment for German. Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France.

Seleskovitch, Danica. (1975). Langage, langues et mémoire: étude de la prise de notes en interprétation consécutive [Language, languages, and memory: A study of note-taking in consecutive interpretation]. Minard.

Setton, Robin and Andrew Dawrant. (2016). Conference Interpreting: A Trainer’s Guide. John Benjamins.

Susanti, Yuni, Takenobu Tokunaga, Hitoshi Nishikawa and Hiroyuki Obari. (2017). Controlling item difficulty for automatic vocabulary question generation. Research and Practice in Technology Enhanced Learning, 12(25), 1–16.

Tamor, Lynne. (1981). Subjective text difficulty: An alternative approach to defining the difficulty level of written text. Journal of Reading Behavior, 13(2), 165–172.

Tymczyńska, Maria. (2009). Integrating in-class and online learning activities in a healthcare interpreting course using Moodle. The Journal of Specialised Translation, 121, 148–164.

Venkatesan, Hari. (2023). Technology preparedness and translator training: Implications for curricula. Babel, 69(5), 666–703.

Venugopal, Vinu E. and P. Sreenivasa Kumar. (2020). Difficulty-level modeling of ontology-based factual questions. Semantic Web, 11(6), 1023–1036.

Wang, Xiaoman and Lu Yuan. (2023). Machine-learning based automatic assessment of communication in interpreting. Frontiers in Communication, 81, 1047753.

Yaneva, Victoria, Peter Baldwin and Janet Mee. (2019, August 2). Predicting the difficulty of multiple choice questions in a high-stakes medical exam. Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, Florence, Italy.