Article published In: Interpreting
Vol. 25:1 (2023) ► pp.109–143
Automatic assessment of spoken-language interpreting based on machine-translation evaluation metrics
A multi-scenario exploratory study
Published online: 4 March 2022
https://doi.org/10.1075/intp.00076.lu
https://doi.org/10.1075/intp.00076.lu
Abstract
Automated metrics for machine translation (MT) such as BLEU are customarily used because they are quick to compute and sufficiently valid to be useful in MT assessment. Whereas the instantaneity and reliability of such metrics are made possible by automatic computation based on predetermined algorithms, their validity is primarily dependent on a strong correlation with human assessments. Despite the popularity of such metrics in MT, little research has been conducted to explore their usefulness in the automatic assessment of human translation or interpreting. In the present study, we therefore seek to provide an initial insight into the way MT metrics would function in assessing spoken-language interpreting by human interpreters. Specifically, we selected five representative metrics – BLEU, NIST, METEOR, TER and BERT – to evaluate 56 bidirectional consecutive English–Chinese interpretations produced by 28 student interpreters of varying abilities. We correlated the automated metric scores with the scores assigned by different types of raters using different scoring methods (i.e., multiple assessment scenarios). The major finding is that BLEU, NIST, and METEOR had moderate-to-strong correlations with the human-assigned scores across the assessment scenarios, especially for the English-to-Chinese direction. Finally, we discuss the possibility and caveats of using MT metrics in assessing human interpreting.
Article outline
- 1.Introduction
- 2.Human versus automatic assessment of interpreting
- 2.1Human assessment: rater types and scoring methods
- 2.2Automatic assessment: Evaluation metrics of machine translation
- 2.2.1An overview of evaluation metrics
- 2.2.2An introduction to BLEU, NIST, METEOR, TER and BERT
- 2.3Use of automated metrics in translation assessment
- 3.Research questions
- 4.Method
- 4.1Interpreting samples
- 4.2Human raters
- 4.3Scoring methods
- 4.4Analysis of the human-assigned scores
- 4.5Computation of evaluation metrics
- 4.6Data analysis
- 5.Results
- 5.1Inter-metric and inter-scenario correlation
- 5.2Overall correlations between metric scores and human-assigned scores
- 5.3Correlations based on metrics, rater types and scoring methods
- 6.Discussion
- 7.Conclusion
- Notes
References
References (42)
Banerjee, S. & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 65–72. [URL]
Callison-Bruch, C., Osborne, M. & Koehn, P. (2006). Re-evaluating the role of BLEU in machine translation research. Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, 249–256. [URL]
Chen, J., Yang, H-B. & Han, C. (2021). Holistic versus analytic scoring of spoken-language interpreting: A multi-perspectival comparative analysis. Manuscript submitted for publication.
Christodoulides, G. & Lenglet, C. (2014). Prosodic correlates of perceived quality and fluency in simultaneous interpreting. In N. Campbell, D. Gibbon & D. Hirst (Eds.), Proceedings of the 7th Speech Prosody Conference, 1002–1006. [URL].
Chung, H-Y. (2020). Automatic evaluation of human translation: BLEU vs. METEOR. Lebende Sprachen 65 (1), 181–205.
Coughlin, D. (2003). Correlating automated and human assessments of machine translation quality. [URL]
Devlin, J., Chang, M-W., Lee, K. & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171–4186. [URL]
Doddington, G. (2002). Automatic evaluation of machine translation quality using N-gram co-occurrence statistics. Proceedings of the Second International Conference on Human Language Technology Research, 138–145.
Ginther, A., Dimova, S. & Yang, R. (2010). Conceptual and empirical relationships between temporal measures of fluency and oral English proficiency with implications for automated scoring. Language Testing 27 (3), 379–399.
Han, C. (2015). Investigating rater severity/leniency in interpreter performance testing: A multifaceted Rasch measurement approach. Interpreting 17 (2), 255–283.
(2018). Using rating scales to assess interpretation: Practices, problems and prospects. Interpreting 20 (1), 59–95.
(2022a). Interpreting testing and assessment: A state-of-the-art review. Language Testing 39 (1), 30–55.
(2022b). Assessing spoken-language interpreting: The method of comparative judgement. Interpreting 24 (1), xx–xx.
Han, C. & Lu, X-L. (2021a). Interpreting quality assessment re-imagined: The synergy between human and machine scoring. Interpreting and Society 1 (1), 70–90.
(2021b). Can automated machine translation evaluation metrics be used to assess students’ interpretation in the language learning classroom? Computer Assisted Language Learning, 1–24.
Han, C. & Xiao, X-Y. (2021). A comparative judgment approach to assessing Chinese Sign Language interpreting. Language Testing, 1–24.
Han, C., Hu, J. & Deng, Y. (forthcoming). Effects of language background and directionality on raters’ assessments of spoken-language interpreting: An exploratory experimental study. Revista Española de Lingüística Aplicada.
Han, C., Chen, S-J., Fu, R-B. & Fan, Q. (2020). Modeling the relationship between utterance fluency and raters’ perceived fluency of consecutive interpreting. Interpreting 22 (2), 211–237.
International School of Linguists (2020). Diploma in Public Service Interpreting learner handbook. London, UK. [URL]
Le, N-T., Lecouteux, B. & Besacier, L. (2018). Automatic quality estimation for speech translation using joint ASR and MT features. Machine Translation 32 (4), 325–351.
Lee, J. (2008). Rating scales for interpreting performance assessment. The Interpreter and Translator Trainer 2 (2), 165–184.
Lee, S-B. (2020). Holistic assessment of consecutive interpretation: How interpreter trainers rate student performances. Interpreting 21 (2), 245–269.
Liu, M-H. (2013). Design and analysis of Taiwan’s interpretation certification examination. In: D. Tsagari & R. van Deemter (Eds.), Assessment issues in language translation and interpreting. Frankfurt: Peter Lang, 163–178.
Liu, Y-M. (2021). Exploring a corpus-based approach to assessing interpreting quality. In: J. Chen & C. Han (Eds.), Testing and assessment of interpreting: Recent developments in China. Singapore: Springer, 159–178.
Loper, E. & Steven, B. (2002). NLTK: the natural language toolkit. Proceedings of the ACL-02 workshop on effective tools and methodologies for teaching natural language processing and computational linguistics, 63–70.
Mathur, N., Wei, J., Freitag, M., Ma, Q-S. & Bojar, O. (2020). Results of the WMT20 metrics shared task. Proceedings of the Fifth Conference on Machine Translation, 688–725. [URL]
NAATI (2019). Certified interpreter test assessment rubrics. [URL]
Ouyang, L-W., Lv, Q-X. & Liang, J-Y. (2021). Coh-Metrix model-based automatic assessment of interpreting quality. In: J. Chen & C. Han (Eds.), Testing and assessment of interpreting: Recent developments in China. Singapore: Springer, 179–200.
Papineni, K., Roukos, S., Ward, T. & Zhu, W-J. (2002). BLEU: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318. [URL]
Reiter, E. (2018). A structured review of the validity of BLEU. Computational Linguistics 44 (3), 393–401.
Sellam, T., Das, D. & Parikh, A. P. (2020). BLEURT: Learning robust metrics for text generation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7881–7892. [URL].
Setton, R. & Dawrant, A. (2016). Conference Interpreting: A Trainer’s Guide. Amsterdam & Philadelphia: John Benjamins.
Snover, M., Dorr, B., Schwartz, R., Micciulla, L. & Makhoul, J. (2006). A study of translation edit rate with targeted human annotation. Proceedings of the 7th Conference of the Association for Machine Translation in the Americas, 223–231. [URL]
Stewart, C., Vogler, N., Hu, J-J., Boyd-Graber, J. & Neubig, G. (2018). Automatic estimation of simultaneous interpreter performance. The 56th Annual Meeting of the Association for Computational Linguistics. [URL].
Su, W. (2019). Exploring native English teachers’ and native Chinese teachers’ assessment of interpreting. Language and Education 331, 577–594.
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C-W., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q. & Rush, A. (2020). Transformers: State-of-the-art natural language processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38–45. [URL].
Wu, S. C. (2010). Assessing simultaneous interpreting: A study on test reliability and examiners’ assessment behavior. [URL]
Wu, Z-W. (2021). Chasing the unicorn? The feasibility of automatic assessment of interpreting fluency. In: J. Chen & C. Han (Eds.). Testing and assessment of interpreting: Recent developments in China. Singapore: Springer, 143–158.
Yang, L-Y. (2015). An exploratory study of fluency in English output of Chinese consecutive interpreting learners. Journal of Zhejiang International Studies University 11, 60–68.
Yu, W-T. & van Heuven, V. J. (2017). Predicting judged fluency of consecutive interpreting from acoustic measures: Potential for automatic assessment and pedagogic implications. Interpreting 191, 47–68.
Zhang, M. (2013). Contrasting automated and human scoring of essays. R&D Connections 211. [URL]
Cited by (11)
Cited by 11 other publications
Chernovaty, Leonid & Natalia Kovalchuk
Guo, Meng, Yuxing Xie, Lili Han, Victoria Lai Cheng Lei & Defeng Li
Han, Chao & Xiaolei Lu
Han, Chao, Xiaolei Lu & Shirong Chen
Han, Chao, Xiaolei Lu & Qin Fan
Han, Chao, Xiaolei Lu, Weiwei Wang & Shirong Chen
2025. Applying n-gram-based evaluation metrics to assess human interpreting. Interpreting. International Journal of Research and Practice in Interpreting
Huang, Yujie, Andrew K F Cheung, Kanglong Liu & Han Xu
Jiang, Zhaokun & Ziyin Zhang
Xu, Han, Jinghang Gu, Kanglong Liu & Qinyi Li
Chernovaty, Leonid Mykolayovych
This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
