Applying n-gram-based evaluation metrics to assess human interpreting: A battery of replications with internal meta‑analysis

Han, Chao; Lu, Xiaolei; Wang, Weiwei; Chen, Shirong

doi:10.1075/intp.00127.han

Article published In: Interpreting
Vol. 28:1 (2026) ► pp.58–90

Get fulltext from our e-platform

Download PDF

Download EPUB

Applying n-gram-based evaluation metrics to assess human interpreting

A battery of replications with internal meta‑analysis

Chao Han | National University of Singapore

Xiaolei Lu | Xiamen University

Weiwei Wang | Guangdong University of Foreign Studies

Shirong Chen | University of International Business and Economics

Published online: 20 November 2025

https://doi.org/10.1075/intp.00127.han

Abstract

We have recently witnessed a number of studies conducted to employ n-gram-based machine-translation evaluation metrics such as BLEU to assess human interpreting automatically. A major limitation of this research lies in the non-probabilistic sampling of a limited number of renditions. Consequently, the correlation coefficients calculated between machine and human assessments, which serve as a proxy for machine–human parity, lack generalizability. Against this background, we conducted a battery of replications of in order to evaluate the efficacy of three n-gram-based automated metrics — BLEU, NIST and METEOR — in the assessment of interpreting. Our replications are based on a self-curated corpus involving a total of 1,695 interpretations across different modes and directions of interpreting, based on various source speeches. Following the replications, we also conducted a four-level meta-analysis to produce an overall estimate of the machine–human correlation and to identify potential moderators. Our main findings are that the replication success rate for BLEU was above 95%, followed by NIST (at about 70%) and METEOR (at about 40%); the overall machine–human correlation was r_s = .638; and the three significant moderators identified were the direction of interpreting, the reliability of human scoring and the type of automated metrics. Our study has methodological and practical implications for conducting interpreting research and assessment.

Keywords: replication, internal meta-analysis, automatic assessment, interpreting assessment, machine translation, evaluation metrics

Article outline

1.Introduction
2.Literature review
- 2.1Approaches to automatic assessment of machine translation: An overview
- 2.2Exploratory research on automatic assessment of interpreting
- 2.3A battery of replications
- 2.4Internal meta-analysis
- 2.5Research questions
3.Method
- 3.1The target study for replication
- 3.2Overview of the database used for replication
- 3.3A battery of replications
- 3.4Computation of n-gram-based evaluation metrics
- 3.5Data analysis
  - 3.5.1Correlation analysis
  - 3.5.2Replication analysis
  - 3.5.3Internal meta-analysis
4.Results
- 4.1Replication analysis
- 4.2Meta-analysis of the machine–human correlation
- 4.3Moderator analysis
5.Discussion
- 5.1Success rate of replication
- 5.2Meta-analytic approach to replication
- 5.3Methodological implications
- 5.4Practical implications
6.Conclusion
Notes
Supplemental material
References

References (46)

References

Anderson, S. F. & Maxwell, S. E. (2016). There’s more than one way to conduct a replication study: Beyond statistical significance. Psychological Methods 21 (1), 1–12.

Asendorpf, J. B., Conner, M., De Fruyt, F., De Houwer, J., Denissen, J. J., Fiedler, K., Fiedler, S., Funder, D. C., Kliegl, R., Nosek, B. A., Perugini, M., Roberts, B. W., Schmitt, M., van Aken, M. A. G., Weber, H. & Wicherts, J. M. (2013). Recommendations for increasing replicability in psychology. European Journal of Personality 27 (2), 108–119.

Banerjee, S. & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 65–72. [URL]

Bonett, D. G. (2012). Replication-extension studies. Current Directions in Psychological Science 21 (6), 409–412.

Borenstein, M., Hedges, L. V., Higgins, J. P. T. & Rothstein, H. R. (2009). Introduction to meta-analysis. Chichester: John Wiley & Sons.

Brandt, M. J., IJzerman, H., Dijksterhuis, A., Farach, F. J., Geller, J., Giner-Sorolla, R., Grange, J. A., Perugini, M., Spies, J. R. & van’t Veer, A. (2014). The Replication Recipe: What makes for a convincing replication? Journal of Experimental Social Psychology 501, 217–224.

Braver, S. L., Thoemmes, F. J. & Rosenthal, R. (2014). Continuously cumulating meta-analysis and replicability. Perspectives on Psychological Science 9 (3), 333–342.

Cheung, M. W-L. (2019). A guide to conducting a meta-analysis with non-independent effect sizes. Neuropsychology Review 291, 387–396.

Cochran, W. G. (1950). The comparison of percentages in matched samples. Biometrika 37 (3/4), 256–266.

Coughlin, D. (2003). Correlating automated and human assessments of machine translation quality. Proceedings of Machine Translation Summit IX: Papers. [URL]

Cumming, G. (2012). Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis. New York: Routledge.

(2014). The new statistics: Why and how. Psychological Science 25 (1), 7–29.

Doddington, G. (2002). Automatic evaluation of machine translation quality using N-gram co-occurrence statistics. Proceedings of the Second International Conference on Human Language Technology Research, 138–145.

Easley, R. W., Madden, C. S. & Dunn, M. G. (2000). Conducting marketing science: The role of replication in the research process. Journal of Business Research 48 (1), 83–92.

Frankenberg-Garcia, A. (2022). Can a corpus-driven lexical analysis of human and machine translation unveil discourse features that set them apart? Target 34 (2), 278–308.

Ghiselli, S. (2022). Working memory tasks in interpreting studies: A meta-analysis. Translation, Cognition & Behavior 5 (1), 50–83.

Gile, D. (1990). Scientific research vs. personal theories in the investigation of interpretation. In L. Gran & C. Taylor (Eds.), Aspects of applied and experimental research on conference interpretation. Udine: Campanotto, 28–41.

Goh, J. X., Hall, J. A. & Rosenthal, R. (2016). Mini meta-analysis of your own studies: Some arguments on why and a primer on how. Social and Personality Psychology Compass 10 (10), 535–549.

Higgins, J. P. T. & Thompson, S. G. (2002). Quantifying heterogeneity in a meta-analysis. Statistics in Medicine 21 (11), 1539–1558.

Han, C. & Lu, X-L. (2021). Interpreting quality assessment re-imagined: The synergy between human and machine scoring. Interpreting and Society 1 (1), 70–90.

(2023). Can automated machine translation evaluation metrics be used to assess students’ interpretation in the language learning classroom? Computer Assisted Language Learning 36 (5/6), 1064–1087.

Han, C., & Lu, X-L. (2025). Beyond BLEU: Repurposing neural-based metrics to assess interlingual interpreting in tertiary-level language learning settings. Research Methods in Applied Linguistics 4 (1), 100184.

Han, C. & Wang, Y-Q. (2025). Conducting replication in translation and interpreting studies: Stakeholders’ perceptions, practices, and expectations. Target 37 (3), 444–484.

Han, C. & Yang, L-Y. (2023). Relating utterance fluency to perceived fluency of interpreting: A partial replication and a mini meta-analysis. Translation and Interpreting Studies 18 (3), 421–447.

Hoeppner, S. (2019). A note on replication analysis. International Review of Law and Economics 591, 98–102.

Koehn, P. (2010). Statistical machine translation. Cambridge: Cambridge University Press.

Liu, M-H. (2016). Putting the horse before the cart: Righting the experimental approach in interpreting studies. In C. Bendazzoli & C. Monacelli (Eds.), Addressing methodological challenges in interpreting studies research. Newcastle upon Tyne: Cambridge Scholars, 87–105.

Liu, Y-B. & Zhang, W. (2022). Exploring the predictive validity of an interpreting aptitude test battery: An approximate replication. Interpreting 24 (2), 279–308.

Loper, E. & Steven, B. (2002). NLTK: the natural language toolkit. Proceedings of the ACL-02 workshop on effective tools and methodologies for teaching natural language processing and computational linguistics, 63–70.

López-López, J. A., Page, M. J., Lipsey, M. W. & Higgins, J. P. T. (2018). Dealing with effect size multiplicity in systematic reviews and meta-analyses. Research Synthesis Methods 9 (3), 336–351.

Lu, X-L. & Han, C. (2023). Automatic assessment of spoken-language interpreting based on machine-translation evaluation metrics: A multi-scenario exploratory study. Interpreting 25 (1), 109–143.

McShane, B. B. & Böckenholt, U. (2017). Single-paper meta-analysis: Benefits for study summary, theory testing, and replicability. Journal of Consumer Research 43 (6), 1048–1063.

Mellinger, C. D. & Hanson, T. A. (2017). Quantitative research methods in translation and interpreting studies. Abingdon: Routledge.

(2019). Meta-analyses of simultaneous interpreting and working memory. Interpreting 21 (2), 165–195.

(2020). Meta-analysis and replication in interpreting studies. Interpreting 22 (1), 140–149.

Olalla-Soler, C. (2020). Practices and attitudes toward replication in empirical translation and interpreting studies. Target 32 (1), 3–36.

Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science 349 (6251).

Papineni, K., Roukos, S., Ward, T. & Zhu, W-J. (2002). BLEU: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318. [URL]

Pöchhacker, F. (2011). Replication in research on quality in conference interpreting. T&I Review 11, 35–57.

Rosenthal, R. (1997). Some issues in the replication of social science research. Labour Economics 4 (2), 121–123.

Schenker, N. & Gentleman, J. F. (2001). On judging the significance of differences by examining the overlap between confidence intervals. The American Statistician 551, 182–186.

Valentine, J. C., Biglan, A., Boruch, R. F., Castro, F. G., Collins, L. M., Flay, B. R., Kellam, S., Mościcki, E. K. & Schinke, S. P. (2011). Replication in prevention science. Prevention Science 121, 103–117.

Van den Noortgate, W., López-López, J. A., Marín-Martínez, F. & Sánchez-Meca, J. (2013). Three-level meta-analysis of dependent effect sizes. Behavior Research Methods 451, 576–594.

Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software 36 (3), 1–48.

Wang, X-M. & Yuan, L. (2023). Machine-learning based automatic assessment of communication in interpreting. Frontiers in Communication 81.

Wen, H. & Dong, Y. (2019). How does interpreting experience enhance working memory and short-term memory: A meta-analysis. Journal of Cognitive Psychology 31 (8), 769–784.