Article published In: Interpreting
Vol. 28:1 (2026) ► pp.58–90
Applying n-gram-based evaluation metrics to assess human interpreting
A battery of replications with internal meta‑analysis
Published online: 20 November 2025
https://doi.org/10.1075/intp.00127.han
https://doi.org/10.1075/intp.00127.han
Abstract
We have recently witnessed a number of studies conducted to employ n-gram-based
machine-translation evaluation metrics such as BLEU to assess human interpreting automatically. A major limitation of this
research lies in the non-probabilistic sampling of a limited number of renditions. Consequently, the correlation coefficients
calculated between machine and human assessments, which serve as a proxy for machine–human parity, lack generalizability. Against
this background, we conducted a battery of replications of (2023). Can
automated machine translation evaluation metrics be used to assess students’ interpretation in the language learning
classroom? Computer Assisted Language
Learning 36 (5/6), 1064–1087. in order
to evaluate the efficacy of three n-gram-based automated metrics — BLEU, NIST and METEOR — in the assessment of
interpreting. Our replications are based on a self-curated corpus involving a total of 1,695 interpretations across different
modes and directions of interpreting, based on various source speeches. Following the replications, we also conducted a four-level
meta-analysis to produce an overall estimate of the machine–human correlation and to identify potential moderators. Our main
findings are that the replication success rate for BLEU was above 95%, followed by NIST (at about 70%) and METEOR (at about 40%);
the overall machine–human correlation was rs = .638; and the three significant moderators identified
were the direction of interpreting, the reliability of human scoring and the type of automated metrics. Our study has
methodological and practical implications for conducting interpreting research and assessment.
Article outline
- 1.Introduction
- 2.Literature review
- 2.1Approaches to automatic assessment of machine translation: An overview
- 2.2Exploratory research on automatic assessment of interpreting
- 2.3A battery of replications
- 2.4Internal meta-analysis
- 2.5Research questions
- 3.Method
- 3.1The target study for replication
- 3.2Overview of the database used for replication
- 3.3A battery of replications
- 3.4Computation of n-gram-based evaluation metrics
- 3.5Data analysis
- 3.5.1Correlation analysis
- 3.5.2Replication analysis
- 3.5.3Internal meta-analysis
- 4.Results
- 4.1Replication analysis
- 4.2Meta-analysis of the machine–human correlation
- 4.3Moderator analysis
- 5.Discussion
- 5.1Success rate of replication
- 5.2Meta-analytic approach to replication
- 5.3Methodological implications
- 5.4Practical implications
- 6.Conclusion
- Notes
- Supplemental material
References
References (46)
Anderson, S. F. & Maxwell, S. E. (2016). There’s
more than one way to conduct a replication study: Beyond statistical
significance. Psychological
Methods 21 (1), 1–12.
Asendorpf, J. B., Conner, M., De Fruyt, F., De Houwer, J., Denissen, J. J., Fiedler, K., Fiedler, S., Funder, D. C., Kliegl, R., Nosek, B. A., Perugini, M., Roberts, B. W., Schmitt, M., van Aken, M. A. G., Weber, H. & Wicherts, J. M. (2013). Recommendations
for increasing replicability in psychology. European Journal of
Personality 27 (2), 108–119.
Banerjee, S. & Lavie, A. (2005). METEOR:
An automatic metric for MT evaluation with improved correlation with human
judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for
Machine Translation and/or Summarization, 65–72. [URL]
Bonett, D. G. (2012). Replication-extension
studies. Current Directions in Psychological
Science 21 (6), 409–412.
Borenstein, M., Hedges, L. V., Higgins, J. P. T. & Rothstein, H. R. (2009). Introduction
to meta-analysis. Chichester: John Wiley & Sons.
Brandt, M. J., IJzerman, H., Dijksterhuis, A., Farach, F. J., Geller, J., Giner-Sorolla, R., Grange, J. A., Perugini, M., Spies, J. R. & van’t Veer, A. (2014). The
Replication Recipe: What makes for a convincing replication? Journal of Experimental Social
Psychology 501, 217–224.
Braver, S. L., Thoemmes, F. J. & Rosenthal, R. (2014). Continuously
cumulating meta-analysis and replicability. Perspectives on Psychological
Science 9 (3), 333–342.
Cheung, M. W-L. (2019). A guide to conducting a
meta-analysis with non-independent effect sizes. Neuropsychology
Review 291, 387–396.
Cochran, W. G. (1950). The
comparison of percentages in matched
samples. Biometrika 37 (3/4), 256–266.
Coughlin, D. (2003). Correlating
automated and human assessments of machine translation quality. Proceedings of Machine
Translation Summit IX: Papers. [URL]
Cumming, G. (2012). Understanding
the new statistics: Effect sizes, confidence intervals, and meta-analysis. New York: Routledge.
Doddington, G. (2002). Automatic
evaluation of machine translation quality using N-gram co-occurrence statistics. Proceedings of
the Second International Conference on Human Language Technology
Research, 138–145.
Easley, R. W., Madden, C. S. & Dunn, M. G. (2000). Conducting
marketing science: The role of replication in the research process. Journal of Business
Research 48 (1), 83–92.
Frankenberg-Garcia, A. (2022). Can
a corpus-driven lexical analysis of human and machine translation unveil discourse features that set them
apart? Target 34 (2), 278–308.
Ghiselli, S. (2022). Working
memory tasks in interpreting studies: A meta-analysis. Translation, Cognition &
Behavior 5 (1), 50–83.
Gile, D. (1990). Scientific
research vs. personal theories in the investigation of
interpretation. In L. Gran & C. Taylor (Eds.), Aspects
of applied and experimental research on conference
interpretation. Udine: Campanotto, 28–41.
Goh, J. X., Hall, J. A. & Rosenthal, R. (2016). Mini
meta-analysis of your own studies: Some arguments on why and a primer on how. Social and
Personality Psychology
Compass 10 (10), 535–549.
Higgins, J. P. T. & Thompson, S. G. (2002). Quantifying
heterogeneity in a meta-analysis. Statistics in
Medicine 21 (11), 1539–1558.
Han, C. & Lu, X-L. (2021). Interpreting
quality assessment re-imagined: The synergy between human and machine scoring. Interpreting and
Society 1 (1), 70–90.
(2023). Can
automated machine translation evaluation metrics be used to assess students’ interpretation in the language learning
classroom? Computer Assisted Language
Learning 36 (5/6), 1064–1087.
Han, C., & Lu, X-L. (2025). Beyond
BLEU: Repurposing neural-based metrics to assess interlingual interpreting in tertiary-level language learning
settings. Research Methods in Applied
Linguistics 4 (1), 100184.
Han, C. & Wang, Y-Q. (2025). Conducting
replication in translation and interpreting studies: Stakeholders’ perceptions, practices, and
expectations. Target 37 (3), 444–484.
Han, C. & Yang, L-Y. (2023). Relating
utterance fluency to perceived fluency of interpreting: A partial replication and a mini
meta-analysis. Translation and Interpreting
Studies 18 (3), 421–447.
Hoeppner, S. (2019). A
note on replication analysis. International Review of Law and
Economics 591, 98–102.
Liu, M-H. (2016). Putting
the horse before the cart: Righting the experimental approach in interpreting
studies. In C. Bendazzoli & C. Monacelli (Eds.), Addressing
methodological challenges in interpreting studies research. Newcastle upon Tyne: Cambridge Scholars, 87–105.
Liu, Y-B. & Zhang, W. (2022). Exploring
the predictive validity of an interpreting aptitude test battery: An approximate
replication. Interpreting 24 (2), 279–308.
Loper, E. & Steven, B. (2002). NLTK:
the natural language toolkit. Proceedings of the ACL-02 workshop on effective tools and
methodologies for teaching natural language processing and computational
linguistics, 63–70.
López-López, J. A., Page, M. J., Lipsey, M. W. & Higgins, J. P. T. (2018). Dealing
with effect size multiplicity in systematic reviews and meta-analyses. Research Synthesis
Methods 9 (3), 336–351.
Lu, X-L. & Han, C. (2023). Automatic
assessment of spoken-language interpreting based on machine-translation evaluation metrics: A multi-scenario exploratory
study. Interpreting 25 (1), 109–143.
McShane, B. B. & Böckenholt, U. (2017). Single-paper
meta-analysis: Benefits for study summary, theory testing, and replicability. Journal of
Consumer
Research 43 (6), 1048–1063.
Mellinger, C. D. & Hanson, T. A. (2017). Quantitative
research methods in translation and interpreting
studies. Abingdon: Routledge.
(2019). Meta-analyses
of simultaneous interpreting and working
memory. Interpreting 21 (2), 165–195.
(2020). Meta-analysis
and replication in interpreting
studies. Interpreting 22 (1), 140–149.
Olalla-Soler, C. (2020). Practices
and attitudes toward replication in empirical translation and interpreting
studies. Target 32 (1), 3–36.
Open Science
Collaboration (2015). Estimating the reproducibility of psychological
science. Science 349 (6251).
Papineni, K., Roukos, S., Ward, T. & Zhu, W-J. (2002). BLEU:
A method for automatic evaluation of machine translation. Proceedings of the 40th Annual
Meeting of the Association for Computational
Linguistics, 311–318. [URL]
Pöchhacker, F. (2011). Replication
in research on quality in conference interpreting. T&I
Review 11, 35–57.
Rosenthal, R. (1997). Some
issues in the replication of social science research. Labour
Economics 4 (2), 121–123.
Schenker, N. & Gentleman, J. F. (2001). On
judging the significance of differences by examining the overlap between confidence
intervals. The American
Statistician 551, 182–186.
Valentine, J. C., Biglan, A., Boruch, R. F., Castro, F. G., Collins, L. M., Flay, B. R., Kellam, S., Mościcki, E. K. & Schinke, S. P. (2011). Replication
in prevention science. Prevention
Science 121, 103–117.
Van den Noortgate, W., López-López, J. A., Marín-Martínez, F. & Sánchez-Meca, J. (2013). Three-level
meta-analysis of dependent effect sizes. Behavior Research
Methods 451, 576–594.
Viechtbauer, W. (2010). Conducting
meta-analyses in R with the metafor package. Journal of Statistical
Software 36 (3), 1–48.
