Examining reliability in translation quality assessment: An empirical study of paper- and computer-based scoring

Sun, Sanjun; Wang, Lulu; Zhang, Qiru

doi:10.1075/tis.23035.sun

Article published In: Translation and Interpreting Studies
Vol. 19:3 (2024) ► pp.405–431

Get fulltext from our e-platform

Download PDF

Download EPUB

Examining reliability in translation quality assessment

An empirical study of paper- and computer-based scoring

Sanjun Sun | Beijing Foreign Studies University

Lulu Wang | The Hong Kong Polytechnic University

Qiru Zhang | Beijing Foreign Studies University

Published online: 13 May 2025

https://doi.org/10.1075/tis.23035.sun

Abstract

This study examines the differences between paper- and computer-based translation quality assessment, focusing on score reliability, variability, scoring speed, and raters’ preferences. Utilizing a within-subjects design, 27 raters assessed 29 translations presented in both handwritten and word-processed formats, employing a holistic scoring method. The findings reveal comparable translation quality ratings across both modes, with paper-based scoring showing greater inter-rater disagreement and being affected by handwriting legibility. Paper-based scoring was generally faster, though computer-based scoring demonstrated less variability in inter-rater reliability. Raters exhibited a preference for paper-based scoring due to its perceived faster speed, flexibility in annotating, and eye-friendliness. The study highlights the importance of comprehensive rater training and calibration to mitigate biases and non-uniform severity, as well as the adoption of detailed scoring rubrics to ensure consistent assessment across modes. The article offers insights on refining computer-based scoring systems, including enhancements in annotation functionality and ergonomic considerations.

Keywords: translation assessment, paper-based scoring, computer-based scoring, rater reliability

Article outline

Introduction
Review of related work
- Comparability of scores across modes
- Handwriting legibility effects
- Efficiency in paper-based versus computer-based scoring
- Raters’ preferences and attitudes toward scoring modes
Method
- Participants and materials
- Experiment procedure
- Data analysis
Results and discussion
- Within-mode inter-rater reliability
  - Inter-rater agreement and consistency within each mode
- Variability in rater severity within each mode
- Cross-mode comparisons of translation scores
- Cross-mode reliability and agreement
- Score differences and limits of agreement
- Implications for high-stakes decision-making
- Correlational analysis between handwriting legibility and paper-based scoring
- Comparison of scoring speed and performance across scoring modes
- Raters’ preferences and experiences with two modes
Conclusion
Acknowledgments
References

References (71)

References

Angelelli, Claudia V. 2009. “Using a rubric to assess translation ability: Defining the construct.” In Testing and Assessment in Translation and Interpreting Studies, ed. by Claudia V. Angelelli and Holly E. Jacobson, 13–47. Amsterdam: John Benjamins.

Arnold, Voiza, et al. 1990. “Do students get higher scores on their word-processed papers? A study of bias in scoring hand-written vs. word-processed papers.” Unpublished manuscript, Rio Hondo College, Whittier, CA.

Ayres, Leonard Porter. 1912. A Scale for Measuring the Quality of Handwriting of School Children (No. 113). New York: Russell Sage Foundation.

Barnett, Anna L., Mellissa Prunty, and Sara Rosenblum. 2018. “Development of the Handwriting Legibility Scale (HLS): A preliminary examination of reliability and validity.” Research in Developmental Disabilities 721: 240–247.

Bennett, Randy Elliot. 2003. Online assessment and the comparability of score meaning. Research Memorandum RM-03–05. Princeton, NJ: Educational Testing Service.

Bugbee, Alan C. Jr. 1996. “The equivalence of paper-and-pencil and computer-based testing.” Journal of Research on Computing in Education 28(3): 282–299.

Campbell, Stuart, and Sandra Hale. 2003. “Translation and interpreting assessment in the context of educational measurement.” In Translation Today: Trends and Perspectives, ed. by Gunilla Anderman and Margaret Rogers, 205–224. Clevedon: Multilingual Matters.

Canz, Thomas, Lars Hoffmann, and Renate Kania. 2020. “Presentation-mode effects in large-scale writing assessments.” Assessing Writing 451:100470.

Chan, Sathena, Stephen Bax, and Cyril Weir. 2018. “Researching the comparability of paper-based and computer-based delivery in a high-stakes writing test.” Assessing Writing 361: 32–48.

Chatzikoumi, Eirini. 2019. “How to evaluate machine translation: A review of automated and human metrics.” Natural Language Engineering 26(2): 1–25.

Clariana, Roy, and Patricia Wallace. 2002. “Paper-based versus computer-based assessment: Key factors associated with the test mode effect.” British Journal of Educational Technology 33(5): 593–602.

Congdon, Peter J., and Joy McQueen. 2000. “The stability of rater severity in large-scale assessment programs.” Journal of Educational Measurement 37(2): 163–178.

Cook, David A., and Thomas J. Beckman. 2006. “Current concepts in validity and reliability for psychometric instruments: Theory and application.” The American Journal of Medicine 119(2): 166.e7–166.e16.

Crisp, Victoria, and Martin Johnson. 2007. “The use of annotations in examination marking: Opening a window into markers’ minds.” British Educational Research Journal 33(6): 943–961.

Dooey, Patricia. 2008. “Language testing and technology: Problems of transition to a new era.” ReCALL 20(1): 21–34.

Eames, Kate, and Kate Loewenthal. 1990. “Effects of handwriting and examiner’s expertise on assessment of essays.” The Journal of Social Psychology 130(6): 831–833.

Graham, Matthew, Anthony Milanowski, and Jackson Miller. 2012. Measuring and Promoting Inter-Rater Agreement of Teacher and Principal Performance Ratings: The Center for Educator Compensation and Reform (CECR).

Graham, Steve, Karen R. Harris, and Michael Hebert. 2011. “It is more than just the message: Presentation effects in scoring writing.” Focus on Exceptional Children 44(4): 1–12.

Greifeneder, Rainer, Sarah Zelt, Tim Seele, Konstantin Bottenberg, and Alexander Alt. 2010. “On writing legibly: Processing fluency systematically biases evaluations of handwritten material.” Social Psychological and Personality Science 1(3): 230–237.

Greifeneder, Rainer, et al. 2012. “Towards a better understanding of the legibility bias in performance assessments: The case of gender-based inferences.” British Journal of Educational Psychology 82(3): 361–374.

Guapacha Chamorro, Maria E. 2020. “Investigating the comparative validity of computer- and paper-based writing tests and differences in impact on EFL test-takers and raters.” PhD dissertation. The University of Auckland.

2022. “Cognitive validity evidence of computer- and paper-based writing tests and differences in the impact on EFL test-takers in classroom assessment.” Assessing Writing 511: art. 100594.

Han, Chao. 2020. “Translation quality assessment: a critical methodological review.” The Translator 26(3): 257–273.

. 2021. “Analytic rubric scoring versus comparative judgment: A comparison of two approaches to assessing spoken-language interpreting.” Meta 66(2): 337–361.

. 2022. “Interpreting testing and assessment: A state-of-the-art review.” Language Testing 39(1): 30–55.

Hanson, Thomas A. 2025. “Interpreting and psychometrics.” In The Routledge Handbook of Interpreting and Cognition, ed. by Christopher D. Mellinger, 151–169. London: Routledge.

Hunsu, Nathaniel J. 2015. “Issues in transitioning from the traditional blue-book to computer-based writing assessment.” Computers and Composition 351: 41–51.

Jin, Yan, and Ming Yan. 2017. “Computer literacy and the construct validity of a high-stakes computer-based writing assessment.” Language Assessment Quarterly 14(2): 101–119.

Johnson, Martin, Rita Nádas, and John F. Bell. 2010. “Marking essays on screen: An investigation into the reliability of marking extended subjective texts.” British Journal of Educational Technology 41(5): 814–826.

Johnson, Martin, and Stuart Shaw. 2008. “Annotating to comprehend: A marginalised activity?” Research Matters (6): 19–24.

Kassim, Noor Lide Abu. 2011. “Judging behaviour and rater errors: An application of the many-facet Rasch model.” GEMA Online Journal of Language Studies 11(3): 179–197.

Kivilehto, Marja, and Leena Salmi. 2017. “Assessing assessment: The authorized translator’s examination in Finland.” Linguistica Antverpiensia, New Series: Themes in Translation Studies 161: 57–70.

Klein, Joseph, and David Taub. 2005. “The effect of variations in handwriting and print on evaluation of student essays.” Assessing Writing 10(2): 134–148.

Ko, Leong. 2020. “Translation and interpreting assessment schemes: NATTI versus CATTI.” In Key Issues in Translation Studies in China: Reflections and New Insights, ed. by Lily Lim and Defeng Li, 161–194. Singapore: Springer.

Koby, Geoffrey, and Gertrud Champe. 2013. “Welcome to the real world: Professional-level translator certification.” The International Journal of Translation and Interpreting Research 5(1): 156–173.

Kohler, Benjamin. 2015. “Paper-based or computer-based essay writing: Differences in performance and perception.” Linguistic Portfolios 4(1): 130–146.

Lee, H. K. 2004. “A comparative study of ESL writers’ performance in a paper-based and a computer-delivered writing test.” Assessing Writing 9(1): 4–26.

Lumley, Tom, and Tim F. McNamara. 1995. “Rater characteristics and rater bias: Implications for training.” Language Testing 12(1): 54–71.

Lynch, Sarah. 2022. “Adapting paper-based tests for computer administration: Lessons learned from 30 years of mode effects studies in education.” Practical Assessment, Research, and Evaluation 27(1): art. 22.

Marshall, Jon C., and Jerry M. Powers. 1969. “Writing neatness, composition errors, and essay grades.” Journal of Educational Measurement 6(2): 97–101.

Medadian, Gholamreza, and Dariush Nezhadansari Mahabadi. 2015. “A summative translation quality assessment model for undergraduate student translations: Objectivity versus manageability.” Studies about Languages 261: 40–54.

Mellinger, Christopher D., and Thomas A. Hanson. 2017. Quantitative Research Methods in Translation and Interpreting Studies. London: Routledge.

Mills, Craig N., and Krista J. Breithaupt. 2016. “Current issues in computer based testing.” In Educational Measurement: From Foundations to Future, ed. by C. S. Wells and M. Faulkner-Bond, 208–220. New York: The Guilford Press.

Mogey, Nora, Jessie Paterson, John Burk, and Michael Purcell. 2010. “Typing compared with handwriting for essay examinations at university: Letting the students choose.” ALT-J, Research in Learning Technology 18(1): 29–47.

Noyes, Jan M., and Kate J. Garland. 2008. “Computer- vs. paper-based tasks: Are they equivalent?” Ergonomics 51(9): 1352–1375.

O’Hara, Kenton, and Abigail Sellen. 1997. “A comparison of reading paper and on-line documents.” In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, 335–342. New York: ACM.

Odacıoglu, Mehmet Cem, and Saban Kokturk. 2015. “The effects of technology on translation students in academic translation teaching.” Procedia-Social and Behavioral Sciences 1971: 1085–1094.

Phelps, Joanne, Lynn Stempel, and Gail Speck. 1985. “The children’s handwriting scale: A new diagnostic tool.” The Journal of Educational Research 79(1): 46–50.

Powers, Donald E., Mary E. Fowles, Marisa Farnum, and Paul Ramsey. 1994. “They think less of my handwritten essay if others word process theirs? Effects on essay scores of intermingling handwritten and word-processed essays.” Journal of Educational Measurement 31(3): 220–233.

R Core Team. 2022. R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. [URL]

Rankin, Angelica Desiree. 2015. “A comparability study on differences between scores of handwritten and typed responses on a large-scale writing assessment.” PhD dissertation. The University of Iowa.

Russell, Michael, Amie Goldberg, and Kathleen O’Connor. 2003. “Computer-based testing and validity: A look back into the future.” Assessment in Education: Principles, Policy & Practice 10(3): 279–293.

Ruusuvirta, Timo, Helena Sievänen, and Marjo Vehkamäki. 2021. “Negative findings of the handwriting legibility effect: The explanatory role of spontaneous task-specific debiasing.” SN Social Sciences 11: 1–14.

Şahin, Mehmet, and Nilgün Dungan. 2014. “Translation testing and evaluation: A study on methods and needs.” Translation & Interpreting 6(2): 67–90.

Salmi, Leena, and Marja Kivilehto. 2020. “A comparative approach to assessing assessment: Revising the scoring chart for the authorized translator’s examination in Finland.” In Institutional Translation and Interpreting, ed. by Fernando Prieto Ramos, 9–25. London: Routledge.

Shaw, Stuart. 2008. “Essay marking on-screen: Implications for assessment validity.” E-Learning and Digital Media 5(3): 256–274.

Shin, Sun-Young, Senyung Lee, and Yena Park. 2023. “Exploring Rater behaviors on handwritten and typed reading-to-write essays using FACETS.” In Fundamental Considerations in Technology Mediated Language Assessment, ed. by Karim Sadeghi and Dan Douglas, 99–114. New York: Routledge.

Stansfield, Charles W., Mary Lee Scott, and Dorry Mann Kenyon. 1992. “The measurement of translation ability.” The Modern Language Journal 76(4): 455–467.

Sweedler-Brown, Carol O. 1991. “Computers and assessment: The effect of typing versus handwriting on the holistic scoring of essays.” Research and Teaching in Developmental Education 8(1): 5–14.

Thorndike, Edward L. 1910. “Handwriting.” Teachers College Record 11(2): 1–11.

Waddington, Christopher. 2001. “Different methods of evaluating student translations: The question of validity.” Meta 46(2): 311–325.

Way, Walter D., and Frederic Robin. 2016. “The history of computer-based testing.” In Educational Measurement: From Foundations to Future, ed. by Craig S. Wells and Molly Faulkner-Bond, 185–207. New York: The Guilford Press.

Weigle, Sara Cushing. 1999. “Investigating rater/prompt interactions in writing assessment: Quantitative and qualitative approaches.” Assessing Writing 6(2): 145–178.

Williams, Malcolm. 2013. “A holistic-componential model for assessing translation student performance and competency.” Mutatis Mutandis 6(2): 419–443.

Wilson, Mark, and Harry Case. 2000. “An examination of variation in rater severity over time: A study in rater drift.” In Objective Measurement: Theory into Practice, ed. by Mark Wilson and Jr. George Engelhard, 113–133. Stamford, CT: Ablex.

Wise, Steven L. 2018. “Computer-based testing.” In The SAGE Encyclopedia of Educational Research, Measurement, and Evaluation, ed. by Bruce B. Frey, 340–344. California: SAGE Publications.

Yan, Zheng, Xue Hu, Hao Chen, and Fan Lu. 2008. “Computer Vision Syndrome: A widely spreading but largely unknown epidemic among computer users.” Computers in Human Behavior 24(5): 2026–2042.

Yu, Guoxing, and Jing Zhang. 2017. “Computer-based english language testing in China: Present and future.” Language Assessment Quarterly 14(2): 177–188.

Zhang, Qi, and Ge Min. 2019. “Chinese writing composition among CFL learners: A comparison between handwriting and typewriting.” Computers and Composition 541: 102522.

Zhao, Hulin, and Xiangdong Gu. 2016. “China Accreditation Test for Translators and Interpreters (CATTI): Test review based on the language pairing of English and Chinese.” Language Testing 33(3): 439–446.

Ziviani, Jenny, and John Elkins. 1984. “An evaluation of handwriting performance.” Educational Review 36(3): 249–261.