Article published In: Translation and Interpreting Studies
Vol. 19:3 (2024) ► pp.405–431
Examining reliability in translation quality assessment
An empirical study of paper- and computer-based scoring
Published online: 13 May 2025
https://doi.org/10.1075/tis.23035.sun
https://doi.org/10.1075/tis.23035.sun
Abstract
This study examines the differences between paper- and computer-based translation quality assessment, focusing on
score reliability, variability, scoring speed, and raters’ preferences. Utilizing a within-subjects design, 27 raters assessed 29
translations presented in both handwritten and word-processed formats, employing a holistic scoring method. The findings reveal
comparable translation quality ratings across both modes, with paper-based scoring showing greater inter-rater disagreement and
being affected by handwriting legibility. Paper-based scoring was generally faster, though computer-based scoring demonstrated
less variability in inter-rater reliability. Raters exhibited a preference for paper-based scoring due to its perceived faster
speed, flexibility in annotating, and eye-friendliness. The study highlights the importance of comprehensive rater training and
calibration to mitigate biases and non-uniform severity, as well as the adoption of detailed scoring rubrics to ensure consistent
assessment across modes. The article offers insights on refining computer-based scoring systems, including enhancements in
annotation functionality and ergonomic considerations.
Article outline
- Introduction
- Review of related work
- Comparability of scores across modes
- Handwriting legibility effects
- Efficiency in paper-based versus computer-based scoring
- Raters’ preferences and attitudes toward scoring modes
- Method
- Participants and materials
- Experiment procedure
- Data analysis
- Results and discussion
- Within-mode inter-rater reliability
- Inter-rater agreement and consistency within each mode
- Variability in rater severity within each mode
- Cross-mode comparisons of translation scores
- Cross-mode reliability and agreement
- Score differences and limits of agreement
- Implications for high-stakes decision-making
- Correlational analysis between handwriting legibility and paper-based scoring
- Comparison of scoring speed and performance across scoring modes
- Raters’ preferences and experiences with two modes
- Within-mode inter-rater reliability
- Conclusion
- Acknowledgments
References
References (71)
Angelelli, Claudia V. 2009. “Using a rubric to assess
translation ability: Defining the construct.” In Testing and
Assessment in Translation and Interpreting Studies, ed. by Claudia V. Angelelli and Holly E. Jacobson, 13–47. Amsterdam: John Benjamins.
Arnold, Voiza, et al. 1990. “Do
students get higher scores on their word-processed papers? A study of bias in scoring hand-written vs. word-processed
papers.” Unpublished manuscript, Rio Hondo College, Whittier, CA.
Ayres, Leonard Porter. 1912. A Scale for Measuring the Quality
of Handwriting of School Children (No. 113). New York: Russell Sage Foundation.
Barnett, Anna L., Mellissa Prunty, and Sara Rosenblum. 2018. “Development
of the Handwriting Legibility Scale (HLS): A preliminary examination of reliability and
validity.” Research in Developmental
Disabilities 721: 240–247.
Bennett, Randy Elliot. 2003. Online assessment and the
comparability of score meaning. Research Memorandum
RM-03–05. Princeton, NJ: Educational Testing Service.
Bugbee, Alan C. Jr. 1996. “The
equivalence of paper-and-pencil and computer-based testing.” Journal of Research on Computing
in
Education 28(3): 282–299.
Campbell, Stuart, and Sandra Hale. 2003. “Translation
and interpreting assessment in the context of educational
measurement.” In Translation Today: Trends and
Perspectives, ed. by Gunilla Anderman and Margaret Rogers, 205–224. Clevedon: Multilingual Matters.
Canz, Thomas, Lars Hoffmann, and Renate Kania. 2020. “Presentation-mode
effects in large-scale writing assessments.” Assessing
Writing 451:100470.
Chan, Sathena, Stephen Bax, and Cyril Weir. 2018. “Researching
the comparability of paper-based and computer-based delivery in a high-stakes writing
test.” Assessing
Writing 361: 32–48.
Chatzikoumi, Eirini. 2019. “How
to evaluate machine translation: A review of automated and human metrics.” Natural Language
Engineering 26(2): 1–25.
Clariana, Roy, and Patricia Wallace. 2002. “Paper-based
versus computer-based assessment: Key factors associated with the test mode effect.” British
Journal of Educational
Technology 33(5): 593–602.
Congdon, Peter J., and Joy McQueen. 2000. “The
stability of rater severity in large-scale assessment programs.” Journal of Educational
Measurement 37(2): 163–178.
Cook, David A., and Thomas J. Beckman. 2006. “Current
concepts in validity and reliability for psychometric instruments: Theory and application.” The
American Journal of
Medicine 119(2): 166.e7–166.e16.
Crisp, Victoria, and Martin Johnson. 2007. “The
use of annotations in examination marking: Opening a window into markers’ minds.” British
Educational Research
Journal 33(6): 943–961.
Dooey, Patricia. 2008. “Language
testing and technology: Problems of transition to a new
era.” ReCALL 20(1): 21–34.
Eames, Kate, and Kate Loewenthal. 1990. “Effects
of handwriting and examiner’s expertise on assessment of essays.” The Journal of Social
Psychology 130(6): 831–833.
Graham, Matthew, Anthony Milanowski, and Jackson Miller. 2012. Measuring
and Promoting Inter-Rater Agreement of Teacher and Principal Performance Ratings: The Center for Educator Compensation and Reform (CECR).
Graham, Steve, Karen R. Harris, and Michael Hebert. 2011. “It
is more than just the message: Presentation effects in scoring writing.” Focus on Exceptional
Children 44(4): 1–12.
Greifeneder, Rainer, Sarah Zelt, Tim Seele, Konstantin Bottenberg, and Alexander Alt. 2010. “On
writing legibly: Processing fluency systematically biases evaluations of handwritten
material.” Social Psychological and Personality
Science 1(3): 230–237.
Greifeneder, Rainer, et al. 2012. “Towards
a better understanding of the legibility bias in performance assessments: The case of gender-based
inferences.” British Journal of Educational
Psychology 82(3): 361–374.
Guapacha Chamorro, Maria E. 2020. “Investigating the comparative
validity of computer- and paper-based writing tests and differences in impact on EFL test-takers and
raters.” PhD dissertation. The University of Auckland.
2022. “Cognitive validity evidence of
computer- and paper-based writing tests and differences in the impact on EFL test-takers in classroom
assessment.” Assessing Writing 511: art.
100594.
Han, Chao. 2020. “Translation
quality assessment: a critical methodological review.” The
Translator 26(3): 257–273.
. 2021. “Analytic
rubric scoring versus comparative judgment: A comparison of two approaches to assessing spoken-language
interpreting.” Meta 66(2): 337–361.
. 2022. “Interpreting
testing and assessment: A state-of-the-art review.” Language
Testing 39(1): 30–55.
Hanson, Thomas A. 2025. “Interpreting and
psychometrics.” In The Routledge Handbook of Interpreting and
Cognition, ed. by Christopher D. Mellinger, 151–169. London: Routledge.
Hunsu, Nathaniel J. 2015. “Issues in transitioning from
the traditional blue-book to computer-based writing assessment.” Computers and
Composition 351: 41–51.
Jin, Yan, and Ming Yan. 2017. “Computer
literacy and the construct validity of a high-stakes computer-based writing
assessment.” Language Assessment
Quarterly 14(2): 101–119.
Johnson, Martin, Rita Nádas, and John F. Bell. 2010. “Marking
essays on screen: An investigation into the reliability of marking extended subjective
texts.” British Journal of Educational
Technology 41(5): 814–826.
Johnson, Martin, and Stuart Shaw. 2008. “Annotating
to comprehend: A marginalised activity?” Research
Matters (6): 19–24.
Kassim, Noor Lide Abu. 2011. “Judging behaviour and rater
errors: An application of the many-facet Rasch model.” GEMA Online Journal of Language
Studies 11(3): 179–197.
Kivilehto, Marja, and Leena Salmi. 2017. “Assessing
assessment: The authorized translator’s examination in Finland.” Linguistica Antverpiensia, New
Series: Themes in Translation
Studies 161: 57–70.
Klein, Joseph, and David Taub. 2005. “The
effect of variations in handwriting and print on evaluation of student essays.” Assessing
Writing 10(2): 134–148.
Ko, Leong. 2020. “Translation
and interpreting assessment schemes: NATTI versus CATTI.” In Key
Issues in Translation Studies in China: Reflections and New Insights, ed. by Lily Lim and Defeng Li, 161–194. Singapore: Springer.
Koby, Geoffrey, and Gertrud Champe. 2013. “Welcome
to the real world: Professional-level translator certification.” The International Journal of
Translation and Interpreting
Research 5(1): 156–173.
Kohler, Benjamin. 2015. “Paper-based
or computer-based essay writing: Differences in performance and perception.” Linguistic
Portfolios 4(1): 130–146.
Lee, H. K. 2004. “A
comparative study of ESL writers’ performance in a paper-based and a computer-delivered writing
test.” Assessing
Writing 9(1): 4–26.
Lumley, Tom, and Tim F. McNamara. 1995. “Rater
characteristics and rater bias: Implications for training.” Language
Testing 12(1): 54–71.
Lynch, Sarah. 2022. “Adapting
paper-based tests for computer administration: Lessons learned from 30 years of mode effects studies in
education.” Practical Assessment, Research, and
Evaluation 27(1): art.
22.
Marshall, Jon C., and Jerry M. Powers. 1969. “Writing
neatness, composition errors, and essay grades.” Journal of Educational
Measurement 6(2): 97–101.
Medadian, Gholamreza, and Dariush Nezhadansari Mahabadi. 2015. “A
summative translation quality assessment model for undergraduate student translations: Objectivity versus
manageability.” Studies about
Languages 261: 40–54.
Mellinger, Christopher D., and Thomas A. Hanson. 2017. Quantitative
Research Methods in Translation and Interpreting
Studies. London: Routledge.
Mills, Craig N., and Krista J. Breithaupt. 2016. “Current
issues in computer based testing.” In Educational Measurement: From
Foundations to Future, ed. by C. S. Wells and M. Faulkner-Bond, 208–220. New York: The Guilford Press.
Mogey, Nora, Jessie Paterson, John Burk, and Michael Purcell. 2010. “Typing
compared with handwriting for essay examinations at university: Letting the students
choose.” ALT-J, Research in Learning
Technology 18(1): 29–47.
Noyes, Jan M., and Kate J. Garland. 2008. “Computer-
vs. paper-based tasks: Are they
equivalent?” Ergonomics 51(9): 1352–1375.
O’Hara, Kenton, and Abigail Sellen. 1997. “A
comparison of reading paper and on-line documents.” In Proceedings of
the ACM SIGCHI Conference on Human Factors in Computing
Systems, 335–342. New York: ACM.
Odacıoglu, Mehmet Cem, and Saban Kokturk. 2015. “The
effects of technology on translation students in academic translation
teaching.” Procedia-Social and Behavioral
Sciences 1971: 1085–1094.
Phelps, Joanne, Lynn Stempel, and Gail Speck. 1985. “The
children’s handwriting scale: A new diagnostic tool.” The Journal of Educational
Research 79(1): 46–50.
Powers, Donald E., Mary E. Fowles, Marisa Farnum, and Paul Ramsey. 1994. “They
think less of my handwritten essay if others word process theirs? Effects on essay scores of intermingling handwritten and
word-processed essays.” Journal of Educational
Measurement 31(3): 220–233.
R Core Team. 2022. R: A language and
environment for statistical computing. Vienna: R Foundation for Statistical Computing. [URL]
Rankin, Angelica Desiree. 2015. “A comparability study on
differences between scores of handwritten and typed responses on a large-scale writing
assessment.” PhD dissertation. The University of Iowa.
Russell, Michael, Amie Goldberg, and Kathleen O’Connor. 2003. “Computer-based
testing and validity: A look back into the future.” Assessment in Education: Principles, Policy
&
Practice 10(3): 279–293.
Ruusuvirta, Timo, Helena Sievänen, and Marjo Vehkamäki. 2021. “Negative
findings of the handwriting legibility effect: The explanatory role of spontaneous task-specific
debiasing.” SN Social
Sciences 11: 1–14.
Şahin, Mehmet, and Nilgün Dungan. 2014. “Translation
testing and evaluation: A study on methods and needs.” Translation &
Interpreting 6(2): 67–90.
Salmi, Leena, and Marja Kivilehto. 2020. “A
comparative approach to assessing assessment: Revising the scoring chart for the authorized translator’s examination in
Finland.” In Institutional Translation and
Interpreting, ed. by Fernando Prieto Ramos, 9–25. London: Routledge.
Shaw, Stuart. 2008. “Essay
marking on-screen: Implications for assessment validity.” E-Learning and Digital
Media 5(3): 256–274.
Shin, Sun-Young, Senyung Lee, and Yena Park. 2023. “Exploring
Rater behaviors on handwritten and typed reading-to-write essays using
FACETS.” In Fundamental Considerations in Technology Mediated
Language Assessment, ed. by Karim Sadeghi and Dan Douglas, 99–114. New York: Routledge.
Stansfield, Charles W., Mary Lee Scott, and Dorry Mann Kenyon. 1992. “The
measurement of translation ability.” The Modern Language
Journal 76(4): 455–467.
Sweedler-Brown, Carol O. 1991. “Computers and assessment: The
effect of typing versus handwriting on the holistic scoring of essays.” Research and Teaching
in Developmental
Education 8(1): 5–14.
Waddington, Christopher. 2001. “Different
methods of evaluating student translations: The question of
validity.” Meta 46(2): 311–325.
Way, Walter D., and Frederic Robin. 2016. “The
history of computer-based testing.” In Educational Measurement: From
Foundations to Future, ed. by Craig S. Wells and Molly Faulkner-Bond, 185–207. New York: The Guilford Press.
Weigle, Sara Cushing. 1999. “Investigating
rater/prompt interactions in writing assessment: Quantitative and qualitative
approaches.” Assessing
Writing 6(2): 145–178.
Williams, Malcolm. 2013. “A
holistic-componential model for assessing translation student performance and
competency.” Mutatis
Mutandis 6(2): 419–443.
Wilson, Mark, and Harry Case. 2000. “An
examination of variation in rater severity over time: A study in rater
drift.” In Objective Measurement: Theory into
Practice, ed. by Mark Wilson and Jr. George Engelhard, 113–133. Stamford, CT: Ablex.
Wise, Steven L. 2018. “Computer-based
testing.” In The SAGE Encyclopedia of Educational Research,
Measurement, and Evaluation, ed. by Bruce B. Frey, 340–344. California: SAGE Publications.
Yan, Zheng, Xue Hu, Hao Chen, and Fan Lu. 2008. “Computer
Vision Syndrome: A widely spreading but largely unknown epidemic among computer
users.” Computers in Human
Behavior 24(5): 2026–2042.
Yu, Guoxing, and Jing Zhang. 2017. “Computer-based
english language testing in China: Present and future.” Language Assessment
Quarterly 14(2): 177–188.
Zhang, Qi, and Ge Min. 2019. “Chinese
writing composition among CFL learners: A comparison between handwriting and
typewriting.” Computers and
Composition 541: 102522.
