Investigating rater severity/leniency in interpreter performance testing: A multifaceted Rasch measurement approach

Han, Chao

doi:10.1075/intp.17.2.05han

Article published In: Interpreting
Vol. 17:2 (2015) ► pp.255–283

Get fulltext from our e-platform

Download PDF

Investigating rater severity/leniency in interpreter performance testing

A multifaceted Rasch measurement approach

Chao Han | Macquarie University

Published online: 3 September 2015

https://doi.org/10.1075/intp.17.2.05han

Rater-mediated performance assessment (RMPA) is a critical component of interpreter certification testing systems worldwide. Given the acknowledged rater variability in RMPA and the high-stakes nature of certification testing, it is crucial to ensure rater reliability in interpreter certification performance testing (ICPT). However, a review of current ICPT practice indicates that rigorous research on rater reliability is lacking. Against this background, the present study reports on use of multifaceted Rasch measurement (MFRM) to identify the degree of severity/leniency in different raters’ assessments of simultaneous interpretations (SIs) by 32 interpreters in an experimental setting. Nine raters specifically trained for the purpose were asked to evaluate four English-to-Chinese SIs by each of the interpreters, using three 8-point rating scales (information content, fluency, expression). The source texts differed in speed and in the speaker’s accent (native vs non-native). Rater-generated scores were then subjected to MFRM analysis, using the FACETS program. The following general trends emerged: 1) homogeneity statistics showed that not all raters were equally severe overall; and 2) bias analyses showed that a relatively large proportion of the raters had significantly biased interactions with the interpreters and the assessment criteria. Implications for practical rating arrangements in ICPT, and for rater training, are discussed.

Keywords: performance testing, multifaceted Rasch measurement, rater severity/leniency, interpreter certification, rater training, rater variability

References (68)

Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika 43 (4), 561–573.

Angelelli, C. (2009). Using a rubric to assess translation ability: Defining the construct. In C. Angelelli & H.E. Jacobson (Eds.), Testing and assessment in translation and interpreting studies. Amsterdam: John Benjamins, 13–47.

Arjona-Tseng, E. (1993). A psychometric approach to the selection of translation and interpreting students in Taiwan. Perspectives 1 (1), 91–104.

Arocha, I.S. & Joyce, L. (2013). Patient safety, professionalization, and reimbursement as primary drivers for National Medical Interpreter Certification in the United States. Translation & Interpreting 5 (1), 127–142.

Bond, T.G. & Fox, C.M. (2007). Applying the Rasch model: Fundamental measurement in the human sciences (2nd ed.). London: Lawrence Erlbaum.

Bonk, W.J. & Ockey, G.J. (2003). A many-facet Rasch analysis of the second language group oral discussion task. Language Testing 20 (1), 89–110.

Brown, A. (1995). The effect of rater variables in the development of an occupation-specific language performance test. Language Testing 12 (1), 1–15.

Campbell, S. & Hale, S. (2003). Translation and interpreting assessment in the context of educational measurement. In G. Anderman & M. Rogers (Eds.), Translation today: Trends and perspectives. Clevedon: Multilingual Matters, 205–224.

Certification Commission for Healthcare Interpreters (2010). Job task analysis study and results. [URL] (accessed 22 May 2015).

(2011). Technical Report on the Development and Pilot Testing of the CCHI Examinations. [URL] (accessed 22 May 2015).

(2012). Technical Report on the Development and Pilot Testing of the Certified Healthcare Interpreter™ (CHI™) Examination for Arabic and Mandarin. [URL] (accessed 22 May 2015).

(2014). Candidate’s Examination Handbook. [URL] (accessed 22 May 2015).

Clifford, A. (2005). Putting the exam to the test: Psychometric validation and interpreter certification. Interpreting 7 (1), 97–13.

Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly 2 (3), 197–221.

. (2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing 25 (2), 155–185.

. (2011). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments. Frankfurt am Main: Peter Lang.

Elder, C., Barkhuizen, G., Knoch, U. & von Randow, J. (2007). Evaluating rater responses to an online training program for L2 writing assessment. Language Testing 24 (1), 37–64.

Feng, J.Z. (2005). 论口译测试的规范化. [Towards the standardization of interpretation testing]. 外语研究, 891, 54–58.

Feuerle, L. (2013). Testing interpreters: Developing, administering, and scoring court interpreter certification exams. Translation & Interpreting 5 (1), 80–93.

Fox, C. & Jones, J. (1998). Uses of Rasch modeling in counseling psychology research. Journal of Counseling Psychology 45 (1), 30–45.

Gile, D. (1995). Basic concepts and models for interpreter and translator training. Amsterdam: John Benjamins.

Green, R. (2013). Statistical analysis for language testers. Basingstoke: Palgrave Macmillan.

Hale, S., Garcia, I., Hlavac, J., Kim, M., Lai, M., Turner, B. & Slatyer, H. (2012). Development of a conceptual overview for a new model for NAATI standards, testing and assessment. Sydney, Australia. [URL] (accessed 22 May 2015).

Han, C. & Mehdi, R. (2015). The effects of speech rate and accent on interpreter performance quality: A mixed-methods replication study. Manuscript submitted for publication.

Henning, G. (1992). Dimensionality and construct validity of language tests. Language Testing 9 (1), 1–11.

Hlavac, J. (2013). A cross-national overview of translator and interpreter certification procedures. Translation & Interpreting 51, 32–65.

IoL Educational Trust. (2010). Diploma in Public Service Interpreting: Handbook for candidates. London, UK. [URL] (accessed 22 May 2015).

Jacobs, E.A., Lauderdale, D.S., Meltzer, D., Shorey, J.M., Levinson, W. & Thisted, R.A. (2001). Impact of interpreter services on delivery of health care to limited-English-proficient patients. Journal of General Internal Medicine 16 (7), 468–474.

Knoch, U. (2011). Investigating the effectiveness of individualized feedback to rating behavior – a longitudinal study. Language Testing 28 (2), 179–200.

Kondo-Brown, K. (2002). A FACETS analysis of rater bias in measuring Japanese second language writing performance. Language Testing 19 (1), 3–31.

Linacre, J.M. (2002). What do infit and outfit, mean-square and standardized mean? Rasch Measurement Transactions 16 (2), 878.

. (2013). A user’s guide to FACETS: Program manual 3.71.2. [URL] (accessed 22 May 2015).

Liu, M. (2013). Design and analysis of Taiwan’s interpretation certification examination. In D. Tsagari & R. van Deemter (Eds.), Assessment issues in language translation and interpreting. Frankfurt: Peter Lang, 163–178.

Lu, M., Liu, C. & Gong, X.F. (2007). 全国翻译专业资格(水平)考试英语口译试题命制一致性研究报告. [How to maintain consistency in CATTI’s interpretation tests: A research report]. 中国翻译, 51, 57–61.

Lumley, T. & McNamara, T.F. (1995). Rater characteristics and rater bias: implications for training. Language Testing 12 (1), 54–71.

Lunz, M.E. & Stahl, J.A. (1990). Judge consistency and severity across grading periods. Evaluation and the Health Professions 13 (4), 425–444.

Lynch, B.K. & McNamara, T.F. (1998). Using G-theory and many-facet Rasch measurement in the development of performance assessments of the ESL speaking skills of immigrants. Language Testing 15 (2), 158–180.

Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika 47 (2), 149–174.

McNamara, T.F. (1996). Measuring second language performance. London: Longman.

McNamara, T.F. & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurement in language testing. Language Testing 29 (4) 555–576.

Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational measurement (3rd ed.). New York: American Council on Education and Macmillan, 13–103.

Mortensen, D. (1998). Establishing a scheme for interpreter certification: The Norwegian experience. [URL] (accessed 22 May 2015).

. (2001). Measuring quality in interpreting: A report on the Norwegian Interpreter Certification Examination (NICE). Oslo, Norway. [URL] (accessed 22 May 2015).

Myford, C.M. & Wolfe, E.W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement 4 (4), 386–422.

National Accreditation Authority for Translators and Interpreters. (2013). INT Project discussion paper. [URL] (accessed 22 May 2015).

National Association of the Deaf (2014). NAD and RID releases the NAD-RID National Interpreter Certification (NIC) Credential Validity, Reliability, & Candidate Performance Report. [URL] (accessed 22 May 2015).

National Board of Certification for Medical Interpreters (2014). The National Board of Certification for Medical Interpreters: Certified Medical Interpreter candidate handbook. [URL] (accessed 22 May 2015).

National Center for States Courts (2013). Federal Court Interpreter Certification Examination for Spanish/English: Examinee handbook. [URL] (accessed 22 May 2015).

Office of China Accreditation Tests for Translators and Interpreters. (2005). 二级口译英语同声传译类考试大纲. 外文出版社. [Syllabus of CATTI Level-two Simultaneous Interpreting Test]. Beijing: Foreign Languages Press.

PSI Services LLC (2010). Development and validation of oral and written examinations for medical interpreter certification: Technical report. Burbank, California, USA. [URL] (accessed 22 May 2015).

(2013). Development and validation of oral examinations for Medical Interpreter Certification: Mandarin, Russian, Cantonese, Korean, and Vietnamese forms. [URL] (accessed 22 May 2015).

Roat, C.E. (2006). Certification of health care interpreters in the United States: A primer, a status report and considerations for national certification. Los Angeles, CA. [URL] (accessed 22 May 2015).

Russell, D. & Malcolm, K. (2009). Assessing ASL–English interpreters: The Canadian model of national certification. In C.V. Angelelli & H.E. Jacobson (Eds.), Testing and assessment in translation and interpreting studies: A call for dialogue between research and practice. Amsterdam: John Benjamins, 331–376.

Schaefer, E. (2008). Rater bias pattern in an EFL writing assessment. Language Testing 25 (4), 465–493.

Schumacker, R.E. (1999). Many-facet Rasch analysis with crossed, nested and mixed designs. Journal of Outcome Measurement 3 (4), 323–338.

South African Translators’ Institute (2007a). Guidelines: SASL interpreter accreditation testing. [URL] (accessed 22 May 2015).

(2007b). Guidelines: Simultaneous interpreter accreditation testing. [URL] (accessed 22 May 2015).

Stansfield, C.W. & Hewitt, W. (2005). Examining the predictive validity of cut scores on a screening test for court interpreters. Language Testing 22 (2), 1–25.

Sudweeks, R., Reeve, S. & Bradshaw, W.S. (2005). A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing. Assessing Writing 91, 239–261.

Turner, B., Lai, M. & Huang, N. (2010). Error deduction and descriptors – a comparison of two methods of translation test assessment. Translation & Interpreting 2 (1), 11–23.

Upshur, J.A. & Turner, C.E. (1999). Systematic effects in the rating of second-language speaking ability: test method and leaner discourse. Language Testing 16 (1), 82–111.

Vermeiren, H., Gucht, J.V. & De Bontridder, L. (2009). Standards as critical success factors in assessments: Certifying social interpreters in Flanders, Belgium. In C.V. Angelelli & H.E. Jacobson (Eds.), Testing and assessment in translation and interpreting studies: A call for dialogue between research and practice. Amsterdam: John Benjamins, 291–330.

Weigle, S.C. (1994). Effects of training on raters of ESL compositions. Language Testing 11(2), 197–223.

. (1998). Using FACETS to model rater training effects. Language Testing 15(2), 263–287.

Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing 10 (3), 305–319.

Wu, S.C. (2010). Assessing simultaneous interpreting: A study on test reliability and examiners’ assessment behavior. PhD thesis, Newcastle University.

Youdelman, M. (2013). The development of certification for healthcare interpreters in the United States. Translation & Interpreting 5 (1), 114–126.

Yu, D.R. (2005). T&I labor market in China. Sydney, Australia. [URL] (accessed 22 May 2015).

Cited by (52)

Cited by 52 other publications

Order by:

Mei, Huan & Shanshan Yang

2026. Empowering Metacognitive Interpreter Training: How Standards‐Based Self‐Assessment Affects Metacognitive Strategy Use and Interpreting Performance. International Journal of Applied Linguistics 36:1 ► pp. 793 ff.

Yokouchi, Yuichiro, Kuangzhe Xu, Shuichi Takaki & Haruhiko Mitsunaga

2026. Modeling Rater Severity Drift in Single-Rater Performance Assessments Using Bayesian Hierarchical Methods. In Behavioural and Social Computing [Lecture Notes in Computer Science, 16431], ► pp. 435 ff.

Zhang, Xueni, Binghan Zheng, Rui Wang & Haoshen He

2026. Is interpreter advantage a gift or an effect of training? Cognitive changes and interpreting acquisition at the early stage of training. Bilingualism: Language and Cognition 29:1 ► pp. 223 ff.

Chen, Qionglu & Wei Su

2025. Comparing students’ reception of AI-based video feedback and written feedback: A Q methodological study. Innovations in Education and Teaching International ► pp. 1 ff.

Chen, Shirong & Chao Han

2025. The Role of Annotating in Peer Assessment of Interpreting: A Quasi-experimental Investigation. The Asia-Pacific Education Researcher 34:6 ► pp. 2109 ff.

Fan, Jiashun, Pingping Hu & Zhuxuan Zhao

2025. Comparing the cognitive structures of peer and teacher feedback in an interpreting course: An epistemic network analysis approach. Innovations in Education and Teaching International ► pp. 1 ff.

Han, Chao, Mengting Jiang & Qionglu Chen

2025. Rubricizing the assessment practice: A systematic review and meta-analysis of rubrics in rater-mediated assessment of language interpreting. Language Testing

Li, Xiaodong, Jie Yuan & Ying Xu

2025. Investigating the Decision-making Style of Translation Raters in Large-scale Language Tests. SAGE Open 15:3

Li, Yang, Xini Liao & Jia Jia

2025. Effects of raters’ professional backgrounds on assessing interpreting quality: An exploratory mixed-methods investigation into rater behavior. System 133 ► pp. 103772 ff.

Liu, Jing & Wei Su

2025. Using screen-cast to explore language learners’ engagement with teacher oral feedback across time. Language Teaching Research

Zhang, Qiuya & Youping Jing

2025. The impact of interpreting students’ gestures and speech content on speech fluency of consecutive interpreting. Frontiers in Psychology 16

Zhuang, Yajin, Liwen Chen & Wei Lun Wong

2025. Lost in translation: Decoding the errors in consecutive interpreting by Chinese EFL learners. PLOS One 20:12 ► pp. e0337758 ff.

Guo, Wei, Xun Guo, Junkang Huang & Sha Tian

2024. Modeling listeners’ perceptions of quality in consecutive interpreting: a case study of a technology interpreting event. Humanities and Social Sciences Communications 11:1

Kim, Sangki & Eunseok Ro

2024. Offering an olive branch: a study of dissenting rater's practices for resolving placement discrepancies. Linguistics and Education 80 ► pp. 101271 ff.

Liu, Yiguang & Junying Liang

2024. Multidimensional comparison of Chinese-English interpreting outputs from human and machine: Implications for interpreting education in the machine-translation age. Linguistics and Education 80 ► pp. 101273 ff.

Wind, Stefanie A. & Yuan Ge

2024. Detecting Rater Bias in Mixed-Format Assessments. Measurement: Interdisciplinary Research and Perspectives 22:1 ► pp. 20 ff.

Chen, Sijia & Jan-Louis Kruger

2023. The effectiveness of computer-assisted interpreting. Translation and Interpreting Studies 18:3 ► pp. 399 ff.

Chen, Sijia & Jan-Louis Kruger

2024. A computer-assisted consecutive interpreting workflow: training and evaluation. The Interpreter and Translator Trainer 18:3 ► pp. 380 ff.

Han, Chao & Xiaoqi Shang

2023. An item-based, Rasch-calibrated approach to assessing translation quality. Target. International Journal of Translation Studies 35:1 ► pp. 63 ff.

Liu, Jun, Meng Sun, Zile Liu & Yanhua Xu

2023. Pre-Service Teachers’ Instructional Innovation Capabilities: A Many-Faceted Rasch Model Analysis. Sage Open 13:4

Lu, Xiaolei & Chao Han

2023. Automatic assessment of spoken-language interpreting based on machine-translation evaluation metrics. Interpreting. International Journal of Research and Practice in Interpreting 25:1 ► pp. 109 ff.

Song, Shuxian & Dechao Li

2023. Aptitude for interpreting: the predictive value of cognitive fluency. The Interpreter and Translator Trainer 17:1 ► pp. 155 ff.

Zhao, Nan

2023. A validation study of a consecutive interpreting test using many-facet Rasch analysis. Frontiers in Communication 7

Chen, Jing, Huabo Yang & Chao Han

2022. Holistic versus analytic scoring of spoken-language interpreting: a multi-perspectival comparative analysis. The Interpreter and Translator Trainer 16:4 ► pp. 558 ff.

Chen, Sijia

2022. The process and product of note-taking and consecutive interpreting: empirical data from professionals and students. Perspectives 30:2 ► pp. 258 ff.

Sawyer, David B.

2022. Review of Chen & Han (2021): Testing and assessment of interpreting: Recent developments in China. Interpreting. International Journal of Research and Practice in Interpreting 24:1 ► pp. 155 ff.

Chen, Hua, Ying Wang & T. Pascal Brown

2021. The effects of topic familiarity on information completeness, fluency, and target language quality of student interpreters in Chinese–English consecutive interpreting. Across Languages and Cultures 22:2 ► pp. 176 ff.

Chen, Jing & Chao Han

2021. Testing and Assessment of Interpreting in China: An Overview. In Testing and Assessment of Interpreting [New Frontiers in Translation Studies, ], ► pp. 1 ff.

Han, Chao & Kerui An

2021. Using unfilled pauses to measure (dis)fluency in English-Chinese consecutive interpreting: in search of an optimal pause threshold(s). Perspectives 29:6 ► pp. 917 ff.

Han, Chao & Xiaolei Lu

2021. Interpreting quality assessment re-imagined: The synergy between human and machine scoring. Interpreting and Society 1:1 ► pp. 70 ff.

Han, Chao & Xiaolei Lu

2023. Can automated machine translation evaluation metrics be used to assess students’ interpretation in the language learning classroom?. Computer Assisted Language Learning 36:5-6 ► pp. 1064 ff.

Han, Chao & Xiaolei Lu

2025. Beyond BLEU: Repurposing neural-based metrics to assess interlingual interpreting in tertiary-level language learning settings. Research Methods in Applied Linguistics 4:1 ► pp. 100184 ff.

Han, Chao, Rui Xiao & Wei Su

2021. Assessing the fidelity of consecutive interpreting. Interpreting. International Journal of Research and Practice in Interpreting 23:2 ► pp. 245 ff.

Han, Chao & Xiao Zhao

2021. Accuracy of peer ratings on the quality of spoken-language interpreting. Assessment & Evaluation in Higher Education 46:8 ► pp. 1299 ff.

Lamprianou, Iasonas, Dina Tsagari & Nansia Kyriakou

2021. The longitudinal stability of rating characteristics in an EFL examination: Methodological and substantive considerations. Language Testing 38:2 ► pp. 273 ff.

Lamprianou, Iasonas, Dina Tsagari & Nansia Kyriakou

2023. Experienced but detached from reality: Theorizing and operationalizing the relationship between experience and rater effects. Assessing Writing 56 ► pp. 100713 ff.

Liu, Yanmeng

2021. Exploring a Corpus-Based Approach to Assessing Interpreting Quality. In Testing and Assessment of Interpreting [New Frontiers in Translation Studies, ], ► pp. 159 ff.

Han, Chao & Qin Fan

2020. Using self-assessment as a formative assessment tool in an English-Chinese interpreting course: student views and perceptions of its utility. Perspectives 28:1 ► pp. 109 ff.

Lee, Sang-Bin

2019. Holistic assessment of consecutive interpretation. Interpreting. International Journal of Research and Practice in Interpreting 21:2 ► pp. 245 ff.

Abdel Latif, Muhammad M. M.

2018. Towards a typology of pedagogy-oriented translation and interpreting research. The Interpreter and Translator Trainer 12:3 ► pp. 322 ff.

Abdel Latif, Muhammad M. M.

2020. Translation and Interpreting Assessment Research. In Translator and Interpreter Education Research [New Frontiers in Translation Studies, ], ► pp. 61 ff.

Han, Chao

2018. Using rating scales to assess interpretation. Interpreting. International Journal of Research and Practice in Interpreting 20:1 ► pp. 63 ff.

Han, Chao

2018. A longitudinal quantitative investigation into the concurrent validity of self and peer assessment applied to English-Chinese bi-directional interpretation in an undergraduate interpreting course. Studies in Educational Evaluation 58 ► pp. 187 ff.

Han, Chao

2019. A generalizability theory study of optimal measurement design for a summative assessment of English/Chinese consecutive interpreting. Language Testing 36:3 ► pp. 419 ff.

Han, Chao

2019. Conceptualizing and Operationalizing a Formative Assessment Model for English-Chinese Consecutive Interpreting. In Quality Assurance and Assessment Practices in Translation and Interpreting [Advances in Linguistics and Communication Studies, ], ► pp. 89 ff.

Han, Chao

2021. Analytic rubric scoring versus comparative judgment: a comparison of two approaches to assessing spoken-language interpreting. Meta 66:2 ► pp. 337 ff.

Han, Chao

2021. Detecting and Measuring Rater Effects in Interpreting Assessment: A Methodological Comparison of Classical Test Theory, Generalizability Theory, and Many-Facet Rasch Measurement. In Testing and Assessment of Interpreting [New Frontiers in Translation Studies, ], ► pp. 85 ff.

Han, Chao

2022. Assessing spoken-language interpreting. Interpreting. International Journal of Research and Practice in Interpreting 24:1 ► pp. 59 ff.

Han, Chao

2022. Interpreting testing and assessment: A state-of-the-art review. Language Testing 39:1 ► pp. 30 ff.

Shang, Xiaoqi

2017. Conference interpreting: a trainer’s guide. Perspectives 25:4 ► pp. 682 ff.

Shang, Xiaoqi

2021. Developing a Weighting Scheme for Assessing Chinese-to-English Interpreting: Evidence from Native English-Speaking Raters. In Testing and Assessment of Interpreting [New Frontiers in Translation Studies, ], ► pp. 45 ff.

Han, Chao & Helen Slatyer

2016. Test validation in interpreter certification performance testing. Interpreting. International Journal of Research and Practice in Interpreting 18:2 ► pp. 225 ff.

This list is based on CrossRef data as of 12 march 2026. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.