Article published In: Interpreting
Vol. 17:2 (2015) ► pp.255–283
Investigating rater severity/leniency in interpreter performance testing
A multifaceted Rasch measurement approach
Published online: 3 September 2015
https://doi.org/10.1075/intp.17.2.05han
https://doi.org/10.1075/intp.17.2.05han
Rater-mediated performance assessment (RMPA) is a critical component of interpreter certification testing systems worldwide. Given
the acknowledged rater variability in RMPA and the high-stakes nature of certification testing, it is crucial to ensure rater
reliability in interpreter certification performance testing (ICPT). However, a review of current ICPT practice indicates that
rigorous research on rater reliability is lacking. Against this background, the present study reports on use of multifaceted Rasch
measurement (MFRM) to identify the degree of severity/leniency in different raters’ assessments of simultaneous interpretations
(SIs) by 32 interpreters in an experimental setting. Nine raters specifically trained for the purpose were asked to evaluate four
English-to-Chinese SIs by each of the interpreters, using three 8-point rating scales (information content, fluency, expression).
The source texts differed in speed and in the speaker’s accent (native vs non-native). Rater-generated scores were then subjected
to MFRM analysis, using the FACETS program. The following general trends emerged: 1) homogeneity statistics showed that not all
raters were equally severe overall; and 2) bias analyses showed that a relatively large proportion of the raters had significantly
biased interactions with the interpreters and the assessment criteria. Implications for practical rating arrangements in ICPT, and
for rater training, are discussed.
References (68)
Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika 43 (4), 561–573.
Angelelli, C. (2009). Using a rubric to assess translation ability: Defining the construct. In C. Angelelli & H.E. Jacobson (Eds.), Testing and assessment in translation and interpreting studies. Amsterdam: John Benjamins, 13–47.
Arjona-Tseng, E. (1993). A psychometric approach to the selection of translation and interpreting students in Taiwan. Perspectives 1 (1), 91–104.
Arocha, I.S. & Joyce, L. (2013). Patient safety, professionalization, and reimbursement as primary drivers for National Medical Interpreter Certification in the United States. Translation & Interpreting 5 (1), 127–142.
Bond, T.G. & Fox, C.M. (2007). Applying the Rasch model: Fundamental measurement in the human sciences (2nd ed.). London: Lawrence Erlbaum.
Bonk, W.J. & Ockey, G.J. (2003). A many-facet Rasch analysis of the second language group oral discussion task. Language Testing 20 (1), 89–110.
Brown, A. (1995). The effect of rater variables in the development of an occupation-specific language performance test. Language Testing 12 (1), 1–15.
Campbell, S. & Hale, S. (2003). Translation and interpreting assessment in the context of educational measurement. In G. Anderman & M. Rogers (Eds.), Translation today: Trends and perspectives. Clevedon: Multilingual Matters, 205–224.
Certification Commission for Healthcare Interpreters (2010). Job task analysis study and results. [URL] (accessed 22 May 2015).
(2011). Technical Report on the Development and Pilot Testing of the CCHI Examinations. [URL] (accessed 22 May 2015).
(2012). Technical Report on the Development and Pilot Testing of the Certified Healthcare Interpreter™ (CHI™) Examination for Arabic and Mandarin. [URL] (accessed 22 May 2015).
(2014). Candidate’s Examination Handbook. [URL] (accessed 22 May 2015).
Clifford, A. (2005). Putting the exam to the test: Psychometric validation and interpreter certification. Interpreting 7 (1), 97–13.
Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly 2 (3), 197–221.
. (2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing 25 (2), 155–185.
. (2011). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments. Frankfurt am Main: Peter Lang.
Elder, C., Barkhuizen, G., Knoch, U. & von Randow, J. (2007). Evaluating rater responses to an online training program for L2 writing assessment. Language Testing 24 (1), 37–64.
Feng, J.Z. (2005). 论口译测试的规范化. [Towards the standardization of interpretation testing]. 外语研究, 891, 54–58.
Feuerle, L. (2013). Testing interpreters: Developing, administering, and scoring court interpreter certification exams. Translation & Interpreting 5 (1), 80–93.
Fox, C. & Jones, J. (1998). Uses of Rasch modeling in counseling psychology research. Journal of Counseling Psychology 45 (1), 30–45.
Gile, D. (1995). Basic concepts and models for interpreter and translator training. Amsterdam: John Benjamins.
Hale, S., Garcia, I., Hlavac, J., Kim, M., Lai, M., Turner, B. & Slatyer, H. (2012). Development of a conceptual overview for a new model for NAATI standards, testing and assessment. Sydney, Australia. [URL] (accessed 22 May 2015).
Han, C. & Mehdi, R. (2015). The effects of speech rate and accent on interpreter performance quality: A mixed-methods replication study. Manuscript submitted for publication.
Henning, G. (1992). Dimensionality and construct validity of language tests. Language Testing 9 (1), 1–11.
Hlavac, J. (2013). A cross-national overview of translator and interpreter certification procedures. Translation & Interpreting 51, 32–65.
IoL Educational Trust. (2010). Diploma in Public Service Interpreting: Handbook for candidates. London, UK. [URL] (accessed 22 May 2015).
Jacobs, E.A., Lauderdale, D.S., Meltzer, D., Shorey, J.M., Levinson, W. & Thisted, R.A. (2001). Impact of interpreter services on delivery of health care to limited-English-proficient patients. Journal of General Internal Medicine 16 (7), 468–474.
Knoch, U. (2011). Investigating the effectiveness of individualized feedback to rating behavior – a longitudinal study. Language Testing 28 (2), 179–200.
Kondo-Brown, K. (2002). A FACETS analysis of rater bias in measuring Japanese second language writing performance. Language Testing 19 (1), 3–31.
Linacre, J.M. (2002). What do infit and outfit, mean-square and standardized mean? Rasch Measurement Transactions 16 (2), 878.
. (2013). A user’s guide to FACETS: Program manual 3.71.2. [URL] (accessed 22 May 2015).
Liu, M. (2013). Design and analysis of Taiwan’s interpretation certification examination. In D. Tsagari & R. van Deemter (Eds.), Assessment issues in language translation and interpreting. Frankfurt: Peter Lang, 163–178.
Lu, M., Liu, C. & Gong, X.F. (2007). 全国翻译专业资格(水平)考试英语口译试题命制一致性研究报告. [How to maintain consistency in CATTI’s interpretation tests: A research report]. 中国翻译, 51, 57–61.
Lumley, T. & McNamara, T.F. (1995). Rater characteristics and rater bias: implications for training. Language Testing 12 (1), 54–71.
Lunz, M.E. & Stahl, J.A. (1990). Judge consistency and severity across grading periods. Evaluation and the Health Professions 13 (4), 425–444.
Lynch, B.K. & McNamara, T.F. (1998). Using G-theory and many-facet Rasch measurement in the development of performance assessments of the ESL speaking skills of immigrants. Language Testing 15 (2), 158–180.
McNamara, T.F. & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurement in language testing. Language Testing 29 (4) 555–576.
Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational measurement (3rd ed.). New York: American Council on Education and Macmillan, 13–103.
Mortensen, D. (1998). Establishing a scheme for interpreter certification: The Norwegian experience. [URL] (accessed 22 May 2015).
. (2001). Measuring quality in interpreting: A report on the Norwegian Interpreter Certification Examination (NICE). Oslo, Norway. [URL] (accessed 22 May 2015).
Myford, C.M. & Wolfe, E.W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of Applied Measurement 4 (4), 386–422.
National Accreditation Authority for Translators and Interpreters. (2013). INT Project discussion paper. [URL] (accessed 22 May 2015).
National Association of the Deaf (2014). NAD and RID releases the NAD-RID National Interpreter Certification (NIC) Credential Validity, Reliability, & Candidate Performance Report. [URL] (accessed 22 May 2015).
National Board of Certification for Medical Interpreters (2014). The National Board of Certification for Medical Interpreters: Certified Medical Interpreter candidate handbook. [URL] (accessed 22 May 2015).
National Center for States Courts (2013). Federal Court Interpreter Certification Examination for Spanish/English: Examinee handbook. [URL] (accessed 22 May 2015).
Office of China Accreditation Tests for Translators and Interpreters. (2005). 二级口译英语同声传译类考试大纲. 外文出版社. [Syllabus of CATTI Level-two Simultaneous Interpreting Test]. Beijing: Foreign Languages Press.
PSI Services LLC (2010). Development and validation of oral and written examinations for medical interpreter certification: Technical report. Burbank, California, USA. [URL] (accessed 22 May 2015).
(2013). Development and validation of oral examinations for Medical Interpreter Certification: Mandarin, Russian, Cantonese, Korean, and Vietnamese forms. [URL] (accessed 22 May 2015).
Roat, C.E. (2006). Certification of health care interpreters in the United States: A primer, a status report and considerations for national certification. Los Angeles, CA. [URL] (accessed 22 May 2015).
Russell, D. & Malcolm, K. (2009). Assessing ASL–English interpreters: The Canadian model of national certification. In C.V. Angelelli & H.E. Jacobson (Eds.), Testing and assessment in translation and interpreting studies: A call for dialogue between research and practice. Amsterdam: John Benjamins, 331–376.
Schaefer, E. (2008). Rater bias pattern in an EFL writing assessment. Language Testing 25 (4), 465–493.
Schumacker, R.E. (1999). Many-facet Rasch analysis with crossed, nested and mixed designs. Journal of Outcome Measurement 3 (4), 323–338.
South African Translators’ Institute (2007a). Guidelines: SASL interpreter accreditation testing. [URL] (accessed 22 May 2015).
(2007b). Guidelines: Simultaneous interpreter accreditation testing. [URL] (accessed 22 May 2015).
Stansfield, C.W. & Hewitt, W. (2005). Examining the predictive validity of cut scores on a screening test for court interpreters. Language Testing 22 (2), 1–25.
Sudweeks, R., Reeve, S. & Bradshaw, W.S. (2005). A comparison of generalizability theory and many-facet Rasch measurement in an analysis of college sophomore writing. Assessing Writing 91, 239–261.
Turner, B., Lai, M. & Huang, N. (2010). Error deduction and descriptors – a comparison of two methods of translation test assessment. Translation & Interpreting 2 (1), 11–23.
Upshur, J.A. & Turner, C.E. (1999). Systematic effects in the rating of second-language speaking ability: test method and leaner discourse. Language Testing 16 (1), 82–111.
Vermeiren, H., Gucht, J.V. & De Bontridder, L. (2009). Standards as critical success factors in assessments: Certifying social interpreters in Flanders, Belgium. In C.V. Angelelli & H.E. Jacobson (Eds.), Testing and assessment in translation and interpreting studies: A call for dialogue between research and practice. Amsterdam: John Benjamins, 291–330.
Weigle, S.C. (1994). Effects of training on raters of ESL compositions. Language Testing 11(2), 197–223.
Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing 10 (3), 305–319.
Wu, S.C. (2010). Assessing simultaneous interpreting: A study on test reliability and examiners’ assessment behavior. PhD thesis, Newcastle University.
Youdelman, M. (2013). The development of certification for healthcare interpreters in the United States. Translation & Interpreting 5 (1), 114–126.
Yu, D.R. (2005). T&I labor market in China. Sydney, Australia. [URL] (accessed 22 May 2015).
Cited by (52)
Cited by 52 other publications
Mei, Huan & Shanshan Yang
Yokouchi, Yuichiro, Kuangzhe Xu, Shuichi Takaki & Haruhiko Mitsunaga
Zhang, Xueni, Binghan Zheng, Rui Wang & Haoshen He
Chen, Qionglu & Wei Su
Chen, Shirong & Chao Han
Fan, Jiashun, Pingping Hu & Zhuxuan Zhao
Han, Chao, Mengting Jiang & Qionglu Chen
Li, Xiaodong, Jie Yuan & Ying Xu
Li, Yang, Xini Liao & Jia Jia
Liu, Jing & Wei Su
Zhang, Qiuya & Youping Jing
Zhuang, Yajin, Liwen Chen & Wei Lun Wong
Guo, Wei, Xun Guo, Junkang Huang & Sha Tian
Kim, Sangki & Eunseok Ro
Liu, Yiguang & Junying Liang
Wind, Stefanie A. & Yuan Ge
Chen, Sijia & Jan-Louis Kruger
2023. The effectiveness of computer-assisted interpreting. Translation and Interpreting Studies 18:3 ► pp. 399 ff.
Chen, Sijia & Jan-Louis Kruger
Han, Chao & Xiaoqi Shang
2023. An item-based, Rasch-calibrated approach to assessing translation quality. Target. International Journal of Translation Studies 35:1 ► pp. 63 ff.
Liu, Jun, Meng Sun, Zile Liu & Yanhua Xu
Lu, Xiaolei & Chao Han
2023. Automatic assessment of spoken-language interpreting based on machine-translation evaluation metrics. Interpreting. International Journal of Research and Practice in Interpreting 25:1 ► pp. 109 ff.
Song, Shuxian & Dechao Li
Zhao, Nan
Chen, Jing, Huabo Yang & Chao Han
Chen, Sijia
Sawyer, David B.
2022. Review of Chen & Han (2021): Testing and assessment of interpreting: Recent developments in China. Interpreting. International Journal of Research and Practice in Interpreting 24:1 ► pp. 155 ff.
Chen, Hua, Ying Wang & T. Pascal Brown
Chen, Jing & Chao Han
Han, Chao & Kerui An
Han, Chao & Xiaolei Lu
Han, Chao & Xiaolei Lu
Han, Chao & Xiaolei Lu
Han, Chao, Rui Xiao & Wei Su
2021. Assessing the fidelity of consecutive interpreting. Interpreting. International Journal of Research and Practice in Interpreting 23:2 ► pp. 245 ff.
Han, Chao & Xiao Zhao
Lamprianou, Iasonas, Dina Tsagari & Nansia Kyriakou
Lamprianou, Iasonas, Dina Tsagari & Nansia Kyriakou
Liu, Yanmeng
Han, Chao & Qin Fan
Lee, Sang-Bin
2019. Holistic assessment of consecutive interpretation. Interpreting. International Journal of Research and Practice in Interpreting 21:2 ► pp. 245 ff.
Abdel Latif, Muhammad M. M.
Abdel Latif, Muhammad M. M.
Han, Chao
2018. Using rating scales to assess interpretation. Interpreting. International Journal of Research and Practice in Interpreting 20:1 ► pp. 63 ff.
Han, Chao
Han, Chao
Han, Chao
Han, Chao
Han, Chao
Han, Chao
2022. Assessing spoken-language interpreting. Interpreting. International Journal of Research and Practice in Interpreting 24:1 ► pp. 59 ff.
Han, Chao
Shang, Xiaoqi
Han, Chao & Helen Slatyer
2016. Test validation in interpreter certification performance testing. Interpreting. International Journal of Research and Practice in Interpreting 18:2 ► pp. 225 ff.
This list is based on CrossRef data as of 12 march 2026. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
