Difficulty level of EFL test designed by pre‑service teachers: A corpus analysis

Rudy, Muhammad; Yusuf, Fazri Nur; Emilia, Emi; Gunawan, Wawan

doi:10.1075/aral.25009.rud

Article published In: Australian Review of Applied Linguistics: Online-First Articles

Get fulltext from our e-platform

Download PDF

Download EPUB

Difficulty level of EFL test designed by pre‑service teachers

A corpus analysis

Muhammad Rudy | Universitas Pendidikan Indonesia

Fazri Nur Yusuf | Universitas Pendidikan Indonesia

Emi Emilia | Universitas Pendidikan Indonesia

Wawan Gunawan | Universitas Pendidikan Indonesia

Published online: 13 March 2026

https://doi.org/10.1075/aral.25009.rud

Abstract

Since 2024, English as a Foreign Language (EFL) teaching and assessment in Indonesian primary and secondary schools have aimed for B1 Common European Framework of Reference (CEFR) proficiency. However, studies regarding aligning teacher training with CEFR-based assessment design are rare. Consequently, teacher training institutions, which previously did not pay attention to the issue, were not ready to integrate this target into assessment design courses. To fill the gap, this study leverages corpus analysis to maintain the difficulty level of the test in accordance with the targeted CEFR level. This study investigates formative test items created by 28 pre-service teachers (PTs) in a Designing Assessment Course. The alignment of CEFR vocabulary and difficulty levels in the developed test items was scrutinized. Using two corpus analysis tools, 26,487 tokens from receptive skill tests were compared with 5,354 tokens from the CEFR. The results showed that both test types were dominated by very easy (A1) and easy (A2) levels, with limited representation of medium (B1), difficult (B2), and very difficult (C1 and C2) items. Listening items had 68.51% CEFR-aligned vocabulary, mostly A1 (55.98%). Similarly, reading items had 68.19% CEFR-aligned vocabulary, with A1 dominating (51.70%). These findings suggest that the test items do not fully align with B1 proficiency. The very easy and easy levels limit the tests’ effectiveness in assessing students’ achievement and higher-level language skills which in turn may weaken the test validity. The findings urge education institutions to integrate corpus literacy into assessment design. The test analysis in this study was a relatively simple procedure but significant for difficulty level investigation. The procedure can be duplicated for assessment design in EFL classrooms and research settings.

Keywords: assessment, corpus analysis, difficulty level, CEFR, test items

Article outline

1.Introduction
2.Literature review
- 2.1Formative assessment
- 2.2Difficulty level
- 2.3Vocabulary threshold
- 2.4Corpus analysis
  - 2.4.1Corpus features
  - 2.4.2Predicting difficulty through corpus analysis
3.Method
- 3.1Research data
- 3.2Word classification
- 3.3Data preparation
  - 3.3.1Data cleaning
  - 3.3.2Word elimination
  - 3.3.3Difficulty classification
- 3.4Word analysis
4.Results
- 4.1CEFR levels of PTs’ Test Items
- 4.2The difficulty levels
5.Discussion
- 5.1Discrepancy between intended and actual difficulty levels
- 5.2Quest for test validity
6.Conclusions and implications
Limitations
Acknowledgements
AI disclosure
References

References (67)

References

Ahmadi, H., Behnam, B., and Seifoori, Z. (2021). The Reciprocal Questioning as a Formative Assessment Strategy: EFL Learners’ Reading Comprehension and Vocabulary Learning. Teaching English Language, 15(2), 61–93.

Akbari, R. (2012). Validity in language testing. The Cambridge Guide to Second Language Assessment, 30–36. [URL]

Alderson, J. C., & Banerjee, J. (2002). Language testing and assessment (Part 2). Language Teaching, 35(2), 79–113.

Alotaibi, K. A. (2019). Teachers’ Perceptions on Factors Influence Adoption of Formative Assessment. Journal of Education and Learning, 8(1), 74–86.

Anam, S. U., & Putri, N. V. W. (2021). How literate am I about assessment: Evidence from Indonesian EFL pre-service and in-service teachers. English Review: Journal of English Education, 9(2), 377–388.

Anthony, L. (2024a). AntConc (Version 4.3.1) [Computer Software]. Tokyo, Japan: Waseda University. [URL]

(2024b). AntWordProfiler (Version 2.2.1) [Computer Software]. Tokyo, Japan: Waseda University. [URL]

Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford University Press.

Berger, A. (2023). A Difficulty-Informed Approach to Developing Language Assessment Literacy for Classroom Purposes. In Chinese J. of Appl. Ling. (Vol. 46, Issue 2).

Bhola, D. S., Impara, J. C., & Buckendahl, C. W. (2003). Aligning tests with states’ content standards: Methods and issues. Educational Measurement: Issues and Practice, 22(3), 21–29.

Boston, C. (2002). “The Concept of Formative Assessment”, Practical Assessment, Research, and Evaluation 8(1): 9.

Brown, H. D. (2004). Language Assessment and Classroom Practices. New York: Pearson Education Limited

Cakrawati, T. D., Agung, A. S. S. N., Nugroho, A., & Ramadhan, R. (2024). How Do The Indonesian Pre-Service Teachers Perceive CEFR?. IJET (Indonesian Journal of English Teaching), 13(1), 14–28.

Chen, L. C., Chang, K. H., Yang, S. C., & Chen, S. C. (2023). A Corpus-Based Word Classification Method for Detecting Difficulty Level of English Proficiency Tests. Applied Sciences (Switzerland), 13(3).

Chen, Y. (2021). Comparing incidental vocabulary learning from reading-only and reading-while-listening. System, 971.

Chiedu, R. E., & Omenogor, H. D. (2014). The concept of reliability in language testing: issues and solutions. [URL]

Choi, I. C. (1994). Content and construct validation of a criterion-referenced English proficiency test. English Teaching, 481, 311–348. [URL]

Choi, I. C., & Moon, Y. (2020). Predicting the Difficulty of EFL Tests Based on Corpus Linguistic Features and Expert Judgment. Language Assessment Quarterly, 17(1), 18–42.

Cizek, G. J., Andrade, H. L., & Bennett, R. E. (2019). Formative assessment: History, definition, and progress. In Handbook of formative assessment in the disciplines (pp. 3–19). Routledge.

Crossley, S. A., Greenfield, J., & McNamara, D. S. (2008). Assessing text readability using cognitively based indices. Tesol Quarterly, 42(3), 475–493.

Davies, A. (2008). Textbook trends in teaching language testing. Language Testing, 25(3), 327–347.

Defianty, D., Wilson, K. (2024). Beating Barriers to Formative Assessment in a Testing-Oriented Nation. TARBIYA: Journal of Education in Muslim Society, 11(1), 1–12.

Du, G., Hasim, Z., & Chew, F. P. (2022). Contribution of English aural vocabulary size levels to L2 listening comprehension. International Review of Applied Linguistics in Language Teaching, IRAL, 60(4), 937–956.

Estaji, M. & Mirzaii, M. (2018). Enhancing EFL learners’ vocabulary learning through formative assessment: Is the effort worth expending?. Language Learning in Higher Education, 8(2), 239–264.

Fan, N. (2020). Strategy use in second language vocabulary learning and its relationships with the breadth and depth of vocabulary knowledge: A structural equation modeling study. Frontiers in Psychology, 111, 752.

Freedle, R., & Kostin, I. (1993). The prediction of TOEFL reading item difficulty: implications for construct validity. Language Testing, 10(2), 133–170.

(1996). The prediction of TOEFL listening comprehension item difficulty for the expository prose passages for minitalk passages: Implications for construct validity. TOEFL Research Reports, 561.

Gaffas, Z. M. (2024). Learning medical terminology in an ESP medical course: Vocabulary notebooks versus word lists. Australian Review of Applied Linguistics.

Gavaldà, N., & Queralt, S. (2020). Determining the Level of a Language Test with English Profile: A Forensic Linguistics Case Study. Atlantis, 42(2), 1–21.

Green, C., Pantelich, M., Barrow, M., Weerasinghe, D., & Daniel, R. (2024). Receptive vocabulary size estimates for general and academic vocabulary at a multi-campus Australian university. Australian Review of Applied Linguistics, 47(2), 153–173.

Hale, G. A., Rock, D. A., & Jirele, T. (1982). Confirmatory factor analysis of the Test of English as a Foreign Language. ETS Research Report Series, 1982(2), i–51. [URL].

Hamada, A. (2015). Linguistic variables determining the difficulty of Eiken reading passages. JLTA Journal, 181, 57–77.

Heilman, M., Collins-Thompson, K., & Eskenazi, M. (2008). An analysis of statistical models and features for reading difficulty prediction. In Proceedings of the third workshop on innovative use of NLP for building educational applications (pp. 71–79). [URL].

Hsu, W. (2011). The vocabulary thresholds of business textbooks and business research articles for EFL learners. English for Specific Purposes, 30(4), 247–257.

Ismail, S. M., Rahul, D. R., Patra, I., & Rezvani, E. (2022). Formative vs. summative assessment: impacts on academic motivation, attitude toward learning, test anxiety, and self-regulation skill. Language Testing in Asia, 12(1), 40.

Kementrian Pendidikan Dasar dan Menengah. (2024). Peraturan Menteri Pendidikan, Kebudayaan, Riset, dan Teknologi Tentang Kurikulum pada Pendidikan Anak Usia Dini, Jenjang Pendidikan Dasar, dan Jenjang Pendidikan Menengah. In Kurikulum Merdeka (Nomor 12 Tahun 2024). Retrieved May 21, 2024, from [URL] [English Translation: The Ministry of Primary and Secondary Education. (2024). Regulation of the Minister of Education, Culture, Research, and Technology Concerning the Curriculum for Early Childhood Education, Primary Education, and Secondary Education. In Emancipated Curriculum (Number 12 of 2024).]

Kumar, D., Jaipurkar, R., Shekhar, A., Sikri, G., & Srinivas, V. (2021). Item analysis of multiple-choice questions: A quality assurance test for an assessment tool. Medical Journal Armed Forces India, 771, S85–S89.

Li, Z., Li, J. Z., Zhang, X., & Reynolds, B. L. (2024). Mastery of Listening and Reading Vocabulary Levels in Relation to CEFR: Insights into Student Admissions and English as a Medium of Instruction. Languages, 9(7), 239.

Loukina, A., Yoon, S. Y., Sakano, J., Wei, Y., & Sheehan, K. (2016). Textual complexity as a predictor of difficulty of listening items in language proficiency tests. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers (pp. 3245–3253). Osaka, Japan. [URL]

Marzaini, A. F. M., Sharil, W. N. E. H., Supramaniam, K., & Yusoff, S. M. (2023). Evaluating Teachers’ Assessment Literacy in Enacting CEFR-Aligned Classroom-Based Assessment in Malaysian Secondary Schools’ ESL Classrooms. International Journal of Academic Research in Progressive Education and Development, 12(1).

McLean, S., Stewart, J., & Batty, A. O. (2020). Predicting L2 reading proficiency with modalities of vocabulary knowledge: A bootstrapping approach. Language Testing, 37(3), 389–411.

McMillen, S., Anaya, J. B., Peña, E. D., Bedore, L. M., & Barquin, E. (2022). That’s hard! Item difficulty and word characteristics for bilinguals with and without developmental language disorder. International Journal of Bilingual Education and Bilingualism, 25(5), 1838–1856.

Medero, J., & Ostendorf, M. (2009). Analysis of vocabulary difficulty using Wiktionary. In SLaTE (pp. 61–64). [URL].

Milton, J., Alexiou, T. (2009). Vocabulary Size and the Common European Framework of Reference for Languages. In: Richards, B., Daller, M. H., Malvern, D. D., Meara, P., Milton, J., Treffers-Daller, J. (eds) Vocabulary Studies in First and Second Language Acquisition. Palgrave Macmillan, London.

Nation, I. S. P. (2013). Learning Vocabulary in Another Language. Second edition. Cambridge: Cambridge University Press.

Natova, I. (2021). Estimating CEFR reading comprehension text complexity. Language Learning Journal, 49(6), 699–710.

Nhan, L. K. (2024). Enhancing Teaching and Learning Through Formative Assessment.

North, B., & Jarosz, E. (2001). Implementing the CEFR in teacher-based assessment: approaches and challenges. Exploring Language Frameworks, 1181. [URL]

Owen, N., Shrestha, P., & Bax, S. (2021). Researching lexical thresholds and lexical profiles across the Common European Framework of Reference for Languages (CEFR) levels assessed in the APTIS test. ARAG’s Research Reports Online, (1). [URL]

Oxford 3000 and 5000. (2023). Oxford Learner’s Dictionaries. Retrieved October 10, 2024, from [URL]

Petersen, S. E., & Ostendorf, M. (2009). A machine learning approach to reading level assessment. Computer speech & language, 23(1), 89–106.

Pujianto, D., Damayanti, I. L., Hamied, F. A., and Sari, D. N. K. (2023). “Identifying the proficiency level of primary English language teachers’ productive skills from Kurikulum Merdeka and CEFR,” Bahasa dan Seni: Jurnal Bahasa, Sastra, Seni, dan Pengajarannya: Vol. 51: No. 2, Article 4.

Qian, D. D. (2002). Investigating the relationship between vocabulary knowledge and academic reading performance: An assessment perspective. Language learning, 52(3), 513–536.

Rafatbakhsh, E., & Ahmadi, A. (2023). Predicting the difficulty of EFL reading comprehension tests based on linguistic indices. Asian-Pacific Journal of Second and Foreign Language Education, 8(1).

Rupp, A. A., Garcia, P., & Jamieson, J. (2001). Combining Multiple Regression and CART to Understand Difficulty in Second Language Reading and Listening Comprehension Test Items. International Journal of Testing, 1(3–4), 185–216.

Shih, C. M. (2008). The general English proficiency test. Language Assessment Quarterly, 5(1), 63–76.

Sung, P.-J., Lin, S.-W., & Hung, P.-H. (2015). Factors Affecting Item Difficulty in English Listening Comprehension Tests. Universal Journal of Educational Research, 3(7), 451–459.

Susanti, Y., Nishikawa, H., Tokunaga, T., & Hiroyuki, O. (2016). Item difficulty analysis of English vocabulary questions. In International Conference on Computer Supported Education (Vol. 21, pp. 267–274). Scitepress.

Schmitt, N., Jiang, X., & Grabe, W. (2011). The percentage of words known in a text and reading comprehension. The modern language journal, 95(1), 26–43.

Uchida, S., Arase, Y., & Kajiwara, T. (2024). Profiling English sentences based on CEFR levels. ITL — International Journal of Applied Linguistics.

Tu, M., Ma, Q., & Jiang, L. (2024). Exploring EFL vocabulary learning through the story continuation writing task A mixed-methods study. Australian Review of Applied Linguistics.

Van Der Vleuten, C. P. (1996). The assessment of professional competence: developments, research, and practical implications. Advances in Health Sciences Education, 1(1), 41–67.

Warnby, M. (2024). Relating academic reading with academic vocabulary and general English proficiency to assess standards of students’ university-preparedness — the case of IELTS and CEFR B2. Scandinavian Journal of Educational Research, 69(3), 506–523.

Weir, C. J. (2005). Language testing and validation. Hampshire: Palgrave McMillan.

Waluyo, B. (2019). Examining Thai first-year university students’ English proficiency on CEFR levels. The New English Teacher ISSN 2985–0959 (Online), 13(2), 51–51. [URL]

Wilkinson, D. (2024). Formative Assessment Activities That Engage Students and Support Success. Journal of Higher Education Theory & Practice, 24(1).

Xi, X. (2017). What does corpus linguistics have to offer to language assessment? Language Testing, 34(4), 565–577.