Article published In: Australian Review of Applied Linguistics: Online-First Articles
Difficulty level of EFL test designed by pre‑service teachers
A corpus analysis
Published online: 13 March 2026
https://doi.org/10.1075/aral.25009.rud
https://doi.org/10.1075/aral.25009.rud
Abstract
Since 2024, English as a Foreign Language (EFL) teaching and assessment in Indonesian primary and secondary
schools have aimed for B1 Common European Framework of Reference (CEFR) proficiency. However, studies regarding aligning teacher
training with CEFR-based assessment design are rare. Consequently, teacher training institutions, which previously did not pay
attention to the issue, were not ready to integrate this target into assessment design courses. To fill the gap, this study
leverages corpus analysis to maintain the difficulty level of the test in accordance with the targeted CEFR level. This study
investigates formative test items created by 28 pre-service teachers (PTs) in a Designing Assessment Course. The alignment of CEFR
vocabulary and difficulty levels in the developed test items was scrutinized. Using two corpus analysis tools, 26,487 tokens from
receptive skill tests were compared with 5,354 tokens from the CEFR. The results showed that both test types were dominated by
very easy (A1) and easy (A2) levels, with limited representation of medium (B1), difficult (B2), and very difficult (C1 and C2)
items. Listening items had 68.51% CEFR-aligned vocabulary, mostly A1 (55.98%). Similarly, reading items had 68.19% CEFR-aligned
vocabulary, with A1 dominating (51.70%). These findings suggest that the test items do not fully align with B1 proficiency. The
very easy and easy levels limit the tests’ effectiveness in assessing students’ achievement and higher-level language skills which
in turn may weaken the test validity. The findings urge education institutions to integrate corpus literacy into assessment
design. The test analysis in this study was a relatively simple procedure but significant for difficulty level investigation. The
procedure can be duplicated for assessment design in EFL classrooms and research settings.
Keywords: assessment, corpus analysis, difficulty level, CEFR, test items
Article outline
- 1.Introduction
- 2.Literature review
- 2.1Formative assessment
- 2.2Difficulty level
- 2.3Vocabulary threshold
- 2.4Corpus analysis
- 2.4.1Corpus features
- 2.4.2Predicting difficulty through corpus analysis
- 3.Method
- 3.1Research data
- 3.2Word classification
- 3.3Data preparation
- 3.3.1Data cleaning
- 3.3.2Word elimination
- 3.3.3Difficulty classification
- 3.4Word analysis
- 4.Results
- 4.1CEFR levels of PTs’ Test Items
- 4.2The difficulty levels
- 5.Discussion
- 5.1Discrepancy between intended and actual difficulty levels
- 5.2Quest for test validity
- 6.Conclusions and implications
- Limitations
- Acknowledgements
- AI disclosure
References
References (67)
Ahmadi, H., Behnam, B., and Seifoori, Z. (2021). The
Reciprocal Questioning as a Formative Assessment Strategy: EFL Learners’ Reading Comprehension and Vocabulary
Learning. Teaching English
Language, 15(2), 61–93.
Akbari, R. (2012). Validity
in language testing. The Cambridge Guide to Second Language
Assessment, 30–36. [URL]
Alderson, J. C., & Banerjee, J. (2002). Language
testing and assessment (Part 2). Language
Teaching, 35(2), 79–113.
Alotaibi, K. A. (2019). Teachers’
Perceptions on Factors Influence Adoption of Formative Assessment. Journal of Education and
Learning, 8(1), 74–86.
Anam, S. U., & Putri, N. V. W. (2021). How
literate am I about assessment: Evidence from Indonesian EFL pre-service and in-service
teachers. English Review: Journal of English
Education, 9(2), 377–388.
Anthony, L. (2024a). AntConc
(Version 4.3.1) [Computer Software]. Tokyo, Japan: Waseda University. [URL]
(2024b). AntWordProfiler
(Version 2.2.1) [Computer Software]. Tokyo, Japan: Waseda University. [URL]
Berger, A. (2023). A
Difficulty-Informed Approach to Developing Language Assessment Literacy for Classroom
Purposes. In Chinese J. of Appl.
Ling. (Vol. 46, Issue 2).
Bhola, D. S., Impara, J. C., & Buckendahl, C. W. (2003). Aligning
tests with states’ content standards: Methods and issues. Educational Measurement: Issues and
Practice, 22(3), 21–29.
Boston, C. (2002). “The
Concept of Formative Assessment”, Practical Assessment, Research, and
Evaluation 8(1): 9.
Brown, H. D. (2004). Language
Assessment and Classroom Practices. New York: Pearson Education Limited
Cakrawati, T. D., Agung, A. S. S. N., Nugroho, A., & Ramadhan, R. (2024). How
Do The Indonesian Pre-Service Teachers Perceive CEFR?. IJET (Indonesian Journal of English
Teaching), 13(1), 14–28.
Chen, L. C., Chang, K. H., Yang, S. C., & Chen, S. C. (2023). A
Corpus-Based Word Classification Method for Detecting Difficulty Level of English Proficiency
Tests. Applied Sciences
(Switzerland), 13(3).
Chen, Y. (2021). Comparing
incidental vocabulary learning from reading-only and
reading-while-listening. System, 971.
Chiedu, R. E., & Omenogor, H. D. (2014). The
concept of reliability in language testing: issues and solutions. [URL]
Choi, I. C. (1994). Content
and construct validation of a criterion-referenced English proficiency test. English
Teaching, 481, 311–348. [URL]
Choi, I. C., & Moon, Y. (2020). Predicting
the Difficulty of EFL Tests Based on Corpus Linguistic Features and Expert Judgment. Language
Assessment
Quarterly, 17(1), 18–42.
Cizek, G. J., Andrade, H. L., & Bennett, R. E. (2019). Formative
assessment: History, definition, and progress. In Handbook of
formative assessment in the
disciplines (pp. 3–19). Routledge.
Crossley, S. A., Greenfield, J., & McNamara, D. S. (2008). Assessing
text readability using cognitively based indices. Tesol
Quarterly, 42(3), 475–493.
Defianty, D., Wilson, K. (2024). Beating
Barriers to Formative Assessment in a Testing-Oriented Nation. TARBIYA: Journal of Education in
Muslim
Society, 11(1), 1–12.
Du, G., Hasim, Z., & Chew, F. P. (2022). Contribution
of English aural vocabulary size levels to L2 listening comprehension. International Review of
Applied Linguistics in Language Teaching,
IRAL, 60(4), 937–956.
Estaji, M. & Mirzaii, M. (2018). Enhancing
EFL learners’ vocabulary learning through formative assessment: Is the effort worth
expending?. Language Learning in Higher
Education, 8(2), 239–264.
Fan, N. (2020). Strategy
use in second language vocabulary learning and its relationships with the breadth and depth of vocabulary knowledge: A
structural equation modeling study. Frontiers in
Psychology, 111, 752.
Freedle, R., & Kostin, I. (1993). The
prediction of TOEFL reading item difficulty: implications for construct validity. Language
Testing, 10(2), 133–170.
(1996). The
prediction of TOEFL listening comprehension item difficulty for the expository prose passages for minitalk passages:
Implications for construct validity. TOEFL Research
Reports, 561.
Gaffas, Z. M. (2024). Learning
medical terminology in an ESP medical course: Vocabulary notebooks versus word
lists. Australian Review of Applied Linguistics.
Gavaldà, N., & Queralt, S. (2020). Determining
the Level of a Language Test with English Profile: A Forensic Linguistics Case
Study. Atlantis, 42(2), 1–21.
Green, C., Pantelich, M., Barrow, M., Weerasinghe, D., & Daniel, R. (2024). Receptive
vocabulary size estimates for general and academic vocabulary at a multi-campus Australian
university. Australian Review of Applied
Linguistics, 47(2), 153–173.
Hale, G. A., Rock, D. A., & Jirele, T. (1982). Confirmatory
factor analysis of the Test of English as a Foreign Language. ETS Research Report
Series, 1982(2), i–51. [URL].
Hamada, A. (2015). Linguistic
variables determining the difficulty of Eiken reading passages. JLTA
Journal, 181, 57–77.
Heilman, M., Collins-Thompson, K., & Eskenazi, M. (2008). An
analysis of statistical models and features for reading difficulty
prediction. In Proceedings of the third workshop on innovative use of
NLP for building educational
applications (pp. 71–79). [URL].
Hsu, W. (2011). The
vocabulary thresholds of business textbooks and business research articles for EFL
learners. English for Specific
Purposes, 30(4), 247–257.
Ismail, S. M., Rahul, D. R., Patra, I., & Rezvani, E. (2022). Formative
vs. summative assessment: impacts on academic motivation, attitude toward learning, test anxiety, and self-regulation
skill. Language Testing in
Asia, 12(1), 40.
Kementrian Pendidikan Dasar dan
Menengah. (2024). Peraturan Menteri Pendidikan, Kebudayaan,
Riset, dan Teknologi Tentang Kurikulum pada Pendidikan Anak Usia Dini, Jenjang Pendidikan Dasar, dan Jenjang Pendidikan
Menengah. In Kurikulum Merdeka (Nomor 12 Tahun
2024). Retrieved May 21,
2024, from [URL] [English Translation: The Ministry of Primary and Secondary Education. (2024). Regulation of the Minister of Education, Culture, Research, and Technology Concerning the Curriculum for Early Childhood
Education, Primary Education, and Secondary Education. In Emancipated Curriculum (Number 12 of 2024).]
Kumar, D., Jaipurkar, R., Shekhar, A., Sikri, G., & Srinivas, V. (2021). Item
analysis of multiple-choice questions: A quality assurance test for an assessment tool. Medical
Journal Armed Forces
India, 771, S85–S89.
Li, Z., Li, J. Z., Zhang, X., & Reynolds, B. L. (2024). Mastery
of Listening and Reading Vocabulary Levels in Relation to CEFR: Insights into Student Admissions and English as a Medium of
Instruction. Languages, 9(7), 239.
Loukina, A., Yoon, S. Y., Sakano, J., Wei, Y., & Sheehan, K. (2016). Textual
complexity as a predictor of difficulty of listening items in language proficiency
tests. In Proceedings of COLING 2016, the 26th International
Conference on Computational Linguistics: Technical
Papers (pp. 3245–3253). Osaka, Japan. [URL]
Marzaini, A. F. M., Sharil, W. N. E. H., Supramaniam, K., & Yusoff, S. M. (2023). Evaluating
Teachers’ Assessment Literacy in Enacting CEFR-Aligned Classroom-Based Assessment in Malaysian Secondary Schools’ ESL
Classrooms. International Journal of Academic Research in Progressive Education and
Development, 12(1).
McLean, S., Stewart, J., & Batty, A. O. (2020). Predicting
L2 reading proficiency with modalities of vocabulary knowledge: A bootstrapping
approach. Language
Testing, 37(3), 389–411.
McMillen, S., Anaya, J. B., Peña, E. D., Bedore, L. M., & Barquin, E. (2022). That’s
hard! Item difficulty and word characteristics for bilinguals with and without developmental language
disorder. International Journal of Bilingual Education and
Bilingualism, 25(5), 1838–1856.
Medero, J., & Ostendorf, M. (2009). Analysis
of vocabulary difficulty using
Wiktionary. In SLaTE (pp. 61–64). [URL].
Milton, J., Alexiou, T. (2009). Vocabulary
Size and the Common European Framework of Reference for
Languages. In: Richards, B., Daller, M. H., Malvern, D. D., Meara, P., Milton, J., Treffers-Daller, J. (eds) Vocabulary
Studies in First and Second Language Acquisition. Palgrave Macmillan, London.
Nation, I. S. P. (2013). Learning
Vocabulary in Another Language. Second
edition. Cambridge: Cambridge University Press.
Natova, I. (2021). Estimating
CEFR reading comprehension text complexity. Language Learning
Journal, 49(6), 699–710.
North, B., & Jarosz, E. (2001). Implementing
the CEFR in teacher-based assessment: approaches and challenges. Exploring Language
Frameworks, 1181. [URL]
Owen, N., Shrestha, P., & Bax, S. (2021). Researching
lexical thresholds and lexical profiles across the Common European Framework of Reference for Languages (CEFR) levels assessed
in the APTIS test. ARAG’s Research Reports
Online, (1). [URL]
Oxford 3000 and 5000. (2023). Oxford
Learner’s Dictionaries. Retrieved October 10, 2024, from [URL]
Petersen, S. E., & Ostendorf, M. (2009). A
machine learning approach to reading level assessment. Computer speech &
language, 23(1), 89–106.
Pujianto, D., Damayanti, I. L., Hamied, F. A., and Sari, D. N. K. (2023). “Identifying
the proficiency level of primary English language teachers’ productive skills from Kurikulum Merdeka and
CEFR,” Bahasa dan Seni: Jurnal Bahasa, Sastra, Seni, dan
Pengajarannya: Vol. 51: No. 2, Article
4.
Qian, D. D. (2002). Investigating
the relationship between vocabulary knowledge and academic reading performance: An assessment
perspective. Language
learning, 52(3), 513–536.
Rafatbakhsh, E., & Ahmadi, A. (2023). Predicting
the difficulty of EFL reading comprehension tests based on linguistic indices. Asian-Pacific
Journal of Second and Foreign Language
Education, 8(1).
Rupp, A. A., Garcia, P., & Jamieson, J. (2001). Combining
Multiple Regression and CART to Understand Difficulty in Second Language Reading and Listening Comprehension Test
Items. International Journal of
Testing, 1(3–4), 185–216.
Shih, C. M. (2008). The
general English proficiency test. Language Assessment
Quarterly, 5(1), 63–76.
Sung, P.-J., Lin, S.-W., & Hung, P.-H. (2015). Factors
Affecting Item Difficulty in English Listening Comprehension Tests. Universal Journal of
Educational
Research, 3(7), 451–459.
Susanti, Y., Nishikawa, H., Tokunaga, T., & Hiroyuki, O. (2016). Item
difficulty analysis of English vocabulary questions. In International
Conference on Computer Supported
Education (Vol. 21, pp. 267–274). Scitepress.
Schmitt, N., Jiang, X., & Grabe, W. (2011). The
percentage of words known in a text and reading comprehension. The modern language
journal, 95(1), 26–43.
Uchida, S., Arase, Y., & Kajiwara, T. (2024). Profiling
English sentences based on CEFR levels. ITL — International Journal of Applied
Linguistics.
Tu, M., Ma, Q., & Jiang, L. (2024). Exploring
EFL vocabulary learning through the story continuation writing task A mixed-methods
study. Australian Review of Applied Linguistics.
Van Der Vleuten, C. P. (1996). The
assessment of professional competence: developments, research, and practical
implications. Advances in Health Sciences
Education, 1(1), 41–67.
Warnby, M. (2024). Relating
academic reading with academic vocabulary and general English proficiency to assess standards of students’
university-preparedness — the case of IELTS and CEFR B2. Scandinavian Journal of Educational
Research, 69(3), 506–523.
Waluyo, B. (2019). Examining
Thai first-year university students’ English proficiency on CEFR levels. The New English
Teacher ISSN 2985–0959
(Online), 13(2), 51–51. [URL]