Article published In: Graded Resources for Second and Foreign Language Learning
Edited by David Alfter and Thomas François
[ITL - International Journal of Applied Linguistics 175:1] 2024
► pp. 25–45
Mapping of American English vocabulary by grade levels
Published online: 26 February 2024
https://doi.org/10.1075/itl.22025.flo
https://doi.org/10.1075/itl.22025.flo
Abstract
We describe a large-scale effort to map English-language vocabulary by U.S. school grade levels. Our motivation is
to rapidly expand graded vocabulary resources for work with native English speakers in the USA, while taking into consideration
school-related influences rather than relying on just the corpus-frequency approaches. We report on the initial effort of data
collection, with mapping of about 22K word forms. We provide comparisons of this mapping to some other recent vocabulary mapping
efforts, such as age-of-acquisition. We then describe the efforts to automatically expand this resource by using linguistically
motivated variables and corpus-based methods. Our current resource maps more than 126K English word forms to US school grade
levels. We also compare a subset of our L1 mapped data to English L2 vocabulary levels, as expressed on the CEFR scale, and find
that there is a considerable overlap in the order of vocabulary learning in L1 and L2 English.
Keywords: vocabulary, grade levels, graded resources, lexical progression, word difficulty
Article outline
- Introduction
- Related work
- Method
- Data Collection
- Comparing VXGL and AoA
- Prediction
- Associative Estimate of Grade Level
- Results
- Comparison with CEFR mapping
- Discussion
- Conclusion
- Notes
References
References (60)
Alfter, D., & Volodina, E. (2018). Towards
single word lexical complexity prediction. In Proceedings of the
Thirteenth Workshop on Innovative Use of NLP for Building Educational
Applications, pages 79–88. New Orleans, Louisiana, June 5,
2018. [URL].
Biemiller, A., & Slonim, N. (2001). Estimating
root word vocabulary growth in normative and advantaged populations: Evidence for a common sequence of vocabulary
acquisition. Journal of Educational
Psychology, 931, 498–520.
Botarleanu, R. M., Dascalu, M., Watanabe, M., Crossley, S. A., McNamara, D. S. (2022). Age of Exposure 2.0: Estimating word complexity using iterative models of word embeddings. Behavior Research Methods, (541), 3015–3042.
Breland, H. (1996). Word
frequency and word difficulty: a comparison of counts in four corpora. Psychological
Science, 7:2, 96–99.
Brysbaert, M., & Biemiller, A. (2017). Test
based age of acquisition norms for 44 thousand English word meanings. Behavior Research
Methods, 491, 1520–1523.
Brysbaert, M., Keuleers, E., & Mandera, P. (2021). Which
words do English non-native speakers know? New supernational levels based on yes/no
decision. Second Language
Research, 37(2), 207–231.
Brysbaert, M., Mandera, P., McCormick, S. F., & Keuleers, E. (2019). Word
prevalence norms for 62,000 English lemmas. Behavior Research
Methods, 511, 467–479.
Brysbaert, M., & New, B. (2009). Moving
beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved
word frequency measure for American English. Behavior Research
Methods, 411, 977–990.
Brysbaert, M., Stevens, M., Mandera, M., & Keuleers, E. (2016). How
Many Words Do We Know? Practical Estimates of Vocabulary Size Dependent on Word Definition, the Degree of Language Input and
the Participant’s Age. Frontiers in
Psychology, 71:1116.
Capel, A. (2012). Completing
the English Vocabulary Profile: C1 and C2 vocabulary. English Profile
Journal, 3(1), 1–14.
Carroll, J. B., Davies, P., & Richman, B. (1971). The
American Heritage word frequency book. New York; American Heritage Publishing Co.
Carroll, J. B., & White, M. N. (1973). Age
of acquisition norms for 220 picturable nouns. Journal of Verbal Learning & Verbal
Behavior, 121, 563–576.
Chenu, F., & Jisa, H. (2009). Reviewing
some similarities and differences in L1 and L2 lexical development. Acquisition et interaction
en langue étrangère, 11, 17–38.
Chujo, K., & Oghigian, K. (2015). Examining
Corpus-based L 2 Vocabulary Lists for Grade Level and Semantic Field Distribution. Journal of
the College of Industrial Technology, Nihon University, Vol. 481, pp. 11–19. [URL]
Council of Europe. (2001). Common European
Framework of Reference for Languages: Learning, Teaching, Assessment. Press Syndicate of the University of Cambridge.
Dale, E., & Chall, J. (1948). A
Formula for Predicting Readability. Educational Research
Bulletin, 271, 11–20
Dale, E., & O’Rourke, J. (1981). The
living word vocabulary, the words we know: A national vocabulary
inventory. Chicago: World Book.
Dang, T. N. Y. (2020). Corpus-based
word lists in second language vocabulary research, learning, and
teaching. In S. Webb (ed.), The
Routledge Handbook of Vocabulary
Studies, pp.288–304. New York: Routledge.
Dürlich, L., & François, T. (2018). EFLLex:
A Graded Lexical Resource for Learners of English as a Foreign
Language. In Proceedings of the 11th International Conference on
Language Resources and Evaluation (LREC
2018), pages 873–879.
Ellis, N. C., Simpson-Vlach, R., & Maynard, C. (2008). Formulaic
Language in Native and Second Language Speakers: Psycholinguistics, Corpus Linguistics, and
TESOL. TESOL
Quarterly, 42:3, 375–396.
Firth, J. R. (1957). A
Synopsis of Linguistic
Theory, 1930–55. In J. R. Firth et al., Studies
in Linguistic
Analysis, pp. 1–31. Special Volume of the Philological Society. Oxford: Blackwell.
Flor, M., & Beigman Klebanov, B. (2014). ETS
Lexical Associations System for the COGALEX 4 Shared Task. In M. Zock, R. Rapp, Ch. R. Huang (eds.), Proceedings
of the 4th Workshop on Cognitive Aspects of the
Lexicon, pages 35–45; At
COLING 2014 conference, Dublin, Ireland.
Gala, N., François, T., & Fairon, C. (2013). Towards
a French lexicon with difficulty measures: NLP helping to bridge the gap between traditional dictionaries and specialized
lexicons. eLex – Electronic Lexicography, October 2013, Tallin, Estonia.
Gilhooly, K., & Logie, R. H. (1980). Age
of acquisition, imagery, concreteness, familiarity and ambiguity measures for 1944
words. Behavior Research Methods &
Instrumentation, 121, 395–427.
Graën, J., Alfter, D., & Schneider, G. (2020). Using
Multilingual Resources to Evaluate CEFRLex for Learner
Applications. In the Proceedings of the 12th Conference on Language
Resources and Evaluation (LREC
2020), pages 346–355.
Gries, S. Th. (2008). Dispersion and adjusted
frequencies in corpora. International Journal of Corpus
Linguistics, 13:4, 403–437.
Harris, A. J. (1972). Rationale
and Description of Basic Elementary Reading Vocabularies. Paper presented at
the meeting of the International Reading Association, Detroit,
Michigan, May, 1972. [URL]
Harris, A. J., & Jacobson, M. D. (1972). Basic
Elementary Reading Vocabularies. New York: The Macmillan Co.
Hiebert, E. H. (2020). The
Core Vocabulary: The Foundation of Proficient Comprehension. The Reading
Teacher, 73:6, pp. 757–768.
Hiebert, E. H., Scott, J. A., Castaneda, R., & Spichtig, A. (2019). An
Analysis of the Features of Words That Influence Vocabulary Difficulty. Education
Sciences, 9(1), 8.
Ivens, S. H., & Koslin, B. L. (1991). Demands
for Reading Literacy Require New Accountability Methods. Touchstone Applied Science
Associates.
Kireyev, K., & Landauer, T. K. (2011). Word
Maturity: Computational Modeling of Word Knowledge. In Proceedings of
the 49th Annual Meeting of the Association for Computational
Linguistics, pages 299–308, Portland, Oregon.
Kučera, H., & Francis, W. N. (1967). Computational
analysis of present-day American English, Providence, RI: Brown University Press.
Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age
of acquisition ratings for 30,000 English words. Behavior Research
Methods, 441, 978–990.
Kyle, K., & Crossley, S. A. (2015). Automatically
Assessing Lexical Sophistication: Indices, Tools, Findings, and Application. TESOL
Quarterly, 49:4, 757–786.
Laufer, B., & Nation, I. S. P. (1995). Vocabulary
Size and Use: Lexical Richness in L2 Written Production. Applied
Linguistics, 16:3, 307–322.
Lété, B., Sprenger-Charolles, L., & Colé, P. (2004). MANULEX:
A grade-level lexical database from French elementary school readers. Behavior Research
Methods, Instruments, &
Computers, 361, 156–166.
Mesmer, H. A., Cunningham, J. W., & Hiebert, E. H. (2012). Toward
a theoretical model of text complexity for the early grades: Learning from the past, anticipating the
future. Reading Research
Quarterly, 47(3), 235–258.
Michel, J. B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Brockman, W., The
Google Books
Team, Pickett, J. P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M. A., & Lieberman Aiden, E. (2011). Quantitative
analysis of culture using millions of digitized
books. Science, 3311, 176–182.
Miralpeix, I. (2020). L1
and L2 Vocabulary Size and Growth. In S. Webb (ed.), The
Routledge Handbook of Vocabulary
Studies, pp.204–06. New York: Routledge.
Mobärg, M. (1997). Acquiring,
teaching and testing vocabulary. International Journal of Applied
Linguistics, 7:2, 201–222.
Nation, I. S. P. (2020). The
Different Aspects of Vocabulary Knowledge. In S. Webb (ed.), The
Routledge Handbook of Vocabulary
Studies, pp. 15–29. New York: Routledge.
(2017). The
BNC/COCA Level 6 word family lists (Version 1.0.0) [Data file]. Available
from [URL]
(2004). A
study of the most frequent word families in the British National
Corpus. In P. Bogaards & B. Laufer (eds.) Vocabulary
in a Second Language: Selection, Acquisition and
Testing Amsterdam: John Benjamins: 3–13.
Nation, I. S. P., & Waring, R. (1997). Vocabulary
size, text coverage, and word lists. In N. Schmitt & M. McCarthy (eds.), Vocabulary:
Description, Acquisition and Pedagogy. Cambridge University Press, Cambridge: 6–19.
The Oxford 3000 from the Oxford Advanced American Dictionary. (2019). Oxford University Press. Online resource: [URL]
Parker, R., Graff, D., Kong, J., Chen, K., & Maeda, K. (2009). Gigaword
Fourth Edition. LDC2009T13. Philadelphia: Linguistic Data Consortium.
Pelánek, R., Effenberger, T., & Čechák, J. (2022). Complexity
and Difficulty of Items in Learning Systems. International Journal of Artificial Intelligence
in Education, 321, 196–232.
Shardlow, M., Evans, R., Paetzold, G. H., & Zampieri, M. (2021). SemEval
2021 Task 1: Lexical Complexity Prediction. In Proceedings of the
15th International Workshop on Semantic Evaluation (SemEval
2021), pages 1–16.
Soares, A. P., Medeiros, J. P., Simões, A., Machado, J., Costa, A., Iriarte, A., João de Almeida, J., Pinheiro, A. P., & Comesaña, M. (2014). ESCOLEX:
A grade-level lexical database from European Portuguese elementary to middle school
textbooks. Behavior Research
Methods, 461, 240–253.
Taylor, S. E., Frackenpohl, H., & White, C. E. (1989). EDL
Core Vocabularies in Reading, Mathematics, Science, and Social Studies. Steck Vaughn Company, Austin, Texas.
Thorndike, E. L., & Lorge, I. (1944). The
teacher’s word book of 30,000 words. New York: Bureau of Publications, Teachers College, Columbia University.
Uemura, T., & Ishikawa, Sh. (2004). JACET
8000 and Asia TEFL Vocabulary Initiative. The journal of Asia
TEFL, vol. 1, No. 1, pp. 333–347.
Vilkaitė-Lozdienė, L., & Schmitt, N. (2020). Frequency
as a Guide for Vocabulary Usefulness. In S. Webb (ed.), The
Routledge Handbook of Vocabulary
Studies, pp. 81–96. New York: Routledge.
Cited by (4)
Cited by four other publications
Brooks, Gavin, Jon Clenton & Simon Fraser
Green, Clarence, Anthony Pak-Hin Kong, Marc Brysbaert & Kathleen Keogh
Sung, Hakyung, Mikyung Kim Wolf, Michael Suhan & Kristopher Kyle
This list is based on CrossRef data as of 30 march 2026. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
