Article published In: International Journal of Learner Corpus Research: Online-First Articles
SEEFLEX
The Corpus of Secondary English as a Foreign Language (EFL) Exams
Published online: 21 August 2025
https://doi.org/10.1075/ijlcr.24027.pau
https://doi.org/10.1075/ijlcr.24027.pau
Abstract
This report presents the Corpus of Secondary School English as a Foreign Language (EFL) Exams
(SEEFLEX). In Germany, upper secondary school EFL exams feature recurring tasks targeting diverse text types. The
SEEFLEX was developed to investigate how students complete these tasks linguistically and whether they meet
the curricular requirements. The corpus contains data from 575 transcribed authentic curriculum-based examinations (1,979 texts,
~625.000 words). The metadata include standardized receptive vocabulary assessments, a cognition scale, the participants’ reading
habits, social background, and their language experience and proficiency. Extensive xml mark-up was added to investigate the
influence of inter alia source material, structural text features, and selected language mistakes. An online repository provides
full-text access as well as ample additional resources, including an interactive Shiny application to investigate register
variation in the corpus.
Article outline
- 1.Introduction
- 2.Data collection
- 3.Ethical considerations
- 4.Corpus description
- 4.1Corpus size
- 4.2Learner metadata
- 4.2.1Language background
- 4.2.2Vocabulary tests
- 4.2.3The “Need for Cognition”
- 4.2.4Reading habits
- 4.3Written data
- 4.3.1Text, situational and task-related metadata
- 4.3.2Tasks
- 4.3.3Genres and registers
- 5.Data processing
- 5.1Transcription and digitizing
- 5.2Part-of-speech tagging
- 5.3Mark-up
- 5.3.1Structural mark-up
- 5.3.2Language mark-up
- 5.3.3Content mark-up
- 6.Additional resources
- 6.1Data pipeline
- 6.2CQPweb
- 6.3Shiny applications
- 7.Conclusion
- Acknowledgements
- Open data badge and data availability statement
- Notes
References
References (70)
Alexopoulou, T., Michel, M., Murakami, A., & Meurers, D. (2017). Task
Effects on Linguistic Complexity and Accuracy: A Large-Scale Learner Corpus Analysis Employing Natural Language Processing
Techniques. Language
Learning, 67(S1), 180–208.
Anderson, J. A. E., Mak, L., Keyvani Chahi, A., & Bialystok, E. (2018). The
language and social background questionnaire: Assessing degree of bilingualism in a diverse
population. Behavior Research
Methods, 50(1), 250–263.
Cacioppo, J. T., & Petty, R. E. (1982). The
need for cognition. Journal of Personality and Social
Psychology, 42(1), 116–131.
Cacioppo, J. T., Petty, R. E., & Feng Kao, C. (1984). The
efficient assessment of need for cognition. Journal of Personality
Assessment, 48(3), 306–307.
Centre for English Corpus
Linguistics. (2024). Learner corpora around the
world [Louvain-la-neuve: Université catholique de louvain.]. [URL]
Chang, W., Cheng, J., Allaire, J. J., Sievert, C., Schloerke, B., Xie, Y., Allen, J., McPherson, J., Dipert, A., & Borges, B. (2025). Shiny:
Web application framework for R (Version R package version 1.10.0.9000). [URL]
Council of Europe (Ed.). (2020). Common
European Framework of Reference for Languages: Learning, teaching, assessment; companion
volume. Council of Europe Publishing.
Daller, H., & Phelan, D. (2007). What
is in a teacher’s mind? Teacher ratings of EFL essays and different aspects of lexical
richness. In H. Daller, J. Milton, & J. Treffers-Daller (Eds.), Modelling
and assessing vocabulary
knowledge (pp. 234–244). Cambridge University Press.
Derewianka, B. (2012). Knowledge
about language in the Australian curriculum: English. The Australian Journal of Language and
Literacy, 35(2), 127–146.
Derewianka, B., & Jones, P. (2016). Teaching
language in context (Second edition). Oxford University Press.
Dirdal, H., Johansen, S. H., & Durrant, P. (2024). Representativeness
and metadata presentation in learner/child corpora: Lessons from the GiG and TRAWL
corpora. Research Methods in Applied
Linguistics, 3(3), 100–145.
Ellis, N. C. (1997). Vocabulary
acquisition: Word structure, collocation, word-class, and
meaning. In M. McCarthy & N. Schmidt (Eds.), Vocabulary:
Description, acquisition and
pedagogy (pp. 122–139). Cambridge University Press.
Farmer, T. A., Fine, A. B., Misyak, J. B., & Christiansen, M. H. (2017). Reading
span task performance, linguistic experience, and the processing of unexpected syntactic
events. Quarterly Journal of Experimental
Psychology, 70(3), 413–433.
Flowerdew, J., & Li, Y. (2007). Language
re-use among Chinese apprentice scientists writing for publication. Applied
Linguistics, 28(3), 440–465.
Garside, R. (1987). The
CLAWS word-tagging system. In R. G. Garside, G. N. Leech, & G. Sampson (Eds.), The
computational analysis of English: A corpus-based
approach. Longman.
Gilquin, G. (2015). From
design to collection of learner corpora. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The
Cambridge handbook of learner corpus
research (pp. 9–34). Cambridge University Press.
Glaznieks, A., Frey, J.-C., Stopfner, M., Zanasi, L., & Nicolas, L. (2022). Leonide:
A longitudinal trilingual corpus of young learners of Italian, German and
English. International Journal of Learner Corpus
Research, 8(1), 97–120.
Granger, S. (2008). Learner
Corpora. In A. Lüdeling & M. Kytö (Eds.), Corpus
linguistics: An international
handbook (Vol. 11, pp. 259–275). Walter de Gruyter.
(2009). The
contribution of learner corpora to second language acquisition and foreign language teaching: A critical
evaluation. In K. Aijmer (Ed.), Studies
in corpus
linguistics (Vol. 331, pp. 13–332). John Benjamins Publishing Company.
(2012). How
to use foreign and second language learner corpora. In A. Mackey & S. M. Gass (Eds.), Research
methods in second language
acquisition (pp. 5–29). Wiley.
Halliday, M. A. K. (1978). Language
as social semiotic: The social interpretation of language and meaning. E. Arnold.
Halliday, M. A. K., & Hasan, R. (1989). Language,
context and text: Aspects of language in a social-semiotic perspective. Oxford University Press.
Halliday, M. A. K., & Matthiessen, C. M. I. M. (2014). Halliday’s
Introduction to Functional Grammar (Fourth
Edition). Routledge.
Halliday, M. A. K., McIntosh, A., & Strevens, P. (1964). The
linguistic sciences and language teaching. Longman.
Hardie, A. (2012). CQPweb
— combining power, flexibility and usability in a corpus analysis tool. International Journal
of Corpus
Linguistics, 17(3), 380–409.
Kerz, E., Neumann, S., & Niemietz, P. (2022). Assessing
linguistic complexity and register flexibility in advanced second language learners: Evidence from group- and individual-level
analyses. Register
Studies, 4(1), 55–90.
KMK. (2012). Bildungsstandards für die
fortgeführte Fremdsprache (englisch/französisch) für die allgemeine Hochschulreife (The Standing
Conference of the Ministers of Education and Cultural Affairs,
Ed.). Retrieved December 17,
2024, from [URL]
Krashen, S. D. (2003). Explorations
in language acquisition and use: The Taipei
lectures. Heinemann.
Kreyer, R. (2015). The
Marburg Corpus of Intermediate Learner English (MILE). In M. Callies & S. Götz (Eds.), Studies
in Corpus
Linguistics, 701 (pp. 13–34). John Benjamins.
Kyle, K., Crossley, S., & Berger, C. (2018). The
tool for the automatic analysis of lexical sophistication (TAALES): Version 2.0. Behavior
Research
Methods, 50(3), 1030–1046.
Lemhöfer, K., & Broersma, M. (2012). Introducing
LexTALE: A quick and valid lexical test for advanced learners of English. Behavior Research
Methods, 44(2), 325–343.
Lu, X. (2011). A
corpus-based evaluation of syntactic complexity measures as indices of college-level ESL writers’ language
development. TESOL
Quarterly, 45(1), 36–62.
Lüdeling, A., & Hirschmann, H. (2015). Error
annotation systems. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The
Cambridge handbook of learner corpus
research (pp. 135–158). Cambridge University Press.
Marian, V., Blumenfeld, H. K., & Kaushanskaya, M. (2007). The
language experience and proficiency questionnaire (LEAP-Q): Assessing language profiles in bilinguals and
multilinguals. Journal of Speech, Language, and Hearing
Research, 50(4), 940–967.
Martin, J. R. (1992). Genre
and literacy — modeling context in educational linguistics. Annual Review of Applied
Linguistics, 131, 141–172.
Meara, P. (1996). The
dimensions of lexical competence. In G. Brown, K. Malmkjaer, & J. Williams (Eds.), Performance
and competence in second language
acquisition (pp. 33–53). Cambridge University Press.
Melissourgou, M. N., & Frantzi, K. T. (2017). Genre
identification based on SFL principles: The representation of text types and genres in English language teaching
material. Corpus
Pragmatics, 11, 373–392.
Milton, J. (2010). The
development of vocabulary breadth across the CEFR levels. In I. Bartning, M. Martin, & I. Vedder (Eds.), Communicative
proficiency and linguistic development. intersections between SLA and language testing
research (pp. 211–232). European Second Language Association.
(2013). Measuring
the contribution of vocabulary knowledge to proficiency in the four
skills. In C. Bardel, C. Lindqvist, & B. Laufer (Eds.), L2
vocabulary acquisition, knowledge and use: New perspectives on assessment and corpus
analysis. European Second Language Association. [URL]
Milton, J., Wade, J., & Hopkins, N. (2010). Aural
word recognition and oral competence in English as a foreign
language. In R. Chacón-Beltrán, C. Abello-Contesse, & M. D. M. Torreblanca-López (Eds.), Insights
into non-native vocabulary teaching and
learning (pp. 83–98). Multilingual Matters.
Ministry of Education. (2014). Kernlehrplan
für die Sekundarstufe II Gymnasium/Gesamtschule in
Nordrhein-Westfalen. Retrieved December
17, 2024, from [URL]
Ministry of
Education. (2017). Operatorenübersicht für das Fach Englisch (Abitur ab
2017). Retrieved December 17,
2024, from [URL]
Ministry of Education. (2023). Kernlehrplan
für die Sekundarstufe II Gymnasium/Gesamtschule in
Nordrhein-Westfalen. Retrieved December
17, 2024, from [URL]
. (2024). Klausuren in
den modernen Fremdsprachen in der Qualifikationsphase der gymnasialen
Oberstufe. Retrieved January 23,
2025, from [URL]
Myles, F. (2021). Commentary:
An SLA perspective on learner corpus research. In B. Le Bruyn & M. Paquot (Eds.), Learner
corpus research meets second language
acquisition (pp. 258–273). Cambridge University Press.
Naismith, B., Han, N.-R., & Juffs, A. (2022). The
university of Pittsburgh English language institute corpus (PELIC). International Journal of
Learner Corpus
Research, 8(1), 121–138.
Neumann, S. (2014). Contrastive
register variation: A quantitative approach to the comparison of English and German. Walter de Gruyter Mouton.
Neumann, S., & Evert, S. (2021). A
register variation perspective on varieties of English. In E. Seoane & D. Biber (Eds.), Corpus-based
approaches to register
variation (pp. 143–178). John Benjamins Publishing Company.
Paltridge, B. (1996). Genre,
text type, and the language learning classroom. ELT
Journal, 50(3), 237–243.
Paquot, M., König, A., Stemle, E. W., & Frey, J.-C. (2024). The
core metadata schema for learner corpora (LC-meta): Collaborative efforts to advance data discoverability, metadata quality
and study comparability in L2 research. International Journal of Learner Corpus
Research, 10(2), 280–300.
Pilegaard, M., & Frandsen, F. (1996). Text
type. In J. Verschueren, J.-O. Östman, J. Blommaert, & C. Bulcaen (Eds.), Handbook
of
Pragmatics (pp. 1–13). John Benjamins Publishing Company.
Puig-Mayenco, E., Chaouch-Orozco, A., Liu, H., & Martín-Villena, F. (2023). The
LexTALE as a measure of L2 global proficiency: A cautionary tale based on a partial replication of Lemhöfer and Broersma
(2012). Linguistic Approaches to
Bilingualism, 13(3), 299–314.
R Core Team. (2022). R: A language and
environment for statistical computing. Vienna, Austria. [URL]
Riemenschneider, A., Weiss, Z., Schröter, P., & Meurers, D. (2023). The
interplay of task characteristics, linguistic complexity, and language proficiency in high-stakes English as a Foreign
Language writing. TESOL
Quarterly, 58(2), 775 — 801.
Schmitt, N., Schmitt, D., & Clapham, C. (2001). Developing
and exploring the behaviour of two new versions of the Vocabulary Levels Test. Language
Testing, 18(1), 55–88.
Stæhr, L. S. (2008). Vocabulary
size and the skills of listening, reading and writing. Language Learning
Journal, 36(2), 139–152.
Swales, J. M. (1990). Genre
analysis: English in academic and research settings. Cambridge University Press.
The TEI Consortium. (2021). TEI p5:
Guidelines for electronic text encoding and interchange (Version
4.3.0.). Retrieved July 2,
2024, from [URL]
van Rooy, B., & Schäfer, L. (2002). The
effect of learner errors on POS tag errors during automatic POS tagging. Southern African
Linguistics and Applied Language
Studies, 20(4), 325–335.
Webb, S., Sasao, Y., & Ballance, O. (2017). The
updated vocabulary levels test: Developing and validating two new forms of the VLT. ITL —
International Journal of Applied
Linguistics, 168(1), 33–69.