Article published In: International Journal of Learner Corpus Research
Vol. 1:1 (2015) ► pp.96–129
Exploring big educational learner corpora for SLA research
Perspectives on relative clauses
Published online: 23 March 2015
https://doi.org/10.1075/ijlcr.1.1.04ale
https://doi.org/10.1075/ijlcr.1.1.04ale
We consider the opportunities presented by big educational learner corpora for Second Language Acquisition (SLA). In particular, we focus on the EF Cambridge Open Language Database (EFCAMDAT), an open access database of student writings submitted to Englishtown, the online school of EF Education First. EFCAMDAT stands out for its size (33 million words, 85 thousand learners) and a range of 128 writing tasks covering all CEFR levels with data from learners from varying nationalities. We discuss methodological issues arising from analyzing big data resources generated in educational contexts and argue that Natural Language Processing (NLP) is essential for the automated processing of such datasets. As a study case, we follow the developmental trajectory of relative clauses, a construction that necessitates deeper syntactic analysis. We consider specific issues that can affect the developmental trajectory, including task effects, formulaic language and national language effects.
References (51)
Bardovi-Harlig, K. 2000. Tense and Aspect in Second Language Acquisition: Form, Meaning and Use. Oxford: Blackwell.
Bley-Vroman, R. 1989. “What is the logical problem of foreign language learning?”. In S.M. Gass and J. Schachter (Eds.), Linguistic Perspectives on Second Language Acquisition. New York: Cambridge University Press, 41–68.
Cambridge Learner Corpus. 2009. Cambridge ESOL and Cambridge University Press. Available at [URL].
Church, K.W. & Hanks, P. 1990. “Word association norms, mutual information, and lexicography”, Computational Linguistics 16(1), 22–29.
Clark, S. & Curran, J.R. 2007. “Wide-coverage efficient statistical parsing with CCG and log-linear models”, Computational Linguistics 33(4), 493–552.
Council of Europe 2001. Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Cambridge: Cambridge University Press.
de Bot, K., Lowie, W. & Verspoor, M.H. (Eds.). 2011. A Dynamic Approach to Second Language Development. Methods and Techniques. Amsterdam: John Benjamins.
DeKeyser, R.M. 2005. “What makes learning second language grammar difficult? A review of issues”, Language Learning 55, S1, 1–25.
Ellis, N.C. 2010. “Construction learning as category learning”. In M. Pütz & L. Sicola (Eds.), Cognitive Processing and Second Language Acquisition: Inside the Learner’s Mind. John Benjamins, 27–48.
Feldweg, H. 1991. The European Science Foundation Second Language Database. Nijmegen: Max Planck Institute for Psycholinguistics.
Fillmore, L.W. 1979. “Individual differences in second language acquisition”. In C. Fillmore, D. Kempler & W.S.-Y. Wang (Eds.), Individual Differences in Language Ability and Language Behavior. New York: Academic Press, 203–228.
Flynn, S., Foley, C. & Vinnitskaya, I. 2004. “The cumulative enhancement model for language acquisition: comparing adults’ and children’s patterns of development in first, second and third language acquisition of relative clauses”, The International Journal of Multilingualism 1(1), 3–16.
Geertzen, J., Alexopoulou, T., Baker, R., Hendriks, H., Jiang, S. & Korhonen, A. 2013a. The EF Cambridge Open Language Database (EFCAMDAT): User Manual Part I: Writtings. Available at [URL]. (accessed 19 November 2014).
Geertzen, J., Alexopoulou, T. & Korhonen, A. 2013b. “Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCAMDAT)”. In R.T. Miller, K.I. Martin, C.M. Eddingon, A. Henery, N. Marcos Miguel, A.M. Tseng, A. Tuninetti & D. Walter (Eds.), Proceedings of the 31st Second Language Research Forum (SLRF), Carnegie Mellon. Cascadilla Proceedings Project, 240–254.
. 2008. “Learner corpora”. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook. Berlin and New York: Walter de Gruyter, 259–275.
Granger, S., Dagneaux, E. & Meunier, F. 2002. International Corpus of Learner English. Louvain-la-Neuve: Presses Universitaires de Louvain.
Granger, S., Dagneaux, E., Meunier, F. & Paquot, M. 2009. International Corpus of Learner English. Version 2 (Handbook + CD-ROM). Louvain-la-Neuve: Presses universitaires de Louvain.
Granger, S., Kraif, O., Ponton, C., Antoniadis, G. & Zampa, V. 2007. “Integrating learner corpora and natural language processing: A crucial step towards reconciling technological sophistication and pedagogical effectiveness”, ReCaLL 19(3), 252–268.
Hockenmaier, J. & Steedman, M. 2007. “CCGbank: a corpus of CCG derivations and dependency structures extracted from the Penn Treebank”, Computational Linguistics 33(3), 355–396.
Lardiere, D. 1998. “Dissociating syntax from morphology in a divergent L2 end-state grammar”, Second Language Research 14(4), 359–375.
Lozano, C. & Mendikoetxea, A. 2013. “Learner corpora and second language acquisition: The design and collection of CEDEL2”. In N. Ballier, A. Díaz-Negrillo & P. Thompson (Eds.), Automatic Treatment and Analysis of Learner Corpus Data. Amsterdam: John Benjamins, 65–100.
Meunier, F. and Littré, D. 2013. “Tracking learners’ progress: adopting a dual corpus cum experimental data approach”, Modern Language Journal 971, 61–76.
Miller, G.A. 1995. “WordNet: a lexical database for English”, Communications of the ACM 38(11), 39–41.
Murakami, A. 2013. L1 Influence and Individual Variation in the L2 Accuracy Development of Grammatical Morphemes: Insights from Learner Corpora. Unpublished doctoral dissertation, University of Cambridge, UK.
Myles, F. 2008. “Investigating learner language development with electronic longitudinal corpora: Theoretical and methodological issues”. In L. Ortega and H. Byrnes (Eds.), The longitudinal Study of Advanced L2 Capacities. New York and London: Routledge, 58–72.
. 2012. “Complexity, accuracy and fluency; the role played by formulaic sequencies in early interlanguage development”. In A. Housen, F. Kuiken, & I. Vedder (Eds.), Dimensions of L2 Performance and Proficiency: Complexity, Accuracy and Fluency in SLA, Language Learning & Language Teaching. Amsterdam & Philadelphia: John Benjamins, 71–94.
Myles, F. & Mitchell, R. 2007. French learner language oral corpora (FLLOC). Available at [URL] (accessed 19 November 2014).
O’Donnell, M.B., Römer, U. & Ellis, N.C. 2013. “The development of formulaic sequences in first and second language writing: investigating effects of frequency, association and native form”, International Journal of Corpus Linguistics 18(1), 83–108.
Orasan, C. & Evans, R. 2007. “NP animacy identification for anaphora resolution”, Journal of Artificial Intelligence Research 291, 79–103.
Paquot, M. 2013. “Lexical bundles and L1 transfer effects”, International Journal of Corpus Linguistics 18(13), 391–417.
Rimell, L., Clark, S. & Steedman, M. 2009. “Unbounded dependency recovery for parser evaluation”. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2. Association for Computational Linguistics, 813–821.
Robinson, P. and Ellis, N.C. 2008. Handbook of Cognitive Linguistics and Second Language Acquisition. London and New York: Routledge.
Selinker, L. 1972. “Interlanguage”, International Review of Applied Linguistics in Language Teaching 10(1–4), 209–232.
Shirai, Y. & Ozeki, H. 2007. “Introduction to the special issue: The acquisition of relative clauses and the noun phrase accessibility hierarchy: a universal in SLA?”, Studies in Second Language Acquisition 291, 55–167.
Sinclair, J. 2005. “How to build a corpus”. In M. Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice, Oxford: Oxbow Books, 79–83.
Tavakoli, P. & Foster, P. 2008. “Task design and second language performance: the effect of narrative type on learner output”, Language Learning 58(2), 439–473.
Team, R.C. 2008. R: a language and environment for statistical computing. Vienna: Foundation for Statistical Computing.
Tizón-Couto, B. 2013. Clausal Complements in Native and Learner Spoken English. A Corpus-based Study with LINDSEI and VICOLSE. Bern: Peter Lang.
Vyatkina, N. 2012. “The development of second language writing complexity in groups and individuals: A longitudinal learner corpus study”, The Modern Language Journal 961, 576–598.
White, L. 1989. Universal Grammar and Second Language Acquisition. Amsterdam and Philadelphia: John Benjamins.
Cited by (31)
Cited by 31 other publications
Sato, Masatoshi, Steven L. Thorne, Marije Michel, Theodora Alexopoulou & John Hellermann
Verratti-Souto, Daniela, Nelly Sagirov & Xiaobin Chen
Callies, Marcus
2024. Challenges in the compilation, annotation, and analysis of learner corpus
data. In Challenges in corpus linguistics [Studies in Corpus Linguistics, 118], ► pp. 55 ff.
Derkach, Kateryna & Theodora Alexopoulou
Lestari, Febriana
Liu, Yingying & Xiaofei Lu
McManus, Kevin
Papadopoulou, Despina, Nikolaos Amvrazis, Gerakini Douka & Alexandros Tantos
Römer-Barron, Ute
2024. How do constructions with modal verbs develop in second language learners of English?. Journal of Second Language Studies 7:2 ► pp. 198 ff.
Shatz, Itamar, Theodora Alexopoulou & Akira Murakami
Ruggia, Simona & Thomas Gaillat
Shatz, Itamar, Theodora Alexopoulou, Akira Murakami & Ramona Bongelli
Naismith, Ben, Na-Rae Han & Alan Juffs
2022. The University of Pittsburgh English Language Institute Corpus (PELIC). International Journal of Learner Corpus Research 8:1 ► pp. 121 ff.
Naismith, Ben, Alan Juffs, Na-Rae Han & Daniel Zheng
O'Keeffe, Anne & Geraldine Mark
Tan, Yi & Ute Römer
Chen, Xiaobin, Theodora Alexopoulou & Ianthi Tsimpli
Meurers, Detmar
Azazil, Lina
Gilquin, Gaëtanelle
Shatz, Itamar
2020. Refining and modifying the EFCAMDAT. International Journal of Learner Corpus Research 6:2 ► pp. 220 ff.
Römer, Ute
2019. A corpus perspective on the development of verb constructions in second language learners. International Journal of Corpus Linguistics 24:3 ► pp. 268 ff.
Römer, Ute
Römer, Ute & Cynthia M. Berger
Zalaltdinova, Liya
Alexopoulou, Theodora, Marije Michel, Akira Murakami & Detmar Meurers
Meurers, Detmar & Markus Dickinson
Garner, James R.
2016. A phrase-frame approach to investigating phraseology in learner writing across proficiency levels. International Journal of Learner Corpus Research 2:1 ► pp. 31 ff.
Murakami, Akira
Vyatkina, Nina
2016. TheKansas Developmental Learner corpus(KANDEL). International Journal of Learner Corpus Research 2:1 ► pp. 101 ff.
[no author supplied]
This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
