Refining and modifying the EFCAMDAT: Lessons from creating a new corpus from an existing large-scale English learner language database

Shatz, Itamar

doi:10.1075/ijlcr.20009.sha

Article published In: International Journal of Learner Corpus Research
Vol. 6:2 (2020) ► pp.220–236

Get fulltext from our e-platform

Download PDF

Corpus Report

Refining and modifying the EFCAMDAT

Lessons from creating a new corpus from an existing large-scale English learner language database

Itamar Shatz | University of Cambridge

Published online: 10 December 2020

https://doi.org/10.1075/ijlcr.20009.sha

Abstract

This report outlines the development of a new corpus, which was created by refining and modifying the largest open-access L2 English learner database – the EFCAMDAT. The extensive data-curation process, which can inform the development and use of other corpora, included procedures such as converting the database from XML to a tabular format, and removing problematic markup tags and non-English texts. The final dataset contains two corresponding samples, written by similar learners in response to different prompts, which represents a unique research opportunity when it comes to analyzing task effects and conducting replication studies. Overall, the resulting corpus contains ~406,000 texts in the first sample and ~317,000 texts in the second sample, written by learners representing diverse L1s and a large range of L2 proficiency levels.

Keywords: data curation, corpus cleaning, English as a second language, EFCAMDAT

Article outline

1.Introduction
2.Preparing the new corpus
- 2.1Selecting the sample
- 2.2Format: Converting from XML to a tabular format
- 2.3Content: Analyzing and removing texts
  - 2.3.1Texts with problematic markup tags
  - 2.3.2Ultra-short texts
  - 2.3.3Non-English texts
  - 2.3.4Duplicate texts
  - 2.3.5Outlier texts based on wordcount
- 2.4Structure: Classifying texts based on prompt
3.Discussion and conclusion
Acknowledgements
References

References (19)

References

Alexopoulou, T., Geertzen, J., Korhonen, A., & Meurers, D. (2015). Exploring big educational learner corpora for SLA research: Perspectives on relative clauses. International Journal of Learner Corpus Research, 1(1), 96–129.

Alexopoulou, T., Michel, M., Murakami, A., & Meurers, D. (2017). Task effects on linguistic complexity and accuracy: A large-scale learner corpus analysis employing natural language processing techniques. Language Learning, 67(S1), 180–208.

Callies, M. (2015). Learner corpus methodology. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp. 35–56). Cambridge: Cambridge University Press.

Feinerer, I., & Hornik, K. (2018). tm: Text Mining Package. Retrieved from [URL]

Geertzen, J., Alexopoulou, T., Baker, R., Hendriks, H., Jiang, S., & Korhonen, A. (2013). The EF Cambridge Open Language Database (EFCAMDAT). User Manual Part I: Written Production. Retrieved from [URL]

Geertzen, J., Alexopoulou, T., & Korhonen, A. (2014). Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCamDat). In R. T. Millar, K. I. Martin, C. M. Eddington, A. Henery, N. M. Miguel, & A. Tseng (Eds.), Selected proceedings of the 2012 Second Language Research Forum (pp. 240–254). Somerville, MA: Cascadilla Proceedings Project.

Grün, B., & Hornik, K. (2011). topicmodels: An R package for fitting topic models. Journal of Statistical Software, 40(13), 1–30.

Huang, Y., Geertzen, J., Baker, R., Korhonen, A., & Alexopoulou, T. (2017). The EF Cambridge Open Language Database (EFCAMDAT): Information for users (pp. 1–18). Retrieved from [URL]

Huang, Y., Murakami, A., Alexopoulou, T., & Korhonen, A. (2018). Dependency parsing of learner English. International Journal of Corpus Linguistics, 23(1), 28–54.

Kaliyaperumal, S. K., Kuppusamy, M., Arumugam, S., Kannan, K. S., Manoj, K., & Arumugam, S. (2015). Labeling methods for identifying outliers. International Journal of Statistics and Systems, 10(2), 231–238.

Lang, D. T.. (2020). XML: Tools for parsing and generating XML within R and S-Plus. Retrieved from [URL]

McEnery, T., Brezina, V., Gablasova, D., & Banerjee, J. (2019). Corpus linguistics, learner corpora, and SLA: Employing technology to analyze language use. Annual Review of Applied Linguistics, 391, 74–92.

Murakami, A. (2013). Individual variation and the role of L1 in the L2 development of English grammatical morphemes: Insights from learner corpora (Unpublished doctoral dissertation). Cambridge University.

(2016). Modeling systematicity and individuality in nonlinear second language development: The case of English grammatical morphemes. Language Learning, 66(4), 834–871.

Ooms, J. (2018). cld2: Google’s compact language detector 2 (Version 1.2). R package. Retrieved from [URL]

Shatz, I. (2019). How native language and L2 proficiency affect EFL learners’ capitalisation abilities: A large-scale corpus study. Corpora, 14(2), 173–202.

Van der Loo, M. P. J. (2014). The stringdist package for approximate string matching. The R Journal, 6(1), 111–122. Retrieved from [URL]

Wickham, H., François, R., Henry, L., Müller, K., & RStudio. (2019). dplyr: A grammar of data manipulation. Retrieved from [URL]

Wickham, H., & RStudio. (2019). stringr: Simple, consistent wrappers for common string operations. Retrieved from [URL]

Cited by (14)

Cited by 14 other publications

Order by:

Bozdağ, Fatih Ünal, Junhua Mo, Gareth Morris & Dragana Bozic Lenard

2025. A corpus-based analysis of noun modifiers in L2 writing: The respective impact of L2 proficiency and L1 background. PLOS ONE 20:3 ► pp. e0320092 ff.

Mallart, Cyriel, Andrew Simpkin, Nicolas Ballier, Paula Lissón, Rémi Venant, Bernardo Stearns, Jen-Yu Li & Thomas Gaillat

2025. Assessing the validity of syntactic alternations as criterial features of proficiency in L2 writings in English. Research Methods in Applied Linguistics 4:3 ► pp. 100238 ff.

Murakami, Akira

2025. Towards more appropriate modelling of linguistic complexity measures: Beyond traditional regression models. Research Methods in Applied Linguistics 4:1 ► pp. 100182 ff.

Sato, Masatoshi, Steven L. Thorne, Marije Michel, Theodora Alexopoulou & John Hellermann

2025. Language, people, classrooms, world: Blending disparate theories for united language education practices. The Modern Language Journal 109:S1 ► pp. 15 ff.

Yamashita, Taichi

2025. Examining English language learners’ longitudinal development of syntactic complexity across five CEFR levels with a robust measurement design: A mixed-methods approach. Applied Linguistics

Alzahrani, Alaa & Lawrence Jun Zhang

2024. Utility of Kolmogorov complexity measures: Analysis of L2 groups and L1 backgrounds. PLOS ONE 19:4 ► pp. e0301806 ff.

Derkach, Kateryna & Theodora Alexopoulou

2024. Definite and indefinite article accuracy in learner English: A multifactorial analysis. Studies in Second Language Acquisition 46:3 ► pp. 710 ff.

Eguchi, Masaki & Kristopher Kyle

2024. Building custom NLP tools to annotate discourse-functional features for second language writing research: A tutorial. Research Methods in Applied Linguistics 3:3 ► pp. 100153 ff.

Forti, Luciana

2024. Proficiency-rated learner corpora. International Journal of Learner Corpus Research 10:1 ► pp. 216 ff.

Römer-Barron, Ute

2024. How do constructions with modal verbs develop in second language learners of English?. Journal of Second Language Studies 7:2 ► pp. 198 ff.

Shatz, Itamar, Theodora Alexopoulou & Akira Murakami

2024. The potential influence of cross-linguistic lexical similarity on lexical diversity in L2 English writing. Corpora 19:2 ► pp. 131 ff.

Du, Xiangtao, Muhammad Afzaal & Hind Al Fadda

2022. Collocation Use in EFL Learners’ Writing Across Multiple Language Proficiencies: A Corpus-Driven Study. Frontiers in Psychology 13

Hnatkowska, Bogumila & Damian Wawrzyniak

2022. Proficiency Level Classification of Foreign Language Learners Using Machine Learning Algorithms and Multilingual Models. In Computational Collective Intelligence [Lecture Notes in Computer Science, 13501], ► pp. 261 ff.

Murakami, Akira & Nick C. Ellis

2022. Effects of Availability, Contingency, and Formulaicity on the Accuracy of English Grammatical Morphemes in Second Language Writing. Language Learning 72:4 ► pp. 899 ff.

This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.