Article published In: International Journal of Learner Corpus Research
Vol. 6:2 (2020) ► pp.220–236
Corpus Report
Refining and modifying the EFCAMDAT
Lessons from creating a new corpus from an existing large-scale English learner language database
Published online: 10 December 2020
https://doi.org/10.1075/ijlcr.20009.sha
https://doi.org/10.1075/ijlcr.20009.sha
Abstract
This report outlines the development of a new corpus, which was created by refining and modifying the largest open-access L2 English learner database – the EFCAMDAT. The extensive data-curation process, which can inform the development and use of other corpora, included procedures such as converting the database from XML to a tabular format, and removing problematic markup tags and non-English texts. The final dataset contains two corresponding samples, written by similar learners in response to different prompts, which represents a unique research opportunity when it comes to analyzing task effects and conducting replication studies. Overall, the resulting corpus contains ~406,000 texts in the first sample and ~317,000 texts in the second sample, written by learners representing diverse L1s and a large range of L2 proficiency levels.
Keywords: data curation, corpus cleaning, English as a second language, EFCAMDAT
Article outline
- 1.Introduction
- 2.Preparing the new corpus
- 2.1Selecting the sample
- 2.2Format: Converting from XML to a tabular format
- 2.3Content: Analyzing and removing texts
- 2.3.1Texts with problematic markup tags
- 2.3.2Ultra-short texts
- 2.3.3Non-English texts
- 2.3.4Duplicate texts
- 2.3.5Outlier texts based on wordcount
- 2.4Structure: Classifying texts based on prompt
- 3.Discussion and conclusion
- Acknowledgements
References
References (19)
Alexopoulou, T., Geertzen, J., Korhonen, A., & Meurers, D. (2015). Exploring big educational learner corpora for SLA research: Perspectives on relative clauses. International Journal of Learner Corpus Research, 1(1), 96–129.
Alexopoulou, T., Michel, M., Murakami, A., & Meurers, D. (2017). Task effects on linguistic complexity and accuracy: A large-scale learner corpus analysis employing natural language processing techniques. Language Learning, 67(S1), 180–208.
Callies, M. (2015). Learner corpus methodology. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp. 35–56). Cambridge: Cambridge University Press.
Feinerer, I., & Hornik, K. (2018). tm: Text Mining Package. Retrieved from [URL]
Geertzen, J., Alexopoulou, T., Baker, R., Hendriks, H., Jiang, S., & Korhonen, A. (2013). The EF Cambridge Open Language Database (EFCAMDAT). User Manual Part I: Written Production. Retrieved from [URL]
Geertzen, J., Alexopoulou, T., & Korhonen, A. (2014). Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCamDat). In R. T. Millar, K. I. Martin, C. M. Eddington, A. Henery, N. M. Miguel, & A. Tseng (Eds.), Selected proceedings of the 2012 Second Language Research Forum (pp. 240–254). Somerville, MA: Cascadilla Proceedings Project.
Grün, B., & Hornik, K. (2011). topicmodels: An R package for fitting topic models. Journal of Statistical Software, 40(13), 1–30.
Huang, Y., Geertzen, J., Baker, R., Korhonen, A., & Alexopoulou, T. (2017). The EF Cambridge Open Language Database (EFCAMDAT): Information for users (pp. 1–18). Retrieved from [URL]
Huang, Y., Murakami, A., Alexopoulou, T., & Korhonen, A. (2018). Dependency parsing of learner English. International Journal of Corpus Linguistics, 23(1), 28–54.
Kaliyaperumal, S. K., Kuppusamy, M., Arumugam, S., Kannan, K. S., Manoj, K., & Arumugam, S. (2015). Labeling methods for identifying outliers. International Journal of Statistics and Systems, 10(2), 231–238.
Lang, D. T.. (2020). XML: Tools for parsing and generating XML within R and S-Plus. Retrieved from [URL]
McEnery, T., Brezina, V., Gablasova, D., & Banerjee, J. (2019). Corpus linguistics, learner corpora, and SLA: Employing technology to analyze language use. Annual Review of Applied Linguistics, 391, 74–92.
Murakami, A. (2013). Individual variation and the role of L1 in the L2 development of English grammatical morphemes: Insights from learner corpora (Unpublished doctoral dissertation). Cambridge University.
(2016). Modeling systematicity and individuality in nonlinear second language development: The case of English grammatical morphemes. Language Learning, 66(4), 834–871.
Ooms, J. (2018). cld2: Google’s compact language detector 2 (Version 1.2). R package. Retrieved from [URL]
Shatz, I. (2019). How native language and L2 proficiency affect EFL learners’ capitalisation abilities: A large-scale corpus study. Corpora, 14(2), 173–202.
Van der Loo, M. P. J. (2014). The stringdist package for approximate string matching. The R Journal, 6(1), 111–122. Retrieved from [URL]
Wickham, H., François, R., Henry, L., Müller, K., & RStudio. (2019). dplyr: A grammar of data manipulation. Retrieved from [URL]
Wickham, H., & RStudio. (2019). stringr: Simple, consistent wrappers for common string operations. Retrieved from [URL]
Cited by (14)
Cited by 14 other publications
Bozdağ, Fatih Ünal, Junhua Mo, Gareth Morris & Dragana Bozic Lenard
Mallart, Cyriel, Andrew Simpkin, Nicolas Ballier, Paula Lissón, Rémi Venant, Bernardo Stearns, Jen-Yu Li & Thomas Gaillat
Murakami, Akira
Sato, Masatoshi, Steven L. Thorne, Marije Michel, Theodora Alexopoulou & John Hellermann
Yamashita, Taichi
Alzahrani, Alaa & Lawrence Jun Zhang
Derkach, Kateryna & Theodora Alexopoulou
Eguchi, Masaki & Kristopher Kyle
Forti, Luciana
2024. Proficiency-rated learner corpora. International Journal of Learner Corpus Research 10:1 ► pp. 216 ff.
Römer-Barron, Ute
2024. How do constructions with modal verbs develop in second language learners of English?. Journal of Second Language Studies 7:2 ► pp. 198 ff.
Shatz, Itamar, Theodora Alexopoulou & Akira Murakami
Du, Xiangtao, Muhammad Afzaal & Hind Al Fadda
Hnatkowska, Bogumila & Damian Wawrzyniak
This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
