In:Automatic Treatment and Analysis of Learner Corpus Data
Edited by Ana Díaz-Negrillo, Nicolas Ballier and Paul Thompson
[Studies in Corpus Linguistics 59] 2013
► pp. 169–204
Criterial feature extraction using parallel learner corpora and machine learning
Published online: 18 December 2013
https://doi.org/10.1075/scl.59.11ton
https://doi.org/10.1075/scl.59.11ton
This study reports on a new approach in semi-automatic error annotation and criterial feature extraction from learner corpora. Parallel learner corpora, a set of original learner writings and their proofread counterparts, were processed using edit distance to automatically identify surface taxonomy errors, which were then statistically analysed to produce language features which serve as criterial for a particular language proficiency level. Two case studies will report on different statistical and machine learning techniques; a clustering technique called variability-based neighbour clustering and ensemble learning called random forest. The results of the two case studies show that using edit distance over parallel learner corpora is a promising direction for annotating a large quantity of learner data with minimum manual annotation work, and both statistical techniques were found to be effective in identifying criterial features from learner corpora. Some theoretical and methodological issues are discussed for further research.
Cited by (2)
Cited by two other publications
Abe, Mariko
2019. Comparing errors across an L2 spoken and written error-tagged Japanese EFL learner corpus. In Learner Corpora and Language Teaching [Studies in Corpus Linguistics, 92], ► pp. 157 ff.
Alonso-Ramos, Margarita
2016. Spanish learner corpus research. In Spanish Learner Corpus Research [Studies in Corpus Linguistics, 78], ► pp. 3 ff.
This list is based on CrossRef data as of 1 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
