Criterial feature extraction using parallel learner corpora and machine learning

Tono, Yukio

doi:10.1075/scl.59.11ton

In:Automatic Treatment and Analysis of Learner Corpus Data
Edited by Ana Díaz-Negrillo, Nicolas Ballier and Paul Thompson
[Studies in Corpus Linguistics 59] 2013
► pp. 169–204

Get fulltext from our e-platform

Download Book PDF

Criterial feature extraction using parallel learner corpora and machine learning

Yukio Tono

Published online: 18 December 2013

https://doi.org/10.1075/scl.59.11ton

This study reports on a new approach in semi-automatic error annotation and criterial feature extraction from learner corpora. Parallel learner corpora, a set of original learner writings and their proofread counterparts, were processed using edit distance to automatically identify surface taxonomy errors, which were then statistically analysed to produce language features which serve as criterial for a particular language proficiency level. Two case studies will report on different statistical and machine learning techniques; a clustering technique called variability-based neighbour clustering and ensemble learning called random forest. The results of the two case studies show that using edit distance over parallel learner corpora is a promising direction for annotating a large quantity of learner data with minimum manual annotation work, and both statistical techniques were found to be effective in identifying criterial features from learner corpora. Some theoretical and methodological issues are discussed for further research.

Cited by (2)

Cited by two other publications

Abe, Mariko

2019. Comparing errors across an L2 spoken and written error-tagged Japanese EFL learner corpus. In Learner Corpora and Language Teaching [Studies in Corpus Linguistics, 92], ► pp. 157 ff.

Alonso-Ramos, Margarita

2016. Spanish learner corpus research. In Spanish Learner Corpus Research [Studies in Corpus Linguistics, 78], ► pp. 3 ff.

This list is based on CrossRef data as of 1 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.