Automated annotation of learner English: An evaluation of software tools

Picoral, Adriana; Staples, Shelley; Reppen, Randi

doi:10.1075/ijlcr.20003.pic

Article published In: Natural language processing for learner corpus research
Edited by Kristopher Kyle
[International Journal of Learner Corpus Research 7:1] 2021
► pp. 17–52

Get fulltext from our e-platform

Download PDF

Automated annotation of learner English

An evaluation of software tools

Adriana Picoral | University of Arizona

Shelley Staples | University of Arizona

Randi Reppen | Northern Arizona University

Published online: 1 March 2021

https://doi.org/10.1075/ijlcr.20003.pic

Abstract

This paper explores the use of natural language processing (NLP) tools and their utility for learner language analyses through a comparison of automatic linguistic annotation against a gold standard produced by humans. While there are a number of automated annotation tools for English currently available, little research is available on the accuracy of these tools when annotating learner data. We compare the performance of three linguistic annotation tools (a tagger and two parsers) on academic writing in English produced by learners (both L1 and L2 English speakers). We focus on lexico-grammatical patterns, including both phrasal and clausal features, since these are frequently investigated in applied linguistics studies. Our results report both precision and recall of annotation output for argumentative texts in English across four L1s: Arabic, Chinese, English, and Korean. We close with a discussion of the benefits and drawbacks of using automatic tools to annotate learner language.

Keywords: learner NLP, automated annotation, learner English, writing research

Article outline

1.Introduction
2.Lexico-grammatical patterns in L2 English academic writing
- 2.1Performance of automated annotation
3.Tool performance evaluation
- 3.1Methods
  - 3.1.1English academic writing corpus
  - 3.1.2Choice of automatic annotation tools
  - 3.1.3Gold standard labels
  - 3.1.4Feature extraction
    - 3.1.4.1Attributive adjectives
    - 3.1.4.2Noun-noun sequences
    - 3.1.4.3Relative clause
    - 3.1.4.4Complement clause
  - 3.1.5Output alignment
  - 3.1.6Analysis
- 3.2Results
  - 3.2.1Phrasal features
  - 3.2.2Clausal features
4.Discussion and conclusion
Acknowledgements
References

References (60)

References

Ansarifar, A., Shahriari, H., & Pishghadam, R. (2018). Phrasal complexity in academic writing: A comparison of abstracts written by graduate students and expert writers in applied linguistics. Journal of English for Academic Purposes, 311, 58–71.

Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press.

(2006). University language: A corpus-based study of spoken and written registers. Amsterdam: John Benjamins Publishing.

Biber, D., & Gray, B. (2013). Discourse characteristics of writing and speaking task types on the TOEFL ibt® test: a lexico-grammatical analysis. ETS Research Report Series, 2013(1), i–128.

(2016). Grammatical complexity in academic English: Linguistic change in writing. Cambridge: Cambridge University Press.

Biber, D., Gray, B., & Poonpon, K. (2011). Should we use characteristics of conversation to measure grammatical complexity in L2 writing development? Tesol Quarterly, 45(1), 5–35.

Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman grammar of written and spoken English. Harlow: Longman.

Buchholz, S., & Marsi, E. (2006). CoNLL-X shared task on multilingual dependency parsing. In L. Màrquez & D. Klein (Eds.), Proceedings of the tenth conference on computational natural language learning (pp. 149–164). Stroudsburg: Association for Computational Linguistics.

Canty, A., & Ripley, B. (2019). Boot: Bootstrap R (S-Plus) Functions. R package version 11.3–22.

Casal, J. E., & Lee, J. J. (2019). Syntactic complexity and writing quality in assessed first-year L2 writing. Journal of Second Language Writing, 441, 51–62.

Cer, D. M., de Marneffe, M., Jurafsky, D., & Manning, C. (2010). Parsing to Stanford Dependencies: Trade-offs between Speed and Accuracy. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner, & D. Tapias (Eds.), Proceedings of the 2010 International Conference on Language Resources and Evaluation (pp. 1–5). European Language Resources Association (ELRA).

Charles, M. (2007). Argument or evidence? Disciplinary variation in the use of the noun that pattern in stance construction. English for Specific Purposes, 26(2), 203–218.

Charniak, E. (2000). A maximum-entropy-inspired parser. In J. Wiebe (Ed.), Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference (pp. 132–139). Stroudsburg: Association for Computational Linguistics.

Chen, D., & Manning, C. (2014). A fast and accurate dependency parser using neural networks. In A. Moschitti, B. Pang, W. Daelemans (Eds.), Proceedings of the 2014 conference on empirical methods in natural language processing (pp. 740–750). Stroudsburg: Association for Computational Linguistics.

Crossley, S. A., & McNamara, D. S. (2009). Computational assessment of lexical differences in L1 and L2 writing. Journal of Second Language Writing, 18(2), 119–135.

(2014). Does writing development equal writing quality? A computational investigation of syntactic complexity in L2 learners. Journal of Second Language Writing, 261, 66–79.

de Marneffe, M., & Manning, C. (2008). The Stanford typed dependencies representation. In Coling 2008: proceedings of the workshop on cross-framework and cross-domain parser evaluation (pp. 1–8). Stroudsburg: Association for Computational Linguistics.

Eisenstein, J. (2019). Introduction to natural language processing. Cambridge, MA: The MIT Press.

ETS (2014). A guide to understanding TOEFL iBT® scores. Educational Testing Service.

Francis, W., & Kučera, H. (1964). Brown corpus. Providence, Rhode Island: Department of Linguistics, Brown University.

Graesser, A. C., McNamara, D. S., Louwerse, M. M., & Cai, Z. (2004). Coh-Metrix: Analysis of text on cohesion and language. Behavior research methods, instruments, & computers, 36(2), 193–202.

Granger, S. (2008). Learner corpora in foreign language education. In S. Thorne & S. May (Eds.), Language, Education and Technology. Encyclopedia of Language and Education (pp. 1427–1441). Berlin: Springer.

Halacsy, P., Kornai, A., & Oravecz, C. (2007). Hunpos: an open source trigram tagger. In S. Ananiadou (Ed.), Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions (pp. 209–212). Stroudsburg: Association for Computational Linguistics.

Hempelmann, C. F., Rus, V., Graesser, A. C., & McNamara, D. S. (2006). Evaluating state-of-the-art treebank-style parsers for Coh-metrix and other learning technology environments. Natural Language Engineering, 12(2), 131–144.

Honnibal, M., & Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing [Python Library version 2.3.2].

Jiang, J., Bi, P., & Liu, H. (2019). Syntactic complexity development in the writings of EFL learners: Insights from a dependency syntactically-annotated corpus. Journal of Second Language Writing, 461, 100666–100679.

Johansson, S., Leech, G., & Goodluck, H. (1978). Manual of information to accompany the Lancaster-Olso/Bergen corpus of British English, for use with digital computers. Oslo. Department of English, University of Oslo. Retrieved from [URL]

Jurafsky, D., & Martin, J. H. (2008). Speech and language processing: An introduction to natural Language processing, computational linguistics, and speech recognition. Upper Saddle River, NJ: Pearson Prentice Hall.

Klein, D., & Manning, C. D. (2003). Fast exact inference with a factored model for natural language parsing. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems 15 (pp. 3–10). Cambridge, MA: The MIT Press.

Koehn, P. (2004). Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 conference on empirical methods in natural language processing (pp. 388–395). Stroudsburg: Association for Computational Linguistics.

Kyle, K. (2016). Measuring syntactic development in L2 writing: Fine Grained Indices of Syntactic Complexity and Usage-based Indices of Syntactic Sophistication (Unpublished doctoral dissertation). Georgia State University, Atlanta, GA.

Kyle, K., & Crossley, S. A. (2018). Measuring syntactic complexity in L2 writing using fine-grained clausal and phrasal indices. The Modern Language Journal, 102(2), 333–349.

Levy, R., & Andrew, G. (2006). Tregex and Tsurgeon: tools for querying and manipulating tree data structures. In N. Calzolari, K. Choukri, A. Gangemi, B. Maegaard, J. Mariani, J. Odijk, & D. Tapias (Eds.), Proceedings of the Fifth International Conference on Language Resources and Evaluation (pp. 2231–2234). European Language Resources Association (ELRA).

Liu, L., & Li, L. (2016). Noun Phrase Complexity in EFL Academic Writing: A Corpus-Based Study of Postgraduate Academic Writing. Journal of Asia TEFL, 13(1), 48–66.

Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International journal of corpus linguistics, 15(4), 474–496.

Lu, X., & Ai, H. (2015). Syntactic complexity in college-level English writing: Differences among writers with diverse L1 backgrounds. Journal of Second Language Writing, 291, 16–27.

Marcus, M., Marcinkiewicz, M., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational linguistics, 19(2), 313–330.

Marcus, M., Kim, G., Marcinkiewicz, M. A., MacIntyre, R., Bies, A., Ferguson, M., Katz, K., & Schasberger, B. (1994). The Penn Treebank: annotating predicate argument structure. In Proceedings of Human Language Technology Workshop (pp. 114–119). Stroudsburg: Association for Computational Linguistics.

McNamara, D. S., Graesser, A. C., McCarthy, P. M., & Cai, Z. (2014). Automated evaluation of text and discourse with Coh-Metrix. New York, NY: Cambridge University Press.

Nivre, J., Hall, J., Nilsson, J., Chanev, A., Eryigit, G., Kubler, S., Marinov, S., & Marsi, E. (2007). MaltParser: A language-independent system for data-driven dependency parsing. Natural Language Engineering, 13(2), 95–135.

Norris, J. M., & Ortega, L. (2009). Towards an organic approach to investigating CAF in instructed SLA: The case of complexity. Applied Linguistics, 30(4), 555–578.

Ott, N., & Ziai, R. (2010). Evaluating dependency parsing performance on German learner language. In M. Dickinson, K. Müürisep, & M. Passarotti (Eds.), Proceedings of the Ninth International Workshop on Treebanks and Linguistic Theories (pp. 175–186). Tartu: NEALT.

Paquot, M. (2019). The phraseological dimension in interlanguage complexity research. Second Language Research, 35(1), 121–145.

Pérez-Paredes, P., & Díez-Bedmar, M. B. (2019). Researching learner language through POS keyword and syntactic complexity analyses. In S. Götz & J. Mukherjee (Eds.), Learner Corpora and Language Teaching (pp. 101–127). Amsterdam: John Benjamins Publishing.

Parkinson, J., & Musgrave, J. (2014). Development of noun phrase complexity in the writing of English for academic purposes students. Journal of English for Academic Purposes, 141, 48–59.

Paul, D., & Baker, J. (1992). The design for the Wall Street Journal-based CSR corpus. In Proceedings of the workshop on Speech and Natural Language (pp. 357–362). Stroudsburg: Association for Computational Linguistics.

Peters, T. (2018). Difflib: Helpers for computing differences between objects. [Python library]. Retrieved from [URL]

Polio, C., & Yoon, H. J. (2018). The reliability and validity of automated tools for examining variation in syntactic complexity across genres. International Journal of Applied Linguistics, 28(1), 165–188.

Rayson, P. (2008). From key words to key semantic domains. International Journal of Corpus Linguistics, 13(4), 519–549.

(2009). Wmatrix: A Web-based Corpus-processing Environment. Lancaster: Computing Department, Lancaster University.

Riezler, S., & Maxwell, J. T. (2005). On some pitfalls in automatic evaluation and significance testing for MT. In Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization (pp. 57–64). New Brunswick: Association for Computational Linguistics.

Santorini, B. (1990). Part-of-speech tagging guidelines for the Penn Treebank (3rd Revision, 2nd Edition). Philadelphia: Department of Computer Science, University of Pennsylvania. Retrieved from [URL]

Schmid, H. (2019). Deep learning-based morphological taggers and lemmatizers for annotating historical texts. In Proceedings of the Digital Access to Textual Cultural Heritage conference (DATeCH) (pp. 133–137). New York: Association for Computing Machinery.

Shenoy, G. G., Dsouza, E. H., & Kübler, S. (2017). Performing stance detection on Twitter data using computational linguistics techniques. arXiv, arXiv:1703.02019.

Simar, L., & Wilson, P. W. (1998). Sensitivity analysis of efficiency scores: How to bootstrap in nonparametric frontier models. Management Science, 44(1), 49–61.

Staples, S., & Reppen, R. (2016). Understanding first-year L2 writing: A lexico-grammatical analysis across L1s, genres, and language ratings. Journal of Second Language Writing, 321, 17–35.

Staples, S., Biber, D., & Reppen, R. (2018). Using Corpus-Based Register Analysis to Explore the Authenticity of High-Stakes Language Exams: A Register Comparison of TOEFL iBT and Disciplinary Writing Tasks. The Modern Language Journal, 102(2), 310–332.

Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information processing & management, 45(4), 427–437.

van Rooy, B. (2015). Annotating learner corpora. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge Handbook of Learner Corpus Research (pp. 79–106). Cambridge: Cambridge University Press.

Yoon, H., & Polio, C. (2017). ESL students’ linguistic development in two written genres. TESOL Quarterly, 51(2), 275–301.

Cited by (15)

Cited by 15 other publications

Order by:

Anitha, D., A. M. Abirami, N. Sharmila, A. Shrishanmathi & Rajiv Ratn Shah

2026. Machine Learning Approaches for Tamil POS Tagging and Dependency Parsing. In Speech and Language Technologies for Low-Resource Languages [Communications in Computer and Information Science, 2656], ► pp. 251 ff.

Jain, Rituraj

2025. Sentiment and Emotion Analysis in Video Interviews Using Deep Learning Techniques. In Emerging Technologies for Recruitment Strategy and Practice, ► pp. 147 ff.

Mahmoudi-Dehaki, Mohsen & Nasim Nasr-Esfahani

2025. Automated vs. manual linguistic annotation for assessing pragmatic competence in English classes. Research Methods in Applied Linguistics 4:3 ► pp. 100253 ff.

Granger, Sylviane

2024. From early to future learner corpus research. International Journal of Learner Corpus Research 10:2 ► pp. 247 ff.

Kyle, Kristopher & Masaki Eguchi

2024. Evaluating NLP models with written and spoken L2 samples. Research Methods in Applied Linguistics 3:2 ► pp. 100120 ff.

Le Foll, Elen & Muhammad Shakir

2024. The Multi-Feature Tagger of English (MFTE): Rationale, description and evaluation. Research in Corpus Linguistics 13:2 ► pp. 63 ff.

Minnillo, Sophia, Claudia Sánchez-Gutiérrez, Ana Ruiz-Alonso-Bartol, Emily Morgan & Carmen González Gómez

2024. Predictors of accuracy in L2 Spanish preterit-imperfect production. International Journal of Learner Corpus Research 10:2 ► pp. 301 ff.

Spina, Stefania, Irene Fioravanti, Luciana Forti & Fabio Zanda

2024. The CELI corpus: Design and linguistic annotation of a new online learner corpus. Second Language Research 40:2 ► pp. 457 ff.

Lan, Ge, Xiaofei Pan, Yachao Sun & Yuan Lu

2023. Part of speech tagging of grammatical features related to L2 Chinese development: A case analysis of Stanza in the L2 writing context. Frontiers in Psychology 14

OUSHIRO, LIVIA

2023. Computational Resources for Handling Sociolinguistic Corpora. In The Handbook of Usage‐Based Linguistics, ► pp. 415 ff.

Larsson, Tove, Shelley Staples & Jesse Egbert

2022. Teaching, learning, and researching with corpora. Applied Corpus Linguistics 2:3 ► pp. 100025 ff.

McCallum, Lee & Philip Durrant

2022. Shaping Writing Grades,

Naismith, Ben, Na-Rae Han & Alan Juffs

2022. The University of Pittsburgh English Language Institute Corpus (PELIC). International Journal of Learner Corpus Research 8:1 ► pp. 121 ff.

Staples, Shelley & Karin Puga

2022. Integrating fluency and prosody into multidimensional analysis. International Journal of Learner Corpus Research 8:2 ► pp. 190 ff.

[no author supplied]

2024. Textbook English [Studies in Corpus Linguistics, 116],

This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.