Article published In: Natural language processing for learner corpus research
Edited by Kristopher Kyle
[International Journal of Learner Corpus Research 7:1] 2021
► pp. 17–52
Automated annotation of learner English
An evaluation of software tools
Published online: 1 March 2021
https://doi.org/10.1075/ijlcr.20003.pic
https://doi.org/10.1075/ijlcr.20003.pic
Abstract
This paper explores the use of natural language processing (NLP) tools and their utility for learner language analyses
through a comparison of automatic linguistic annotation against a gold standard produced by humans. While there are a number of automated
annotation tools for English currently available, little research is available on the accuracy of these tools when annotating learner data.
We compare the performance of three linguistic annotation tools (a tagger and two parsers) on academic writing in English produced by
learners (both L1 and L2 English speakers). We focus on lexico-grammatical patterns, including both phrasal and clausal features, since
these are frequently investigated in applied linguistics studies. Our results report both precision and recall of annotation output for
argumentative texts in English across four L1s: Arabic, Chinese, English, and Korean. We close with a discussion of the benefits and
drawbacks of using automatic tools to annotate learner language.
Keywords: learner NLP, automated annotation, learner English, writing research
Article outline
- 1.Introduction
- 2.Lexico-grammatical patterns in L2 English academic writing
- 2.1Performance of automated annotation
- 3.Tool performance evaluation
- 3.1Methods
- 3.1.1English academic writing corpus
- 3.1.2Choice of automatic annotation tools
- 3.1.3Gold standard labels
- 3.1.4Feature extraction
- 3.1.4.1Attributive adjectives
- 3.1.4.2Noun-noun sequences
- 3.1.4.3Relative clause
- 3.1.4.4Complement clause
- 3.1.5Output alignment
- 3.1.6Analysis
- 3.2Results
- 3.2.1Phrasal features
- 3.2.2Clausal features
- 3.1Methods
- 4.Discussion and conclusion
- Acknowledgements
References
References (60)
Ansarifar, A., Shahriari, H., & Pishghadam, R. (2018). Phrasal complexity in academic writing: A comparison of abstracts written by graduate students and expert writers in applied linguistics. Journal of English for Academic Purposes, 311, 58–71.
(2006). University language: A corpus-based study of spoken and written registers. Amsterdam: John Benjamins Publishing.
Biber, D., & Gray, B. (2013). Discourse characteristics of writing and speaking task types on the TOEFL ibt® test: a lexico-grammatical analysis. ETS Research Report Series, 2013(1), i–128.
(2016). Grammatical complexity in academic English: Linguistic change in writing. Cambridge: Cambridge University Press.
Biber, D., Gray, B., & Poonpon, K. (2011). Should we use characteristics of conversation to measure grammatical complexity in L2 writing development? Tesol Quarterly, 45(1), 5–35.
Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman grammar of written and spoken English. Harlow: Longman.
Buchholz, S., & Marsi, E. (2006). CoNLL-X shared task on multilingual dependency parsing. In L. Màrquez & D. Klein (Eds.), Proceedings of the tenth conference on computational natural language learning (pp. 149–164). Stroudsburg: Association for Computational Linguistics.
Casal, J. E., & Lee, J. J. (2019). Syntactic complexity and writing quality in assessed first-year L2 writing. Journal of Second Language Writing, 441, 51–62.
Cer, D. M., de Marneffe, M., Jurafsky, D., & Manning, C. (2010). Parsing to Stanford Dependencies: Trade-offs between Speed and Accuracy. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner, & D. Tapias (Eds.), Proceedings of the 2010 International Conference on Language Resources and Evaluation (pp. 1–5). European Language Resources Association (ELRA).
Charles, M. (2007). Argument or evidence? Disciplinary variation in the use of the noun that pattern in stance construction. English for Specific Purposes, 26(2), 203–218.
Charniak, E. (2000). A maximum-entropy-inspired parser. In J. Wiebe (Ed.), Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference (pp. 132–139). Stroudsburg: Association for Computational Linguistics.
Chen, D., & Manning, C. (2014). A fast and accurate dependency parser using neural networks. In A. Moschitti, B. Pang, W. Daelemans (Eds.), Proceedings of the 2014 conference on empirical methods in natural language processing (pp. 740–750). Stroudsburg: Association for Computational Linguistics.
Crossley, S. A., & McNamara, D. S. (2009). Computational assessment of lexical differences in L1 and L2 writing. Journal of Second Language Writing, 18(2), 119–135.
(2014). Does writing development equal writing quality? A computational investigation of syntactic complexity in L2 learners. Journal of Second Language Writing, 261, 66–79.
de Marneffe, M., & Manning, C. (2008). The Stanford typed dependencies representation. In Coling 2008: proceedings of the workshop on cross-framework and cross-domain parser evaluation (pp. 1–8). Stroudsburg: Association for Computational Linguistics.
Francis, W., & Kučera, H. (1964). Brown corpus. Providence, Rhode Island: Department of Linguistics, Brown University.
Graesser, A. C., McNamara, D. S., Louwerse, M. M., & Cai, Z. (2004). Coh-Metrix: Analysis of text on cohesion and language. Behavior research methods, instruments, & computers, 36(2), 193–202.
Granger, S. (2008). Learner corpora in foreign language education. In S. Thorne & S. May (Eds.), Language, Education and Technology. Encyclopedia of Language and Education (pp. 1427–1441). Berlin: Springer.
Halacsy, P., Kornai, A., & Oravecz, C. (2007). Hunpos: an open source trigram tagger. In S. Ananiadou (Ed.), Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions (pp. 209–212). Stroudsburg: Association for Computational Linguistics.
Hempelmann, C. F., Rus, V., Graesser, A. C., & McNamara, D. S. (2006). Evaluating state-of-the-art treebank-style parsers for Coh-metrix and other learning technology environments. Natural Language Engineering, 12(2), 131–144.
Honnibal, M., & Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing [Python Library version 2.3.2].
Jiang, J., Bi, P., & Liu, H. (2019). Syntactic complexity development in the writings of EFL learners: Insights from a dependency syntactically-annotated corpus. Journal of Second Language Writing, 461, 100666–100679.
Johansson, S., Leech, G., & Goodluck, H. (1978). Manual of information to accompany the Lancaster-Olso/Bergen corpus of British English, for use with digital computers. Oslo. Department of English, University of Oslo. Retrieved from [URL]
Jurafsky, D., & Martin, J. H. (2008). Speech and language processing: An introduction to natural Language processing, computational linguistics, and speech recognition. Upper Saddle River, NJ: Pearson Prentice Hall.
Klein, D., & Manning, C. D. (2003). Fast exact inference with a factored model for natural language parsing. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems 15 (pp. 3–10). Cambridge, MA: The MIT Press.
Koehn, P. (2004). Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 conference on empirical methods in natural language processing (pp. 388–395). Stroudsburg: Association for Computational Linguistics.
Kyle, K. (2016). Measuring syntactic development in L2 writing: Fine Grained Indices of Syntactic Complexity and Usage-based Indices of Syntactic Sophistication (Unpublished doctoral dissertation). Georgia State University, Atlanta, GA.
Kyle, K., & Crossley, S. A. (2018). Measuring syntactic complexity in L2 writing using fine-grained clausal and phrasal indices. The Modern Language Journal, 102(2), 333–349.
Levy, R., & Andrew, G. (2006). Tregex and Tsurgeon: tools for querying and manipulating tree data structures. In N. Calzolari, K. Choukri, A. Gangemi, B. Maegaard, J. Mariani, J. Odijk, & D. Tapias (Eds.), Proceedings of the Fifth International Conference on Language Resources and Evaluation (pp. 2231–2234). European Language Resources Association (ELRA).
Liu, L., & Li, L. (2016). Noun Phrase Complexity in EFL Academic Writing: A Corpus-Based Study of Postgraduate Academic Writing. Journal of Asia TEFL, 13(1), 48–66.
Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International journal of corpus linguistics, 15(4), 474–496.
Lu, X., & Ai, H. (2015). Syntactic complexity in college-level English writing: Differences among writers with diverse L1 backgrounds. Journal of Second Language Writing, 291, 16–27.
Marcus, M., Marcinkiewicz, M., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational linguistics, 19(2), 313–330.
Marcus, M., Kim, G., Marcinkiewicz, M. A., MacIntyre, R., Bies, A., Ferguson, M., Katz, K., & Schasberger, B. (1994). The Penn Treebank: annotating predicate argument structure. In Proceedings of Human Language Technology Workshop (pp. 114–119). Stroudsburg: Association for Computational Linguistics.
McNamara, D. S., Graesser, A. C., McCarthy, P. M., & Cai, Z. (2014). Automated evaluation of text and discourse with Coh-Metrix. New York, NY: Cambridge University Press.
Nivre, J., Hall, J., Nilsson, J., Chanev, A., Eryigit, G., Kubler, S., Marinov, S., & Marsi, E. (2007). MaltParser: A language-independent system for data-driven dependency parsing. Natural Language Engineering, 13(2), 95–135.
Norris, J. M., & Ortega, L. (2009). Towards an organic approach to investigating CAF in instructed SLA: The case of complexity. Applied Linguistics, 30(4), 555–578.
Ott, N., & Ziai, R. (2010). Evaluating dependency parsing performance on German learner language. In M. Dickinson, K. Müürisep, & M. Passarotti (Eds.), Proceedings of the Ninth International Workshop on Treebanks and Linguistic Theories (pp. 175–186). Tartu: NEALT.
Paquot, M. (2019). The phraseological dimension in interlanguage complexity research. Second Language Research, 35(1), 121–145.
Pérez-Paredes, P., & Díez-Bedmar, M. B. (2019). Researching learner language through POS keyword and syntactic complexity analyses. In S. Götz & J. Mukherjee (Eds.), Learner Corpora and Language Teaching (pp. 101–127). Amsterdam: John Benjamins Publishing.
Parkinson, J., & Musgrave, J. (2014). Development of noun phrase complexity in the writing of English for academic purposes students. Journal of English for Academic Purposes, 141, 48–59.
Paul, D., & Baker, J. (1992). The design for the Wall Street Journal-based CSR corpus. In Proceedings of the workshop on Speech and Natural Language (pp. 357–362). Stroudsburg: Association for Computational Linguistics.
Peters, T. (2018). Difflib: Helpers for computing differences between objects. [Python library]. Retrieved from [URL]
Polio, C., & Yoon, H. J. (2018). The reliability and validity of automated tools for examining variation in syntactic complexity across genres. International Journal of Applied Linguistics, 28(1), 165–188.
Rayson, P. (2008). From key words to key semantic domains. International Journal of Corpus Linguistics, 13(4), 519–549.
(2009). Wmatrix: A Web-based Corpus-processing Environment. Lancaster: Computing Department, Lancaster University.
Riezler, S., & Maxwell, J. T. (2005). On some pitfalls in automatic evaluation and significance testing for MT. In Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization (pp. 57–64). New Brunswick: Association for Computational Linguistics.
Santorini, B. (1990). Part-of-speech tagging guidelines for the Penn Treebank (3rd Revision, 2nd Edition). Philadelphia: Department of Computer Science, University of Pennsylvania. Retrieved from [URL]
Schmid, H. (2019). Deep learning-based morphological taggers and lemmatizers for annotating historical texts. In Proceedings of the Digital Access to Textual Cultural Heritage conference (DATeCH) (pp. 133–137). New York: Association for Computing Machinery.
Shenoy, G. G., Dsouza, E. H., & Kübler, S. (2017). Performing stance detection on Twitter data using computational linguistics techniques. arXiv, arXiv:1703.02019.
Simar, L., & Wilson, P. W. (1998). Sensitivity analysis of efficiency scores: How to bootstrap in nonparametric frontier models. Management Science, 44(1), 49–61.
Staples, S., & Reppen, R. (2016). Understanding first-year L2 writing: A lexico-grammatical analysis across L1s, genres, and language ratings. Journal of Second Language Writing, 321, 17–35.
Staples, S., Biber, D., & Reppen, R. (2018). Using Corpus-Based Register Analysis to Explore the Authenticity of High-Stakes Language Exams: A Register Comparison of TOEFL iBT and Disciplinary Writing Tasks. The Modern Language Journal, 102(2), 310–332.
Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information processing & management, 45(4), 427–437.
Cited by (15)
Cited by 15 other publications
Anitha, D., A. M. Abirami, N. Sharmila, A. Shrishanmathi & Rajiv Ratn Shah
Jain, Rituraj
Mahmoudi-Dehaki, Mohsen & Nasim Nasr-Esfahani
Granger, Sylviane
2024. From early to future learner corpus research. International Journal of Learner Corpus Research 10:2 ► pp. 247 ff.
Kyle, Kristopher & Masaki Eguchi
Le Foll, Elen & Muhammad Shakir
Minnillo, Sophia, Claudia Sánchez-Gutiérrez, Ana Ruiz-Alonso-Bartol, Emily Morgan & Carmen González Gómez
2024. Predictors of accuracy in L2 Spanish preterit-imperfect production. International Journal of Learner Corpus Research 10:2 ► pp. 301 ff.
Spina, Stefania, Irene Fioravanti, Luciana Forti & Fabio Zanda
Lan, Ge, Xiaofei Pan, Yachao Sun & Yuan Lu
OUSHIRO, LIVIA
Larsson, Tove, Shelley Staples & Jesse Egbert
Naismith, Ben, Na-Rae Han & Alan Juffs
2022. The University of Pittsburgh English Language Institute Corpus (PELIC). International Journal of Learner Corpus Research 8:1 ► pp. 121 ff.
Staples, Shelley & Karin Puga
2022. Integrating fluency and prosody into multidimensional analysis. International Journal of Learner Corpus Research 8:2 ► pp. 190 ff.
[no author supplied]
This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
