Article published In: International Journal of Corpus Linguistics
Vol. 26:2 (2021) ► pp.187–218
Subcategorization frame identification for learner English
Published online: 8 December 2020
https://doi.org/10.1075/ijcl.18097.hua
https://doi.org/10.1075/ijcl.18097.hua
Abstract
As large-scale learner corpora become increasingly available, it is vital that natural language processing (NLP)
technology is developed to provide rich linguistic annotations necessary for second language (L2) research. We present a system for
automatically analyzing subcategorization frames (SCFs) for learner English. SCFs link lexis with morphosyntax, shedding light on the
interplay between lexical and structural information in learner language. Meanwhile, SCFs are crucial to the study of a wide range of
phenomena including individual verbs, verb classes and varying syntactic structures. To illustrate the usefulness of our system for learner
corpus research and second language acquisition (SLA), we investigate how L2 learners diversify their use of SCFs in text and how this
diversity changes with L2 proficiency.
Article outline
- 1.Introduction
- 2.Subcategorization frames and their automatic identification
- 3.A SCF identification system for learner English
- 3.1Data
- 3.2Method
- 3.3Training and evaluation
- 3.3.1Accuracy
- 3.3.2Error analysis
- i.Distinction between arguments and adjuncts
- ii.Prepositional attachment
- 4.Case study: SCF diversity and L2 proficiency
- 4.1Design of SCF diversity metrics
- 4.1.1Basic design
- i.Repetition
- ii.Evenness
- iii.Dispersion
- iv.Disparity
- 4.1.2Control for text length
- 4.1.1Basic design
- 4.2Data selection and statistical analysis method
- 4.3Results
- 4.1Design of SCF diversity metrics
- 5.Conclusion
- Notes
References
References (49)
Al-Rfou’, R., Perozzi, B., & Skiena, S. (2013). Polyglot: Distributed word representations for multilingual NLP. In J. Hockenmaier & S. Riedel (Eds.), Proceedings of the Seventeenth Conference on Computational Natural Language Learning (pp. 183–192). Association for Computational Linguistics. [URL]
Alexopoulou, T., Michel, M., Murakami, A., & Meurers, D. (2017). Task effects on linguistic complexity and accuracy: A large-scale learner corpus analysis employing natural language processing techniques. Language Learning, 67(S1), 180–208.
Andor, D., Alberti, C., Weiss, D., Severyn, A., Presta, A., Ganchev, K., Petrov, S., & Collins, M. (2016). Globally normalized transition-based neural networks. In K. Erk & N. A. Smith (Eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 2442–2452). Association for Computational Linguistics.
Aston, G., & Burnard, L. (1998). The BNC Handbook: Exploring the British National Corpus with SARA. Edinburgh University Press.
Baker, S., Reichart, R., & Korhonen, A. (2014). An unsupervised model for instance level subcategorization acquisition. In A. Moschitti, B. Pang, & W. Daelemans (Eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (pp. 278–289). Association for Computational Linguistics. [URL].
Berger, A. L., Pietra, V. J. Della, & Pietra, S. A. Della. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22(1), 39–71.
Biber, D., Gray, B., & Poonpon, K. (2011). Should we use characteristics of conversation to measure grammatical complexity in L2 writing development? TESOL Quarterly, 45(1), 5–35.
Boguraev, B., & Briscoe, T. (1987). Large lexicons for natural language processing: Utilising the grammar coding system of LDOCE. Computational Linguistics, 13(3–4), 203–218.
Briscoe, T., & Carroll, J. (1997). Automatic extraction of subcategorization from corpora. In Proceedings of the Fifth Conference on Applied Natural Language Processing (pp. 356–363). Association for Computational Linguistics. [URL].
Bulté, B., & Housen, A. (2012). Defining and operationalising L2 complexity. In A. Housen, F. Kuiken, & I. Vedder (Eds.), Dimensions of L2 Performance and Proficiency: Complexity, Accuracy and Fluency in SLA (pp. 21–46). John Benjamins.
Charniak, E., & Johnson, M. (2005). Coarse-to-fine n-best parsing and MaxEnt discriminative reranking. In K. Knight, H. T. Ng, & K. Oflazer (Eds.), Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (pp. 173–180). Association for Computational Linguistics. [URL].
Chen, X., & Meurers, D. (2019). Linking text readability and learner proficiency using linguistic complexity feature vector distance. Computer Assisted Language Learning, 32(4), 418–447.
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. Lawrence Earlbaum Associates.
Council of Europe. (2001). Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Cambridge University Press.
Covington, M. A., & McFall, J. D. (2010). Cutting the Gordian Knot: The Moving-Average Type–Token Ratio (MATTR). Journal of Quantitative Linguistics, 17(2), 94–100.
De Marneffe, M.-C., & Manning, C. D. (2008). The Stanford typed dependencies representation. In J. Bos, E. Briscoe, A. Cahill, J. Carroll, S. Clark, A. Copestake, D. Flickinger, J. van Genabith, J. Hockenmaier, A. Joshi, R. Kaplan, T. Holloway King, S. Kuebler, D. Lin, J. T. Lønning, C. Manning, Y. Miyao, J. Nivre, S. Oepen, …, Y. Zhang (Eds.), COLING 2008: Proceedings of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation (pp. 1–8). Coling 2008 Organizing Committee. [URL].
Dušek, O., Hajič, J., & Urešová, Z. (2014). Verbal valency frame detection and selection in Czech and English. In T. Mitamura, E. Hovy, & M. Palmer (Eds.), Proceedings of the Second Workshop on EVENTS: Definition, Detection, Coreference, and Representation (pp. 6–11). Association for Computational Linguistics.
Ellis, N. C., Römer, U., & O’Donnell, M. B. (2016). Usage-based Approaches to Language Acquisition and Processing: Cognitive and Corpus Investigations of Construction Grammar. Wiley.
Geertzen, J., Alexopoulou, T., & Korhonen, A. (2013). Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCAMDAT). In R. T. Miller, K. I. Martin, C. M. Eddington, A. Henery, N. M. Miguel, A. Tseng, A. Tuninetti, & D. Walter (Eds.), Proceedings of the 31st Second Language Research Forum: Building Bridges Between Disciplines. Cascadilla Proceedings Project. [URL]
Gerz, D., Vulić, I., Hill, F., Reichart, R., & Korhonen, A. (2016). SimVerb-3500: A large-Scale evaluation set of verb similarity. In J. Su, K. Duh, & X. Carreras (Eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 2173–2182). Association for Computational Linguistics. [URL].
Goldberg, A. E. (1995). Constructions: A Construction Grammar Approach to Argument Structure. University of Chicago Press.
Graesser, A. C., McNamara, D. S., & Kulikowich, J. M. (2011). Coh-Metrix: Providing multilevel analyses of text characteristics. Educational Researcher, 40(5), 223–234.
Gries, S. T., & Berez, A. L. (2017). Linguistic annotation in/for corpus linguistics. In N. Ide & J. Pustejovsky (Eds.), Handbook of Linguistic Annotation (pp. 379–409). Springer.
Grishman, R., Macleod, C., & Meyers, A. (1994). COMLEX syntax: Building a computational lexicon. In Proceedings of the 15th Conference on Computational Linguistics-Volume 1 (pp. 268–272). [URL].
Helbig, G., & Schenkel, W. (1991). Wörterbuch zur Valenz und Distribution deutscher Verben [Dictionary of the valency and distribution of German verbs]. VEB Bibliographisches Institut.
Huang, Y., Murakami, A., Alexopoulou, T., & Korhonen, A. (2018). Dependency parsing of learner English. International Journal of Corpus Linguistics, 23(1), 28–54.
Kyle, K. (2016). Measuring Syntactic Development in L2 Writing: Fine Grained Indices of Syntactic Complexity and Usage-based Indices of Syntactic Sophistication [Doctoral dissertation, Georgia State University]. [URL]
Levin, B. (1993). English Verb Classes and Alternations: A Preliminary Investigation. University of Chicago Press.
Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), 474–496.
Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.
Meurers, D., Krivanek, J., & Bykh, S. (2013). On the automatic analysis of learner corpora: Native language identification as experimental testbed of language modeling between surface features and linguistic abstraction. In A. A. Sintes & S. V. Hernández (Eds.), Diachrony and Synchrony in English Corpus Studies. Peter Lang.
Meyers, A., Macleod, C., & Grishman, R. (1996). Standardization of the complement adjunct distinction. In M. Gellerstam, J. Järborg, S.-G. Malmgren, K. Norén, L. Rogström, & C. Röjder Papmehl (Eds.), Proceedings of EURALEX 96 (International Conference on Lexicography). Novum Grafiska AB. [URL]
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In Y. Bengio & Y. LeCun (Eds.), 1st International Conference on Learning Representations. [URL]
Nicholls, D. (2003). The Cambridge Learner Corpus: Error coding and analysis for lexicography and ELT. In A. Dawn, P. Rayson, A. Wilson, & T. McEnery (Eds.), Proceedings of the Corpus Linguistics 2003 Conference (pp. 572–581). UCREL. [URL]
Norris, J. M., & Ortega, L. (2009). Towards an organic approach to investigating CAF in instructed SLA: The case of complexity. Applied Linguistics, 30(4), 555–578.
Preiss, J., Briscoe, T., & Korhonen, A. (2007). A system for large-scale acquisition of verbal, nominal and adjectival subcategorization frames from corpora. In A. Zaenen & A. van den Bosch (Eds.), Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (pp. 912–919). Association for Computational Linguistics. [URL]
Quochi, V., Frontini, F., Bartolini, R., Hamon, O., Poch, M., Padró, M., Bel, N., Thurmair, G., Toral, A., & Kamram, A. (2014). Third Evaluation Report: Evaluation of PANACEA v3 and Produced Resources. [URL]
Römer, U., O’Donnell, M. B., & Ellis, N. C. (2015). Using COBUILD grammar patterns for a large-scale analysis of verb-argument constructions. In N. Groom, M. Charles, & S. John (Eds.), Corpora, Grammar and Discourse: In Honour of Susan Hunston (pp. 43–72). John Benjamins.
Römer, U., Roberson, A., O’Donnell, M. B., & Ellis, N. C. (2014). Linking learner corpus and experimental data in studying second language learners’ knowledge of verb-argument constructions. ICAME Journal, 38(1), 115–135.
Somers, H. L. (1984). On the validity of the complement-adjunct distinction in valency grammar. Linguistics, 22(4), 507–530.
Taguchi, N., Crawford, W., & Wetzel, D. Z. (2013). What linguistic features are indicative of writing quality? A case of argumentative essays in a college composition program. TESOL Quarterly, 47(2), 420–430.
Tesnière, L. (1965). Eléments de Syntaxe Structurale [Elements of structural
syntax]. John Benjamins.
Tono, Y. (2004). Multiple comparisons of IL, L1 and TL corpora: The case of L2 acquisition of verb subcategorization patterns by Japanese learners of English. In G. Aston, S. Bernardini, & D. Stewart (Eds.), Corpora and Language Learners (pp. 45–66). John Benjamins.
