Article published In: International Journal of Learner Corpus Research: Online-First Articles
Automatic discourse segmentation of L1 and L2 spoken English transcripts
Available under the Creative Commons Attribution (CC BY) 4.0 license.
For any use beyond this license, please contact the publisher at rights@benjamins.nl.
Open Access publication of this article was funded through a Transformative Agreement with Radboud University Nijmegen.
Published online: 7 October 2025
https://doi.org/10.1075/ijlcr.24023.yan
https://doi.org/10.1075/ijlcr.24023.yan
Abstract
Natural language processing (NLP) tools, primarily trained on L1 written English, have achieved remarkable
performance, but are rarely used in L2 learner data. This study leverages a rule-based segmenter to automatically segment spoken
English discourse by both L1 speakers and learners, presenting novel preparatory data-cleaning steps that combine a
state-of-the-art disfluency detector and additional rules to improve segmentation performance. In three successive segmentation
tests on data from the Louvain Corpus of Native English Conversation (LOCNEC; De Cock, S. (2004). Preferred
sequences of words in NS and NNS speech. Belgian Journal of English Language and Literatures
(BELL), New
Series, 21, 225–246.) and the Louvain International Database of Spoken English Interlanguage (LINDSEI; Gilquin, G., De Cock, S., & Granger, S. (2010). The
Louvain International Database of Spoken English Interlanguage: Handbook and CD-ROM. Presses universitaires de Louvain.), we achieve an enhanced segmentation performance that is similar for
both the L1 and L2 data (.84). Our approach highlights the effectiveness of leveraging existing NLP tools to process disfluent L2
spoken transcripts, facilitating automatic discourse analysis in Learner Corpus Research (LCR). The code for executing our
pipeline is publicly available for future research.
Article outline
- 1.Introduction
- 1.1Segmentation principles
- 1.2Existing segmenters
- 2.Implementation
- 2.1Corpora
- 2.2Segmentation Test 1
- 2.2.1Procedure
- 2.2.2Results and discussion
- 2.3Segmentation Test 2 with disfluency removal
- 2.3.1Methodological overview
- 2.3.2Procedure
- 2.3.3Results and discussion
- 2.4Segmentation Test 3 with hand-crafted rules
- 3.Conclusion
- Open code badge
- Notes
References
References (66)
Bach, N., & Huang, F. (2019). Noisy
BiLSTM-based models for disfluency detection. Proceedings of Interspeech
2019, 4230–4234.
Bhat, S., & Yoon, S. Y. (2015). Automatic
assessment of syntactic complexity for spontaneous speech scoring. Speech
Communication, 671, 42–57.
Biber, D., Gray, B., & Staples, S. (2016). Predicting
patterns of grammatical complexity across language exam task types and proficiency
levels. Applied
Linguistics, 37(5), 639–668.
Caines, A., & Buttery, P. (2014). The
effect of disfluencies and learner errors on the parsing of spoken learner
language. In Y. Goldberg, Y. Marton, I. Rehbein, Y. Versley, Ö. Çetinoğlu, & J. Tetreault (Eds.), Proceedings
of the first joint workshop on statistical parsing of morphologically rich languages and syntactic analysis of non-canonical
languages (pp. 74–81). Dublin City University. Retrieved from [URL]
Carlson, L., Okurowski, M. E., & Marcu, D. (2002). RST
discourse treebank. Linguistic Data Consortium.
Chambers, L., & Ingham, K. (2011). The
BULATS online speaking test. Research
Notes, 431, 21–25. Retrieved
from [URL]
Charniak, E., & Johnson, M. (2001). Edit
detection and parsing for transcribed speech. Second Meeting of the North American Chapter of the Association for
Computational Linguistics. NAACL 2001. Retrieved from [URL]
Chen, M., & Zechner, K. (2011). Computing
and evaluating syntactic complexity features for automated scoring of spontaneous non-native
speech. Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics: Human Language
Technologies (pp. 722–731). Association for Computational Linguistics.
Cieri, C., Graff, D., Kimball, O., Miller, D., & Walker, K. (2004). Fisher
English training speech part 1 transcripts LDC2004T19. Linguistic Data Consortium.
Cresti, E. (1995). Speech
act units and informational units. In E. Fava (Ed.), Speech
acts and linguistic
research. (pp. 89–107). Proceedings
of the Workshop, Center for Cognitive Science of New York at Buffalo
De Cock, S. (2004). Preferred
sequences of words in NS and NNS speech. Belgian Journal of English Language and Literatures
(BELL), New
Series, 21, 225–246.
Dong, Q., Wang, F., Yang, Z., Chen, W., Xu, S., & Xu, B. (2019). Adapting
translation models for transcript disfluency detection. Proceedings of the AAAI Conference on
Artificial
Intelligence, 33(01), 6351–6358.
Feng, V. W., & Hirst, G. (2014). Two-pass
discourse segmentation with pairing and global features. CoRR,
abs/1407.8215. Retrieved from [URL]
Foster, P., Tonkyn, A., & Wigglesworth, G. (2000). Measuring
spoken language: A unit for all reasons. Applied
linguistics, 21(3), 354–375.
Gilquin, G., De Cock, S., & Granger, S. (2010). The
Louvain International Database of Spoken English Interlanguage: Handbook and CD-ROM. Presses universitaires de Louvain.
Godfrey, J. J., Holliman, E. C., & McDaniel, J. (1992). SWITCHBOARD:
Telephone speech corpus for research and development. Proceedings of the 1992 IEEE
International Conference on Acoustics, Speech, and Signal Processing
(ICASSP-92) (Vol. 11, pp. 517–520). IEEE.
Guhr, O., Schumann, A.-K., Bahrmann, F., & Böhme, H. J. (2021). FullStop:
Multilingual Deep Models for Punctuation Prediction. Proceedings of the Swiss Text Analytics
Conference 2021. CEUR Workshop Proceedings. Retrieved [URL]
Himmelmann, N. P. (2006). The
challenges of segmenting spoken language. In J. Gippert, N. P. Himmelmann, & U. Mosel (Eds.), Essentials
of language
documentation (pp. 253–274). Mouton De Gruyter.
Hirschberg, J., & Litman, D. (1993). Empirical
studies on the disambiguation of cue phrases. Computational
Linguistics, 19(3), 501–530.
Hoek, J., Evers-Vermeul, J., & Sanders, T. J. M. (2018). Segmenting
discourse: Incorporating interpretation into segmentation? Corpus Linguistics and Linguistic
Theory, 14(2), 357–386.
Honnibal, M., & Johnson, M. (2014). Joint
incremental disfluency detection and dependency parsing. Transactions of the Association for
Computational
Linguistics, 21, 131–142.
Hough, J., & Schlangen, D. (2015). Recurrent
neural networks for incremental disfluency detection. Proceedings of Interspeech
2015, 849–853.
Izumi, E., Uchimoto, K., & Isahara, H. (2004). The
NICT JLE Corpus Exploiting the language learners’ speech database for research and
education. The International Journal of the Computer, the Internet and
Management, 121, 119–125.
Johnson, M., & Charniak, E. (2004). A
TAG-based noisy channel model of speech repairs. Proceedings of the 42nd Annual Meeting on
Association for Computational
Linguistics (pp. 33–39). Association for Computational Linguistics.
Joty, S., Carenini, G., & Ng, R. T. (2015). Codra:
A novel discriminative framework for rhetorical analysis. Computational
Linguistics, 41(3), 385–435.
Kahane, S., Caron, B., Strickland, E., & Gerdes, K. (2021). Annotation
guidelines of UD and SUD treebanks for spoken corpora: a
proposal. In D. Dakota, K. Evang, & S. Kübler (Eds.), Proceedings
of the 20th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest
2021) (pp. 35–47). Association for Computational Linguistics.
Knill, K. M., Gales, M. J., Manakul, P. P., & Caines, A. P. (2019). Automatic
grammatical error detection of non-native spoken learner English. ICASSP 2019–2019 IEEE
International Conference on Acoustics, Speech and Signal Processing
(ICASSP) (pp. 8127–8131). IEEE.
Kyle, K., Eguchi, M., Miller, A., & Sither, T. (2022). A
dependency treebank of spoken second language English. Proceedings of the 17th Workshop on
Innovative Use of NLP for Building Educational Applications (BEA
2022) (pp. 39–45). Association for Computational Linguistics.
Kyle, K., & Eguchi, M. (2024). Evaluating
NLP models with written and spoken L2 samples. Research Methods in Applied
Linguistics, 3(2), 100120.
Le Thanh, H., Abeysinghe, G., & Huyck, C. (2004). Automated
discourse segmentation by syntactic information and cue phrases. Proceedings of the IASTED
International Conference on Artificial Intelligence and Applications (AIA 2004), Innsbruck,
Austria, (pp. 411–415). IASTED.
Lou, P. J., & Johnson, M. (2017). Disfluency
detection using a noisy channel model and a deep neural language model. Proceedings of the
55th Annual Meeting of the Association for Computational Linguistics, Volume 2:(Short
Papers), (pp. 547–553). Association
for Computational Linguistics.
(2020). Improving
disfluency detection by self-training a self-attentive model. Proceedings of the 58th Annual
Meeting of the Association for Computational
Linguistics (pp. 3754–3763). Association
for Computational Linguistics.
Lu, Y., Gales, M. J. F., Knill, K. M., Manakul, P., & Wang, Y. (2019). Disfluency
detection for spoken learner English. Proceedings of the 8th ISCA Workshop on Speech and
Language Technology in Education (SLaTE
2019), (pp. 74–78).
Lu, Y., Gales, M. J. F., & Wang, Y. (2020). Spoken
language ‘grammatical error correction.’ Proceedings of Interspeech
2020, (pp. 3840–3844).
Mann, W., & Thompson, S. (1988). Rhetorical
Structure Theory: Toward a functional theory of text organization. Text — Interdisciplinary
Journal for the Study of
Discourse, 8(3), 243–281.
Meurers, D. (2015). Learner
corpora and natural language processing. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The
Cambridge handbook of learner corpus
research (pp. 537–566). Cambridge University Press.
Moore, R., Caines, A., Graham, C., & Buttery, P. (2015). Incremental
dependency parsing and disfluency detection in spoken learner
English. In P. Král & V. Matoušek (Eds.), Text,
Speech, and Dialogue:
TSD 2015 (Vol. 93021, pp. 470–479). Springer.
Oberländer, L., & Klinger, R. (2020). Token
sequence labelling vs. clause classification for English emotion stimulus
detection. Proceedings of the Ninth Joint Conference on Lexical and Computational
Semantics (pp. 58–70). Association for Computational Linguistics.
Ostendorf, M., & Hahn, S. (2013). A
sequential repetition model for improved disfluency detection. Proceedings of Interspeech
2013, 2624–2628.
Passonneau, R. J., & Litman, D. (1997). Discourse
segmentation by human and automated means. Computational
Linguistics, 23(1), 103–139.
Pietrandrea, P., Kahane, S., Lacheret, A., & Sabio, F. (2014). The
notion of sentence and other discourse units in corpus
annotation. In T. Raso & H. Mello (Eds.), Spoken
corpora and linguistic
studies (pp. 331–364). John Benjamins.
Polanyi, L. (1988). A
formal model of the structure of discourse. Journal of
Pragmatics, 12(5–6),601–638.
Qian, X., & Liu, Y. (2013). Disfluency
detection using multi-step stacked learning. Proceedings of the 2013 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language
Technologies (pp. 820–825). NAACL.
Rocholl, J., Zayats, V., Walker, D., Murad, N., Schneider, A., & Liebling, D. (2021). Disfluency
detection with unlabeled data and small BERT models. Proceedings of Interspeech
2021, 766–770.
Römer, U., Roberson, A., O’Donnell, M. B., & Ellis, N. C. (2014). Linking
learner corpus and experimental data in studying second language learners’ knowledge of verb-argument
constructions. ICAME
Journal, 38(1), 115–135.
Sacks, H., & Schegloff, E. A., & Jefferson, G. (1974). A
simplest systematics for the organization of turn-taking for
conversation. Language, 50(4), 696–735.
Sanders, T., & Wijk, C. (1996). PISA
— A procedure for analyzing the structure of explanatory texts. Text &
Talk, 16(1), 91–132.
Schilperoord, J., & Verhagen, A. (1998). Conceptual
dependency and the clausal structure of discourse. In J. Koenig (Ed.), Discourse
and cognition: bridging the
gap (pp. 141–163). CSLI Publications.
Shriberg, E. E. (1994). Preliminaries
to a theory of speech disfluencies [Unpublished Doctoral
dissertation). University of California at Berkley
Skidmore, L. (2022). Incremental
disfluency detection for spoken learner English (Doctoral
dissertation). University of Sheffield.
Skidmore, L., & Moore, R. (2022). Incremental disfluency detection for spoken learner English. In Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022) (pp. 272–278). Association for Computational Linguistics.
Soricut, R., & Marcu, D. (2003). Sentence
level discourse parsing using syntactic and lexical information. Proceedings of the 2003
Human Language Technology Conference of the North American Chapter of the Association for Computational
Linguistics (pp. 228–235). NAACL 2003.
Stede, M. (2012). Small
discourse units and coherence relations. In Hirst, G. (Ed.), Discourse
processing (pp. 79–127). Springer International Publishing.
(2020). Automatic
argumentation mining and the role of stance and sentiment. Journal of Argumentation in
Context, 9(1), 19–41.
Subba, R., & Di Eugenio, B. (2007). Automatic
discourse segmentation using neural networks. Proceedings of the 11th Workshop on the Semantics
and Pragmatics of
Dialogue (pp. 189–190). SEMDIAL.
Tofiloski, M., Brooke, J., & Taboada, M. (2009). A
syntactic and lexical-based discourse segmenter. In K.-Y. Su, J. Su, J. Wiebe, & H. Li (Eds.), Proceedings
of the ACL-IJCNLP 2009 Conference Short
Papers (pp. 77–80). Association for Computational Linguistics.
Van Enschot, R., Spooren, W., van den Bosch, A., Burgers, C., Degand, L., Evers-Vermeul, J., … & Maes, A. (2024). Taming
our wild data: On intercoder reliability in discourse research. Dutch Journal of Applied
Linguistics, 131, 1–24.
Van Hest, E., Poulisse, N., & Bongaerts, T. (1997). Self-repair
in L1 and L2 production: an overview. International Journal of Applied
Linguistics, 117(1), 85–115.
Wang, Y., Li, S., & Yang, J. (2018). Toward
fast and accurate neural discourse segmentation. Proceedings of the 2018 Conference on
Empirical Methods in Natural Language
Processing (pp. 962–967). Association
for Computational Linguistics.
Wierszycka, J. (2013). Phrasal
verbs in learner English: a semantic approach. A study based on a POS-tagged spoken corpus of learner
English. Research in Corpus
Linguistics, 11, 81–93.
Wu, S., Zhang, D., Zhou, M., & Zhao, T. (2015). Efficient
disfluency detection with transition-based parsing. Proceedings of the 53rd Annual Meeting of
the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing,
Volume 1, (Long
Papers), (pp. 495–503). Association
for Computational Linguistics.
Yu, J., Zhang, L., Wu, S., & Zhang, B. (2017). Rhythm
and disfluency: Interactions in Chinese L2 English speech. 2017 20th Conference of the Oriental
Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment
(O-COCOSDA), 1–6.
Zayats, V., Ostendorf, M., & Hajishirzi, H. (2016). Disfluency
detection using a bidirectional LSTM. Proceedings of
Interspeech 20161, 2523–2527.