Exploring automatic speech recognition for corrective and confirmative pronunciation feedback

John, Paul; Johnson, Carol; Cardoso, Walcir

doi:10.1075/jslp.24035.joh

Article published In: Journal of Second Language Pronunciation
Vol. 11:2 (2025) ► pp.213–239

Get fulltext from our e-platform

Download PDF

Download EPUB

Exploring automatic speech recognition for corrective and confirmative pronunciation feedback

Paul John | University of Quebec in Trois-Rivieres

Carol Johnson | Concordia University

Walcir Cardoso | Concordia University

Published online: 1 April 2025

https://doi.org/10.1075/jslp.24035.joh

Abstract

Given that second language pronunciation errors are typically variable, learners would benefit from feedback that both flags errors (corrective feedback) and confirms correct pronunciation (confirmative feedback). We investigated Google Translate (GT) automatic speech recognition (ASR) transcription accuracy to determine its capacity to provide such feedback, based on Quebec francophone recordings of correctly/incorrectly realized English th-initial, h-initial and vowel-initial items in predictable/unpredictable sentence contexts. Recordings from male and female speakers were used to verify possible gender bias. In predictable contexts, transcription accuracy rates were higher for correct vs incorrect pronunciations; rates in unpredictable contexts for correct or incorrect pronunciations fell midway between the two. GT ASR is thus better at providing confirmative feedback in predictable contexts but corrective feedback in unpredictable contexts. Regardless of context, accuracy was considerably higher on errors leading to real-word than nonword output. Contra the anticipated pattern, female speakers were transcribed with higher accuracy than male.

Keywords: automatic speech recognition, corrective feedback, confirmative feedback, L2 pronunciation, Google Translate, gender bias, approximate representations

Article outline

1.Introduction
2.Background
- 2.1Theoretical framework
- 2.2ASR for pronunciation feedback
- 2.3The current study: Research question and hypotheses
3.Method
- 3.1Data collection material and procedure
  - 3.1.1Phase I — Predictable contexts
  - 3.1.2Phase II — Unpredictable contexts
- 3.2Data analysis
4.Results
- 4.1Phase I — Predictable contexts
- 4.2Phase II — Unpredictable contexts
5.Discussion
6.Pedagogical implications
7.Limitations
8.Conclusion
Acknowledgements
References

References (70)

References

Adda-Decker, M., & Lamel, L. (2005). Do speech recognizers prefer female speakers? In INTERSPEECH 2005 — Eurospeech, 9th European conference on speech communication and technology (pp. 2205–2208), ISCA. [URL].

Ashwell, T., & Elam, J. R. (2017). How accurately can the Google WEB Speech API recognize and transcribe Japanese L2 English learners’ oral production? The JALT CALL Journal, 13(1), 59–76.

Best, C. T., & Tyler, M. D. (2007). Nonnative and second-language speech perception: Commonalities and complementarities. In M. J. Munro and O.-S. Bohn (Eds.), Language experience in second language speech learning: In honor of James Emil Flege (pp. 13–34). John Benjamins.

Bliss, H., Abel, J., & Gick, B. (2018). Computer-assisted visual articulation feedback in L2 pronunciation instruction — A review. Journal of Second Language Pronunciation, 4(1), 129–153.

Cámara-Arenas, E., Tejedor-García, C., Tomas-Vázquez, C. J., & Escudero-Mancebo, D. (2023). Automatic pronunciation assessment vs. automatic speech recognition: A study of conflicting conditions for L2-English. Language Learning & Technology, 27(1), 1–19. [URL]

Chanethom, V., & Henderson, A. (2023). Alignment in ASR and L1 listeners’ recognition of L2 learner speech: French EFL learners & Dictation.Io. Research in Language, 21(3), 245–266.

Cox, T., & Davies, R. (2012). Using automatic speech recognition technology with elicited oral response testing. CALICO Journal, 29(4), 601–618.

Criado Perez, I. (2019). Invisible women: Data bias in a world designed for men. Abrams Press.

Dai, Y., & Wu, Z. (2023). Mobile-assisted pronunciation learning with feedback from peers and/or automatic speech recognition: A mixed-methods study. Computer Assisted Language Learning, 36(5–6), 861–884.

Darcy, I., Daidone, D., & Kojima, C. (2013). Asymmetric lexical access and fuzzy lexical representations in second language learners. The Mental Lexicon, 8(3), 372–420.

Derwing, T. M., Munro, M. J., & Carbonaro, J. (2000). Does popular speech recognition software work with ESL speech? TESOL Quarterly, 34(3), 592–603.

Derwing, T. M., & Munro, M. J. (2015). Pronunciation fundamentals: Evidence-based perspectives for L2 teaching and research. John Benjamins.

Dillon, T., & Wells, D. (2021). Student perceptions of mobile automated speech recognition for pronunciation study and testing. English Teaching, 76(4), 101–122.

Evers, K., & Chen, S. (2020). Effects of an automatic speech recognition system with peer feedback on pronunciation instruction for adults. Computer Assisted Language Learning, 35(8), 1869–1889.

Feng, S., Kudina, O., Halpern, B. M., & Scharenborg, O. (2021). Quantifying bias in automatic speech recognition. Arxiv, 1–5.

Filippidou, F., & Moussiades, L. (2020). Benchmarking of IBM, Google and Wit Automatatic Speech Recognition Systems. In Maglogiannis, I., Iliadis, L., & Pimenidis, E. (Eds.), Artificial intelligence applications and innovations, Part 1 of the proceedings of the 16th IFIG WG 12.5 International Conference (pp. 73–82), Springer.

Flege, J. E., & Bohn, O.-S. (2021). The revised speech learning model (SLM-r). In R. Wayland (Ed.), Second language speech learning: Theoretical and empirical progress (pp.3–83). Cambridge University Press.

Flege, J. E., Munro, M. J., & MacKay, I. R. A. (1995). Effects of age of second-language learning on the production of English consonants. Speech Communication, 16(1), 1–26.

Flege, J. E., Yeni-Komshian, G. H., & Liu, S. (1999). Age constraints on second-language acquisition. Journal of Memory and Language, 41(1), 78–104.

García, C., Nikolai, D., & Jones, L. (2020). Traditional versus ASR-based pronunciation instruction: An empirical study. CALICO Journal, 37(3), 213–232.

Garnerin, M., Rossato, S., & Besacier, L. (2019). Gender representation in French broadcast corpora and its impact on ASR performance. In AI4TV ’19: Proceedings of the 1st international workshop on AI for smart TV content production (pp. 3–9). Association for Computing Machinery.

Goldsmith, J., & Laks, B. (2012). Generative phonology: Its origins, its principles, and its successors. Cambridge University Press.

Guskaroska, A. (2020). ASR-dictation on smartphones for vowel pronunciation practice. Journal of Contemporary Philology, 3(2), 45–61.

Heffernan, K. (2010). Mumbling is macho: Phonetic distinctiveness in the speech of American radio DJs. American Speech, 85(1), 67–90.

Inceoglu, S., Chen, W., & Lim, H. (2023). Assessment of L2 intelligibility: Comparing L1 listeners and automatic speech recognition. ReCALL, 35(1), 89–104.

Janda, R. D., & Auger, J. (1992). Quantitative evidence, qualitative hypercorrection, sociolinguistic variables — and French speakers’ ‘eadhaches with English h/Ø”, Language & Communication, 12(3/4), 195–236.

John, P. (2006). Variable h-epenthesis in the interlanguage of francophone ESL learners. Unpublished Master’s thesis, Concordia University.

John, P., & Cardoso, W. (2009). Francophone ESL learners’ difficulties with English /h/. In M. A. Watkins, A. S. Rauber, & B. O. Baptista (Eds.), Recent research in second language phonetics/phonology: Perception and production (pp. 118–140). Cambridge Scholars Publishing.

John, P., Cardoso, W., & Johnson, C. (2023). Automatic speech recognition as a source of corrective feedback on L2 pronunciation. In M. Peterson & N. Jabbari (Eds.), Frontiers in technology-mediated language learning (pp. 1–19). Routledge.

John, P., & Frasnelli, J. (2022). On the lexical source of variable L2 phoneme production. The Mental Lexicon, 17(2), 239–276.

John, P., & Rigoulot, S. (2023). On the representation of /h/ by Quebec francophone learners of English. Frontiers in Language Sciences, 21, 1–13.

Johnson, C., & Cardoso, W. (2024). Hey Google, let’s write: Examining L2 learners’ acceptance of automatic speech recognition as a writing tool. CALICO Journal, 41(2), 122–145.

Johnson, C., Cardoso, W., Zuercher, B., Brannen, K., & Springer, S. (2024). Assessing pronunciation using dictation tools: the use of Google Voice Typing to score a pronunciation placement test. Journal of Second Language Pronunciation, 10(1), 10–34.

Kang, O., & Moran, M. (2014). Functional loads of pronunciation features in nonnative speakers’ oral assessment. TESOL Quarterly, 48(1), 176–187.

Kathiresan, T. (2022). Gender bias in voice recognition: An i- and x-vector-based gender-specific automatic speaker recognition study. In Italian Association for Speech Science Conference (pp. 113–122).

Kenstowicz, M. (1994). Phonology in generative grammar. Blackwell.

Këpuska, V., & Bohouta, G. (2017). Comparing speech recognition systems (Microsoft API, Google API And CMU Sphinx). International Journal of Engineering Research and Application, 7(3), 20–24.

Labov, W. (1966). The social stratification of English in New York City. Center for Applied Linguistics.

Levis, J. (2005). Changing contexts and shifting paradigms in pronunciation teaching. TESOL Quarterly, 39(3), 369–377.

(2018). Intelligibility, oral communication, and the teaching of pronunciation. Cambridge University Press.

Levis, J., & Suvorov, R. (2012). Automatic speech recognition. In C. Chapelle (Ed.), The encyclopedia of applied linguistics. John Wiley & Sons.

Liakin, D., Cardoso, W., & Liakina, N. (2017). Mobilizing instruction in a second-language context: Learners’ perceptions of two speech technologies. Languages, 2(11), 1–21.

Mah, J., Goad, H., & Steinhauer, K. (2016). Using event-related brain potentials to assess perceptibility: The case of French speakers and English [h]. Frontiers in psychology, 71, 1–14.

Major, R. (2004). Gender and stylistic variation in second language phonology. Language Variation and Change, 16(3), 169–188.

McCrocklin, S. (2019). Learners’ feedback regarding ASR-based dictation practice for pronunciation learning. CALICO Journal, 36(2), 119–137.

McCrocklin, S. M. (2016). Pronunciation learner autonomy: The potential of automatic speech recognition. System, 571, 25–42.

McCrocklin, S., Humaidan, A., & Edalatishams, E. (2019). ASR dictation program accuracy: Have current programs improved? In J. Levis, C. Nagle, & E. Todey (Eds.), Proceedings of the 10th Pronunciation in Second Language Learning and Teaching Conference (pp. 191–200). Iowa State University.

McCrocklin, S., & Edalatishams, I. (2020). Revisiting popular speech recognition software for ESL speech. TESOL Quarterly, 54(4), 1086–1097.

Mehdipour-Kolour, D. & Cardoso, W. (2023). A systematic literature review of automatic speech recognition in L2 learning: A case for L2 writing. In M. Peterson & N. Jabbari (Eds.), Frontiers in technology-mediated language learning (pp. 121–137). Routledge.

Milroy, L. (1988). Gender as a speaker variable: The interesting case of the glottalised stops in Tyneside. In York Papers in Linguistics 13: Selected papers from the Sociolinguistics Symposium (pp. 1–11). York University.

Moyer, A. (2016). The puzzle of gender effects in L2 phonology. Journal of Second Language Pronunciation, 2(1), 8–28.

Mroz, A. (2020). Aiming for advanced intelligibility and proficiency using mobile ASR. Journal of Second Language Pronunciation, 6(1), 12–38.

Munro, M., & Derwing, T. (2006). The functional load principle in ESL pronunciation instruction: an exploratory study. System, 34(4), 520–531.

Nelson, C., & Cardoso, W. (2023). Evaluating the effectiveness of Microsoft Transcribe for automating the assessment of pronunciation in language proficiency tests. In EUROCALL 2023 Short Papers.

Olson, D. J. (2014). Benefits of visual feedback on segmental production in the L2 classroom. Language Learning & Technology, 18(3), 173–192. [URL]

O’Shaughnessy, D. (2008). Automatic speech recognition: History, methods and challenges. Pattern Recognition, 41(10), 2965–2979.

Saito, K. (2021). Effects of corrective feedback on second language pronunciation development. In H. Nassaji & E. Kartchava (Eds.), The Cambridge handbook of corrective feedback in second language learning and teaching (pp. 407–428). Cambridge University Press.

Sewell, A. (2017). Functional load revisited. Journal of Second Language Pronunciation, 3(1), 57–79.

Sun, W. (2023). The impact of automatic speech recognition technology on second language pronunciation and speaking skills of EFL learners: A mixed methods investigation. Frontiers in psychology, 141, 1–14.

Suzukida, Y., & Saito, K. (2019). Which segmental features matter for successful L2 comprehensibility? Revisiting and generalizing the pedagogical value of the Functional Load Principle. Language Teaching Research, 25(3), 1–20.

Tatman, R. (2017). Gender and dialect bias in YouTube’s automatic captions. In Proceedings of the first workshop on ethics in natural language processing (pp. 53–59). Association for Computational Linguistics.

Tatman, R., & Kasten, C. (2017). Effects of talker dialect, gender & race on accuracy of Bing speech and YouTube automatic captions. In Interspeech 2017, 18th annual conference of the International Speech Communication Association (pp. 934–938). ISCA.

Thi-Nhu Ngo, T., Hao-Jan Chen, H., & Kuo-Wei Lai, K. (2023). The effectiveness of automatic speech recognition in ESL/EFL pronunciation: A meta-analysis. ReCALL, 36(1), 4–21.

Trofimovich, P., & John, P. (2011). When three equals tree: Examining the nature of phonological entries in L2 lexicons of Quebec speakers of English. In P. Trofimovich & K. McDonough (Eds.), Applying priming methods to L2 learning, teaching and research: Insights from psycholinguistics (pp. 105–129). John Benjamins.

Trudgill, P. (1983). On dialect: Social and geographical perspectives. Oxford: Blackwell.

van Lieshout, C., & Cardoso, W. (2022). Google Translate as a tool for self-directed language learning. Language Learning & Technology, 26(1), 1–19. [URL]

Wang, Y.-H., & Young, S. S.-C. (2015). Effectiveness of feedback for enhancing English pronunciation in an ASR-based CALL system. Journal of Computer Assisted Learning, 31(6), 493–504.

White, E. J., Titone, D., Genesee, F., & Steinhauer, K. (2015). Phonological processing in late second language learners: The effects of proficiency and task. Bilingualism: Language and Cognition, 20(1), 161–183.

Whiteside, S. P., & Irving, C. J. (1997). Speakers’ sex differences in voice onset time: Some preliminary findings. Perceptual and Motor Skills, 85(2), 459–463.

Winford, D. (1978). Phonological hypercorrection in the process of decreolization — the case of Trinidadian English. Journal of Linguistics, 14(2), 277–91.