Article published In: Journal of Second Language Pronunciation
Vol. 11:2 (2025) ► pp.213–239
Exploring automatic speech recognition for corrective and confirmative pronunciation feedback
Published online: 1 April 2025
https://doi.org/10.1075/jslp.24035.joh
https://doi.org/10.1075/jslp.24035.joh
Abstract
Given that second language pronunciation errors are typically variable, learners would benefit from feedback that
both flags errors (corrective feedback) and confirms correct pronunciation (confirmative
feedback). We investigated Google Translate (GT) automatic speech recognition (ASR) transcription accuracy to
determine its capacity to provide such feedback, based on Quebec francophone recordings of correctly/incorrectly realized English
th-initial, h-initial and vowel-initial items in predictable/unpredictable sentence contexts. Recordings from male and female
speakers were used to verify possible gender bias. In predictable contexts, transcription accuracy rates were higher for correct
vs incorrect pronunciations; rates in unpredictable contexts for correct or incorrect pronunciations fell midway between the two.
GT ASR is thus better at providing confirmative feedback in predictable contexts but corrective feedback in unpredictable
contexts. Regardless of context, accuracy was considerably higher on errors leading to real-word than nonword output. Contra the
anticipated pattern, female speakers were transcribed with higher accuracy than male.
Article outline
- 1.Introduction
- 2.Background
- 2.1Theoretical framework
- 2.2ASR for pronunciation feedback
- 2.3The current study: Research question and hypotheses
- 3.Method
- 3.1Data collection material and procedure
- 3.1.1Phase I — Predictable contexts
- 3.1.2Phase II — Unpredictable contexts
- 3.2Data analysis
- 3.1Data collection material and procedure
- 4.Results
- 4.1Phase I — Predictable contexts
- 4.2Phase II — Unpredictable contexts
- 5.Discussion
- 6.Pedagogical implications
- 7.Limitations
- 8.Conclusion
- Acknowledgements
References
References (70)
Adda-Decker, M., & Lamel, L. (2005). Do
speech recognizers prefer female speakers? In INTERSPEECH 2005 —
Eurospeech, 9th European conference on speech communication and
technology (pp. 2205–2208), ISCA. [URL].
Ashwell, T., & Elam, J. R. (2017). How
accurately can the Google WEB Speech API recognize and transcribe Japanese L2 English learners’ oral
production? The JALT CALL
Journal, 13(1), 59–76.
Best, C. T., & Tyler, M. D. (2007). Nonnative
and second-language speech perception: Commonalities and
complementarities. In M. J. Munro and O.-S. Bohn (Eds.), Language
experience in second language speech learning: In honor of James Emil
Flege (pp. 13–34). John Benjamins.
Bliss, H., Abel, J., & Gick, B. (2018). Computer-assisted
visual articulation feedback in L2 pronunciation instruction — A review. Journal of Second
Language
Pronunciation, 4(1), 129–153.
Cámara-Arenas, E., Tejedor-García, C., Tomas-Vázquez, C. J., & Escudero-Mancebo, D. (2023). Automatic
pronunciation assessment vs. automatic speech recognition: A study of conflicting conditions for
L2-English. Language Learning &
Technology, 27(1), 1–19. [URL]
Chanethom, V., & Henderson, A. (2023). Alignment
in ASR and L1 listeners’ recognition of L2 learner speech: French EFL learners &
Dictation.Io. Research in
Language, 21(3), 245–266.
Cox, T., & Davies, R. (2012). Using
automatic speech recognition technology with elicited oral response testing. CALICO
Journal, 29(4), 601–618.
Dai, Y., & Wu, Z. (2023). Mobile-assisted
pronunciation learning with feedback from peers and/or automatic speech recognition: A mixed-methods
study. Computer Assisted Language
Learning, 36(5–6), 861–884.
Darcy, I., Daidone, D., & Kojima, C. (2013). Asymmetric
lexical access and fuzzy lexical representations in second language learners. The Mental
Lexicon, 8(3), 372–420.
Derwing, T. M., Munro, M. J., & Carbonaro, J. (2000). Does
popular speech recognition software work with ESL speech? TESOL
Quarterly, 34(3), 592–603.
Derwing, T. M., & Munro, M. J. (2015). Pronunciation
fundamentals: Evidence-based perspectives for L2 teaching and research. John Benjamins.
Dillon, T., & Wells, D. (2021). Student
perceptions of mobile automated speech recognition for pronunciation study and testing. English
Teaching, 76(4), 101–122.
Evers, K., & Chen, S. (2020). Effects
of an automatic speech recognition system with peer feedback on pronunciation instruction for
adults. Computer Assisted Language
Learning, 35(8), 1869–1889.
Feng, S., Kudina, O., Halpern, B. M., & Scharenborg, O. (2021). Quantifying
bias in automatic speech
recognition. Arxiv, 1–5.
Filippidou, F., & Moussiades, L. (2020). Benchmarking
of IBM, Google and Wit Automatatic Speech Recognition
Systems. In Maglogiannis, I., Iliadis, L., & Pimenidis, E. (Eds.), Artificial
intelligence applications and innovations, Part 1 of the proceedings of the 16th IFIG WG 12.5 International
Conference (pp. 73–82), Springer.
Flege, J. E., & Bohn, O.-S. (2021). The
revised speech learning model (SLM-r). In R. Wayland (Ed.), Second
language speech learning: Theoretical and empirical
progress (pp.3–83). Cambridge University Press.
Flege, J. E., Munro, M. J., & MacKay, I. R. A. (1995). Effects
of age of second-language learning on the production of English consonants. Speech
Communication, 16(1), 1–26.
Flege, J. E., Yeni-Komshian, G. H., & Liu, S. (1999). Age
constraints on second-language acquisition. Journal of Memory and
Language, 41(1), 78–104.
García, C., Nikolai, D., & Jones, L. (2020). Traditional
versus ASR-based pronunciation instruction: An empirical study. CALICO
Journal, 37(3), 213–232.
Garnerin, M., Rossato, S., & Besacier, L. (2019). Gender
representation in French broadcast corpora and its impact on ASR
performance. In AI4TV ’19: Proceedings of the 1st international
workshop on AI for smart TV content
production (pp. 3–9). Association for Computing Machinery.
Goldsmith, J., & Laks, B. (2012). Generative
phonology: Its origins, its principles, and its successors. Cambridge University Press.
Guskaroska, A. (2020). ASR-dictation
on smartphones for vowel pronunciation practice. Journal of Contemporary
Philology, 3(2), 45–61.
Heffernan, K. (2010). Mumbling
is macho: Phonetic distinctiveness in the speech of American radio DJs. American
Speech, 85(1), 67–90.
Inceoglu, S., Chen, W., & Lim, H. (2023). Assessment
of L2 intelligibility: Comparing L1 listeners and automatic speech
recognition. ReCALL, 35(1), 89–104.
Janda, R. D., & Auger, J. (1992). Quantitative
evidence, qualitative hypercorrection, sociolinguistic variables — and French speakers’ ‘eadhaches with English
h/Ø”, Language &
Communication, 12(3/4), 195–236.
John, P. (2006). Variable
h-epenthesis in the interlanguage of francophone ESL learners. Unpublished Master’s
thesis, Concordia University.
John, P., & Cardoso, W. (2009). Francophone
ESL learners’ difficulties with English /h/. In M. A. Watkins, A. S. Rauber, & B. O. Baptista (Eds.), Recent
research in second language phonetics/phonology: Perception and
production (pp. 118–140). Cambridge Scholars Publishing.
John, P., Cardoso, W., & Johnson, C. (2023). Automatic
speech recognition as a source of corrective feedback on L2
pronunciation. In M. Peterson & N. Jabbari (Eds.), Frontiers
in technology-mediated language
learning (pp. 1–19). Routledge.
John, P., & Frasnelli, J. (2022). On
the lexical source of variable L2 phoneme production. The Mental
Lexicon, 17(2), 239–276.
John, P., & Rigoulot, S. (2023). On
the representation of /h/ by Quebec francophone learners of English. Frontiers in Language
Sciences, 21, 1–13.
Johnson, C., & Cardoso, W. (2024). Hey
Google, let’s write: Examining L2 learners’ acceptance of automatic speech recognition as a writing
tool. CALICO
Journal, 41(2), 122–145.
Johnson, C., Cardoso, W., Zuercher, B., Brannen, K., & Springer, S. (2024). Assessing
pronunciation using dictation tools: the use of Google Voice Typing to score a pronunciation placement
test. Journal of Second Language
Pronunciation, 10(1), 10–34.
Kang, O., & Moran, M. (2014). Functional
loads of pronunciation features in nonnative speakers’ oral assessment. TESOL
Quarterly, 48(1), 176–187.
Kathiresan, T. (2022). Gender
bias in voice recognition: An i- and x-vector-based gender-specific automatic speaker recognition
study. In Italian Association for Speech Science
Conference (pp. 113–122).
Këpuska, V., & Bohouta, G. (2017). Comparing
speech recognition systems (Microsoft API, Google API And CMU Sphinx). International Journal of
Engineering Research and
Application, 7(3), 20–24.
Labov, W. (1966). The
social stratification of English in New York City. Center for Applied Linguistics.
Levis, J. (2005). Changing
contexts and shifting paradigms in pronunciation teaching. TESOL
Quarterly, 39(3), 369–377.
(2018). Intelligibility,
oral communication, and the teaching of pronunciation. Cambridge University Press.
Levis, J., & Suvorov, R. (2012). Automatic
speech recognition. In C. Chapelle (Ed.), The
encyclopedia of applied linguistics. John Wiley & Sons.
Liakin, D., Cardoso, W., & Liakina, N. (2017). Mobilizing
instruction in a second-language context: Learners’ perceptions of two speech
technologies. Languages, 2(11), 1–21.
Mah, J., Goad, H., & Steinhauer, K. (2016). Using
event-related brain potentials to assess perceptibility: The case of French speakers and English
[h]. Frontiers in
psychology, 71, 1–14.
Major, R. (2004). Gender
and stylistic variation in second language phonology. Language Variation and
Change, 16(3), 169–188.
McCrocklin, S. (2019). Learners’
feedback regarding ASR-based dictation practice for pronunciation learning. CALICO
Journal, 36(2), 119–137.
McCrocklin, S. M. (2016). Pronunciation
learner autonomy: The potential of automatic speech
recognition. System, 571, 25–42.
McCrocklin, S., Humaidan, A., & Edalatishams, E. (2019). ASR
dictation program accuracy: Have current programs improved? In J. Levis, C. Nagle, & E. Todey (Eds.), Proceedings
of the 10th Pronunciation in Second Language Learning and Teaching
Conference (pp. 191–200). Iowa State University.
McCrocklin, S., & Edalatishams, I. (2020). Revisiting
popular speech recognition software for ESL speech. TESOL
Quarterly, 54(4), 1086–1097.
Mehdipour-Kolour, D. & Cardoso, W. (2023). A
systematic literature review of automatic speech recognition in L2 learning: A case for L2
writing. In M. Peterson & N. Jabbari (Eds.), Frontiers
in technology-mediated language
learning (pp. 121–137). Routledge.
Milroy, L. (1988). Gender
as a speaker variable: The interesting case of the glottalised stops in
Tyneside. In York Papers in Linguistics 13: Selected papers from the
Sociolinguistics
Symposium (pp. 1–11). York University.
Moyer, A. (2016). The
puzzle of gender effects in L2 phonology. Journal of Second Language
Pronunciation, 2(1), 8–28.
Mroz, A. (2020). Aiming
for advanced intelligibility and proficiency using mobile ASR. Journal of Second Language
Pronunciation, 6(1), 12–38.
Munro, M., & Derwing, T. (2006). The
functional load principle in ESL pronunciation instruction: an exploratory
study. System, 34(4), 520–531.
Nelson, C., & Cardoso, W. (2023). Evaluating
the effectiveness of Microsoft Transcribe for automating the assessment of pronunciation in language proficiency
tests. In EUROCALL 2023 Short Papers.
Olson, D. J. (2014). Benefits
of visual feedback on segmental production in the L2 classroom. Language Learning &
Technology, 18(3), 173–192. [URL]
O’Shaughnessy, D. (2008). Automatic
speech recognition: History, methods and challenges. Pattern
Recognition, 41(10), 2965–2979.
Saito, K. (2021). Effects
of corrective feedback on second language pronunciation
development. In H. Nassaji & E. Kartchava (Eds.), The
Cambridge handbook of corrective feedback in second language learning and
teaching (pp. 407–428). Cambridge University Press.
Sewell, A. (2017). Functional
load revisited. Journal of Second Language
Pronunciation, 3(1), 57–79.
Sun, W. (2023). The
impact of automatic speech recognition technology on second language pronunciation and speaking skills of EFL learners: A
mixed methods investigation. Frontiers in
psychology, 141, 1–14.
Suzukida, Y., & Saito, K. (2019). Which
segmental features matter for successful L2 comprehensibility? Revisiting and generalizing the pedagogical value of the
Functional Load Principle. Language Teaching
Research, 25(3), 1–20.
Tatman, R. (2017). Gender
and dialect bias in YouTube’s automatic captions. In Proceedings of
the first workshop on ethics in natural language
processing (pp. 53–59). Association for Computational Linguistics.
Tatman, R., & Kasten, C. (2017). Effects
of talker dialect, gender & race on accuracy of Bing speech and YouTube automatic
captions. In Interspeech 2017, 18th annual conference of the
International Speech Communication
Association (pp. 934–938). ISCA.
Thi-Nhu Ngo, T., Hao-Jan Chen, H., & Kuo-Wei Lai, K. (2023). The
effectiveness of automatic speech recognition in ESL/EFL pronunciation: A
meta-analysis. ReCALL, 36(1), 4–21.
Trofimovich, P., & John, P. (2011). When
three equals tree: Examining the nature of phonological entries in L2 lexicons of Quebec speakers of
English. In P. Trofimovich & K. McDonough (Eds.), Applying
priming methods to L2 learning, teaching and research: Insights from
psycholinguistics (pp. 105–129). John Benjamins.
van Lieshout, C., & Cardoso, W. (2022). Google
Translate as a tool for self-directed language learning. Language Learning &
Technology, 26(1), 1–19. [URL]
Wang, Y.-H., & Young, S. S.-C. (2015). Effectiveness
of feedback for enhancing English pronunciation in an ASR-based CALL system. Journal of
Computer Assisted
Learning, 31(6), 493–504.
White, E. J., Titone, D., Genesee, F., & Steinhauer, K. (2015). Phonological
processing in late second language learners: The effects of proficiency and task. Bilingualism:
Language and
Cognition, 20(1), 161–183.
