Article published In: Journal of Second Language Pronunciation
Vol. 11:3 (2025) ► pp.394–422
Assessing the efficacy of word error rate as a proxy for pronunciation quality
A comparative study of ASR systems and human evaluations among young EFL learners
Published online: 7 July 2025
https://doi.org/10.1075/jslp.25012.won
https://doi.org/10.1075/jslp.25012.won
Abstract
This study examines the validity of WER as a proxy for pronunciation quality in EFL contexts. Human ratings of comprehensibility and accentedness were compared with WER and automated pronunciation scores from six ASR systems — Kaldi, wav2vec 2.0, HuBERT, Whisper (Base and Large-v3), and Microsoft Azure — using 190 read-aloud recordings by Korean elementary learners. With respect to pronunciation scoring, Azure’s phoneme-level accuracy scores demonstrated moderate correlations with human judgments, while Kaldi’s GOP scores showed no meaningful association. Analysis of WER revealed a critical trade-off between ASR accuracy and perceptual sensitivity: high-performing systems such as Whisper Large-v3 and Azure produced near-zero WERs but weakly correlated with human ratings. In contrast, mid-performing systems such as Whisper Base and HuBERT showed stronger correlations, indicating that moderate WER values may better reflect pronunciation variation. These results underscore the limitations of WER in advanced ASR systems and the need for perceptually grounded, interpretable metrics.
Article outline
- 1.Introduction
- 2.Literature review
- 2.1Comprehensibility and accentedness
- 2.2The evolution of automatic speech recognition technology
- 2.3Word error rate in pronunciation assessment
- 3.Method
- 3.1Dataset
- 3.2Raters and rating criteria
- 3.3ASR algorithms used
- 3.4Procedures
- 3.5Statistical analysis
- 4.Results
- 4.1Comparison of human ratings and ASR-based ratings
- 4.2Relationship between pronunciation scores and WER
- 5.Discussion
- 6.Conclusion
- Declaration of generative AI and AI-assisted technologies in the writing process
- Data availability statement
- Acknowledgements
References
References (72)
Alphacephei. (2025). Vosk speech recognition toolkit. [URL]
Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.
Baker, A. (2014). Exploring teachers’ knowledge of second language pronunciation techniques: Teacher cognitions, observed classroom practices, and student perceptions. TESOL Quarterly, 48(1), 136–163.
Cámara-Arenas, E., Tejedor-García, C., Tomas-Vázquez, C. J., & Escudero-Mancebo, D. (2023). Automatic pronunciation assessment vs. automatic speech recognition: A study of conflicting conditions for L2-English. Language Learning & Technology, 27(1), 1–19. [URL]
Crowther, D., Trofimovich, P., Saito, K., & Isaacs, T. (2018). Linguistic dimensions of L2 accentedness and comprehensibility vary across speaking tasks. Studies in Second Language Acquisition, 40(2), 443–457.
Dai, Y., & Wu, Z. (2023). Mobile-assisted pronunciation learning with feedback from peers and/or automatic speech recognition: A mixed-methods study. Computer Assisted Language Learning, 36(5–6), 861–884.
Deadman, J. (2023). Simulating realistic multiparty speech data: For the development of distant microphone ASR systems. [Doctoral dissertation, University of Sheffield]. [URL]
Derwing, T. M., & Munro, M. J. (1997). Accent, intelligibility, and comprehensibility: Evidence from four L1s. Studies in Second Language Acquisition, 19(1), 1–16.
(2005). Second language accent and pronunciation teaching: A research-based approach. TESOL Quarterly, 39(3), 379–397.
(2015). Pronunciation fundamentals: Evidence-based perspectives for L2 teaching and research. John Benjamins Publishing Company.
Dizon, G. (2020). Evaluating intelligent personal assistants for L2 listening and speaking development. Language Learning & Technology, 24(1), 16–26.
Eckes, T. (2015). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments (2nd ed.). Peter Lang GmbH.
El Kheir, Y., Ali, A., & Chowdhury, S. A. (2023). Automatic pronunciation assessment: A review. Findings of the Association for Computational Linguistics: EMNLP 2023, 8304–8324. [URL].
Farrús, M. (2023). Automatic speech recognition in L2 learning: A review based on PRISMA methodology. Languages, 8(4), 242.
Ferraro, A., Galli, A., La Gatta, V., & Postiglione, M. (2023). Benchmarking open source and paid services for speech to text: An analysis of quality and input variety. Frontiers in Big Data, 61, 1210559.
Geng, H., Saito, D., & Minematsu, N. (2024). Simulating native speaker shadowing for nonnative speech assessment with latent speech representations. arXiv.
Gong, Y., Chen, Z., Chu, I.-H., Chang, P., & Glass, J. (2022). Transformer-based multi-aspect multi-granularity non-native English speaker pronunciation assessment. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7262–7266.
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.
Hirai, A., & Kovalyova, A. (2023). Using speech-to-text applications for assessing English language learners’ pronunciation: A comparison with human raters. In M.-d.-M. Suárez & W. M. El-Henawy (Eds.), Optimizing online English language learning and teaching (pp. 337–355). Springer International Publishing.
Hosseini-Kivanani, N., Gretter, R., Matassoni, M., & Falavigna, G. D. (2021). Experiments of ASR-based mispronunciation detection for children and adult English learners. arXiv.
Hsu, W.-N., Bolte, B., Tsai, Y.-H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 291, 3451–3460.
Inceoglu, S., Chen, W.-H., & Lim, H. (2023). Assessment of L2 intelligibility: Comparing L1 listeners and automatic speech recognition. ReCALL, 35(1), 89–104.
Isbell, D. R., & Lee, J. (2022). Self-assessment of comprehensibility and accentedness in second language Korean. Language Learning, 72(3), 806–852.
Jelinek, F. (1976). Continuous speech recognition by statistical methods. Proceedings of the IEEE, 64(4), 532–556.
Jenkins, J. (2000). The phonology of English as an international language: New models, new norms, new goals. Oxford University Press.
Jurafsky, D., & Martin, J. H. (2024). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition with language models (3rd ed.). [URL]
Kang, O., & Rubin, D. L. (2009). Reverse linguistic stereotyping: Measuring the effect of listener expectations on speech evaluation. Journal of Language and Social Psychology, 28(4), 441–456.
Karhila, R., Smolander, A.-R., Ylinen, S., & Kurimo, M. (2019). Transparent pronunciation scoring using articulatorily weighted phoneme edit distance. Proceedings of INTERSPEECH 2019, 1866–1870.
Khabbazbashi, N., Xu, J., & Galaczi, E. D. (2021). Opening the black box: Exploring automated speaking evaluation. In B. Lanteigne, C. Coombe, & J. D. Brown (Eds.), Challenges in language testing around the world (pp. 333–343). Springer.
Kheddar, H., Hemis, M., & Himeur, Y. (2024). Automatic speech recognition using advanced deep learning approaches: A survey. Information Fusion, 1091, 102422.
Kim, M. (2023). Digital enhancement of pronunciation assessment: Automated speech recognition and human raters. Phonetics and Speech Sciences, 15(2), 13–20.
Kim, S.-E., Chernyak, B. R., Seleznova, O., Keshet, J., Goldrick, M., & Bradlow, A. R. (2024). Automatic recognition of second language speech-in-noise. JASA Express Letters, 4(2), 025204.
Koizumi, R., Okabe, Y., & Kashimada, Y. (2017). A multifaceted Rasch analysis of rater reliability of the speaking section of the GTEC CBT. ARELE: Annual Review of English Language Education in Japan, 281, 241–256.
Kumalija, E., & Nakamoto, Y. (2022). Performance evaluation of automatic speech recognition systems on integrated noise-network distorted speech. Frontiers in Signal Processing, 21, 999457.
Kunal, G. (2023, August 24). Advancements in automatic speech recognition (ASR): Revolutionizing speech recognition technology. [URL]
Levis, J. (2005). Changing contexts and shifting paradigms in pronunciation teaching. TESOL Quarterly, 39(3), 369–377.
(2020). Revisiting the intelligibility and nativeness principles. Journal of Second Language Pronunciation, 6(3), 310–328.
Liakin, D., Cardoso, W., & Liakina, N. (2017). Mobilizing instruction in a second-language context: Learners’ perceptions of two speech technologies. Languages, 2(3), 11.
Likhomanenko, T., Xu, Q., Pratap, V., Tomasello, P., Kahn, J., Avidov, G., Collobert, R. , & Synnaeve, G. (2021). Rethinking evaluation in ASR: Are our models robust enough? Proceedings of INTERSPEECH 2021, 311–315.
Linacre, J. M. (2014). A user’s guide to FACETS (Version 3.80). [URL]
Lindemann, S. (2002). Listening with an attitude: A model of native-speaker comprehension of non-native speakers in the United States. Language in Society, 31(3), 419–441.
Lounis, M., Dendani, B., & Bahi, H. (2024). Mispronunciation detection and diagnosis using deep neural networks: A systematic review. Multimedia Tools and Applications, 831, 62793–62827.
Ma, M. (2023, February 14). Speech service update: Hierarchical Transformer for pronunciation assessment. [URL]
McGuire, M. (2025). Automatic speech recognition for non-native English: Accuracy and disfluency handling. arXiv.
Meeker, M. (2017, May 31). Internet trends 2017. [URL]
Mehrish, A., Majumder, N., Bharadwaj, R., Mihalcea, R., & Poria, S. (2023). A review of deep learning techniques for speech processing. Information Fusion, 991, 101869.
Microsoft. (2024, October 6). Use pronunciation assessment. [URL]
Munro, M. J., & Derwing, T. M. (1995a). Foreign accent, comprehensibility, and intelligibility in the speech of second language learners. Language Learning, 45(1), 73–97.
(1995b). Processing time, accent, and comprehensibility in the perception of native and foreign-accented speech. Language and Speech, 38(3), 289–306.
(2011). The foundations of accent and intelligibility in pronunciation research. Language Teaching, 44(3), 316–327.
NCH Software. (2022). WavePad audio editor (Version 16.01) [Computer software]. [URL]
Neri, A., Cucchiarini, C., & Strik, H. (2008). The effectiveness of computer-based speech corrective feedback for improving segmental quality in L2 Dutch. ReCALL, 20(2), 225–243.
O’Shaughnessy, D. (2024). Trends and developments in automatic speech recognition research. Computer Speech & Language, 831, 101538.
Ockey, G. J., Chukharev-Hudilainen, E., & Hirch, R. R. (2023). Assessing interactional competence: ICE versus a human partner. Language Assessment Quarterly, 20(4-5), 377–398.
Ortega, M., Mora, J. C., & Mora-Plaza, I. (2022). L2 learners’ self-assessment of comprehensibility and accentedness: Over/under-estimation, effects of rating peers, and attention to speech features. Proceedings of the 12th Pronunciation in Second Language Learning and Teaching Conference.
Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). Librispeech: An ASR corpus based on public domain audio books. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5206–5210.
Patman, C., & Chodroff, E. (2024). Speech recognition in adverse conditions by humans and machines. JASA Express Letters, 4(11), 115204.
Pieraccini, R. (2012). The voice in the machine: Building computers that understand speech. MIT Press.
Povey, D. (2020). Librispeech ASR model. [URL]
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., & Schwarz, P. (2011). The Kaldi speech recognition toolkit. Proceedings of ASRU 2011, IEEE Signal Processing Society.
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. Proceedings of the 40th International Conference on Machine Learning, 28492–28518. [URL]
Saito, K., Webb, S., Trofimovich, P., & Isaacs, T. (2016). Lexical correlates of comprehensibility versus accentedness in second language speech. Bilingualism: Language and Cognition, 19(3), 597–609.
Sun, W. (2023). The impact of automatic speech recognition technology on second language pronunciation and speaking skills of EFL learners: A mixed methods investigation. Frontiers in Psychology, 141, 1210187.
Tergujeff, E. (2021). Second language comprehensibility and accentedness across oral proficiency levels: A comparison of two L1s. System, 1001, 102567.
Thi-Nhu Ngo, T., Hao-Jan Chen, H., & Kuo-Wei Lai, K. (2023). The effectiveness of automatic speech recognition in ESL/EFL pronunciation: A meta-analysis. ReCALL, 36(1), 4–21.
Thomson, R. I., & Derwing, T. M. (2015). The effectiveness of L2 pronunciation instruction: A narrative review. Applied Linguistics, 36(3), 326–344.
Trofimovich, P., & Isaacs, T. (2012). Disentangling accent from comprehensibility. Bilingualism: Language and Cognition, 15(4), 905–916.
