Assessing the accuracy of Chinese speech-to-text tools for Chinese as foreign language learners

Feng, Yisu; Tian, Ye

doi:10.1075/csl.24013.fen

Article published In: Chinese as a Second Language (漢語教學研究—美國中文教師學會學報)
Vol. 60:2 (2025) ► pp.79–108

Get fulltext from our e-platform

Download PDF

Download EPUB

Assessing the accuracy of Chinese speech-to-text tools for Chinese as foreign language learners

Yisu Feng | University of Pennsylvania

Ye Tian | University of Pennsylvania

Published online: 8 April 2025

https://doi.org/10.1075/csl.24013.fen

Abstract

This article examines the effectiveness of four Chinese Speech-to-Text tools in transcribing the speech of Chinese as a Foreign Language (CFL) learners across different ACTFL proficiency levels. The results indicate notable differences in transcription accuracy. Among the CSTT tools, ChatGPT 3.5 proves to be the most accurate, followed by WeChat and Baidu IME, while iOS IME shows the lowest performance. Except for iOS IME, these tools achieve 100% accuracy at the Distinguished and Superior levels, where speech closely approximates native fluency. ChatGPT 3.5 excels from Novice to Distinguished levels but occasionally overcorrects Novice-level CFL learners’ erroneous speech. WeChat performs robustly above the Novice level, while Baidu IME is best at the Advanced level and above. Conversely, iOS IME displays significant limitations at all levels. This study offers new perspectives on “good pronunciation” and the debate over handwriting versus typing Chinese characters for CFL learners.

Keywords: speech-to-text, Chinese Language Education, Language Learning Technology

摘要

本文探讨了四款中文语音转文字 (CSTT) 工具在转录ACTFL不同水平的国际中文学习者 (CFL) 语音时的有效性。结果显示，这些工具在转录准确性上存在显著差异。其中，ChatGPT 3.5的准确性最高，其次是微信和百度输入法，而苹果iOS输入法表现最差。除iOS输入法外，这些工具在ACTFL优异 (Distinguished) 和优秀 (Superior) 级别都达到100%的准确率，因为此二级别的语音水平接近母语者水平。ChatGPT 3.5在从初级 (Novice) 到优异 (Distinguished) 的转录表现均十分出色，但在处理初级 (Novice) 的错误语音时偶尔会过度纠正。微信在初级 (Novice) 以上的级别表现稳定，而百度输入法在高 (Advanced) 及以上水平效果最佳。相反，iOS输入法在所有水平上都显示出显著的局限性。本研究亦为“良好发音”的定义以及中文学习者手写汉字还是电打汉字的争论提供了新的视角。

关键词：语音转文字，中文语言教育，语言学习技术

Article outline

1.Introduction
2.Literature review
3.Four CSTT tools
4.Research questions
5.Methodology
- 5.1Data collection
- 5.2Data analysis procedure
- 5.3Quantifying and qualifying the accuracy of CSTT tools
6.Findings
- 6.1Superior and distinguished level
- 6.2Advanced level
- 6.3Intermediate level
- 6.4Novice level
7.Discussion
- 7.1Performance variation of CSTT tools across proficiency levels
- 7.2ChatGPT’s transcription accuracy and correction capabilities
- 7.3WeChat and Baidu IME’s efficacy and educational implications
8.Pedagogical implications and further studies
- 8.1Integrating AI-Assisted CSTT in Chinese language education
- 8.2Reevaluating the emphasis on pronunciation accuracy and the debate over handwriting Chinese characters through CSTT
- 8.3Further study
9.Conclusion
References

References (29)

References

American Council on the Teaching of Foreign Languages. (2012). ACTFL distinguished Chinese speaking sample. The Speaking Sample. Retrieved from [URL]

. (2012). ACTFL superior Chinese speaking sample. The First Speaking Sample. Retrieved from [URL]

. (2012). ACTFL advanced Chinese speaking sample. The Second Speaking Sample. Retrieved from [URL]

. (2012). ACTFL intermediate Chinese speaking sample. The First Speaking Sample. Retrieved from [URL]

. (2012). ACTFL novice Chinese speaking sample. The First Speaking Sample. Retrieved from [URL]

. (2012). Chinese (simplified characters) speaking. Retrieved from [URL]

An, M., Yu, Z., Guo, J., Gao, S., & Xian, Y. (2014, May). The teaching experiment of speech recognition based on HMM. In The 26th Chinese Control and Decision Conference (2014 CCDC) (pp. 2416–2420). IEEE.

Coniam, D. (1998). Voice recognition software accuracy with second language speakers of English. System, 26(4), 533–544.

Deepgram. (2022). Benchmarking OpenAI’s Whisper model across languages. Retrieved from [URL]

Evers, K., & Chen, S. (2020). Effects of automatic speech recognition software on pronunciation for adults with different learning styles. Journal of Educational Computing Research, 59(4), 669–685.

Golas, K. C. (1995). Computer-based English language training for the Royal Saudi Naval Forces. Journal of Interactive Instruction Development, 7(4), 3–9.

Hirai, A., & Kovalyova, A. (2024). Speech-to-text applications’ accuracy in English language learners’ speech transcription. Language Learning & Technology, 28(1), 1–21. [URL]

Hwang, W. Y., Shadiev, R., Kuo, T. C. T., & Chen, N. S. (2012). Effects of speech-to-text recognition application on learning performance in synchronous cyber classrooms. Journal of Educational Technology & Society, 15(1), 367–380.

Kaur, J., Singh, A. & Kadyan, V. (2021). Automatic speech recognition system for tonal languages: State-of-the-art survey. Archives of Computational Methods in Engineering, 281, 1039–1068.

Kincaid, J. (2018, September 5). Which automatic transcription service is the most accurate? Descript Blog. [URL]

Kuo, T. C. T., Shadiev, R., Hwang, W. Y., & Chen, N. S. (2012). Effects of applying STR for group learning activities on learning performance in a synchronous cyber classroom. Computers & Education, 58(1), 600–608.

Manes, S. (1997). Speech recognition: Now you’re talking!. PC World, 15(10), 400–400.

McCrocklin, S. (2019). ASR-based dictation practice for second language pronunciation improvement. Journal of Second Language Pronunciation, 5(1), 98–118.

Mushangwe, H. (2015). Using voice recognition software in learning of Chinese as a foreign language pronunciation. The Journal of Language Teaching and Learning, 5(1), 52–67.

Ngo, T. T., Chen, H. H., & Lai, K. K. (2023). The overall effect size of using ASR in ESL/EFL pronunciation training. ReCALL.

Ngoc, T. P., & Khai, T. T. (2021). A new approach in elementary Chinese pronunciation test using AI voice recognition at HCMUE. EDULEARN21 Proceedings, 1056–1061.

Noyes, J., & Starr, A. (1996). Use of automatic speech recognition: current and potential applications. Computing & Control Engineering Journal, 7(5), 203–208.

Shadiev, R., & Liu, J. (2023). Review of research on applications of speech recognition technology to assist language learning. ReCALL, 35(1), 74–88.

Tejedor-García, C., Cardeñoso-Payo, V., & Escudero-Mancebo, D. (2021). Automatic speech recognition (ASR) systems applied to pronunciation assessment of L2 Spanish for Japanese speakers. Applied Sciences, 11(15), 6695.

Tejedor-García, C., Escudero-Mancebo, D., Cámara-Arenas, E., González-Ferreras, C., & Cardeñoso-Payo, V. (2020). Assessing pronunciation improvement in students of English using a controlled computer-assisted pronunciation tool. IEEE Transactions on Learning Technologies, 13(2), 269–282.

Thomala, L. L. (2024, July 2). Number of active WeChat messenger accounts Q1 2014-Q1 2024. Statista. [URL]

Tian, Y. (2020). Error tolerance of machine translation: Findings from failed teaching design. Journal of Technology & Chinese Language Teaching, 11(1).

Vaughn, C., Baese-Berk, M., & Idemaru, K. (2019). Re-examining phonetic variability in native and non-native speech. Phonetica, 76(5), 327–358.

Wu, X. (2023, February 28). Third-party input method user scale grows rapidly, Baidu Input Method leads the industry with a 46.4% market share. 第三方输入法用户规模高位增长，百度输入法以46.4%市占率领跑行业. Xianning News Network. 咸宁新闻网. [URL]