Article published In: Chinese as a Second Language (漢語教學研究—美國中文教師學會學報)
Vol. 60:2 (2025) ► pp.79–108
Assessing the accuracy of Chinese speech-to-text tools for Chinese as foreign language learners
Published online: 8 April 2025
https://doi.org/10.1075/csl.24013.fen
https://doi.org/10.1075/csl.24013.fen
Abstract
This article examines the effectiveness of four Chinese Speech-to-Text tools in transcribing the speech of Chinese
as a Foreign Language (CFL) learners across different ACTFL proficiency levels. The results indicate notable differences in
transcription accuracy. Among the CSTT tools, ChatGPT 3.5 proves to be the most accurate, followed by WeChat and Baidu IME, while
iOS IME shows the lowest performance. Except for iOS IME, these tools achieve 100% accuracy at the Distinguished and Superior
levels, where speech closely approximates native fluency. ChatGPT 3.5 excels from Novice to Distinguished levels but occasionally
overcorrects Novice-level CFL learners’ erroneous speech. WeChat performs robustly above the Novice level, while Baidu IME is best
at the Advanced level and above. Conversely, iOS IME displays significant limitations at all levels. This study offers new
perspectives on “good pronunciation” and the debate over handwriting versus typing Chinese characters for CFL learners.
摘要
本文探讨了四款中文语音转文字 (CSTT) 工具在转录ACTFL不同水平的国际中文学习者 (CFL) 语音时的有效性。结果显示,这些工具在转录准确性上存在显著差异。其中,ChatGPT 3.5的准确性最高,其次是微信和百度输入法,而苹果iOS输入法表现最差。除iOS输入法外,这些工具在ACTFL优异 (Distinguished) 和优秀 (Superior) 级别都达到100%的准确率,因为此二级别的语音水平接近母语者水平。ChatGPT 3.5在从初级 (Novice) 到优异 (Distinguished) 的转录表现均十分出色,但在处理初级 (Novice) 的错误语音时偶尔会过度纠正。微信在初级 (Novice) 以上的级别表现稳定,而百度输入法在高 (Advanced) 及以上水平效果最佳。相反,iOS输入法在所有水平上都显示出显著的局限性。本研究亦为“良好发音”的定义以及中文学习者手写汉字还是电打汉字的争论提供了新的视角。
Article outline
- 1.Introduction
- 2.Literature review
- 3.Four CSTT tools
- 4.Research questions
- 5.Methodology
- 5.1Data collection
- 5.2Data analysis procedure
- 5.3Quantifying and qualifying the accuracy of CSTT tools
- 6.Findings
- 6.1Superior and distinguished level
- 6.2Advanced level
- 6.3Intermediate level
- 6.4Novice level
- 7.Discussion
- 7.1Performance variation of CSTT tools across proficiency levels
- 7.2ChatGPT’s transcription accuracy and correction capabilities
- 7.3WeChat and Baidu IME’s efficacy and educational implications
- 8.Pedagogical implications and further studies
- 8.1Integrating AI-Assisted CSTT in Chinese language education
- 8.2Reevaluating the emphasis on pronunciation accuracy and the debate over handwriting Chinese characters through CSTT
- 8.3Further study
- 9.Conclusion
References
References (29)
American Council on the Teaching of Foreign
Languages. (2012). ACTFL distinguished Chinese speaking
sample. The Speaking Sample. Retrieved
from [URL]
. (2012). ACTFL superior Chinese speaking
sample. The First Speaking Sample. Retrieved
from [URL]
. (2012). ACTFL advanced Chinese speaking
sample. The Second Speaking Sample. Retrieved
from [URL]
. (2012). ACTFL intermediate Chinese speaking
sample. The First Speaking Sample. Retrieved
from [URL]
. (2012). ACTFL novice Chinese speaking
sample. The First Speaking Sample. Retrieved
from [URL]
. (2012). Chinese (simplified characters)
speaking. Retrieved from [URL]
An, M., Yu, Z., Guo, J., Gao, S., & Xian, Y. (2014, May). The
teaching experiment of speech recognition based on HMM. In The 26th
Chinese Control and Decision Conference (2014
CCDC) (pp. 2416–2420). IEEE.
Coniam, D. (1998). Voice
recognition software accuracy with second language speakers of
English. System, 26(4), 533–544.
Deepgram. (2022). Benchmarking
OpenAI’s Whisper model across languages. Retrieved from [URL]
Evers, K., & Chen, S. (2020). Effects
of automatic speech recognition software on pronunciation for adults with different learning
styles. Journal of Educational Computing
Research, 59(4), 669–685.
Golas, K. C. (1995). Computer-based
English language training for the Royal Saudi Naval Forces. Journal of Interactive Instruction
Development, 7(4), 3–9.
Hirai, A., & Kovalyova, A. (2024). Speech-to-text
applications’ accuracy in English language learners’ speech transcription. Language Learning
&
Technology, 28(1), 1–21. [URL]
Hwang, W. Y., Shadiev, R., Kuo, T. C. T., & Chen, N. S. (2012). Effects
of speech-to-text recognition application on learning performance in synchronous cyber
classrooms. Journal of Educational Technology &
Society, 15(1), 367–380.
Kaur, J., Singh, A. & Kadyan, V. (2021). Automatic
speech recognition system for tonal languages: State-of-the-art survey. Archives of
Computational Methods in
Engineering, 281, 1039–1068.
Kincaid, J. (2018, September 5). Which
automatic transcription service is the most accurate? Descript Blog. [URL]
Kuo, T. C. T., Shadiev, R., Hwang, W. Y., & Chen, N. S. (2012). Effects
of applying STR for group learning activities on learning performance in a synchronous cyber
classroom. Computers &
Education, 58(1), 600–608.
McCrocklin, S. (2019). ASR-based
dictation practice for second language pronunciation improvement. Journal of Second Language
Pronunciation, 5(1), 98–118.
Mushangwe, H. (2015). Using
voice recognition software in learning of Chinese as a foreign language pronunciation. The
Journal of Language Teaching and
Learning, 5(1), 52–67.
Ngo, T. T., Chen, H. H., & Lai, K. K. (2023). The
overall effect size of using ASR in ESL/EFL pronunciation
training. ReCALL.
Ngoc, T. P., & Khai, T. T. (2021). A
new approach in elementary Chinese pronunciation test using AI voice recognition at
HCMUE. EDULEARN21
Proceedings, 1056–1061.
Noyes, J., & Starr, A. (1996). Use
of automatic speech recognition: current and potential applications. Computing & Control
Engineering
Journal, 7(5), 203–208.
Shadiev, R., & Liu, J. (2023). Review
of research on applications of speech recognition technology to assist language
learning. ReCALL, 35(1), 74–88.
Tejedor-García, C., Cardeñoso-Payo, V., & Escudero-Mancebo, D. (2021). Automatic
speech recognition (ASR) systems applied to pronunciation assessment of L2 Spanish for Japanese
speakers. Applied
Sciences, 11(15), 6695.
Tejedor-García, C., Escudero-Mancebo, D., Cámara-Arenas, E., González-Ferreras, C., & Cardeñoso-Payo, V. (2020). Assessing
pronunciation improvement in students of English using a controlled computer-assisted pronunciation
tool. IEEE Transactions on Learning
Technologies, 13(2), 269–282.
Thomala, L. L. (2024, July 2). Number
of active WeChat messenger accounts Q1 2014-Q1 2024. Statista. [URL]
Tian, Y. (2020). Error
tolerance of machine translation: Findings from failed teaching design. Journal of Technology
& Chinese Language
Teaching, 11(1).
Vaughn, C., Baese-Berk, M., & Idemaru, K. (2019). Re-examining
phonetic variability in native and non-native
speech. Phonetica, 76(5), 327–358.
Wu, X. (2023, February 28). Third-party input method user scale grows rapidly, Baidu Input Method leads the industry with a 46.4% market
share. 第三方输入法用户规模高位增长,百度输入法以46.4%市占率领跑行业. Xianning News Network. 咸宁新闻网. [URL]
