Article published In: Journal of Second Language Pronunciation
Vol. 11:1 (2025) ► pp.46–75
Exploring large language models for L2 metaphonological awareness tutoring
Published online: 2 June 2025
https://doi.org/10.1075/jslp.24030.lod
https://doi.org/10.1075/jslp.24030.lod
Abstract
This is the first observational study to evaluate the feasibility of implementing large language models (LLMs) for
second language (L2) metaphonological awareness training. A custom implementation of GPT-4 acting as a homework tutor was piloted
in an English phonetics and phonology course for first-year university students. Two novel homework assignments were designed to
leverage the LLM’s strengths and explore its weaknesses. Analysis of learner interaction logs, homework reflections, and survey
data revealed that most learners perceived the AI Tutor as helpful for its personalised explanations. However, the overall
sentiment was mixed due to the LLM’s propensity for confabulation. Despite these challenges, the pilot demonstrated the potential
for LLMs to engage learners in active and self-regulated learning. Recommendations for future directions include designing
LLM-based learning environments, promoting AI literacy among educators and learners, and experimentally researching long-term
effects of AI tutors on learning outcomes.
Keywords: CAPT, artificial intelligence, AI, L2 pronunciation.
Article outline
- 1.Introduction
- 1.1Prior use of AI in second language pronunciation acquisition
- 1.2Potential of large language models as tutors
- 1.3Research questions
- 2.Study design
- 2.1Course profile
- 2.2Participants
- 2.3AI tutor design
- 2.4Pilot deployment
- 2.5Activity design
- 2.6Data collection
- 2.7Data analysis
- 3.Results
- 3.1General usage patterns
- 3.2Homework activity 1
- 3.3Homework activity 2
- 3.4Post-pilot survey
- 4.Discussion
- 4.1RQ1: Implementation Feasibility
- 4.2RQ2: Accuracy and helpfulness
- 4.3RQ3: Active and self-regulated learning
- 4.4Limitations
- 4.5Future directions
- 4.5.1Custom implementations
- 4.5.2Raising teacher and learner awareness
- 4.5.3Educational research
- 5.Conclusion
- Acknowledgements
References
References (73)
Anderson, L. W., Krathwohl, D. R., Airasian, P. W., Cruikshank, K. A., Mayer, R. E., Pintrich, P. R., Raths, J., & Wittrock, M. C. (2001). A
taxonomy for learning, teaching and assessing: A revision of Bloom’s taxonomy of educational
objectives. New York: Addison Wesley Longman.
Anthropic. (2023, June 14). Claude
2.1: A milestone in helpful and honest AI. Claude by Anthropic. [URL]
Azevedo, R., Taub, M., & Mudrick, N. V. (2018). Understanding
and reasoning about real-time cognitive, affective, and metacognitive processes to foster self-regulation with advanced
learning technologies. In D. H. Schunk & J. A. Greene (Eds.), Handbook
of self-regulation of learning and performance (2nd
ed., pp. 254–270). Routledge/Taylor & Francis Group.
Becker, K., & Edalatishams, I. (2019). ELSA
Speak — Accent reduction. In J. Levis, C. Nagle, & E. Todey (Eds.), Proceedings
of the 10th pronunciation in second language learning and teaching
conference (pp. 434–438). Iowa State University.
Beguš, G., Dąbkowski, M., & Rhodes, R. (2023). Large
linguistic models: Analyzing theoretical linguistic abilities of
LLMs. arXiv.
Bernstein, J. (1999). PhonePass
testing: Structure and construct. Menlo Park, CA: Ordinate Corporation.
Bjork, R. A., Dunlosky, J., & Kornell, N. (2013). Self-regulated
learning: Beliefs, techniques, and illusions. Annual Review of
Psychology, 641, 417–444.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language
models are few-shot learners. arXiv.
Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lundberg, S., Nori, H., Palangi, H., Ribeiro, M. T., & Zhang, Y. (2023). Sparks
of artificial general intelligence: Early experiments with GPT-4. arXiv.
Carlet, A., & Kivistö-de Souza, H. (2018). Improving
L2 pronunciation inside and outside the classroom: Perception, production and autonomous learning of L2
vowels. Ilha do
Desterro, 71(3), 99–123.
Chao, P.-J., Hsu, T.-H., Liu, T.-P., & Cheng, Y.-H. (2021). Knowledge
of and competence in artificial intelligence: Perspectives of Vietnamese digital-native
students. IEEE
Access, 91, 75751–75760.
Chi, M. T. H., & Wylie, R. (2014). The
ICAP framework: Linking cognitive engagement to active learning outcomes. Educational
Psychologist, 49(4), 219–243.
Chun, D. M. (2023). ELSA:
English Language Speech Assistant. Journal of Second Language
Pronunciation.
Collins, B., Mees, I. M., & Carley, P. (2019). Practical
English phonetics and phonology: A resource book for students (4th
ed.). Routledge.
Cope, B., & Kalantzis, M. (2023). A
multimodal grammar of artificial intelligence: Measuring the gains and losses in generative
AI. Multimodality &
Society, 4(2).
Coulange, S. (2023). Computer-aided
pronunciation training in 2022: When pedagogy struggles to catch
up. In A. Henderson & A. Kirkova-Naskova (Eds.), Proceedings
of the 7th International Conference on English Pronunciation: Issues and
Practices (pp. 11–22).
Couper, G. (2022). Teaching
and testing perception of word stress: Many shades of
perception. In J. Levis & A. Guskaroska (Eds.), Proceedings
of the 12th Pronunciation in Second Language Learning and Teaching Conference.
D’Mello, S. K., & Graesser, A. C. (2023). Intelligent
tutoring systems: ‘. In P. A. Schutz & K. R. Muis (Eds.), Handbook
of educational psychology (4th
ed.). Routledge.
Duijn, M. van, Dijk, B. van, Kouwenhoven, T., Valk, W. de, Spruit, M., & van der Putten, P. (2023). Theory
of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art models vs. Children Aged 7–10 on Advanced
Tests. Proceedings of the 27th Conference on Computational Natural Language
Learning, 389–402.
Ellis, R. (2004). The
definition and measurement of L2 explicit knowledge. Language
Learning, 54(2), 227–275.
Fu, K., Peng, L., Yang, N., & Zhou, S. (2024). Pronunciation
assessment with multi-modal large language models. arXiv. [URL]
Fu, Y., Song, H., Lee, S., Ma, D., Chi, E. H., Yang, L., & Zhang, C. (2023). Why
can GPT-4 fail simple reasoning problems? A comprehensive evaluation of reasoning
failures. arXiv.
Garrett, N. (2009). Technology
in the service of language learning: Trends and issues. The Modern Language
Journal, 931: 697–718.
Gartner. (2023). Hype cycle for
artificial intelligence, 2023. Retrieved November 25, 2023, from [URL]
Gonet, W., & Stadnicka, L. (2005). Vowel
clipping in English. Speech and Language
Technology 81, 77–86.
Graesser, A. C., Person, N. K., & Magliano, J. P. (1995). Collaborative
dialogue patterns in naturalistic one-to-one tutoring. Applied Cognitive
Psychology, 9(6), 495–522.
Graesser, A. C., Conley, M. W., & Olney, A. (2012). Intelligent
tutoring systems. In K. R. Harris, S. Graham, T. Urdan, A. G. Bus, S. Major, & H. L. Swanson (Eds.), APA
educational psychology handbook, Vol. 3. Application to learning and
teaching (pp. 451–473). American Psychological Association.
Hunter, J. D. (2007). Matplotlib:
A 2D Graphics Environment. Computing in Science &
Engineering, 9(3), 90–95.
Jekiel, M. (2022). L2
rhythm production and musical rhythm perception in advanced learners of English. Poznań Studies
in Contemporary
Linguistics, 58(2), 315–340.
Jekiel, M., & Malarski, K. (2021). Musical
hearing and musical experience in second language English vowel acquisition. Journal of Speech,
Language, and Hearing
Research, 64(5), 1666–1682.
Jeon, J., & Lee, S. (2023). Large
language models in education: A focus on the complementary relationship between human teachers and
ChatGPT. Education and Information
Technologies, 28(12), 15873–15892.
Jeon, J., Lee, S., & Choe, H. (2023). Beyond
ChatGPT: A conceptual framework and systematic review of speech-recognition chatbots for language
learning. Computers &
Education, 2061, 104898.
Kheiri, K., & Karimi, H. (2023). SentimentGPT:
Exploiting GPT for advanced sentiment analysis and its departure from current machine
learning. arXiv.
Kosinski, M. (2023). Theory
of Mind Might Have Spontaneously Emerged in Large Language
Models. arXiv.
Kulik, J. A., & Fletcher, J. D. (2016). Effectiveness
of ITSs: A meta-analytic review. Review of Educational
Research, 86(1), 42–78.
Lombardi, D., Shipley, T. F., Bailey, J. M., Bretones, P. S., Prather, E. E., Ballen, C. J., Knight, J. K., Smith, M. K., Stowe, R. L., Cooper, M. M., Prince, M., Atit, K., Uttal, D. H., LaDue, N. D., McNeal, P. M., Ryker, K., St. John, K., van der Hoeven Kraft, K. J., & Docktor, J. L. (2021). The
curious construct of active learning. Psychological Science in the Public
Interest, 22(1), 8–43.
Łodzikowski, K. (2021). Association
between allophonic transcription tool use and phonological awareness level. Language Learning
and
Technology, 25(1), 20–30. [URL]
Łodzikowski, K., Foltz, P. W., & Behrens, J. T. (2023). Generative
AI and its educational implications. arXiv.
Liu, R., Zenke, C., Liu, C., Holmes, A., Thornton, P., & Malan, D. J. (2024). Teaching
CS50 with AI: Leveraging generative artificial intelligence in computer science
education. In Proceedings of the 55th ACM Technical Symposium on
Computer Science Education — SIGCSE
2024 (Vol. 11, pp. 1–7). Portland: ACM.
McKinney, W. (2010). Data
structures for statistical computing in Python. In S. van der Walt & J. Millman (Eds.), Proceedings
of the 9th Python in Science
Conference (pp. 56–61).
Merriënboer, J. G., & Sweller, J. (2005). Cognitive
load theory and complex learning: Recent developments and future directions. Educational
Psychology
Review, 17(2), 147–177.
Mollick, E. R., & Mollick, L. (2024). Instructors
as innovators: A future-focused approach to new AI learning opportunities, with prompts. The Wharton School Research Paper.
Mompeán, J. A. (2024). ChatGPT
for L2 pronunciation teaching and learning. ELT
Journal, 78(4), 423–434.
Nowacka, M. (2022). English
phonetics course: University students’ preferences and expectations. Research in
Language, 20(1), 71–84.
NYU Center for Mind, Brain and
Consciousness. (2023, April 6). Debate: Do
language models need sensory grounding for meaning and understanding?
[Video]. YouTube. [URL]
Organisation for Economic Co-operation and
Development. (2024). Explanatory memorandum on the updated OECD definition of an
AI system (OECD Artificial Intelligence Papers, No. 8). OECD Publishing.
OpenAI. (2022, November 30). Introducing
ChatGPT. [URL]
. (2023a). GPT-3.5 Turbo [Large
language model]. OpenAI Blog. [URL]
. (2024). GPT-4o [Large
language model]. [URL]
Pea, R. D. (2004). The
social and technological dimensions of scaffolding and related theoretical concepts for learning, education, and human
activity. Journal of the Learning
Sciences, 13(3), 423–451.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., … & Duchesnay, E. (2011). Scikit-learn:
Machine learning in Python. Journal of Machine Learning
Research, 121, 2825–2830.
Pennington, M. C., & Rogerson-Revell, P. (2019). Using
technology for pronunciation teaching, learning, and
assessment. In M. C. Pennington & P. Rogerson-Revell (Eds.), English
pronunciation teaching and
research (pp. 235–286). Palgrave Macmillan.
Python Software
Foundation. (2023). Python (Version
3.12.1). [URL]
Reid, L., Button, D., & Brommeyer, M. (2023). Challenging
the myth of the digital native: A narrative review. Nursing
Reports, 13(2), 573–600.
Rojczyk, A., & Porzuczek, A. (2012). Selected
aspects in the acquisition of English phonology by Polish learners — Segments and
prosody. In D. Gabryś-Barker (Ed.), Readings
in second language
acquisition (pp. 93–120). Katowice: Uniwersytet Śląski.
Schmidt, R. W. (1990). The
role of consciousness in second language learning. Applied
Linguistics, 11(2), 129–158.
Shaikh, S., Yayilgan, S. Y., Klimova, B., & Pikhart, M. (2023). Assessing
the usability of ChatGPT for formal English language learning. European Journal of
Investigation in Health, Psychology and
Education, 13(9), 1937–1960.
Sharma, P., & Hannafin, M. J. (2007). Scaffolding
in technology-enhanced learning environments. Interactive Learning
Environments, 15(1), 27–46.
Sim, A., Wang, Y., Chan, T. S., & Huang, Y. (2024). Evaluating
the generation of spatial relations in text and image generative models. arXiv
Preprint, arXiv:2411.07664.
Simmering, P. F., & Huoviala, P. (2023). Large
language models for aspect-based sentiment analysis. arXiv.
Stockwell, G., & Wang, Y. (2023). Exploring
the challenges of technology in language teaching in the aftermath of the pandemic. RELC
Journal, 54(2).
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA:
Open and efficient foundation language models. arXiv.
Tunstall, L., Beeching, E., Lambert, N., Rajani, N., Rasul, K., Belkada, Y., Huang, S., Werra, L. von, Fourrier, C., Habib, N., Sarrazin, N., Sanseviero, O., Rush, A. M., & Wolf, T. (2023). Zephyr:
Direct distillation of LM alignment. arXiv.
Weckwerth, J. (2011). English
TRAP vowel in advanced Polish learners: Variation and system
typology. In W.-S. Lee & E. Zee (Eds.), Proceedings
of the 17th International Congress of Phonetic
Sciences (pp. 2110–2113). Hong Kong: City University of Hong Kong.
Winne, P. H. (2021). Open
learner models working in symbiosis with self-regulating learners: A research
agenda. International Journal of Artificial Intelligence in
Education, 311, 446–459.
Wrembel, M. (2011). Cross-modal
reinforcements in phonetics teaching and learning: An overview of innovative trends in pronunciation
pedagogy. In W. S. Lee & E. Zee (Eds.), Proceedings
of the 17th International Congress of Phonetic
Sciences (pp. 104–107). City University of Hong Kong.
Yan, D., Rupp, A. A., & Foltz, P. W. (Eds.). (2020). Handbook
of automated scoring: Theory into practice. CRC Press.
Zheng, Y., & De Jong, J. H. A. L. (2011). Research
note: Establishing construct and concurrent validity of Pearson Test of English
Academic. Pearson. Retrieved from [URL]
