Article published In: English World-Wide
Vol. 46:3 (2025) ► pp.274–298
The YouTube corpus of Singapore English podcasts
Available under the Creative Commons Attribution-NonCommercial (CC BY-NC) 4.0 license.
For any use beyond this license, please contact the publisher at rights@benjamins.nl.
This article was made Open Access under a CC BY-NC 4.0 license through payment of an APC by or on behalf of the authors.
Published online: 9 September 2025
https://doi.org/10.1075/eww.25018.coa
https://doi.org/10.1075/eww.25018.coa
Abstract
Recent advances in streaming protocols and automatic speech recognition (ASR) have enabled large-scale spoken
language corpora, yet research on Singapore English remains constrained by small or text-based datasets. The YouTube Corpus of
Singapore English Podcasts (YCSEP) addresses this gap with 620 hours of transcribed, diarized speech from over 1,300 podcast
episodes by Singapore-based content creators. YCSEP supports the empirical analysis of phonetics, morphosyntax, and discourse,
enabling the study of low-frequency features like discourse particles and reduplication. The dataset reflects informal,
spontaneous speech from diverse speakers and facilitates investigation into nativization and endonormative stabilization processes
in postcolonial English. Built using a pipeline of yt-dlp, WhisperX, and Pyannote, YCSEP offers robust empirical grounding for
linguistic features such as verb complementation and modality. It also contributes to broader theoretical discussions on areal
norms and construction grammar in World Englishes.
Article outline
- 1.Introduction
- 2.Existing SgE resources
- 2.1Written and mixed corpora
- 2.2Speech corpora
- 2.3Evaluation of existing corpora and databases
- 3.Corpus creation: Data and methods
- 3.1Podcasts included in the corpus
- 3.2Data harvesting and corpus creation
- 4.Corpus analysis of SgE features
- 4.1Features associated with SgE
- 4.2Other features
- 5.Potential future use cases
- 5.1Features associated with spoken language which are uncommon in writing
- 5.2Phonetic features
- 5.3Social variation in SgE
- 6.Caveats, summary, and outlook
- 6.1Limitations and caveats
- 6.2Summary and outlook
- Acknowledgments
- Notes
References
References (81)
Ardila, Rosana, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. 2020. Common
Voice: A Massively-Multilingual Speech Corpus arXiv:1912.06670 [cs.CL].
Bain, Max, Jaesung Huh, Tengda Han, and Andrew Zisserman. 2023. “WhisperX:
Time-Accurate Speech Transcription of Long-Form
Audio”. In Proceedings of Interspeech
2023, 4489–4493.
Bao, Zhiming. 2010a. “A
Usage-based Approach to Substratum Transfer: The Case of Four Unproductive Features in Singapore
English”. Language 861: 792–820.
. 2015. The
Making of Vernacular Singapore English: System, Transfer and
Filter. Cambridge: Cambridge University Press.
Basile, Carmelo Alessandro. 2023. “Necessity Modal
Development in Singapore English: An Investigation of Substratist and Contact-Grammaticalisation
Approaches”. English
World-Wide 441: 276–302.
Boo, Ashley, Junwen Lee, and Ying-Ying Tan. 2023. “Particle
Stacking in Singlish — New Data from the National Speech
Corpus”. Lingua 2871.
Botha, Werner. 2018. “A
Social Network Approach to Particles in Singapore English”. World
Englishes 371: 261–281.
Botha, Werner, and Tobias Bernaisch. 2025. “Social
Network Effects on Particle Variation among Singapore Students”. World
Englishes 441: 144–165.
Bredin, Hervé. 2023. “Pyannote.audio
2.1 Speaker Diarization Pipeline: Principle, Benchmark and
Recipe”. In Proceedings of Interspeech
2023, 1983–1987.
Chen, Wenda, Ying-Ying Tan, Eng Siong Chng, and Haizhou Li. 2010. “The
Development of a Singapore English Call Resource”. In Proceedings of
Oriental COCOSDA 2010.
Chong, Adam J., and James S. German. 2023. “Prominence
and Intonation in Singapore English”. Journal of
Phonetics 981: 101240.
Coats, Steven. 2023a. “Double
Modals in Contemporary British and Irish Speech”. English Language and
Linguistics 271: 693–718.
. 2023b. “Dialect
Corpora from YouTube”. In Beatrix Busse, Nina Dumrukcic, and Ingo Kleiber, eds., Language
and Linguistics in a Complex World. Berlin: De Gruyter, 79–102.
Coats, Steven, and Cameron Morin. 2024. “Double
Modals beyond the Atlantic: New Evidence from Computational Sociolinguistics”. English
Today 401: 294–299.
Collins, Peter, and Adam Smith. 2025. “The
Double Modal Construction in English World Wide.” World
Englishes 001: 1–19.
Davies, Mark, and Robert Fuchs. 2015. “Expanding
Horizons in the Study of World Englishes with the 1.9-billion-word Global Web-based English corpus
(GloWbE)”. English
World-Wide 361: 1–28.
Davies, Mark. 2016–. Corpus
of News on the Web (NOW) 〈[URL]〉
Deterding, David. 2000. “Measurements
of the /eɪ/ and /əʊ/ Vowels of Young English Speakers in
Singapore”. In David Deterding, Ee Ling Low, and Adam Brown, eds., The
English Language in Singapore: Research on
Pronunciation. Singapore: Singapore Association for Applied Linguistics, 93–99.
. 2003. “An
Instrumental Study of the Monophthong Vowels of Singapore English”. English
World-Wide 241: 1–16.
Deterding, David, and Ee Ling Low. 2001. “The
NIE corpus of spoken Singapore English (NIECSSE)”. SAAL
Quarterly 561: 2–5.
Dunn, Jonathan. 2024. Computational
Construction Grammar: A Usage-Based
Approach. Cambridge: Cambridge University Press.
Fuchs, Robert. 2017. “Do
Women Use More Intensifiers than Men? Recent Change in the Sociolinguistics of Intensifiers in British
English”. International Journal of Corpus
Linguistics 221: 345–374.
. 2023. “A
Synthesis of Research on Speech Rhythm in Native, Learner and Second Language Varieties of English — Introduction to the
Volume”. In Robert Fuchs, ed., Speech
Rhythm in Learner and Second Language Varieties of
English. Singapore: Springer, 1–14.
Gek, Heng Mui, and David Deterding. 2005. “Reduced
Vowels in Conversational Singapore English”. In David Deterding, Ee Ling Low, and Adam Brown, eds., English
in Singapore: Phonetic Research on a
Corpus. Singapore: McGraw-Hill Education, 54–63.
Gonzales, Wilkinson D. W., Jakob Leimgruber, Mie Hiramoto, and Junjie Lim. 2023. “The
Corpus of Singapore English Messages (CoSEM)”. World
Englishes 421: 371–388.
Grieve, Jack, Sara Bartl, Matteo Fuoli, Jason Grafmiller, Weihang Huang, Alejandro Jawerbaum, Akira Murakami, Marcus Perlman, Dana Roemling, and Bodo Winter. 2025. “The
Sociolinguistic Foundations of Language Modeling”. Frontiers in Artificial
Intelligence 71.
Grieve, Jack, Dirk Hovy, David Jurgens, Tyler Kendall, Dong Nguyen, James Stanford, and Meghan Sumner. 2023. Computational
Sociolinguistics. Lausanne: Frontiers Media SA.
Gut, Ulrike. 2009. “Past
Tense Marking in Singapore English Verbs”. English
World-Wide 301: 262–277.
Hansen, Beke. 2018. Corpus
Linguistics and Sociolinguistics: A Study of Variation and Change in the Modal System of World
Englishes. Leiden: Brill.
Hoffmann, Thomas. 2021. The
Cognitive Foundation of Post-colonial
Englishes. Cambridge: Cambridge University Press.
Honnibal, Matthew, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. 2020. spaCy:
Industrial-strength Natural Language Processing in Python.
Huang, Nick, Li Lin, Kunmei Han, Jia Wen Hing, Luwen Cao, Vincent Ooi, and Zhiming Bao. 2025. “Treebanks
and World Englishes: a Singapore English Perspective”. English
World-Wide 461: 93–121.
Kalaivanan, Kastoori, Firqin Sumartono, and Ying-Ying Tan. 2021. “The
Homogenization of Ethnic Differences in Singapore English? A Consonantal Production
Study”. Language and
Speech 641: 123–140.
Kirkpatrick, Andy. 2007. World
Englishes: Implications for International Communication and English Language
Teaching. Cambridge: Cambridge University Press.
Koh, Jia Xin, Aqilah Mislan, Kevin Khoo, Brian Ang, Wilson Ang, Charmaine Ng, and Ying-Ying Tan. 2019. “Building
the Singapore English National Speech Corpus”. In Proceedings of
Interspeech 2019, 321–325.
Kortmann, Bernd. 2019. “Global
Variation in the Anglophone World”. In Bas Aarts, Jill Bowie, and Gergana Popova, eds., Oxford
Handbook of English Grammar. Oxford: Oxford University Press, 630–653.
Kwan-Terry, A. 1978. “The
Meaning and the Source of the ‘la’ and the ‘what’ Particles in Singapore English”. RELC
Journal 91: 22–36.
Kwek, Geraldine, and Ee-Ling Low. 2021. “Emergent
Features of Young Singaporean Speech: An Investigatory Study of the Labiodental /r/ in Singapore
English”. Asian
Englishes 231: 116–136.
Lea, Colin, Zifang Huang, Jaya Narain, Lauren Tooley, Dianna Yee, Dung Tien Tran, Panayiotis Georgiou, Jeffrey P. Bigham, and Leah Findlater. 2023. “From
User Perceptions to Technical Improvement: Enabling People who Stutter to Better Use Speech
Recognition”. In Proceedings of the 2023 CHI Conference on Human
Factors in Computing Systems, 1–16.
Leech, Geoffrey. 2003. “Modality
on the Move: The English Modal Auxiliaries 1961–1992”. In Roberta Facchinetti, Manfred Krug, and Frank R. Palmer, eds., Modality
in Contemporary English. Berlin: Mouton de Gruyter, 223–240.
Leimgruber, Jakob R. 2013. Singapore English: Structure,
Variation, and Usage. Cambridge: Cambridge University Press.
Leimgruber, Jakob R., Jun Lie Lim, Wilkinson Gonzales, and Mie Hiramoto. 2020. “Ethnic
and Gender Variation in the Use of Colloquial Singapore English Discourse Particles”. English
Language and
Linguistics 251: 601–620.
Li, Lijun, Eliane Lorenz, and Peter Siemund. 2022. “The
Ages of Pragmatic Particles in Colloquial Singapore English: A Corpus Study Based on Oral History
Interviews”. English
World-Wide 441: 91–117.
Lim, Lisa. 2001. Towards
a Reference Grammar of Singapore English. Final Research
Report. Singapore: National University of Singapore.
. 2007. “Mergers
and Acquisitions: On the Ages and Origins of Singapore English Particles”. World
Englishes 261: 446–473.
Lim, Lisa, and Joseph Foley. 2004. “English
in Singapore and Singapore English: Background and
Methodology”. In Lisa Lim, ed., Singapore
English: A Grammatical Description. Amsterdam: John Benjamins, 1–18.
Lin, Li, Kunmei Han, Jia Wen Hing, Luwen Cao, Vincent Ooi, Nick Huang, and Zhiming Bao. 2023. “Tagging
Singapore English”. World
Englishes 421: 624–641.
Low, Ee Ling, and Esther Grabe. 1999. “A
Contrastive Study of Prosody and Lexical Stress Placement in Singapore English and British
English”. Language and
Speech 421: 39–56.
Low, Ee Ling, Esther Grabe, and Francis Nolan. 2000. “Quantitative
Characterizations of Speech Rhythm: Syllable-timing in Singapore English”. Language and
Speech 431: 377–401.
MacKenzie, Laurel, and Danielle Turton. 2020. “Assessing
the Accuracy of Existing Forced Alignment Software on Varieties of British
English”. Linguistics
Vanguard 61: 20180061.
Mair, Christian, and Geoffrey Leech. 2020. “Current
Changes in English Syntax”. In Bas Aarts, April McMahon, and Lars Hinrichs, eds., The
Handbook of English Linguistics (2nd
ed.). Malden: Wiley-Blackwell, 249–276.
Moorthy, Shanti Marion, and David Deterding. 2000. “Three
or Tree? Dental Fricatives in the Speech of Educated
Singaporeans”. In Adam Brown, David Deterding, and Ee Ling Low, eds., The
English Language in Singapore: Research on
Pronunciation. Singapore: Singapore Association for Applied Linguistics, 76–83
Morin, Cameron. 2023. “Social
meaning in Construction Grammar: Double Modals in Dialects of English”. PhD
dissertation, Université Paris-Cité.
Morin, Cameron, and Carmelo Alessandro Basile. 2022. “Elicitation
and Experimentation: Implications for English
Sociolinguistics”. Anglophonia 341: 1–25.
Morin, Cameron, and Steven Coats. 2023. “Double
Modals in Australian and New Zealand English”. World
Englishes 441: 415–438.
Morin, Cameron, and Jack Grieve. 2024. “The
Semantics, Sociolinguistics, and Origins of Double Modals in American English: New Insights from Social
Media”. PLOS
One 191: E0295799.
Morin, Cameron, Guillaume Desagulier, and Jack Grieve. 2020. “Dialect
syntax in Construction Grammar: Theoretical Benefits of a Constructionist Approach to Double Modals in
English”. Belgian Journal of
Linguistics 341: 252–262.
. 2024. “A
Social turn for Construction Grammar: Double Modals on British Twitter”. English Language and
Linguistics 281: 275–303.
Nguyen, Dong, A. Seza Doğruöz, Carolyn P. Rosé, and Franciska de Jong. 2016. “Computational
Sociolinguistics: A Survey”. Computational
Linguistics 421: 537–593.
Pantos, Roger, and William May. 2017. “HTTP
Live Streaming”. RFC 82161. 〈[URL]〉.
Parviainen, Hanna, and Robert Fuchs. 2019. “‘I
Don’t Get Time Only’: An Apparent-time Investigation of Clause-final Focus Particles in Asian
Englishes”. Asian
Englishes 211: 285–304.
Plaquet, Alexis, and Hervé Bredin. 2023. “Powerset
Multi-class Cross Entropy Loss for Neural Speaker
Diarization”. In Proceedings of Interspeech
2023, 3222–3226.
Radford, Alec, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. “Robust
Speech Recognition via Large-Scale Weak Supervision”. In Proceedings
of the 40th International Conference on Machine Learning (= Proceedings of Machine Learning
Research) 2021: 28492–28518. 〈[URL]〉
Richards, Jack C., and Mary W. J. Tay. 1977. “The
La particle in Singapore English”. In William Crewe, ed., The
English Language in Singapore. Singapore: Eastern Universities Press, 145–156.
Schneider, Edgar W. 2007. Postcolonial English: Varieties Around
the World. Cambridge: Cambridge University Press.
Smakman, Dick, and Stephanie Wagenaar. 2013. “Discourse
Particles in Colloquial Singapore English”. World
Englishes 321: 308–324.
Sodagar, Iraj. 2011. “The
MPEG-DASH Standard for Multimedia Streaming over the Internet”. IEEE
Multimedia 181: 62–67.
Tan, Rachel Siew Kuang, and Ee-Ling Low. 2010. “How
Different are the Monophthongs of Malay Speakers of Malaysian and Singapore English?”. English
World-Wide 311: 162–189.
Teo, Ming Chew. 2019. “The Role of Parallel
Constructions in Imposition: A Synchronic Study of already in Colloquial Singapore
English”. Journal of Pidgin and Creole
Languages 341: 347–377.
Wang, Bin, Xunlong Zou, Shuo Sun, Wenyu Zhang, Yingxu He, Zhuohan Liu, Chengwei Wei, Nancy F. Chen, and AiTi Aw. 2025. “Advancing
Singlish Understanding: Bridging the Gap with Datasets and Multimodal
Models”. arXiv:2501.01034 [cs.CL].
Wee, Lionel. 2004. “Singapore
English: Morphology and Syntax”. In Bernd Kortmann, Kate Burridge, Rajend Mesthrie, Edgar W. Schneider, and Clive Upton eds., A
Handbook of Varieties of English: A Multimedia Reference
Tool (vol. 11). Berlin: Mouton de Gruyter, 1058–1073.
Wong, John D. 2016. Global Trade in the Nineteenth Century:
The House of Houqua and the Canton
System. Cambridge: Cambridge University Press.
Ziegeler, Debra. 2014. “Replica
Grammaticalisation as Recapitulation: The Other Side of
Contact”. Diachronica 311: 106–141.
