Article published In: Journal of Second Language Pronunciation
Vol. 5:2 (2019) ► pp.294–323
Developing and validating a methodology for crowdsourcing L2 speech ratings in Amazon Mechanical Turk
Published online: 17 September 2019
https://doi.org/10.1075/jslp.18016.nag
https://doi.org/10.1075/jslp.18016.nag
Abstract
Researchers have increasingly turned to Amazon Mechanical Turk (AMT) to crowdsource speech data, predominantly in
English. Although AMT and similar platforms are well positioned to enhance the state of the art in L2 research, it is unclear if
crowdsourced L2 speech ratings are reliable, particularly in languages other than English. The present study describes the
development and deployment of an AMT task to crowdsource comprehensibility, fluency, and accentedness ratings for L2 Spanish
speech samples. Fifty-four AMT workers who were native Spanish speakers from 11 countries participated in the ratings. Intraclass
correlation coefficients were used to estimate group-level interrater reliability, and Rasch analyses were undertaken to examine
individual differences in rater severity and fit. Excellent reliability was observed for the comprehensibility and fluency
ratings, but indices were slightly lower for accentedness, leading to recommendations to improve the task for future data
collection.
Article outline
- 1.Introduction
- 2.Background
- 2.1Online and laboratory L2 speech ratings
- 2.2Demographics of AMT workers
- 2.3The current study
- 3.Method
- 3.1Speech samples
- 3.2Development and deployment of the AMT HITs
- 3.3AMT workers
- 4.Results
- 4.1Attention checks and near-native control samples
- 4.2Reliability
- 4.3Rasch modeling
- 4.3.1Calibration of speakers, raters, time, and tasks
- 4.3.2Rater fit
- 4.3.3Rating scale use and structure
- 4.4Rater characteristics: Age, gender, and education
- 5.Discussion
- 5.1Data quality and reliability
- 5.2Rater background
- 5.3Recommendations for improving the AMT task
- 6.Conclusion
- Notes
References
References (41)
Akiyama, Y., & Saito, K. (2017). Development of comprehensibility and its linguistic correlates: A longitudinal study of video-mediated telecollaboration. The Modern Language Journal, 100(3), 585–609.
Bergeron, A., & Trofimovich, P. (2017). Linguistic dimensions of accentedness and comprehensibility: Exploring task and listener effects in second language French. Foreign Language Annals, 50(3), 547–566.
Buhrmester, M., Kwang, T., & Gosling, S. D. (2011). Amazon’s Mechanical Turk: A new source of inexpensive, yet high-quality, data? Perspectives on Psychological Science, 6(1), 3–5.
Crowther, D., Trofimovich, P., Isaacs, T., & Saito, K. (2015). Does a speaking task affect second language comprehensibility? The Modern Language Journal, 99(1), 80–95.
Crowther, D., Trofimovich, P., Saito, K., & Isaacs, T. (2018). Linguistic dimensions of L2 accentedness and comprehensibility vary across speaking tasks. Studies in Second Language Acquisition, 40(2), 443–457.
Derwing, T. M., & Munro, M. J. (2013). The development of L2 oral language skills in two L1 groups: A 7-year study. Language Learning, 63(2), 163–185.
Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2(3), 197–221.
Eskénazi, M., Levow, G.-A., Meng, H., Parent, G., & Suendermann, D. (Eds.). (2013). Crowdsourcing for speech processing: Applications to data collection, transcription and assessment. UK: John Wiley & Sons.
Evanini, K., Higgins, D., & Zechner, K. (2010). Using Amazon Mechanical Turk for transcription of non-native speech. Paper presented at the Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Los Angeles, CA.
Flege, J. E., & Fletcher, K. L. (1992). Talker and listener effects on degree of perceived foreign accent. The Journal of the Acoustical Society of America, 91(1), 370–389.
Fort, K., Adda, G., & Bretonnel Cohen, K. (2011). Amazon Mechanical Turk: Gold mine or coal mine? Computational Linguistics, 37(2), 413–420.
Gelas, H., Teferra Abate, S., Besacier, L., & Pellegrino, F. (2011). Quality assessment of crowdsourcing transcriptions for African languages Interspeech-2011
(pp. 3065–3068).
Goodman, J. K., Cryder, C. E., & Cheema, A. (2013). Data collection in a flat world: The strengths and weaknesses of Mechanical Turk samples. Journal of Behavioral Decision Making, 26(3), 213–224.
Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: An overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8(1), 23–34.
Isaacs, T., & Thomson, R. I. (2013). Rater experience, rating scale length, and judgments of L2 pronunciation: Revisiting research conventions. Language Assessment Quarterly, 10(2), 135–159.
Kennedy, S., Foote, J. A., & Dos Santos Buss, L. K. (2015). Second language speakers at university: Longitudinal development and rater behaviour. TESOL Quarterly, 49(1), 199–209.
Kunath, S. A., & Weinberger, S. H. (2010). The wisdom of the crowd’s ear: Speech accent rating and annotation with Amazon Mechanical Turk Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk (pp. 168–171). Los Angeles, CA: Association for Computational Linguistics.
Linacre, J. M. (2002). What do infit and outfit, mean-square and standardized mean? Rasch Measurement Transactions, 16(2), 878.
Martin, D., Hanrahan, B. V., O’Neill, J., & Gupta, N. (2014). Being a turker. Paper presented at the 17th ACM Conference on Computer Supported Cooperative Work and Social Computing, Baltimore, MD.
McAllister Byun, T., Halpin, P. F., & Szeredi, D. (2015). Online crowdsourcing for efficient rating of speech: A validation study. Journal of Communication Disorders, 531, 70–83.
McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30–46.
Muñoz, C. (Ed.) (2006). Age and the rate of foreign language learning. Tonawanda, NY: Multilingual Matters.
Munro, M. J., & Derwing, T. M. (1995). Foreign accent, comprehensibility, and intelligibility in the speech of second language learners. Language Learning, 45(1), 73–97.
Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part 1. Journal of Applied Measurement, 4(4), 386–422.
Nagle, C. (2018a). Modeling classroom language learners’ comprehensibility and accentedness over time: The case of L2 Spanish. In J. Levis (Ed.), Proceedings of the 9th Pronunciation in Second Language Learning and Teaching Conference (pp. 17–29). Ames, IA: Iowa State University.
(2018b). Motivation, comprehensibility, and accentedness in L2 Spanish: Investigating motivation as a time-varying predictor of pronunciation development. The Modern Language Journal, 102(1), 199–217.
O’Brien, M. G. (2014). L2 learners’ assessments of accentedness, fluency, and comprehensibility of native and nonnative German speech. Language Learning, 64(4), 715–748.
(2016). Methodological choices in rating speech samples. Studies in Second Language Acquisition, 38(3), 587–605.
Paolacci, G., & Chandler, J. (2014). Inside the Turk. Current Directions in Psychological Science, 23(3), 184–188.
Paolacci, G., Chandler, J., & Ipeirotis, P. G. (2010). Running experiments on Amazon Mechanical Turk. Judgment and Decision Making, 5(5), 411–419.
Pavlick, E., Post, M., Irvine, A., Kachaev, D., & Callison-Burch, C. (2014). The language demographics of Amazon Mechanical Turk. Transactions of the Association for Computational Linguistics (Vol. 21, pp. 79–92).
Peabody, M. A. (2011). Methods for pronunciation assessment in computer aided language learning (Unpublished doctoral dissertation). Massachusetts Institute of Technology, Cambridge, MA.
Peer, E., Vosgerau, J., & Acquisti, A. (2014). Reputation as a sufficient condition for data quality on Amazon Mechanical Turk. Behavioral Research Methods, 46(4), 1023–1031.
Ross, J., Irani, L., Silberman, M. S., Zaldivar, A., & Tomlinson, B. (2010). Who are the crowdworkers? Shifting demographics in mechanical turk. Paper presented at the CHI ’10 Extended Abstracts on Human Factors in Computing Systems, Atlanta, GA.
Saito, K., Dewaele, J.-M., Abe, M., & In’nami, Y. (2018). Motivation, emotion, learning experience, and second language comprehensibility development in classroom settings: A cross-sectional and longitudinal study. Language Learning, 68(3), 709–743.
Saito, K., Trofimovich, P., & Isaacs, T. (2017). Using listener judgments to investigate linguistic influences on L2 comprehensibility and accentedness: A validation and generalization study. Applied Linguistics, 38(4), 439–462.
Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428.
Trofimovich, P., & Isaacs, T. (2012). Disentangling accent from comprehensibility. Bilingualism: Language and Cognition, 15(4), 905–916.
Wang, H., Qian, X., & Meng, H. (2013). Predicting gradation of L2 English mispronunciations using crowdsourced ratings and phonological rules. In P. Badin, T. Hueber, G. Bailly, D. Demolin, & F. Raby (Eds.), Proceedings of Speech and Language Technology in Education (SLaTE 2013) (pp. 127–131). Grenoble, France.
Cited by (24)
Cited by 24 other publications
Jia, Ruirui, Ekaterina Sudina & Kejun Du
Tymbay, Alexey
Ghaffarvand-Mokari, Payam
2024. Effects of listeners’ dialectal variation on comprehensibility and accentedness judgements of second language
speech. Journal of Second Language Pronunciation 10:1 ► pp. 35 ff.
Kim, Kathy Minhye, Xiaoyi Liu, Daniel R. Isbell & Xiaobin Chen
Levis, John M.
2024. Key issues in L2 pronunciation research. Journal of Second Language Pronunciation 10:3 ► pp. 293 ff.
Sonsaat-Hegelheimer, Sinem & Şebnem Kurt
2024. The impact of generative AI-powered chatbots on L2 comprehensibility. Journal of Second Language Pronunciation 10:3 ► pp. 339 ff.
Tekin, Oguzhan & Pavel Trofimovich
Tsunemoto, Aki & Pavel Trofimovich
Dalman, Mohammadreza & Okim Kang
Gallant, Jordan
Nagle, Charlie, Pavel Trofimovich, Oguzhan Tekin & Kim McDonough
Olson, Daniel J.
Olson, Daniel J.
2024. The Bilingual Code-Switching Profile (BCSP). Linguistic Approaches to Bilingualism 14:3 ► pp. 400 ff.
Olson, Daniel J.
Tsunemoto, Aki, Mark McAndrews, Pavel Trofimovich & Eric Friginal
2023. Listener perceptions of customer service agents’ performance. Journal of Second Language Pronunciation 9:2 ► pp. 234 ff.
Tsunemoto, Aki, Pavel Trofimovich & Sara Kennedy
Tsunemoto, Aki, Pavel Trofimovich, Josée Blanchet, Juliane Bertrand & Sara Kennedy
Huensch, Amanda & Charlie Nagle
Nagle, Charles L.
2021. Assessing the state of the art in longitudinal L2 pronunciation research. Journal of Second Language Pronunciation 7:2 ► pp. 154 ff.
Nagle, Charles L. & Ivana Rehman
Saito, Kazuya, Yui Suzukida, Mai Tran & Adam Tierney
Kobayashi, Aozora, Ian Wilson & D. Roy
Nagle, Charles L. & Amanda Huensch
2020. Expanding the scope of L2 intelligibility research. Journal of Second Language Pronunciation 6:3 ► pp. 329 ff.
Nagle, Charles L. & Amanda Huensch
2022. Expanding the scope of L2 intelligibility research. In The Evolution of Pronunciation Teaching and Research [Benjamins Current Topics, 121], ► pp. 51 ff.
This list is based on CrossRef data as of 13 november 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
