Developing and validating a methodology for crowdsourcing L2 speech ratings in Amazon Mechanical Turk

Nagle, Charles

doi:10.1075/jslp.18016.nag

Article published In: Journal of Second Language Pronunciation
Vol. 5:2 (2019) ► pp.294–323

Get fulltext from our e-platform

Download PDF

Developing and validating a methodology for crowdsourcing L2 speech ratings in Amazon Mechanical Turk

Charles Nagle | Iowa State University

Published online: 17 September 2019

https://doi.org/10.1075/jslp.18016.nag

Abstract

Researchers have increasingly turned to Amazon Mechanical Turk (AMT) to crowdsource speech data, predominantly in English. Although AMT and similar platforms are well positioned to enhance the state of the art in L2 research, it is unclear if crowdsourced L2 speech ratings are reliable, particularly in languages other than English. The present study describes the development and deployment of an AMT task to crowdsource comprehensibility, fluency, and accentedness ratings for L2 Spanish speech samples. Fifty-four AMT workers who were native Spanish speakers from 11 countries participated in the ratings. Intraclass correlation coefficients were used to estimate group-level interrater reliability, and Rasch analyses were undertaken to examine individual differences in rater severity and fit. Excellent reliability was observed for the comprehensibility and fluency ratings, but indices were slightly lower for accentedness, leading to recommendations to improve the task for future data collection.

Keywords: research methods, speech ratings, Spanish, reliability, many-facet Rasch measurement

Article outline

1.Introduction
2.Background
- 2.1Online and laboratory L2 speech ratings
- 2.2Demographics of AMT workers
- 2.3The current study
3.Method
- 3.1Speech samples
- 3.2Development and deployment of the AMT HITs
- 3.3AMT workers
4.Results
- 4.1Attention checks and near-native control samples
- 4.2Reliability
- 4.3Rasch modeling
  - 4.3.1Calibration of speakers, raters, time, and tasks
  - 4.3.2Rater fit
  - 4.3.3Rating scale use and structure
- 4.4Rater characteristics: Age, gender, and education
5.Discussion
- 5.1Data quality and reliability
- 5.2Rater background
- 5.3Recommendations for improving the AMT task
6.Conclusion
Notes
References

References (41)

References

Akiyama, Y., & Saito, K. (2017). Development of comprehensibility and its linguistic correlates: A longitudinal study of video-mediated telecollaboration. The Modern Language Journal, 100(3), 585–609.

Bergeron, A., & Trofimovich, P. (2017). Linguistic dimensions of accentedness and comprehensibility: Exploring task and listener effects in second language French. Foreign Language Annals, 50(3), 547–566.

Buhrmester, M., Kwang, T., & Gosling, S. D. (2011). Amazon’s Mechanical Turk: A new source of inexpensive, yet high-quality, data? Perspectives on Psychological Science, 6(1), 3–5.

Crowther, D., Trofimovich, P., Isaacs, T., & Saito, K. (2015). Does a speaking task affect second language comprehensibility? The Modern Language Journal, 99(1), 80–95.

Crowther, D., Trofimovich, P., Saito, K., & Isaacs, T. (2018). Linguistic dimensions of L2 accentedness and comprehensibility vary across speaking tasks. Studies in Second Language Acquisition, 40(2), 443–457.

Derwing, T. M., & Munro, M. J. (2013). The development of L2 oral language skills in two L1 groups: A 7-year study. Language Learning, 63(2), 163–185.

Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2(3), 197–221.

(2015). Introduction to many-facet Rasch measurement. New York: Peter Lang.

Eskénazi, M., Levow, G.-A., Meng, H., Parent, G., & Suendermann, D. (Eds.). (2013). Crowdsourcing for speech processing: Applications to data collection, transcription and assessment. UK: John Wiley & Sons.

Evanini, K., Higgins, D., & Zechner, K. (2010). Using Amazon Mechanical Turk for transcription of non-native speech. Paper presented at the Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Los Angeles, CA.

Flege, J. E., & Fletcher, K. L. (1992). Talker and listener effects on degree of perceived foreign accent. The Journal of the Acoustical Society of America, 91(1), 370–389.

Fort, K., Adda, G., & Bretonnel Cohen, K. (2011). Amazon Mechanical Turk: Gold mine or coal mine? Computational Linguistics, 37(2), 413–420.

Gelas, H., Teferra Abate, S., Besacier, L., & Pellegrino, F. (2011). Quality assessment of crowdsourcing transcriptions for African languages Interspeech-2011 (pp. 3065–3068).

Goodman, J. K., Cryder, C. E., & Cheema, A. (2013). Data collection in a flat world: The strengths and weaknesses of Mechanical Turk samples. Journal of Behavioral Decision Making, 26(3), 213–224.

Hallgren, K. A. (2012). Computing inter-rater reliability for observational data: An overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8(1), 23–34.

Isaacs, T., & Thomson, R. I. (2013). Rater experience, rating scale length, and judgments of L2 pronunciation: Revisiting research conventions. Language Assessment Quarterly, 10(2), 135–159.

Kennedy, S., Foote, J. A., & Dos Santos Buss, L. K. (2015). Second language speakers at university: Longitudinal development and rater behaviour. TESOL Quarterly, 49(1), 199–209.

Kunath, S. A., & Weinberger, S. H. (2010). The wisdom of the crowd’s ear: Speech accent rating and annotation with Amazon Mechanical Turk Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk (pp. 168–171). Los Angeles, CA: Association for Computational Linguistics.

Linacre, J. M. (2002). What do infit and outfit, mean-square and standardized mean? Rasch Measurement Transactions, 16(2), 878.

Martin, D., Hanrahan, B. V., O’Neill, J., & Gupta, N. (2014). Being a turker. Paper presented at the 17th ACM Conference on Computer Supported Cooperative Work and Social Computing, Baltimore, MD.

McAllister Byun, T., Halpin, P. F., & Szeredi, D. (2015). Online crowdsourcing for efficient rating of speech: A validation study. Journal of Communication Disorders, 531, 70–83.

McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30–46.

Muñoz, C. (Ed.) (2006). Age and the rate of foreign language learning. Tonawanda, NY: Multilingual Matters.

Munro, M. J., & Derwing, T. M. (1995). Foreign accent, comprehensibility, and intelligibility in the speech of second language learners. Language Learning, 45(1), 73–97.

Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part 1. Journal of Applied Measurement, 4(4), 386–422.

Nagle, C. (2018a). Modeling classroom language learners’ comprehensibility and accentedness over time: The case of L2 Spanish. In J. Levis (Ed.), Proceedings of the 9th Pronunciation in Second Language Learning and Teaching Conference (pp. 17–29). Ames, IA: Iowa State University.

(2018b). Motivation, comprehensibility, and accentedness in L2 Spanish: Investigating motivation as a time-varying predictor of pronunciation development. The Modern Language Journal, 102(1), 199–217.

O’Brien, M. G. (2014). L2 learners’ assessments of accentedness, fluency, and comprehensibility of native and nonnative German speech. Language Learning, 64(4), 715–748.

(2016). Methodological choices in rating speech samples. Studies in Second Language Acquisition, 38(3), 587–605.

Paolacci, G., & Chandler, J. (2014). Inside the Turk. Current Directions in Psychological Science, 23(3), 184–188.

Paolacci, G., Chandler, J., & Ipeirotis, P. G. (2010). Running experiments on Amazon Mechanical Turk. Judgment and Decision Making, 5(5), 411–419.

Pavlick, E., Post, M., Irvine, A., Kachaev, D., & Callison-Burch, C. (2014). The language demographics of Amazon Mechanical Turk. Transactions of the Association for Computational Linguistics (Vol. 21, pp. 79–92).

Peabody, M. A. (2011). Methods for pronunciation assessment in computer aided language learning (Unpublished doctoral dissertation). Massachusetts Institute of Technology, Cambridge, MA.

Peer, E., Vosgerau, J., & Acquisti, A. (2014). Reputation as a sufficient condition for data quality on Amazon Mechanical Turk. Behavioral Research Methods, 46(4), 1023–1031.

Ross, J., Irani, L., Silberman, M. S., Zaldivar, A., & Tomlinson, B. (2010). Who are the crowdworkers? Shifting demographics in mechanical turk. Paper presented at the CHI ’10 Extended Abstracts on Human Factors in Computing Systems, Atlanta, GA.

Saito, K., Dewaele, J.-M., Abe, M., & In’nami, Y. (2018). Motivation, emotion, learning experience, and second language comprehensibility development in classroom settings: A cross-sectional and longitudinal study. Language Learning, 68(3), 709–743.

Saito, K., Trofimovich, P., & Isaacs, T. (2017). Using listener judgments to investigate linguistic influences on L2 comprehensibility and accentedness: A validation and generalization study. Applied Linguistics, 38(4), 439–462.

Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420–428.

Trofimovich, P., & Isaacs, T. (2012). Disentangling accent from comprehensibility. Bilingualism: Language and Cognition, 15(4), 905–916.

Wang, H., Qian, X., & Meng, H. (2013). Predicting gradation of L2 English mispronunciations using crowdsourced ratings and phonological rules. In P. Badin, T. Hueber, G. Bailly, D. Demolin, & F. Raby (Eds.), Proceedings of Speech and Language Technology in Education (SLaTE 2013) (pp. 127–131). Grenoble, France.

Wu, M., & Adams, R. J. (2013). Properties of Rasch residual fit statistics. Journal of Applied Measurement, 141, 339–355.

Cited by (24)

Cited by 24 other publications

Order by:

Jia, Ruirui, Ekaterina Sudina & Kejun Du

2025. Do data collection methods matter for self-reported L2 individual differences questionnaires? In-person vs crowdsourced data. Research Methods in Applied Linguistics 4:3 ► pp. 100235 ff.

Tymbay, Alexey

2025. Non-native (Czech and Russian L1) auditor assessments of some English suprasegmental features: Prominence and pitch accents. Speech Communication 173 ► pp. 103281 ff.

Ghaffarvand-Mokari, Payam

2024. Effects of listeners’ dialectal variation on comprehensibility and accentedness judgements of second language speech. Journal of Second Language Pronunciation 10:1 ► pp. 35 ff.

Kim, Kathy Minhye, Xiaoyi Liu, Daniel R. Isbell & Xiaobin Chen

2024. A comparison of lab- and web-based elicited imitation: Insights from explicit-implicit L2 grammar knowledge and L2 proficiency. Studies in Second Language Acquisition 46:3 ► pp. 946 ff.

Levis, John M.

2024. Key issues in L2 pronunciation research. Journal of Second Language Pronunciation 10:3 ► pp. 293 ff.

Sonsaat-Hegelheimer, Sinem & Şebnem Kurt

2024. The impact of generative AI-powered chatbots on L2 comprehensibility. Journal of Second Language Pronunciation 10:3 ► pp. 339 ff.

Tekin, Oguzhan & Pavel Trofimovich

2024. Local residents’ attitudes toward and contact with international students: a perspective from Montreal, Quebec. Frontiers in Psychology 15

Tsunemoto, Aki & Pavel Trofimovich

2024. Coherence and comprehensibility in second language speakers’ academic speaking performance. Studies in Second Language Acquisition 46:3 ► pp. 795 ff.

Dalman, Mohammadreza & Okim Kang

2023. VALIDITY EVIDENCE: UNDERGRADUATE STUDENTS’ PERCEPTIONS OF TOEFL IBT HIGH SCORE SPOKEN RESPONSES. International Journal of Listening 37:2 ► pp. 113 ff.

Gallant, Jordan

2023. Typed transcription as a simultaneous measure of foreign-accent comprehensibility and intelligibility: An online replication study. Research Methods in Applied Linguistics 2:2 ► pp. 100055 ff.

Nagle, Charlie, Pavel Trofimovich, Oguzhan Tekin & Kim McDonough

2023. Framing second language comprehensibility: Do interlocutors’ ratings predict their perceived communicative experience?. Applied Psycholinguistics 44:1 ► pp. 131 ff.

Olson, Daniel J.

2023. Measuring bilingual language dominance: An examination of the reliability of the Bilingual Language Profile. Language Testing 40:3 ► pp. 521 ff.

Olson, Daniel J.

2024. The Bilingual Code-Switching Profile (BCSP). Linguistic Approaches to Bilingualism 14:3 ► pp. 400 ff.

Olson, Daniel J.

2024. Bilingual language experience and code-switching acceptability judgments: A constructive replication of the work by Stadthagen-González et al. (2019), Balam et al. (2020), and Stadthagen-González et al. (2018). International Journal of Bilingualism

Tsunemoto, Aki, Mark McAndrews, Pavel Trofimovich & Eric Friginal

2023. Listener perceptions of customer service agents’ performance. Journal of Second Language Pronunciation 9:2 ► pp. 234 ff.

Tsunemoto, Aki, Pavel Trofimovich & Sara Kennedy

2023. Pre-service teachers’ beliefs about second language pronunciation teaching, their experience, and speech assessments. Language Teaching Research 27:1 ► pp. 115 ff.

Tsunemoto, Aki, Pavel Trofimovich, Josée Blanchet, Juliane Bertrand & Sara Kennedy

2022. Effects of benchmarking and peer‐assessment on French learners' self‐assessments of accentedness, comprehensibility, and fluency. Foreign Language Annals 55:1 ► pp. 135 ff.

Huensch, Amanda & Charlie Nagle

2021. The Effect of Speaker Proficiency on Intelligibility, Comprehensibility, and Accentedness in L2 Spanish: A Conceptual Replication and Extension of Munro and Derwing (1995a). Language Learning 71:3 ► pp. 626 ff.

Nagle, Charles L.

2021. Assessing the state of the art in longitudinal L2 pronunciation research. Journal of Second Language Pronunciation 7:2 ► pp. 154 ff.

Nagle, Charles L. & Ivana Rehman

2021. DOING L2 SPEECH RESEARCH ONLINE: WHY AND HOW TO COLLECT ONLINE RATINGS DATA. Studies in Second Language Acquisition 43:4 ► pp. 916 ff.

Saito, Kazuya, Yui Suzukida, Mai Tran & Adam Tierney

2021. Domain‐General Auditory Processing Partially Explains Second Language Speech Learning in Classroom Settings: A Review and Generalization Study. Language Learning 71:3 ► pp. 669 ff.

Kobayashi, Aozora, Ian Wilson & D. Roy

2020. Using deep learning to classify English native pronunciation level from acoustic information. SHS Web of Conferences 77 ► pp. 02004 ff.

Nagle, Charles L. & Amanda Huensch

2020. Expanding the scope of L2 intelligibility research. Journal of Second Language Pronunciation 6:3 ► pp. 329 ff.

Nagle, Charles L. & Amanda Huensch

2022. Expanding the scope of L2 intelligibility research. In The Evolution of Pronunciation Teaching and Research [Benjamins Current Topics, 121], ► pp. 51 ff.

This list is based on CrossRef data as of 13 november 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.