Inter-rater reliability in Learner Corpus Research: Insights from a collaborative study on adverb placement

Larsson, Tove; Paquot, Magali; Plonsky, Luke

doi:10.1075/ijlcr.20001.lar

Article published In: International Journal of Learner Corpus Research
Vol. 6:2 (2020) ► pp.237–251

Get fulltext from our e-platform

Download PDF

Materials & Methods Report

Inter-rater reliability in Learner Corpus Research

Insights from a collaborative study on adverb placement

Tove Larsson | Uppsala University

Magali Paquot | FNRS | UCLouvain

Luke Plonsky | Northern Arizona University

Available under the Creative Commons Attribution-NonCommercial (CC BY-NC) 4.0 license.

For any use beyond this license, please contact the publisher at rights@benjamins.nl.

Published online: 10 December 2020

https://doi.org/10.1075/ijlcr.20001.lar

Abstract

In Learner Corpus Research (LCR), a common source of errors stems from manual coding and annotation of linguistic features. To estimate the amount of error present in a coded dataset, coefficients of inter-rater reliability are used. However, despite the importance of reliability and internal consistency for validity and, by extension, study quality, interpretability and generalizability, it is surprisingly uncommon for studies in the field of LCR to report on such reliability coefficients. In this Methods Report, we use a recent collaborative research project to illustrate the pertinence of considering inter-rater reliability. In doing so, we hope to initiate methodological discussion on instrument design, piloting and evaluation. We also suggest some ways forward to encourage increased transparency in reporting practices.

Keywords: inter-rater reliability, coding errors, reporting practices, study quality, Fleiss’ kappa

Article outline

1.Introduction
2.Working towards increased reliability in a study on adverb placement
- 2.1The coding scheme
- 2.2Piloting the coding scheme and estimating inter-rater reliability
- 2.3Revising the coding scheme
- 2.4From a single-coder to a double-coder approach
3.Conclusion and ways forward
Note
Acknowledgements
Notes
References

References (39)

References

Andreu-Andrés, M., Astor-Guardiola, A., Boquera-Matarredona, M., Macdonald, P., Montero-Fleta, B., & Pérez-Sabater, C. (2010). Analysing EFL learner output in the MiLC project: An error it’s*, but which tag?. In M. C. Campoy-Cubillo, B. Bellés-Fortuño, & M. Ll. Gea-Valor (Eds.), Corpus-based approaches to English language teaching (pp. 167–188). London: Continuum.

Artstein, R. (2017). Inter-annotator agreement. In N. Ide & J. Pustejovsky (Eds.), Handbook of linguistic annotation (pp. 297–313). New York, NY: Springer.

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 201, 37–46.

Collentine, K. (2009). Learner use of holistic language units in task-based synchronous computer-mediated communication. Language Learning & Technology, 131, 67–87.

Derrick, D. (2015). Instrument reporting practices in second language research. TESOL Quarterly, 50(1), 132–153.

Díez-Bedmar, M. B. (2015). Dealing with errors in learner corpora to describe, teach and assess EFL writing: Focus on article use. In E. Castello, K. Ackerley, & F. Coccetta (Eds.), Studies in Learner Corpus Linguistics: Research and applications for foreign language teaching and assessment (pp. 37–69). Bern: Peter Lang.

Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378–382.

Gamer, M., Lemon, J., Fellows, I., & Singh, P. (2012). irr: Various coefficients of interrater reliability and agreement. R package version 0.84.

Hallgren, K. (2012). Computing inter-rater reliability for observational data: An overview and tutorial. Tutorials in Quantitative Methods for Psychology, 8(1), 23–34.

Hasselgård, H. (2010). Adjunct adverbials in English. Cambridge: Cambridge University Press.

Johnson, R. L., Penny, J., & Gordon, B. (2010). The relation between score resolution methods and interrater reliability: An empirical study of an analytic scoring rubric. Applied Measurement in Education, 13(2), 121–138.

Kutuk, G., Putwain, D. W., Kaye, L., & Garrett, B. (in press). Development and validation of a new multidimensional language class anxiety scale. Journal of Psychoeducational Assessment.

Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 331, 159–174.

Larsson, T. (2018). Is there a correlation between form and function? A syntactic and functional investigation of the introductory it pattern in student writing. ICAME Journal, 42(1), 13–40.

Larsson, T., Callies, M., Hasselgård, H., Laso, N. J., Van Vuuren, S., Verdaguer, I., & Paquot, M. (2020). Adverb placement in EFL academic writing: Going beyond syntactic transfer. International Journal of Corpus Linguistics, 25(2), 155–184.

Larson-Hall, J., & Plonsky, L. (2015). Reporting and interpreting quantitative research findings: What gets reported and recommendations for the field. Language Learning, 65(Suppl. 1), 127–159.

Loewen, S., & Plonsky, L. (2015). An A–Z of applied linguistics research methods. New York, NY: Palgrave.

Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), 474–496.

Lüdeling, A., & Hirschmann, H. (2015). Error annotation systems. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp. 135–157). Cambridge: Cambridge University Press.

McKay, T., & Plonsky, L. (in press). Reliability analyses: Estimating error in L2 research. In P. Winke & T. Brunfaut (Eds.), The Routledge handbook of second language acquisition and language testing. New York, NY: Routledge.

Morgan, G. B., Zhu, M., Johnson, R. L., & Hodge, K. J. (2014). Interrater reliability estimators commonly used in scoring language assessments: A Monte Carlo investigation of estimator accuracy. Language Assessment Quarterly, 111, 304–324.

Norris, J. M., Plonsky, L., Ross, S. J., & Schoonen, R. (2015). Guidelines for reporting quantitative methods and results in primary research. Language Learning, 65(2), 470–476.

Osborne, J. (2003). Effect sizes and the disattenuation of correlation and regression coefficients: Lessons from educational psychology. Practical Assessment, Research, & Evaluation, 8(11). Retrieved from [URL]

Paquot, M., Hasselgård, H., & Oksefjell Ebeling, S. (2013). Writer/reader visibility in learner writing across genres: A comparison of the French and Norwegian components of the ICLE and VESPA learner corpora. In S. Granger, G. Gilquin, & F. Meunier (Eds.), Twenty years of Learner Corpus Research: Looking back, moving ahead. Proceedings of the first Learner Corpus Research Conference (LCR 2011) (pp. 377–387). Louvain-la-Neuve: Presses Universitaires de Louvain.

Paquot, M., Grafmiller, J., & Szmrecsanyi, B. (2019). Particle placement alternation in EFL learner vs. L1 speech: Assessing the similarity of probabilistic grammars. In A. Abel, A. Glaznieks, V. Lyding, & L. Nicolas (Eds.), Widening the scope of learner corpus research: Selected papers from the fourth Learner Corpus Research Conference (pp. 71–92). Louvain-la-Neuve: Presses universitaires de Louvain.

Paquot, M., & Plonsky, L. (2017). Quantitative research methods and study quality in learner corpus research. International Journal of Learner Corpus Research, 3(1), 61–94.

Plonsky, L. (2013). Study quality in SLA: An assessment of designs, analyses, and reporting practices in quantitative L2 research. Studies in Second Language Acquisition, 351, 655–687.

Plonsky, L., & Derrick, D. J. (2016). A meta-analysis of reliability coefficients in second language research. Modern Language Journal, 1001, 538–553.

Polio, C., & Shea, M. (2014). An investigation into current measures of linguistic accuracy in second language writing research. Journal of Second Language Writing, 26(1), 10–27.

Purpura, J., Brown, J. D., & Schoonen, R. (2015). Improving the validity of quantitative measures in applied linguistics research. Language Learning, 65(Suppl. 1), 37–75.

Quirk, R., Greenbaum, S., Leech, G., & Svartvik, J. (1985). A comprehensive grammar of the English language. London: Longman.

R Core Team. (2018). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Retrieved from [URL]

Révész, A. (2012). Coding second language data validly and reliably. In A. Mackey & S. Gass (Eds.), Research methods in Second Language Acquisition: A practical guide (pp. 203–221). Hoboken, NJ: Wiley-Blackwell.

Rose, Y., & MacWhinney, B. (2014). The PhonBank Project: Data and software-assisted methods for the study of phonology and phonological development. In J. Durand, U. Gut, & G. Kristoffersen (Eds.), The Oxford handbook of corpus phonology (pp. 380–401). Oxford: Oxford University Press.

Rosen, A., Hana, J., Stindlova, B., & Feldman, A. (2014). Evaluating and automating the annotation of a learner corpus. Language Resources and Evaluation, 481, 65–92.

Sim, J., & Wright, C. C. (2005). The Kappa statistic in reliability studies: Use, interpretation, and sample size requirements. Physical Therapy, 85(3), 257–268.

Spooren, W., & Degand, L. (2010). Coding coherence relations: Reliability and validity. Corpus Linguistics and Linguistic Theory, 6(2), 241–266.

Trafimow, D. (2017). The attenuation of correlation coefficients: A statistical literacy issue. Teaching Statistics, 381, 25–28.

Vyatkina, N. (2016). KANDEL: A developmental corpus of learner German. International Journal of Learner Corpus Research, 2(1), 102–120.

Cited by (17)

Cited by 17 other publications

Order by:

Larsson, Tove, Marcus Callies, Tülay Dixon, Hilde Hasselgård, Nicole Hober, Natalia Judith Laso, Sanne van Vuuren, Isabel Verdaguer & Magali Paquot

2025. Adverb placement in L1 and L2 spoken production. International Journal of Corpus Linguistics 30:1 ► pp. 79 ff.

Song, Yingming & Jiajin Xu

2025. Variation in phrase frame structure and function in argumentative writing by EFL learners across different L1 backgrounds. International Journal of Applied Linguistics 35:1 ► pp. 380 ff.

Chong, Sin Wang & Luke Plonsky

2024. A typology of secondary research in Applied Linguistics. Applied Linguistics Review 15:4 ► pp. 1569 ff.

Demir, Nur Yağmur, Ryan Bartholomew & Tove Larsson

2024. “I’m on retreat and will respond to messages after 7/6”. Register Studies 6:2 ► pp. 175 ff.

Kim, Minjin, Xixin Qiu & Yuanheng (Arthur) Wang

2024. Interrater agreement in genre analysis: A methodological review and a comparison of three measures. Research Methods in Applied Linguistics 3:1 ► pp. 100097 ff.

Listanti, Andrea & Jacopo Torregrossa

2024. The development of postverbal subjects in L2 Italian: A multifactorial corpus analysis. Applied Psycholinguistics 45:1 ► pp. 180 ff.

Minnillo, Sophia, Claudia Sánchez-Gutiérrez, Ana Ruiz-Alonso-Bartol, Emily Morgan & Carmen González Gómez

2024. Predictors of accuracy in L2 Spanish preterit-imperfect production. International Journal of Learner Corpus Research 10:2 ► pp. 301 ff.

Paquot, Magali

2024. Learner corpus research: a critical appraisal and roadmap for contributing (more) to SLA research agendas. Corpus Linguistics and Linguistic Theory 20:3 ► pp. 567 ff.

Rosemeyer, Malte

2024. Data-driven identification of situated meanings in corpus data using Latent Class Analysis. Open Linguistics 10:1

Hober, Nicole, Tülay Dixon & Tove Larsson

2023. Towards increased reliability and transparency in projects with manual linguistic coding. Corpora 18:2 ► pp. 245 ff.

Love, Robbie & Anna-Brita Stenstrom

2023. Corpus-pragmatic perspectives on the contemporary weakening of fuck: The case of teenage British English conversation. Journal of Pragmatics 216 ► pp. 167 ff.

Rygg, Kristin & Stine Hulleberg Johansen

2023. When the Norwegian ‘politeness marker’ vennligst becomes impolite. Journal of Politeness Research 19:2 ► pp. 439 ff.

Hoffmann, Tim

2022. Measuring lexical accuracy. In Complexity, Accuracy and Fluency in Learner Corpus Research [Studies in Corpus Linguistics, 104], ► pp. 159 ff.

Kim, YouJin & Laura Gurzynski-Weiss

2022. Contributing to the advancement of the field:. In Research methods in instructed second language acquisition [Research Methods in Applied Linguistics, 3], ► pp. 355 ff.

Larsson, Tove, Randi Reppen & Tülay Dixon

2022. A phraseological study of highlighting strategies in novice and expert writing. Journal of English for Academic Purposes 60 ► pp. 101179 ff.

Vetchinnikova, Svetlana, Alena Konina, Nitin Williams, Nina Mikušová & Anna Mauranen

2022. Perceptual chunking of spontaneous speech: Validating a new method with non-native listeners. Research Methods in Applied Linguistics 1:2 ► pp. 100012 ff.

Larsson, Tove, Luke Plonsky & Gregory R. Hancock

2021. On the benefits of structural equation modeling for corpus linguists. Corpus Linguistics and Linguistic Theory 17:3 ► pp. 683 ff.

This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.