Compiling computer-mediated spoken language corpora: Key issues and recommendations

Diemer, Stefan; Brunner, Marie-Louise; Schmidt, Selina

doi:10.1075/ijcl.21.3.03die

Article published In: Compilation, transcription, markup and annotation of spoken corpora
Edited by John M. Kirk and Gisle Andersen
[International Journal of Corpus Linguistics 21:3] 2016
► pp. 348–371

Get fulltext from our e-platform

Download PDF

Compiling computer-mediated spoken language corpora

Key issues and recommendations

Stefan Diemer | Saarland University

Marie-Louise Brunner

Selina Schmidt

Published online: 29 September 2016

https://doi.org/10.1075/ijcl.21.3.03die

This paper discusses key issues in the compilation of spoken language corpora in a computer-mediated communication (CMC) environment, using data from the Corpus of Academic Spoken English (CASE), a corpus of Skype conversations currently being compiled at Saarland University, Germany, in cooperation with European and US partners. Based on first findings, Skype is presented as a suitable tool for collecting informal spoken data. In addition, new recommendations concerning data compilation and transcription are put forward to supplement existing best practice as presented in Wynne (2005). We recommend the preservation of multimodal features during anonymisation, and the addition of annotation elements already at the transcription stage, particularly CMC-related discourse features, English as a Lingua Franca (ELF) features (e.g. non-standard language and code-switching), as well as the inclusion of prosodic, paralinguistic, and non-verbal annotation. Additionally, we propose a layered corpus design in order to allow researchers to focus on specific annotation features.

Keywords: Computer-mediated communication (CMC), data compilation and transcription, spoken language corpora, , best practice

References (37)

Adolphs, S., & Carter, R. (2013). Spoken Corpus Linguistics. From Monomodal to Multimodal. London: Routledge.

Biber, D. (1988). Variation across Speech and Writing. Cambridge: Cambridge University Press.

Brunner, M.-L. (2015). Negotiating Conversation Starts in the Corpus of Academic Spoken English (Unpublished MA thesis). Universität des Saarlandes, Saarbrücken, Germany.

ECAMM – Call Recorder for Mac. (2013). [Computer software]. Retrieved from [URL] (last accessed March 2016).

CASE – Corpus of Academic Spoken English. (Forthcoming S. Diemer, M.-L. Brunner, C. Collet & S. Schmidt). . Saarbrücken: Saarland University (Coordination) / Sofia: St Kliment Ohridski University / Forlì: University of Bologna-Forlì / Santiago: University of Santiago de Compostela / Helsinki: Helsinki University & Hanken School of Economics / Birmingham: Birmingham City University / Växjö: Linnaeus University / Louvain-la-Neuve: Université catholique de Louvain / Lyon: Université Lumière Lyon 2 / Boise: Boise State University. Retrieved from [URL] (last accessed March 2016).

Chafe, W. (2007). The Importance of not Being Earnest: The Feeling behind Laughter and Humor. Amsterdam: John Benjamins.

CLAWS Part-of-Speech Tagger for English. (1994-2016). [Computer software]. Retrieved from [URL] (last accessed March 2016).

Conrad, S., & Mauranen, A. (2003). The corpus of English as lingua franca in academic settings. TESOL Quarterly, 37(3), 513–527.

Dressler, R.A., & Kreuz, R.J. (2000). Transcribing oral discourse: A survey and a model system. Discourse Processes, 29(1), 25–36.

Edwards, J.A. (1993). Principles and contrasting systems of discourse transcription. In J.A. Edwards & M.D. Lampert (Eds.), Talking Data: Transcription and Coding in Discourse Research (pp. 3–32). Hillsdale: Lawrence Erlbaum Associates.

ELFA – The Corpus of English as a Lingua Franca in Academic Settings. (2008). A. Mauranen (Director). Retrieved from [URL] (last accessed February 2015).

Firth, A. (1996). The discursive accomplishment of normality: On ‘lingua franca’ English and conversation analysis. Journal of Pragmatics, 26(2) 237–259.

Gee, M. (2014). CASE XML Conversion Tool [Computer software]. Retrieved from [URL] (last accessed November 2015).

Geluykens, R. (1993). Topic introduction in English conversation. Transactions of the Philological Society, 91(2). 181–214.

Gibbon, D., Moore R., & Winski, R. (1998). Handbook of Standards and Resources for Spoken Language Systems 1: Spoken Language Systems and Corpus Design. Berlin, Germany: Mouton de Gruyter.

Glenn, P. (2003). Laughter in Interaction. Cambridge: Cambridge University Press.

Howarth, P.A. (1996). Phraseology in English Academic Writing: Some Implications for Language Learning and Dictionary Making. Tübingen: Niemeyer.

ICE Corpus annotation guidelines. (2009). Retrieved from [URL] (last accessed March 2016).

IFA Dialog Video Corpus. (2008). Retrieved from [URL] (last accessed March 2016).

Jefferson, G., Sacks, H., & Schegloff, E.A. (1987). Notes on laughter in the pursuit of intimacy. In G. Button & J.R.E. Lee (Eds.), Talk and Social Organisation (pp. 152–205). Clevedon: Multilingual Matters.

Jenkins, J., Modiano, M., & Seidlhofer, B. (2001). Euro-English. English Today, 17(4), 13–19.

Leech, G., Myers, G., & Thomas, J. (Eds.) (1995). Spoken English on Computer. Harlow: Longman.

Leech, G. (2005). Adding linguistic annotation. In M. Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice (pp. 17–29). Oxford: Oxbow Books.

Mair, C. (Ed.) (2003). The Politics of English as a World Language. Amsterdam: Rodopi.

Meierkord, C. (1996). Englisch als Medium der interkulturellen Kommunikation. Untersuchungen zum non-native-/non-native Speaker-Diskurs. Frankfurt am Main: Peter Lang.

Nelson, G. (2002). ICE mark-up manual for spoken texts. Retrieved from [URL] (last accessed 31 March 2016)

Sauer, S., & Lüdeling, A. (2016). Flexible Multi-Layer Spoken Dialogue Corpora. International Journal of Corpus Linguistics (this volume).

Schegloff, E.A. (1968). Sequencing in conversational openings. American Anthropologist, 70(6), 1075–1095.

Schmidt, S. (2015). Laughter in computer-mediated communication: A means of creating rapport in first-contact situations (Unpublished MA dissertation). Universität des Saarlandes, Saarbrücken, Germany.

Schmidt, S., Brunner, M.-L., & Diemer, S. (2014). CASE: Corpus of Academic Spoken English: Transcription Conventions. Retrieved from [URL] (last accessed March 2016).

Sinclair, J. (1995). From theory to practice. In G. Leech, G. Myers & J. Thomas (Eds.), Spoken English on Computer (pp. 99–112). Harlow: Longman.

Spencer-Oatey, H. (2002). Managing rapport in talk: Using rapport sensitive incidents to explore the motivational concerns underlying the management of relations. Journal of Pragmatics, 34(5) 529–545.

Supertintin – Skype Video Call Recorder (2013). [Computer software]. Retrieved from [URL] (last accessed March 2016).

Tannen, D. (1989). Talking Voices: Repetition, Dialog, and Imagery in Conversational Discourse. Cambridge: Cambridge University Press.

Thompson, P. (2005). Spoken language corpora. In M. Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice (pp. 59–70). Oxford: Oxbow Books.

VOICE – The Vienna-Oxford International Corpus of English (Version 2.0 XML) (2013). B. Seidlhofer (Director). Vienna: University of Vienna. Retrieved from [URL] (last accessed March 2016).

Wynne, M. (Ed.) (2005). Developing Linguistic Corpora: A Guide to Good Practice. Oxford: Oxbow Books. Retrieved from [URL] (last accessed March 2016).

Cited by (12)

Cited by 12 other publications

Order by:

Fitzgerald, Christopher & Dawn Knight

2025. Corpus Analysis of Multimodality. In The Encyclopedia of Applied Linguistics, ► pp. 1 ff.

Gonzales, Wilkinson Daniel Wong, Mie Hiramoto, Jakob R. E. Leimgruber & Jun Jie Lim

2023. The Corpus of Singapore English Messages (CoSEM). World Englishes 42:2 ► pp. 371 ff.

Rühlemann, Christoph & Alexander Ptak

2023. Reaching beneath the tip of the iceberg: A guide to the Freiburg Multimodal Interaction Corpus. Open Linguistics 9:1

Fernández Polo, Francisco Javier

2021. Backchannels in video-mediated ELF conversations: a case study. Journal of English as a Lingua Franca 10:1 ► pp. 113 ff.

Põldvere, Nele, Johan Frid, Victoria Johansson & Carita Paradis

2021. Challenges of releasing audio material for spoken data: The case of the London-Lund Corpus 2. Research in Corpus Linguistics 9:1 ► pp. 35 ff.

PÕLDVERE, NELE, VICTORIA JOHANSSON & CARITA PARADIS

2021. OnThe London–Lund Corpus 2: design, challenges and innovations. English Language and Linguistics 25:3 ► pp. 459 ff.

Bosso, Rino

2020. Exploring the Pragmatics of Computer-Mediated English as a Lingua Franca Communication. In Language Change, ► pp. 291 ff.

Brunner, Marie-Louise & Stefan Diemer

2018. “You are struggling forwards, and you don’t know, and then you … you do code-switching…” – Code-switching in ELF Skype conversations. Journal of English as a Lingua Franca 7:1 ► pp. 59 ff.

Steen, Francis F., Anders Hougaard, Jungseock Joo, Inés Olza, Cristóbal Pagán Cánovas, Anna Pleshakova, Soumya Ray, Peter Uhrig, Javier Valenzuela, Jacek Woźny & Mark Turner

2018. Toward an infrastructure for data-driven multimodal communication research. Linguistics Vanguard 4:1

Kok, Kasper I.

2017. Functional and temporal relations between spoken and gestured components of language. International Journal of Corpus Linguistics 22:1 ► pp. 1 ff.

[no author supplied]

2022. QUEST: Guidelines and Specifications for the Assessment of Audiovisual, Annotated Language Data [Working Papers in Corpus Linguistics and Digital Technologies: Analyses and Methodology, 8],

[no author supplied]

2022. List of Example Stand-alone Corpus Description Articles. In Designing and Evaluating Language Corpora, ► pp. 224 ff.

This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.