Article published In: Compilation, transcription, markup and annotation of spoken corpora
Edited by John M. Kirk and Gisle Andersen
[International Journal of Corpus Linguistics 21:3] 2016
► pp. 348–371
Compiling computer-mediated spoken language corpora
Key issues and recommendations
Published online: 29 September 2016
https://doi.org/10.1075/ijcl.21.3.03die
https://doi.org/10.1075/ijcl.21.3.03die
This paper discusses key issues in the compilation of spoken language corpora in a computer-mediated communication (CMC) environment, using data from the Corpus of Academic Spoken English (CASE), a corpus of Skype conversations currently being compiled at Saarland University, Germany, in cooperation with European and US partners. Based on first findings, Skype is presented as a suitable tool for collecting informal spoken data. In addition, new recommendations concerning data compilation and transcription are put forward to supplement existing best practice as presented in Wynne (2005). We recommend the preservation of multimodal features during anonymisation, and the addition of annotation elements already at the transcription stage, particularly CMC-related discourse features, English as a Lingua Franca (ELF) features (e.g. non-standard language and code-switching), as well as the inclusion of prosodic, paralinguistic, and non-verbal annotation. Additionally, we propose a layered corpus design in order to allow researchers to focus on specific annotation features.
References (37)
Adolphs, S., & Carter, R. (2013). Spoken Corpus Linguistics. From Monomodal to Multimodal. London: Routledge.
Brunner, M.-L. (2015). Negotiating Conversation Starts in the Corpus of Academic Spoken English (Unpublished MA thesis). Universität des Saarlandes, Saarbrücken, Germany.
ECAMM – Call Recorder for Mac. (2013). [Computer software]. Retrieved from [URL] (last accessed March 2016).
CASE – Corpus of Academic Spoken English. (Forthcoming S. Diemer, M.-L. Brunner, C. Collet & S. Schmidt). . Saarbrücken: Saarland University (Coordination) / Sofia: St Kliment Ohridski University / Forlì: University of Bologna-Forlì / Santiago: University of Santiago de Compostela / Helsinki: Helsinki University & Hanken School of Economics / Birmingham: Birmingham City University / Växjö: Linnaeus University / Louvain-la-Neuve: Université catholique de Louvain / Lyon: Université Lumière Lyon 2 / Boise: Boise State University. Retrieved from [URL] (last accessed March 2016).
Chafe, W. (2007). The Importance of not Being Earnest: The Feeling behind Laughter and Humor. Amsterdam: John Benjamins.
CLAWS Part-of-Speech Tagger for English. (1994-2016). [Computer software]. Retrieved from [URL] (last accessed March 2016).
Conrad, S., & Mauranen, A. (2003). The corpus of English as lingua franca in academic settings. TESOL Quarterly, 37(3), 513–527.
Dressler, R.A., & Kreuz, R.J. (2000). Transcribing oral discourse: A survey and a model system. Discourse Processes, 29(1), 25–36.
Edwards, J.A. (1993). Principles and contrasting systems of discourse transcription. In J.A. Edwards & M.D. Lampert (Eds.), Talking Data: Transcription and Coding in Discourse Research (pp. 3–32). Hillsdale: Lawrence Erlbaum Associates.
ELFA – The Corpus of English as a Lingua Franca in Academic Settings. (2008). A. Mauranen (Director). Retrieved from [URL] (last accessed February 2015).
Firth, A. (1996). The discursive accomplishment of normality: On ‘lingua franca’ English and conversation analysis. Journal of Pragmatics, 26(2) 237–259.
Gee, M. (2014). CASE XML Conversion Tool [Computer software]. Retrieved from [URL] (last accessed November 2015).
Geluykens, R. (1993). Topic introduction in English conversation. Transactions of the Philological Society, 91(2). 181–214.
Gibbon, D., Moore R., & Winski, R. (1998). Handbook of Standards and Resources for Spoken Language Systems 1: Spoken Language Systems and Corpus Design. Berlin, Germany: Mouton de Gruyter.
Howarth, P.A. (1996). Phraseology in English Academic Writing: Some Implications for Language Learning and Dictionary Making. Tübingen: Niemeyer.
ICE Corpus annotation guidelines. (2009). Retrieved from [URL] (last accessed March 2016).
IFA Dialog Video Corpus. (2008). Retrieved from [URL] (last accessed March 2016).
Jefferson, G., Sacks, H., & Schegloff, E.A. (1987). Notes on laughter in the pursuit of intimacy. In G. Button & J.R.E. Lee (Eds.), Talk and Social Organisation (pp. 152–205). Clevedon: Multilingual Matters.
Leech, G. (2005). Adding linguistic annotation. In M. Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice (pp. 17–29). Oxford: Oxbow Books.
Meierkord, C. (1996). Englisch als Medium der interkulturellen Kommunikation. Untersuchungen zum non-native-/non-native Speaker-Diskurs. Frankfurt am Main: Peter Lang.
Nelson, G. (2002). ICE mark-up manual for spoken texts. Retrieved from [URL] (last accessed 31 March 2016)
Sauer, S., & Lüdeling, A. (2016). Flexible Multi-Layer Spoken Dialogue Corpora. International Journal of Corpus Linguistics (this volume).
Schegloff, E.A. (1968). Sequencing in conversational openings. American Anthropologist, 70(6), 1075–1095.
Schmidt, S. (2015). Laughter in computer-mediated communication: A means of creating rapport in first-contact situations (Unpublished MA dissertation). Universität des Saarlandes, Saarbrücken, Germany.
Schmidt, S., Brunner, M.-L., & Diemer, S. (2014). CASE: Corpus of Academic Spoken English: Transcription Conventions. Retrieved from [URL] (last accessed March 2016).
Sinclair, J. (1995). From theory to practice. In G. Leech, G. Myers & J. Thomas (Eds.), Spoken English on Computer (pp. 99–112). Harlow: Longman.
Spencer-Oatey, H. (2002). Managing rapport in talk: Using rapport sensitive incidents to explore the motivational concerns underlying the management of relations. Journal of Pragmatics, 34(5) 529–545.
Supertintin – Skype Video Call Recorder (2013). [Computer software]. Retrieved from [URL] (last accessed March 2016).
Tannen, D. (1989). Talking Voices: Repetition, Dialog, and Imagery in Conversational Discourse. Cambridge: Cambridge University Press.
Thompson, P. (2005). Spoken language corpora. In M. Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice (pp. 59–70). Oxford: Oxbow Books.
VOICE – The Vienna-Oxford International Corpus of English (Version 2.0 XML) (2013). B. Seidlhofer (Director). Vienna: University of Vienna. Retrieved from [URL] (last accessed March 2016).
Wynne, M. (Ed.) (2005). Developing Linguistic Corpora: A Guide to Good Practice. Oxford: Oxbow Books. Retrieved from [URL] (last accessed March 2016).
Cited by (12)
Cited by 12 other publications
Fitzgerald, Christopher & Dawn Knight
Gonzales, Wilkinson Daniel Wong, Mie Hiramoto, Jakob R. E. Leimgruber & Jun Jie Lim
Rühlemann, Christoph & Alexander Ptak
Fernández Polo, Francisco Javier
Põldvere, Nele, Johan Frid, Victoria Johansson & Carita Paradis
PÕLDVERE, NELE, VICTORIA JOHANSSON & CARITA PARADIS
Bosso, Rino
Brunner, Marie-Louise & Stefan Diemer
Steen, Francis F., Anders Hougaard, Jungseock Joo, Inés Olza, Cristóbal Pagán Cánovas, Anna Pleshakova, Soumya Ray, Peter Uhrig, Javier Valenzuela, Jacek Woźny & Mark Turner
Kok, Kasper I.
2017. Functional and temporal relations between spoken and gestured components of language. International Journal of Corpus Linguistics 22:1 ► pp. 1 ff.
[no author supplied]
This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
