Article published In: Compilation, transcription, markup and annotation of spoken corpora
Edited by John M. Kirk and Gisle Andersen
[International Journal of Corpus Linguistics 21:3] 2016
► pp. 323–347
Semi-lexical features in corpus transcription
Consistency, comparability, standardisation
Published online: 29 September 2016
https://doi.org/10.1075/ijcl.21.3.02and
https://doi.org/10.1075/ijcl.21.3.02and
An aspect of corpus compilation that poses a particular challenge is the question of how to transcribe orthographically units that are not part of any standardised vocabulary. Among the problematic categories we find voiced pauses, minimal response signals, interjections, certain discourse markers, phonologically reduced forms, colloquialisms and dialect forms. Such semi-lexical features are usually represented by regular phonemic-graphemic correspondences but are nevertheless often inconsistently handled. This paper reviews a number of existing transcription guidelines and assesses whether the recommendations they provide are sufficient and detailed enough to secure a consistent transcription of the categories mentioned. Further, the paper assesses to what extent transcription of semi-lexical features is consistent within and across two spoken corpora. On the basis of a cross-corpus comparison of the Bergen Corpus of London Teenage Language (COLT) and the London English Corpus (LEC), the paper provides specific recommendations for corpus transcription.
References (30)
Aijmer, K. (2002). English Discourse Particles: Evidence from a Corpus. Amsterdam: John Benjamins.
Ameka, F. (1992). Interjections: The universal yet neglected part of speech. Journal of Pragmatics, 18(2/3), 101–118.
Andersen, G. (2001). Pragmatic Markers and Sociolinguistic Variation. Amsterdam: John Benjamins.
. (2016). Using the corpus-driven method to chart discourse-pragmatic change. In H. Pichler (Ed.), Discourse-Pragmatic Variation and Change in English: New Methods and Insights (pp. 21–40). Cambridge: Cambridge University Press.
Berglund, Y. (2005). Expressions of Future in Present-day English: A Corpus-based Approach. Uppsala: Acta Universitatis Upsaliensis.
Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman Grammar of Spoken and Written English. London: Longman.
Cheshire, J., Fox, S., Kerswill, P., & Torgersen, E. (2008). Ethnicity, friendship network and social practices as the motor of dialect change: Linguistic innovation in London. Sociolinguistica Jahrbuch, 221, 1–23.
Cheshire, J., Kerswill, P., Fox, S., & Torgersen, E. (2011). Contact, the feature pool and the speech community: The emergence of Multicultural London English. Journal of Sociolinguistics, 15(2), 151–196.
Du Bois, J.W., Schuetze-Coburn, S., Cumming, S., & Danae, P. (1993). Outline of discourse transciption. In J.A. Edwards & M.D. Lampert (Eds.), Talking Data: Transcription and Coding in Discourse Research (pp. 45–89). Hillsdale, NJ: Lawrence Erlbaum.
Edwards, J.A. (1993). Principles and contrasting systems of discourse transcription. In J.A. Edwards & M.D. Lampert (Eds.), Talking Data: Transcription and Coding in Discourse Research (pp. 3–31). Hillsdale, NJ: Lawrence Erlbaum.
Gibbon, D., Moore, R., & Winsky, R. (Eds.) (1997). Handbook of Standards and Resources for Spoken Language Systems. Berlin: Mouton de Gruyter.
Jefferson, G. (1983). Issues in the transcription of naturally occurring talk: Caricature versus capturing pronunciational particulars. Tilburg Papers in Language and Literature, 341, 1–12.
. (2004). Glossary of transcript symbols with an introduction. In G.H. Lerner (Ed.), Conversation Analysis: Studies from the First Generation (pp. 13–31). Amsterdam: John Benjamins.
Johansson, S. (1995). The approach of the Text Encoding Initiative to the encoding of spoken discourse. In G. Leech, G. Myers & J. Thomas (Eds.), Spoken English on Computer: Transcription, Mark-up and Application (pp. 82–98). Harlow: Longman.
MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk (3rd ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
Nelson, G. (2002). International Corpus of English: Markup Manual for: Spoken Texts. Retrieved from [URL] (last accessed November 2015).
Payne, J. (1995). The COBUILD spoken corpus: Transcription conventions. In G. Leech, G. Myers & J. Thomas (Eds.), Spoken English on Computer: Transcription, Mark-up and Application (pp. 203–207). Harlow: Longman.
Poplack, S. & Tagliamonte, S. (2000). The grammaticization of going to in (African American) English. Language Variation and Change, 11(3), 315–342.
Sachs, H., Schegloff, E.A., & Jefferson, G. (1974). A simplest systematics for the organization of turn-taking for conversation. Language, 50(4), 696–735.
Sinclair, J. (1995). From theory to practice. In G. Leech, G. Myers & J. Thomas (Eds.), Spoken English on Computer: Transcription, Mark-up and Application. (pp. 99–109). Harlow: Longman.
Stenström, A.-B. (1998). From sentence to discourse: cos (because) in teenage talk. In A. Jucker & Y. Ziv. (Eds.), Discourse Markers: Descriptions and Theory (pp. 127–146). Amsterdam: John Benjamins.
Stenström, A.-B., Andersen, G., & Hasund, K. (2002). Trends in Teenage Talk: Corpus Compilation, Analysis and Findings. Amsterdam: John Benjamins.
TEI, T.-E. I. TEI P5: Guidelines for Electronic Text Encoding and Interchange.
Thompson, P. (2005). Spoken language corpora. In M. Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice (pp. 59–70). Oxford: Oxbow Books.
Torgersen, E., Gabrielatos, C., Hoffman, S., & Fox, S. (2011). A corpus-based study of pragmatic markers in London English. Corpus Linguistics and Linguistic Theory, 7(1), 93–118.
Cited by (4)
Cited by four other publications
Taylor, Roxanne
Põldvere, Nele, Johan Frid, Victoria Johansson & Carita Paradis
Pizarro Pedraza, Andrea
This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
