Article published In: Compiling and analysing the Spoken British National Corpus 2014
Edited by Tony McEnery, Robbie Love and Vaclav Brezina
[International Journal of Corpus Linguistics 22:3] 2017
► pp. 319–344
The Spoken BNC2014
Designing and building a spoken corpus of everyday conversations
Available under the Creative Commons Attribution (CC BY) 4.0 license.
For any use beyond this license, please contact the publisher at rights@benjamins.nl.
Published online: 23 November 2017
https://doi.org/10.1075/ijcl.22.3.02lov
https://doi.org/10.1075/ijcl.22.3.02lov
Abstract
This paper introduces the Spoken British National Corpus 2014, an 11.5-million-word corpus of orthographically transcribed conversations among L1 speakers of British English from across the UK, recorded in the years 2012–2016. After showing that a survey of the recent history of corpora of spoken British English justifies the compilation of this new corpus, we describe the main stages of the Spoken BNC2014’s creation: design, data and metadata collection, transcription, XML encoding, and annotation. In doing so we aim to (i) encourage users of the corpus to approach the data with sensitivity to the many methodological issues we identified and attempted to overcome while compiling the Spoken BNC2014, and (ii) inform (future) compilers of spoken corpora of the innovations we implemented to attempt to make the construction of corpora representing spontaneous speech in informal contexts more tractable, both logistically and practically, than in the past.
Keywords: Spoken BNC2014, transcription, corpus construction, spoken corpora
Article outline
- 1.Introduction
- 2.Similar existing corpora – why do we need a new one?
- 2.1The Spoken British National Corpus 1994
- 2.2Other British English corpora containing spoken conversational data
- 2.3Justification for the Spoken BNC2014
- 3.Corpus design and data collection
- 3.1Opportunistic data collection
- 3.2Recruitment of participants and audio recording
- 3.3Metadata categories in the Spoken BNC2014
- 3.3.1Name
- 3.3.2Age
- 3.3.3Gender
- 3.3.4Accent/dialect
- 3.3.5Occupation
- 3.3.6Other metadata categories
- 4.Transcribing the Spoken BNC2014
- 4.1Developing the transcription scheme
- 4.2Speaker identification
- 4.3Converting the transcripts
- 5.Conclusion
- Notes
References
References (53)
Adolphs, S., & Carter, R. (2013). Spoken Corpus Linguistics: From Monomodal to Multimodal. Abingdon: Routledge.
Adolphs, S., Knight, D., & Carter, R. (2015). Beyond modal spoken corpora: A dynamic approach to tracking language in context. In P. Baker & T. McEnery (Eds.), Corpora and Discourse Studies: Integrating Discourse and Corpora (pp. 41–62). Houndsmill: Palgrave Macmillan.
Alderson, C. J. (2007). Judging the frequency of English words. Applied Linguistics, 28(3), 383–409.
Aston, G., & Burnard, L. (1998). The BNC Handbook: Exploring the British National Corpus with SARA. Edinburgh: Edinburgh University Press.
Atkins, A., Clear, J., & Ostler, N. (1992). Corpus design criteria. Literary and Linguistic Computing, 7(1), 1–16.
Biber, D. (1993). Representativeness in Corpus Design. Literary and Linguistic Computing, 8(4), 243–257.
Brezina, V., & Meyerhoff, M. (2014). Significant or random? A critical review of sociolinguistic generalisations based on large corpora. International Journal of Corpus Linguistics, 19(1), 1–28.
Brezina, V., Gablasova, D., McEnery, T., & Meyerhoff, M. (2016). British National Corpus (BNC) as a sociolinguistic dataset: Exploring individual and social variation. Retrieved from [URL] (last accessed November 2016).
Brezina, V., Love, R., & Aijmer, K. (Eds.) (forthcoming). Corpus Approaches to Sociolinguistic Variation in Contemporary British English: An Exploration of the Spoken BNC2014. New York: Routledge.
Burnard, L. (2000). Reference guide for the British National Corpus (World Edition). Oxford University. Retrieved from [URL] (last accessed December 2013).
(2002). Where did we go wrong? A retrospective look at the British National Corpus. In B. Kettemann & G. Markus (Eds.), Teaching and Learning by Doing Corpus Analysis (pp. 51–71). Amsterdam: Rodopi.
(2007). Reference Guide for the British National Corpus (XML Edition). Oxford University. Retrieved from [URL] (last accessed December 2013).
Burnard, L., & Bauman, S. (Eds.) (2013). TEI: P5 Guidelines. TEI Consortium. Retrieved from [URL] (last accessed June 2017).
Carter, R. (1998). Orders of reality: CANCODE, communication, and culture. ELT Journal, 52(1), 43–56.
Cappelle, B., Dugas, E., & Tobin, V. (2015). An afterthought on let alone. Journal of Pragmatics, 801, 70–85.
Čermák, F. (2009). Spoken corpora design: Their constitutive parameters. International Journal of Corpus Linguistics, 14(1), 113–123.
(1995). The BNC spoken corpus. In G. Leech, G. Myers & J. Thomas (Eds.), Spoken English on Computer: Transcription, Mark-Up and Annotation (pp. 224–234). Harlow: Longman.
Davies, M. (2004). BYU-BNC (Based on the British National Corpus from Oxford University Press). Brigham Young University. Retrieved from [URL] (last accessed June 2017).
Deuchar, M., Davies P., Herring J., Parafita Couto, M., & Carter D. (2014). Building bilingual corpora. In E. M. Thomas & I. Mennen (Eds.), Advances in the Study of Bilingualism (pp. 93–111). Bristol: Multilingual Matters.
Douglas, F. (2003). The Scottish Corpus of Texts and Speech: Problems of corpus design. Literary and Linguistic Computing, 18(1), 23–37.
Flowerdew, J. (2009). Corpora in language teaching. In M. H. Long & C. J. Doughty (Eds.), The Handbook of Language Teaching (pp. 327–350). Oxford: Wiley-Blackwell.
Gabrielatos, C. (2013). If-conditionals in ICLE and the BNC: A success story for teaching or learning? In S. Granger, G. Gilquin & F. Meunier (Eds.), Twenty Years of Learner Corpus Research: Looking back, Moving ahead (pp. 155–156). Louvain-la-Neuve: Presses Universitaires de Louvain.
Garside, R., & Smith, N. (1997). A hybrid grammatical tagger: CLAWS4. In R. Garside, G. Leech & A. McEnery (Eds.), Corpus Annotation: Linguistic Information from Computer Text Corpora (pp. 102–121). London: Longman.
Hardie, A. (2012). CQPweb – Combining power, flexibility and usability in a corpus analysis tool. International Journal of Corpus Linguistics, 17(3), 380–409.
Hatice, C. (2015). Impoliteness in Corpora: A Comparative Analysis of British English and Spoken Turkish. Sheffield: Equinox.
Hoffmann, S., Evert, S., Lee, D., & Ylva, B. (2008). Corpus Linguistics with BNCweb: A Practical Guide. Frankfurt am Main: Peter Lang.
Ide, N. (1996). Corpus Encoding Standard. Expert Advisory Group on Language Engineering Standards (EAGLES). Retrieved from [URL] (last accessed June 2017).
Kallen, J. L., & Kirk, J. (2008). ICE-Ireland: A User’s Guide Documentation to accompany the Ireland Component of the International Corpus of English (ICE-Ireland). Belfast: Cló Ollscoil na Banríona. Retrieved from [URL] (last accessed June 2017).
Lam, P. (2009). The making of a BNC customised spoken corpus for comparative purposes. Corpora, 4(1), 167–188.
Leech, G., Rayson, P., & Wilson, A. (2001). Word Frequencies in Written and Spoken English: Based on the British National Corpus. Harlow: Pearson Education Limited.
Lüdeling, A., & Kytö, M. (2008). Introduction. In A. Lüdeling, & M. Kytö (Eds.), Corpus Linguistics: An International Handbook (pp. i–xii). Berlin: Walter de Gruyter.
Love, R., Hawtin, A., & Hardie, A. (2017). The British National Corpus 2014: User Manual and Reference Guide (version 1.0). Lancaster: ESRC Centre for Corpus Approaches to Social Science.
McEnery, T. (2005). Swearing in English: Bad Language, Purity and Power from 1586 to the Present. New York, NY: Routledge.
Montgomery, C. (2012). The effect of proximity in perceptual dialectology. Journal of Sociolinguistics, 16(5), 638–668.
Nelson, G., Wallis, S., & Aarts, B. (2002). Exploring Natural Language: Working with the British Component of the International Corpus of English. Amsterdam/Philadelphia: John Benjamins.
Nesselhauf, N., & Römer, U. (2007). Lexical-grammatical patterns in spoken English: The case of the progressive with future time reference. International Journal of Corpus Linguistics, 12(3), 297–333.
Ribaric, S., Ariyaeeinia, A., & Pavesic, N. (2016). De-identification for privacy protection in multimedia content: A survey. Signal Processing: Image Communication, 471, 131–151.
Rühlemann, C. (2006). Coming to terms with conversational grammar: ‘Dislocation’ and ‘dysfluency’. International Journal of Corpus Linguistics, 11(4), 385–409.
Rühlemann, C., & Gries, S. (2015). Turn order and turn distribution in multi-party storytelling. Journal of Pragmatics, 871, 171–191.
Säily, T. (2011). Variation in morphological productivity in the BNC: Sociolinguistic and methodological considerations. Corpus Linguistics and Linguistic Theory, 7(1), 119–141.
Schmidt, T. (2016). Good practices in the compilation of FOLK, the Research and Teaching Corpus of Spoken German. International Journal of Corpus Linguistics, 21(3), 396–418.
Shirk, J. L., Ballard, H. L., Wilderman, C. C., Phillips, T., Wiggins, A., Jordan, R., McCallie, E., Minarchek, M., Lewenstein, B. V., Krasny, M. E., & Bonney, R. (2012). Public participation in scientific research: A framework for deliberate design. Ecology and Society, 17(2), 29.
Smith, A. (2014). Newly emerging subordinators in spoken/written English. Australian Journal of Linguistics, 34(1), 118–138.
Stenström, A. -B., Andersen, G., & Hasund, I. K. (2002). Trends in Teenage Talk: Corpus Compilation, Analysis and Findings. Amsterdam/Philadelphia: John Benjamins.
Thompson, P., & Nesi, H. (2001). The British Academic Spoken English (BASE) Corpus Project. Language Teaching Research, 5(3), 263–264.
Wang, S. (2005). Corpus-based approaches and discourse analysis in relation to reduplication and repetition. Journal of Pragmatics, 34(4), 505–540.
Cited by (167)
Cited by 167 other publications
Bottini, Raffaella & Elen Le Foll
2025. The more proficient the learners, the less sophisticated their L2 vocabulary?. International Journal of Learner Corpus Research 11:1 ► pp. 47 ff.
Cannon-Jones, Jill & Viola Wiegand
2025. To what extent can a comedy drama provide a classroom model for natural conversation?. Register Studies
Carradini, Stephen, Mathew Gillings & Sky Marsen
Childs, Claire
Chim, Jenny, Julia Ive & Maria Liakata
Culpeper, Jonathan, Isolde van Dorst & Mathew Gillings
Frick, Elena & Thomas Schmidt
Gablasova, Dana
Gablasova, Dana & Vaclav Brezina
2025. Adjective + noun collocations in L2 spoken English. International Journal of Learner Corpus Research 11:1 ► pp. 79 ff.
Griffin, David
Laitinen, Mikko & Paula Rautionaho
2025. Reuse of social media data in corpus linguistics. International Journal of Corpus Linguistics 30:2 ► pp. 171 ff.
Larsson, Tove & Mark Sullivan
Leuckert, Sven & Claudia Lange
Lewis, Diana
Miestamo, Matti, Olli O. Silvennoinen & Chingduang Yurayong
2025. Asymmetry in temporal specification between affirmation and negation. Studies in Language 49:3 ► pp. 501 ff.
Rautionaho, Paula
2025. Grinding to a halt?. In The Progressive Revisited [Studies in Language Companion Series, 236], ► pp. 75 ff.
Schmidt, Thomas
Surcouf, Christian
2025. Spoken French corpora and listening comprehension. In Applying Corpora in Teaching and Learning Romance Languages [Studies in Corpus Linguistics, 122], ► pp. 40 ff.
Tantucci, Vittorio, Raffaella Bottini & Aiqing Wang
Timyam, Napasri & Panjanit Chaipuapae
Van Parys, Amaury, Vanessa De Wilde, Lieve Macken & Maribel Montero Perez
Verdonik, Darinka, Andreja Bizjak, Andrej Žgank, Mirjam Sepesy Maučec, Mitja Trojar, Jerneja Žganec Gros, Marko Bajec, Iztok Lebar Bajec & Simon Dobrišek
Wang, Tingting & Dandan Zhou
Xiao, Richard, Gavin Brookes & Tony McEnery
Zhao, Ning & Lei Lei
Chen, Yu-Hua, Simon Harrison, Michael Paul Stevens & Qianqian Zhou
Collins, Luke, Niamh Nicholson, Nicky Lidbetter, Dave Smithson & Paul Baker
Crosthwaite, Peter & Vít Baisa
2024. A user-friendly corpus tool for disciplinary data-driven learning. International Journal of Corpus Linguistics 29:4 ► pp. 595 ff.
Hanks, Elizabeth, Tony McEnery, Jesse Egbert, Tove Larsson, Douglas Biber, Randi Reppen, Paul Baker, Vaclav Brezina, Gavin Brookes, Isobelle Clarke & Raffaella Bottini
Hartmann, Stefan & Olaf Mikkelsen
Harvey, Daisy, Paul Rayson, Fiona Lobban, Jasper Palmier-Claus & Steven Jones
Jehangir, Humaira & Andrew Hardie
Jones, Christian & David Oakey
Li, Haowei, Jinyi Zhang, Ye Tian & Tadahiro Matsumoto
Lim, Jun Jie, Mie Hiramoto, Jakob R. E. Leimgruber & Wilkinson Daniel Wong Gonzales
Looi, Jarvis & Alessandra Cacciato
McEnery, Tony & Gavin Brookes
Murphy, Sean E., Richard Harris & Ann E. Wilson-Daily
Nance, Claire, Maya Dewhurst, Lois Fairclough, Pamela Forster, Sam Kirkham, Justin J. H. Lo, Jessica McMonagle, Takayuki Nagamine, Seren Parkman, Haleema Rabani, Andrea Siem, Danielle Turton & Di Wang
Pitzl-Hagin, Marie-Luise
Pérez-Paredes, Pascual
Smith, Nicholas, Cristiano Broccias & Cathleen Waters
Troiani, Giorgia, John W. Du Bois & Andrey Filchenko
Verdonik, Darinka, Mitja Trojar & Andreja Bizjak
Wang, Qi
Woodin, Greg, Bodo Winter, Jeannette Littlemore, Marcus Perlman & Jack Grieve
Zorzi, Virginia
Boritchev, Maria
Grabowski, Łukasz & Piotr Pęzik
Hirota, Tomoharu & Laurel J. Brinton
Love, Robbie & Anna-Brita Stenstrom
Malá, Markéta & Zuzana Ježková
Sanchez-Stockhammer, Christina & Peter Uhrig
Schmid, Hans-Jörg
Szmrecsanyi, Benedikt & Alexandra Engel
Özbay, Ali Şükrü, Ayşenur Hoşoğlu, Buse Uzuner & Ercüment Öztürk
An, Yi, Hang Su & Mingyou Xiang
2022. Apology responses and gender differences in spoken British English. Pragmatics. Quarterly Publication of the International Pragmatics Association (IPrA) 32:1 ► pp. 28 ff.
Claridge, Claudia
Collins, Luke & Andrew Hardie
Curry, Niall, Robbie Love & Olivia Goodman
Diewald, Gabriele & Dániel Czicza
2022. Variation and Grammaticalization of Verbal Constructions. Constructions and Frames 14:1 ► pp. 1 ff.
Engel, Alexandra, Jason Grafmiller, Laura Rosseel & Benedikt Szmrecsanyi
Gablasova, Dana, Vaclav Brezina & Tony McEnery
Ha, Hung Tan
Hanks, Elizabeth & Jesse Egbert
Jansen, Lennert, Arabella Sinclair, Margot J. van der Goot, Raquel Fernández & Sandro Pezzelle
Kirjavainen, Minna, Ludivine Crible & Kate Beeching
Leone, Ljubica
Leone, Ljubica
Leone, Ljubica
Li, Lexi Xiaoduo
Love, Robbie, Vaclav Brezina, Tony McEnery, Abi Hawtin, Andrew Hardie & Claire Dembry
2022. Functional variation in the Spoken BNC2014 and the potential for register analysis. Register Studies ► pp. 296 ff.
McEnery, Tony, Vaclav Brezina & Helen Baker
Mikkelsen, Olaf & Stefan Hartmann
2022. Competing future constructions and the Complexity Principle. In Broadening the Spectrum of Corpus Linguistics [Studies in Corpus Linguistics, 105], ► pp. 9 ff.
SCHÜTZLER, OLE & JENNY HERZKY
Sönning, Lukas & Manfred Krug
Wang, Huanyu & Yajuan Tang
Winter, Tatjana & Elen Le Foll
2022. Testing the pedagogical norm. International Journal of Learner Corpus Research 8:1 ► pp. 31 ff.
Wulff, Dirk U., Simon De Deyne, Samuel Aeschbach & Rui Mata
Batinić, Josip, Elena Frick & Thomas Schmidt
Biber, Douglas, Jesse Egbert, Daniel Keller & Stacey Wizner
2021. Extending text-linguistic studies of register variation to a continuous situational space. In Corpus-based approaches to register variation [Studies in Corpus Linguistics, 103], ► pp. 19 ff.
Brezina, Vaclav, Abi Hawtin & Tony McEnery
Davies, Mark
Deignan, Alice & Robbie Love
Durand López, Ezequiel M.
Egbert, Jesse, Stacey Wizner, Daniel Keller, Douglas Biber, Tony McEnery & Paul Baker
Engel, Alexandra, Jason Grafmiller, Laura Rosseel, Benedikt Szmrecsanyi & Freek Van de Velde
2021. How register-specific is probabilistic grammatical knowledge?. In Corpus-based approaches to register variation [Studies in Corpus Linguistics, 103], ► pp. 51 ff.
Farr, Fiona
Hunt, Daniel
Johannsen, Berit
Johansen, Stine Hulleberg
Keselman, Iosif & Yulia Yakovleva
Knight, Dawn, Steve Morris, Laura Arman, Jennifer Needs & Mair Rees
Knight, Dawn, Steve Morris, Laura Arman, Jennifer Needs & Mair Rees
Knight, Dawn, Steve Morris, Laura Arman, Jennifer Needs & Mair Rees
Knight, Dawn, Steve Morris, Laura Arman, Jennifer Needs & Mair Rees
Knight, Dawn, Steve Morris & Tess Fitzpatrick
Knight, Dawn, Steve Morris & Tess Fitzpatrick
Knight, Dawn, Steve Morris & Tess Fitzpatrick
Knight, Dawn, Steve Morris & Tess Fitzpatrick
Le Foll, Elen
Le Foll, Elen
2022. “I’m putting some salt in my sandwich”.. In Broadening the Spectrum of Corpus Linguistics [Studies in Corpus Linguistics, 105], ► pp. 93 ff.
Lewis, Diana M.
2021. Pragmatic markers at the periphery and discourse
prominence. In Pragmatic Markers and Peripheries [Pragmatics & Beyond New Series, 325], ► pp. 351 ff.
Love, Robbie
Love, Robbie
LOVE, ROBBIE & NIALL CURRY
Love, Robbie & David Wright
Lustig, Andrew, Gavin Brookes & Daniel Hunt
PARADIS, CARITA, VICTORIA JOHANSSON & NELE PÕLDVERE
Põldvere, Nele, Johan Frid, Victoria Johansson & Carita Paradis
PÕLDVERE, NELE, VICTORIA JOHANSSON & CARITA PARADIS
RAUTIONAHO, PAULA & ROBERT FUCHS
REICHELT, SUSAN
Seoane, Elena & Douglas Biber
2021. A corpus-based approach to register variation. In Corpus-based approaches to register variation [Studies in Corpus Linguistics, 103], ► pp. 1 ff.
Su, Hang
2021. Changing patterns of apology in spoken British English. Pragmatics and Society 12:3 ► pp. 410 ff.
Vartiainen, Turo
Čermáková, Anna
2021. Diachronic change in the ordering of kinship binomials. In Time in Languages, Languages in Time [Studies in Corpus Linguistics, 101], ► pp. 39 ff.
Aijmer, Karin
2020. That’s absolutely fine. In Corpora and the changing society [Studies in Corpus Linguistics, 96], ► pp. 143 ff.
Aijmer, Karin
Alderton, Roy
Bednarek, Monika
Bednarek, Monika, Peter Crosthwaite & Alexandra I. García
2020. Corpus linguistics and education in Australia. Australian Review of Applied Linguistics 43:2 ► pp. 105 ff.
Berglund Prytz, Ylva
2020. Return to the future. In Voices Past and Present - Studies of Involved, Speech-related and Spoken Texts [Studies in Corpus Linguistics, 97], ► pp. 227 ff.
Chen, Yu-Hua & Radovan Bruncak
CLARIDGE, CLAUDIA, EWA JONSSON & MERJA KYTÖ
Dang, Thi Ngoc Yen
Davidse, Kristin & Hendrik De Smet
Denison, David
2020. Explaining explanatory so
. In Voices Past and Present - Studies of Involved, Speech-related and Spoken Texts [Studies in Corpus Linguistics, 97], ► pp. 207 ff.
Hashimoto, Brett, Daniel Keller, Ekaterina Sudina, Katherine Yaw, Jesse Egbert & Luke Plonsky
Leuckert, Sven
McGillivray, Barbara & Gábor Mihály Tóth
Meyer, Charles F. & Gerald Nelson
Miller, Jim & Andreea S. Calude
Stratton, James M.
Vartiainen, Turo & Mikko Höglund
Vasheghani Farahani, Mehrdad
Yoon, Soyeon
Bębeniec, Daria
Felice, Rachele De
2019. Rühlemann, C. (2018). Corpus Linguistics for Pragmatics: A Guide for Research
. International Journal of Corpus Linguistics 24:1 ► pp. 136 ff.
Jenset, Gard B. & Barbara McGillivray
Kaatari, Henrik & Tove Larsson
Kopřivová, Marie, Zuzana Komrsková, Petra Poukarová & David Lukeš
Mahlberg, Michaela, Viola Wiegand, Peter Stockwell & Anthony Hennessey
Méndez-Naya, Belén
Schauer, Gila A.
Wagner, Susanne
2019. Whyvery goodin India might bepretty goodin North America. International Journal of Corpus Linguistics 24:4 ► pp. 445 ff.
Weetman, Katharine, Jeremy Dale, Emma Scott & Stephanie Schnurr
Weetman, Katharine, Jeremy Dale, Emma Scott & Stephanie Schnurr
Cheng, Peng, Ibrahim Ethem Bagci, Jeff Yan & Utz Roedig
Laws, Jacqueline & Chris Ryder
2018. Register variation in spoken British English. International Journal of Corpus Linguistics 23:1 ► pp. 1 ff.
Riccioni, Ilaria, Ramona Bongelli, Gill Philip & Andrzej Zuczkowski
Smith, Nicholas & Cathleen Waters
Smith, Nicholas & Cathleen Waters
2019. Variation and change in a specialized register. International Journal of Corpus Linguistics 24:2 ► pp. 169 ff.
Wang, Zhan & Gaowei Chen
2018. Discourse performance in L2 task repetition. In Learning Language through Task Repetition [Task-Based Language Teaching, 11], ► pp. 97 ff.
Calude, Andreea S.
2017. Sociolinguistic variation at the grammatical/discourse level. International Journal of Corpus Linguistics 22:3 ► pp. 429 ff.
Fuchs, Robert
2017. Do women (still) use more intensifiers than men?. International Journal of Corpus Linguistics 22:3 ► pp. 345 ff.
Hessner, Tanja & Ira Gawlitzek
Laws, Jacqueline, Chris Ryder & Sylvia Jaworska
2017. A diachronic corpus-based study into the effects of age and gender on the usage patterns of verb-forming suffixation in spoken British English. International Journal of Corpus Linguistics 22:3 ► pp. 375 ff.
[no author supplied]
[no author supplied]
[no author supplied]
[no author supplied]
This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
