Article published In: Corpus approaches to telecinematic language
Edited by Monika Bednarek, Valentin Werner and Marcia Veirano Pinto
[International Journal of Corpus Linguistics 26:1] 2021
► pp. 10–37
The TV and Movies corpora
Design, construction, and use
Published online: 17 November 2020
https://doi.org/10.1075/ijcl.00035.dav
https://doi.org/10.1075/ijcl.00035.dav
Abstract
This paper discusses the creation and use of the TV Corpus (subtitles from 75,000 episodes, 325 million words, 6
English-speaking countries, 1950s-2010s) and the Movies Corpus (subtitles from 25,000 movies, 200 million words, 6 English-speaking
countries, 1930s–2010s), which are available at English-Corpora.org. The corpora compare
well to the BNC-Conversation data in terms of informality, lexis, phraseology, and syntax. But at 525 million words in total size, they are
more than 30 times as large as BNC-Conversation (both BNC1994 and BNC2014 combined), which means that they can be used to look at a wide
range of linguistic phenomena. The TV and Movies corpora also allow useful comparisons of very informal language across time (containing
texts from the 1930s and later for the movies, and from the 1950s onwards for TV shows) and between dialects of English (such as British and
American English).
Keywords: TV, movies, diachronic, dialects, speech
Article outline
- 1.Introduction
- 2.Rationale for the TV and Movies corpora
- 3.Creating the TV and Movies corpora
- 4.Using metadata to create “Virtual Corpora”
- 5.Informal nature of the language in the TV and Movies corpora
- 6.Dialectal and historical variation in English
- 6.1Dialectal differences
- 6.2Change over time
- 7.Conclusion
- Note
References
References (32)
Baker, P. (2009). The BE06 corpus of British English and recent language change. International Journal of Corpus Linguistics, 14(3), 312–337.
(2011). Times may change but we’ll always have money: A corpus driven examination of vocabulary change in four diachronic corpora. Journal of English Linguistics, 39(1), 65–88.
Bednarek, M. (2018). Language and Television Series: A Linguistic Approach to TV Dialogue. Cambridge University Press.
Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman Grammar of Spoken and Written English. Longman.
BNC Consortium. (2007). British National Corpus (version 3, BNC XML ed.). [URL]
Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977–990.
Brysbaert, M., Mandera, P., & Keuleers, E. (2018). The word frequency effect in word processing: An updated review. Current Directions in Psychological Science, 27(1), 45–50.
Canavan, A., & Zipperlen, G. (1996). CALLFRIEND American English-Non-Southern Dialect (LDC96S46). Linguistic Data Consortium [URL].
Canavan, A., Graff, D., & Zipperlen, G. (1997). CALLHOME American English Speech (LDC97S42). Linguistic Data Consortium [URL].
Davies, M. (2009). the 385+ million word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights. International Journal of Corpus Linguistics, 14(2), 159–190.
(2011). The Corpus of Contemporary American English as the first reliable monitor corpus of English. Literary and Linguistic Computing, 25(4), 447–465.
(2012). Expanding horizons in historical linguistics with the 400 million word Corpus of Historical American English. Corpora, 7(2), 121–157.
(2015). Corpora: An introduction. In D. Biber & R. Reppen (Eds.), Cambridge Handbook of English Corpus Linguistics (pp. 11–31). Cambridge University Press.
(2017). Using large online corpora to examine lexical, semantic, and cultural variation in different dialects and time periods. In E. Friginal (Ed.), Studies in Corpus-Based Sociolinguistics (pp. 19–82). Routledge.
(2018). Corpus-based studies of lexical and semantic variation: The importance of both corpus size and corpus design. In C. Suhr, T. Nevalainen & I. Taavitsainen (Eds.), From Data to Evidence in English Language Research (pp. 34–55). Brill.
Forchini, P. (2012). Movie Language Revisited: Evidence from Multi-Dimensional Analysis and Corpora. Peter Lang.
Greenbaum, S. (1996). Comparing English Worldwide: The International Corpus of English. Clarendon Press.
Godfrey, J. J., & Holliman, E. (1993). Switchboard-1 Release 2 (LDC97S62). Linguistic Data Consortium. [URL]
Van Heuven, W., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-UK: A new and improved word frequency database for British English. The Quarterly Journal of Experimental Psychology, 67(6), 1176–1190.
Levshina, N. (2017). Online film subtitles as a corpus: An n-gram approach. Corpora, 12(3), 311–338.
Lison, P., & Tiedemann, J. (2016). OpenSubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association (ELRA). [URL]
Love, R. (2020). Overcoming Challenges in Corpus Construction: The Spoken British National Corpus 2014. Routledge.
Love, R., Dembry, C., Hardie, A., Brezina, V., & McEnery, T. (2017). The Spoken BNC2014: Designing and building a spoken corpus of everyday conversations. International Journal of Corpus Linguistics, 22(3), 319–344.
Lugea, J. (2019). The intralingual subtitling of The Wire: Changes of style and substance. Journal of Applied Linguistics and Professional Practice, 12(1), 23–49.
Piazza, R., Bednarek, M., & Rossi, F. (Eds.) (2011). Telecinematic Discourse: Approaches to the Language of Films and Television Series. John Benjamins.
Quaglio, P. (2009). Television Dialogue: The Sitcom Friends vs. Natural Conversation. John Benjamins.
Simpson, R., Briggs, L., Ovens, J., & Swales, J. (2002). The Michigan Corpus of Academic Spoken English. The Regents of the University of Michigan.
Tiedemann, J. (2016). OPUS – parallel corpora for everyone. Baltic Journal of Modern Computing, 4(2), 384.
Veirano Pinto, M. (2014). Dimensions of variation in North American movies. In T. Berber Sardinha & M. Veirano Pinto (Eds.), Multi-dimensional Analysis, 25 Years on: A Tribute to Douglas Biber (pp. 109–146). John Benjamins.
Cited by (32)
Cited by 32 other publications
Basile, Carmelo Alessandro, Agnès Celle & Cameron Morin
Bednarek, Monika & Tracey-Anne Cameron
Crosthwaite, Peter & Martin Schweinberger
2025. Corpora and instructed second language acquisition. In Technology and Instructed Second Language Acquisition [Language Learning & Language Teaching, 63], ► pp. 115 ff.
Goutsos, Dionysis
2025. Language change and (im)politeness in film discourse. Journal of Language and Pop Culture 1:2 ► pp. 177 ff.
Latouche, Lucie, Samantha Laporte & Ilse Depraetere
van Rooy, Bertus
Werner, Valentin, Mie Hiramoto & Paul Flanagan
Zhang, Jinyi, Haowei Li, Dashuai Deng, Yanshu Wang & Tadahiro Matsumoto
Castro, Adrián
Leedham, Maria
Li, Haowei, Jinyi Zhang, Ye Tian & Tadahiro Matsumoto
Flesch, Marie
Hirota, Tomoharu & Laurel J. Brinton
Jucker, Andreas H. & Daniela Landert
Landert, Daniela, Tanja Säily & Mika Hämäläinen
Viollain, Cécile
Yusufali, Hussein, Stefan Goetze & Roger K. Moore
Zhang, Jinyi, Ye Tian, Jiannan Mao, Mei Han, Feng Wen, Cong Guo, Zhonghui Gao & Tadahiro Matsumoto
Ha, Hung Tan
Ha, Hung Tan
López-Rodríguez, Clara Inés
Montero Perez, Maribel
Gentile, Federico Pio
Gentile, Federico Pio
Gentile, Federico Pio
Keselman, Iosif & Yulia Yakovleva
Werner, Valentin
2021. A register approach toward pop lyrics in EFL education. In Corpus-based approaches to register variation [Studies in Corpus Linguistics, 103], ► pp. 209 ff.
[no author supplied]
[no author supplied]
[no author supplied]
This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
