The TV and Movies corpora: Design, construction, and use

Davies, Mark

doi:10.1075/ijcl.00035.dav

Article published In: Corpus approaches to telecinematic language
Edited by Monika Bednarek, Valentin Werner and Marcia Veirano Pinto
[International Journal of Corpus Linguistics 26:1] 2021
► pp. 10–37

Get fulltext from our e-platform

Download PDF

The TV and Movies corpora

Design, construction, and use

Mark Davies | Brigham Young University

Published online: 17 November 2020

https://doi.org/10.1075/ijcl.00035.dav

Abstract

This paper discusses the creation and use of the TV Corpus (subtitles from 75,000 episodes, 325 million words, 6 English-speaking countries, 1950s-2010s) and the Movies Corpus (subtitles from 25,000 movies, 200 million words, 6 English-speaking countries, 1930s–2010s), which are available at English-Corpora.org. The corpora compare well to the BNC-Conversation data in terms of informality, lexis, phraseology, and syntax. But at 525 million words in total size, they are more than 30 times as large as BNC-Conversation (both BNC1994 and BNC2014 combined), which means that they can be used to look at a wide range of linguistic phenomena. The TV and Movies corpora also allow useful comparisons of very informal language across time (containing texts from the 1930s and later for the movies, and from the 1950s onwards for TV shows) and between dialects of English (such as British and American English).

Keywords: TV, movies, diachronic, dialects, speech

Article outline

1.Introduction
2.Rationale for the TV and Movies corpora
3.Creating the TV and Movies corpora
4.Using metadata to create “Virtual Corpora”
5.Informal nature of the language in the TV and Movies corpora
6.Dialectal and historical variation in English
- 6.1Dialectal differences
- 6.2Change over time
7.Conclusion
Note
References

References (32)

References

Baker, P. (2009). The BE06 corpus of British English and recent language change. International Journal of Corpus Linguistics, 14(3), 312–337.

(2011). Times may change but we’ll always have money: A corpus driven examination of vocabulary change in four diachronic corpora. Journal of English Linguistics, 39(1), 65–88.

Bednarek, M. (2018). Language and Television Series: A Linguistic Approach to TV Dialogue. Cambridge University Press.

(2019). Creating Dialogue for TV: Screenwriters Talk Television. Routledge.

Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman Grammar of Spoken and Written English. Longman.

BNC Consortium. (2007). British National Corpus (version 3, BNC XML ed.). [URL]

Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977–990.

Brysbaert, M., Mandera, P., & Keuleers, E. (2018). The word frequency effect in word processing: An updated review. Current Directions in Psychological Science, 27(1), 45–50.

Canavan, A., & Zipperlen, G. (1996). CALLFRIEND American English-Non-Southern Dialect (LDC96S46). Linguistic Data Consortium [URL].

Canavan, A., Graff, D., & Zipperlen, G. (1997). CALLHOME American English Speech (LDC97S42). Linguistic Data Consortium [URL].

Davies, M. (2009). the 385+ million word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights. International Journal of Corpus Linguistics, 14(2), 159–190.

(2011). The Corpus of Contemporary American English as the first reliable monitor corpus of English. Literary and Linguistic Computing, 25(4), 447–465.

(2012). Expanding horizons in historical linguistics with the 400 million word Corpus of Historical American English. Corpora, 7(2), 121–157.

(2015). Corpora: An introduction. In D. Biber & R. Reppen (Eds.), Cambridge Handbook of English Corpus Linguistics (pp. 11–31). Cambridge University Press.

(2017). Using large online corpora to examine lexical, semantic, and cultural variation in different dialects and time periods. In E. Friginal (Ed.), Studies in Corpus-Based Sociolinguistics (pp. 19–82). Routledge.

(2018). Corpus-based studies of lexical and semantic variation: The importance of both corpus size and corpus design. In C. Suhr, T. Nevalainen & I. Taavitsainen (Eds.), From Data to Evidence in English Language Research (pp. 34–55). Brill.

Forchini, P. (2012). Movie Language Revisited: Evidence from Multi-Dimensional Analysis and Corpora. Peter Lang.

Greenbaum, S. (1996). Comparing English Worldwide: The International Corpus of English. Clarendon Press.

Godfrey, J. J., & Holliman, E. (1993). Switchboard-1 Release 2 (LDC97S62). Linguistic Data Consortium. [URL]

Van Heuven, W., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-UK: A new and improved word frequency database for British English. The Quarterly Journal of Experimental Psychology, 67(6), 1176–1190.

Levshina, N. (2017). Online film subtitles as a corpus: An n-gram approach. Corpora, 12(3), 311–338.

Lison, P., & Tiedemann, J. (2016). OpenSubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). European Language Resources Association (ELRA). [URL]

Love, R. (2020). Overcoming Challenges in Corpus Construction: The Spoken British National Corpus 2014. Routledge.

Love, R., Dembry, C., Hardie, A., Brezina, V., & McEnery, T. (2017). The Spoken BNC2014: Designing and building a spoken corpus of everyday conversations. International Journal of Corpus Linguistics, 22(3), 319–344.

Lugea, J. (2019). The intralingual subtitling of The Wire: Changes of style and substance. Journal of Applied Linguistics and Professional Practice, 12(1), 23–49.

Piazza, R., Bednarek, M., & Rossi, F. (Eds.) (2011). Telecinematic Discourse: Approaches to the Language of Films and Television Series. John Benjamins.

Quaglio, P. (2009). Television Dialogue: The Sitcom Friends vs. Natural Conversation. John Benjamins.

Rayson, P., & Garside, R. (1998). The CLAWS web tagger. ICAME Journal, 22(4), 121–123.

Simpson, R., Briggs, L., Ovens, J., & Swales, J. (2002). The Michigan Corpus of Academic Spoken English. The Regents of the University of Michigan.

Tiedemann, J. (2016). OPUS – parallel corpora for everyone. Baltic Journal of Modern Computing, 4(2), 384.

Veirano Pinto, M. (2014). Dimensions of variation in North American movies. In T. Berber Sardinha & M. Veirano Pinto (Eds.), Multi-dimensional Analysis, 25 Years on: A Tribute to Douglas Biber (pp. 109–146). John Benjamins.

(2018). Variation in movies and television programs: The impact of corpus sampling. In V. Werner (Ed.), The Language of Pop Culture (pp. 139–161). Routledge.

Cited by (32)

Cited by 32 other publications

Order by:

Basile, Carmelo Alessandro, Agnès Celle & Cameron Morin

2025. Cognitive approaches to variation and change in the English modal domain: introduction. English Language and Linguistics 29:3 ► pp. 444 ff.

Bednarek, Monika & Tracey-Anne Cameron

2025. Aboriginal English, culture, racism and colonization: Television dialogue as a means of creating and enhancing visibility. Australian Journal of Linguistics 45:2 ► pp. 169 ff.

Crosthwaite, Peter & Martin Schweinberger

2025. Corpora and instructed second language acquisition. In Technology and Instructed Second Language Acquisition [Language Learning & Language Teaching, 63], ► pp. 115 ff.

Goutsos, Dionysis

2025. Language change and (im)politeness in film discourse. Journal of Language and Pop Culture 1:2 ► pp. 177 ff.

Latouche, Lucie, Samantha Laporte & Ilse Depraetere

2025. Hedged performatives in spoken American English: recent change and variation in their use. English Language and Linguistics 29:3 ► pp. 527 ff.

van Rooy, Bertus

2025. Corpora of Englishes in the Inner Circle. In The Wiley Blackwell Encyclopedia of World Englishes, ► pp. 1 ff.

Werner, Valentin, Mie Hiramoto & Paul Flanagan

2025. Language and pop culture. Journal of Language and Pop Culture 1:1 ► pp. 1 ff.

Zhang, Jinyi, Haowei Li, Dashuai Deng, Yanshu Wang & Tadahiro Matsumoto

2025. PhraseBT: A phrase-level back-translation data augmentation method for neural machine translation. Neurocomputing 652 ► pp. 130832 ff.

Castro, Adrián

2024. Telecinematic stylistics: Language and style in fantasy TV series. Language and Literature: International Journal of Stylistics 33:1 ► pp. 3 ff.

Leedham, Maria

2024. Depictions of social workers and other caring professionals on television. Journal of Social Work 24:5 ► pp. 664 ff.

Li, Haowei, Jinyi Zhang, Ye Tian & Tadahiro Matsumoto

2024. 2024 2nd International Conference on Signal Processing and Intelligent Computing (SPIC), ► pp. 980 ff.

Bednarek, Monika

2023. Corpus linguistics and television series: A personal reflection. TV/Series 22

Flesch, Marie

2023. “Dude” and “Dudette”, “Bro” and “Sis”: A Diachronic Study of Four Address Terms in the TV Corpus. Anglica. An International Journal of English Studies :32/2 ► pp. 23 ff.

Hirota, Tomoharu & Laurel J. Brinton

2023. “You betcha I’m a ’Merican”. International Journal of Corpus Linguistics 28:4 ► pp. 528 ff.

Jucker, Andreas H. & Daniela Landert

2023. The diachrony of im/politeness in American and British movies (1930–2019). Journal of Pragmatics 209 ► pp. 123 ff.

Landert, Daniela, Tanja Säily & Mika Hämäläinen

2023. TV series as disseminators of emerging vocabulary: Non-codified expressions in the TV Corpus. ICAME Journal 47:1 ► pp. 63 ff.

Viollain, Cécile

2023. What TV series “do” to phonology and vice-versa – or should TV series be used as phonological corpora?. TV/Series 22

Yusufali, Hussein, Stefan Goetze & Roger K. Moore

2023. Bridging the Communication Rate Gap: Enhancing Text Input for Augmentative and Alternative Communication (AAC). In HCI International 2023 – Late Breaking Papers [Lecture Notes in Computer Science, 14055], ► pp. 434 ff.

Zhang, Jinyi, Ye Tian, Jiannan Mao, Mei Han, Feng Wen, Cong Guo, Zhonghui Gao & Tadahiro Matsumoto

2023. WCC-JC 2.0: A Web-Crawled and Manually Aligned Parallel Corpus for Japanese-Chinese Neural Machine Translation. Electronics 12:5 ► pp. 1140 ff.

Ha, Hung Tan

2022. Vocabulary Demands of Informal Spoken English Revisited: What Does It Take to Understand Movies, TV Programs, and Soap Operas?. Frontiers in Psychology 13

Ha, Hung Tan

2022. Lexical Profile of Newspapers Revisited: A Corpus-Based Analysis. Frontiers in Psychology 13

López-Rodríguez, Clara Inés

2022. Emotion at the end of life: Semantic annotation and key domains in a pilot study audiovisual corpus. Lingua 277 ► pp. 103401 ff.

Montero Perez, Maribel

2022. Second or foreign language learning through watching audio-visual input and the role of on-screen text. Language Teaching 55:2 ► pp. 163 ff.

Gentile, Federico Pio

2021. The Motive ‘Whydunit’ Television Hybrid. In Corpora, Corpses and Corps, ► pp. 177 ff.

Gentile, Federico Pio

2021. The Linguistic and Cultural Environment of Canadian Television. In Corpora, Corpses and Corps, ► pp. 71 ff.

Gentile, Federico Pio

2021. The 19-2 Anglified Police Procedural Noir. In Corpora, Corpses and Corps, ► pp. 241 ff.

Gentile, Federico Pio

2021. The Research Methodology. In Corpora, Corpses and Corps, ► pp. 15 ff.

Keselman, Iosif & Yulia Yakovleva

2021. Short Teacher Responses in the EFL Classroom: A Corpus-Approach Assessment. Journal of Language and Education 7:2 ► pp. 175 ff.

Werner, Valentin

2021. A register approach toward pop lyrics in EFL education. In Corpus-based approaches to register variation [Studies in Corpus Linguistics, 103], ► pp. 209 ff.

[no author supplied]

2022. List of Example Stand-alone Corpus Description Articles. In Designing and Evaluating Language Corpora, ► pp. 224 ff.

[no author supplied]

2023. Language and Characterisation in Television Series [Studies in Corpus Linguistics, 106],

[no author supplied]

2025. Decoding Movie Language through MDA and the Grammar of Graphics [Studies in Corpus Linguistics, 124],

This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.