A many-sided, multi-purpose corpus of EU parliament proceedings: Building EPTIC

Ferraresi, Adriano; Bernardini, Silvia

doi:10.1075/scl.90.08fer

In:Parallel Corpora for Contrastive and Translation Studies: New resources and applications
Edited by Irene Doval and M. Teresa Sánchez Nieto
[Studies in Corpus Linguistics 90] 2019
► pp. 123–139

Get fulltext from our e-platform

Download Book PDF

Building EPTIC

A many-sided, multi-purpose corpus of EU parliament proceedings

Adriano Ferraresi | University of Bologna

Silvia Bernardini | University of Bologna

Published online: 20 March 2019

https://doi.org/10.1075/scl.90.08fer

This chapter describes the steps involved in the construction of EPTIC, an intermodal corpus of European Parliament speeches. Despite its limited size, this corpus has features that justify its labour-intensive building process, in particular its multiple alignments. The text-to-text alignments allow users to compare interpretations and translations of source speeches and their written-up reports, while text-to-video alignments allow them to access the multimedia components from concordance lines. To illustrate the potential of EPTIC, a case study is presented of English loan words in original, translated and interpreted Italian and French. Results suggest that borrowing is more likely to occur in translated Italian than in any of the other corpus components.

Keywords: intermodal corpora, text-to-text alignment, text-to-video alignment, corpus annotation, loan words

Article outline

1.Introduction: Why another corpus of European Parliament speeches?
2.What EPTIC looks like
- 2.1One corpus, fourteen subcorpora
- 2.2Practical details: Size and availability
3.Building EPTIC
- 3.1Selecting and obtaining raw corpus materials
- 3.2Transcribing the oral data
- 3.3Adding metadata
- 3.4Performing text-to-text alignment
- 3.5Performing text-to-video alignment
- 3.6POS-tagging, lemmatization and indexing
4.An example: English loan words in Italian and French
5.Conclusion: Teaming up
Acknowledgement
Notes
References

References (21)

References

Baker, Mona. 1995. Corpora in translation studies: An overview and some suggestions for future research. Target 7(2): 223–243.

Bernardini, Silvia, Collard, Camille, Ferraresi, Adriano, Russo Mariachiara & Defrancq, Bart. 2018. Building interpreting and intermodal corpora: A how-to for a formidable task. In Making Way in Corpus-based Interpreting Studies, Mariachiara Russo, Claudio Bendazzoli & Bart Defrancq (eds), 21–42. Singapore: Springer.

Bogaards, Paul. 2008. On ne parle pas franglais: La langue française face à l'anglais. Brussels: De Boeck/Duculot.

Burnard, Lou. 2004. Metadata for corpus work. In Developing Linguistic Corpora: A Guide to Good Practice, Martin Wynne (ed.). <[URL]> (30 June 2017).

Chesterman, Andrew. 2004. Hypotheses about translation universals. In Claims, Changes and Challenges in Translation Studies [Benjamins Translation Library 50], Gyde Hansen, Kirsten Malmkjaer & Daniel Gile (eds), 1–13. Amsterdam: John Benjamins.

Codrea-Rado, Anna. 2014. European parliament has 24 official languages, but MEPs prefer English. The Guardian. <[URL]> (30 October 2017).

Evert, Stefan & the CWB Development Team. 2016. The IMS Open Corpus Workbench (CWB) Corpus Encoding Tutorial. CWB Version 3.4: <[URL]> (30 October 2017).

Frankenberg-Garcia, Ana & Santos, Diana. 2003. Introducing COMPARA: The Portuguese–English parallel corpus. In Corpora in Translator Educatio, Federico Zanettin, Silvia Bernardini & Dominic Stewart (eds), 71–87. Manchester: St. Jerome.

Granger, Sylviane. 2010. Comparable and translation corpora in cross-linguistic research. Design, analysis and applications. Journal of Shanghai Jiaotong University 2: 14–21.

Johansson, Stig. 1998. On the role of corpora in cross-linguistic research. In Corpora and Cross-linguistic Research, Stig Johansson & Signe Oksefjell (eds), 3–24. Amsterdam: Rodopi.

Koehn, Philipp. 2005. Europarl: A parallel corpus for statistical machine translation. In Machine Translation Summit X, 79–86. Phuket, Thailand.

Motschenbacher, Heiko. 2013. New Perspectives on English as a European Lingua Franca. Amsterdam: John Benjamins.

Niemants, Natacha. 2015. Transcription. In The Routledge Encylopedia of Intepreting Studies, Franz Pöchhacker (ed), 421–422. London: Routledge.

Nisioi, Sergiu, Rabinovich, Ella, Dinu, Liviu P. & Wintner, Shuly. 2016. A corpus of native, non-native and translated texts. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), 4197–4201.

Pietrandrea, Paola, Kahane, Sylvain, Lacheret-Dujour, Anne & Sabio, Frédéric. 2014. The notion of sentence and other discourse units in corpus annotation. In Spoken Corpora and Linguistic Studies [Studies in Corpus Linuistics 61], Tommaso Raso & Heliana Mello (eds), 331–364. Amsterdam: John Benjamins.

Rychlý, Pavel. 2007. Manatee/Bonito – A modular corpus manager. In 1st Workshop on Recent Advances in Slavonic Natural Language Processing, 65–70. Masaryk University, Brno.

Shlesinger, Miriam. 2009. Towards a definition of interpretese: An intermodal, corpus-based study. In Efforts and Models in Interpreting and Translation Research: A Tribute to Daniel Gile [Benjamins Translation Library 80], Gyde Hansen, Andrew Chesterman & Heidrun Gerzymisch-Arbogast (eds), 237–253. Amsterdam: John Benjamins.

Toury, Gideon. 1995. Descriptive Translation Studies – and Beyond [Benjamins Translation Library 4]. Amsterdam: John Benjamins.

Varga, Dániel, Németh, László, Halácsy, Péter, Kornai, András, Viktor Trón & Nagy, Viktor. 2005. Parallel corpora for medium density languages. In Proceedings of the RANLP 2005, 590–596.

Vondřička, Pavel. 2014. Aligning parallel texts with InterText. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), 1875–1879.

Zanettin, Federico. 2012. Translation-driven Corpora: Corpus Resources for Descriptive and Applied Translation Studies. Abingdon: Taylor & Francis.

Cited by (4)

Cited by four other publications

Order by:

Pérez-Paredes, Pascual & Carlos Ordoñana-Guillamón

2025. Types of Corpora in Data-Driven Learning. In The Palgrave Encyclopedia of Computer-Assisted Language Learning, ► pp. 1 ff.

Kajzer-Wietrzny, Marta

2022. An intermodal approach to cohesion in constrained and unconstrained language. Target. International Journal of Translation Studies 34:1 ► pp. 130 ff.

Bendazzoli, Claudio, Michela Bertozzi & Mariachiara Russo

2020. Du texte aux ressources multimodales : faire avancer la recherche en interprétation à partir d’un corpus déjà existant†. Meta 65:1 ► pp. 211 ff.

Ferraresi, Adriano, Silvia Bernardini, Maja Miličević Petrović & Marie-Aude Lefer

2019. Simplified or not Simplified? The Different Guises of Mediated English at the European Parliament. Meta 63:3 ► pp. 717 ff.

This list is based on CrossRef data as of 1 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.