In:Parallel Corpora for Contrastive and Translation Studies: New resources and applications
Edited by Irene Doval and M. Teresa Sánchez Nieto
[Studies in Corpus Linguistics 90] 2019
► pp. 103–121
Corpus PaGeS
A multifunctional resource for language learning, translation and cross-linguistic research
Published online: 20 March 2019
https://doi.org/10.1075/scl.90.07dov
https://doi.org/10.1075/scl.90.07dov
This chapter presents the bilingual parallel corpus PaGeS, compiled by the research group SpatiAlEs from the University of Santiago de Compostela. PaGeS currently amounts to nearly 20 million tokens and consists of texts originally written in German and in Spanish and their correspondent translations into the other language, as well as a small portion of German and Spanish translations from third languages. The present contribution introduces the main characteristics of the PaGeS corpus, focusing on its design and compilation. It first explains the criteria for the selection of the texts and the details of text pre-processing, automatic alignment and manual review. It then addresses the search and display features describing the server architecture and indexing process. Finally, the intended development of the PaGeS corpus is briefly discussed.
Keywords: parallel corpora, corpus alignment, corpus visualization, Spanish, German
Article outline
- 1.Introduction
- 2.Components and content
- 3.Text preprocessing, textual mark-up and metadata
- 4.Alignment
- 5.Search and display features
- 6.Server architecture and publishing data
- 7.Summary and outlook
Acknowledgement Notes References
References (25)
Čermák, Petr. This volume. InterCorp. Parallel corpus of 40 languages. In Parallel Corpora for Contrastive and Translation Studies: New Resources and Applications [Studies in Corpus Linguistics 90] Irene Doval & M. Teresa Sánchez (eds). Amsterdam: John Benjamins.
Čermák, František & Rosen, Alexandr. 2012. The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics 13(3): 411–427.
Clematide, Simon, Graën, Johannes & Volk, Martin. 2016. Multilingwis – A multilingual search tool for multi-word units in multiparallel corpora. In Computerised and Corpusbased Approaches to Phraseology: Monolingual and Multilingual Perspectives – Fraseología computacional y basada en corpus: perspectivas monolingües y multilingües, Gloria Corpas Pastor (ed.), 447–455. Geneva: Tradulex.
Danielsson, Pernilla & Ridings, Daniel. 1997. Practical presentation of a Vanilla Aligner. In TELRI Workshop in alignment and exploitation of texts, Ljubljana, Slovenia. <[URL]> (30 May 2017).
Dörk, Marian & Knight, Dawn. 2015. WordWanderer: A navigational approach to text visualisation. Corpora 10(1): 83–94.
Doval, Irene. 2016. PaGeS: Design and compilation of a bilingual parallel corpus German Spanish. Epic Series in Languages and Linguistics 1: 88–96.
. 2017. POS-tagging a bilingual parallel corpus: Methods and challenges. Research in Corpus Linguistics 5: 35–46.
. 2018. Das PaGeS-Korpus, ein Parallelkorpus der deutschen und spanischen Gegenwartssprache. Revista de Filología Alemana 26: 181–197.
Łaziński, Marek & Kuratczyk, Magdalena. 2016 Korpus Polsko-Rosyjski Uniwersytetu Warszawskiego / The University of Warsaw Polish-Russian Parallel Corpus. In Polskojęzyczne korpusy równoległe – Polish-language Parallel Corpora, Ewa Gruszczyńska & Anieszka Leńko-Szymańska (eds), 83–95. Warszawa: Instytut Lingwistyki Stosowanej WLS, Uniwersytet Warszawski.
Lübke, Barbara & Liste Lamas, Elsa. 2019. Raumrelationen im Deutschen: Kontrast, Erwerb und Übersetzung. Tübingen: Stauffenburg.
Lüdeling, Anke & Kytö, Merja (eds). 2008. Corpus Linguistics. An International Handbook, Vol. 1. Berlin: Walter de Gruyter.
Macken, Lieve, Trushkina, Julia, Paulussen, Hans, Rura, Lidia, Desmet, Piet & Wandeweghe, Wily. 2007. Dutch Parallel Corpus: A multilingual annotated corpus. In Proceedings of the fourth Corpus Linguistics conference, University of Birmingham. <[URL]> (12 April 2017).
Molés-Cases, Teresa & Oster, Ulrike. This volume. Indexation and analysis of a parallel corpus using CQPweb: The COVALT PAR_ES corpus (EN/FR/DE>ES). In Parallel Corpora for Contrastive and Translation Studies: New Resources and Applications [Studies in Corpus Linguistics 90], Irene Doval & M. Teresa Sánchez (eds). Amsterdam: John Benjamins.
Rosen, Alexandr. 2016. InterCorp – a look behind the façade of a parallel corpus. In Polskojęzyczne korpusy równoległe – Polish-language Parallel Corpora, Ewa Gruszczyńska & Anieszka Leńko-Szymańska (eds), 21–40. Warszawa: Instytut Lingwistyki Stosowanej WLS, Uniwersytet Warszawski.
Steinberger, Ralf. et al. 2006. The JRCAcquis: A multilingual aligned parallel corpus with 20 + languages. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06). <[URL]> (12 October 2017).
2014. An overview of the European Union’s highly multilingual parallel corpora. Language Resources and Evaluation 48(4): 679–707.
Tiedemann, Jörg. 2009. News from OPUS – A collection of multilingual parallel corpora with tools and interfaces. In Recent Advances in Natural Language Processing, Vol. V [Current Issues in Linguistic Theory 309], Nicolas Nicolov, Galia Angelova & Ruslan Mitkov (eds), 237–248. Amsterdam: John Benjamins.
Tóth, Krisztina, Farkas, Richárd & Kocsor, András. 2008. Sentence alignment of Hungarian–English parallel corpora using a hybrid algorithm. Acta Cybern 18: 463–478.
Varga, Dániel, Németh, László, Halácsy, Péter, Kornai, András, Trón, Viktor & Nagy, Viktor. 2005. Parallel corpora for medium density languages. In Proceedings of RANLP 2005, 590–596.
Varga, Dániel. 2012. Natural Language Processing of Large Parallel Corpora. PhD dissertation. Budapest: Eötvös Loránd University.
Volk, Martin, Graen, Johannes & Callegaro, Elena. 2014. Innovations in parallel corpus search tools. In Proceedings of LREC, Reykjavik. <[URL]> (13 May 2017)
Volk, Martin, Clematide, Simon, Graen, Johannes, Ströbel, Phillip. 2016. Bi-particle adverbs, pos-tagging and the recognition of German separable prefix verbs. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016), 296–305.
Cited by (4)
Cited by four other publications
Schmidhofer, Astrid & Jesús Manuel Millán Vidal
Sánchez Nieto, María Teresa
2023. “Ich bekomme es erklärt”. In Corpus Use in Cross-linguistic Research [Studies in Corpus Linguistics, 113], ► pp. 67 ff.
Molés-Cases, Teresa & Ulrike Oster
2019. Indexation and analysis of a parallel corpus using CQPweb. In Parallel Corpora for Contrastive and Translation Studies [Studies in Corpus Linguistics, 90], ► pp. 197 ff.
This list is based on CrossRef data as of 1 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
