A multifunctional resource for language learning, translation and cross-linguistic research: Corpus PaGeS

Doval, Irene; Lanza, Santiago Fernández; Juliá, Tomás Jiménez; Lamas, Elsa Liste; Lübke, Barbara

doi:10.1075/scl.90.07dov

In:Parallel Corpora for Contrastive and Translation Studies: New resources and applications
Edited by Irene Doval and M. Teresa Sánchez Nieto
[Studies in Corpus Linguistics 90] 2019
► pp. 103–121

Get fulltext from our e-platform

Download Book PDF

Corpus PaGeS

A multifunctional resource for language learning, translation and cross-linguistic research

Irene Doval | University of Santiago de Compostela

Santiago Fernández Lanza | University of Santiago de Compostela

Tomás Jiménez Juliá | University of Santiago de Compostela

Elsa Liste Lamas | University of Santiago de Compostela

Barbara Lübke | University of Santiago de Compostela

Published online: 20 March 2019

https://doi.org/10.1075/scl.90.07dov

This chapter presents the bilingual parallel corpus PaGeS, compiled by the research group SpatiAlEs from the University of Santiago de Compostela. PaGeS currently amounts to nearly 20 million tokens and consists of texts originally written in German and in Spanish and their correspondent translations into the other language, as well as a small portion of German and Spanish translations from third languages. The present contribution introduces the main characteristics of the PaGeS corpus, focusing on its design and compilation. It first explains the criteria for the selection of the texts and the details of text pre-processing, automatic alignment and manual review. It then addresses the search and display features describing the server architecture and indexing process. Finally, the intended development of the PaGeS corpus is briefly discussed.

Keywords: parallel corpora, corpus alignment, corpus visualization, Spanish, German

Article outline

1.Introduction
2.Components and content
3.Text preprocessing, textual mark-up and metadata
4.Alignment
5.Search and display features
6.Server architecture and publishing data
7.Summary and outlook
Acknowledgement
Notes
References

References (25)

References

Čermák, Petr. This volume. InterCorp. Parallel corpus of 40 languages. In Parallel Corpora for Contrastive and Translation Studies: New Resources and Applications [Studies in Corpus Linguistics 90] Irene Doval & M. Teresa Sánchez (eds). Amsterdam: John Benjamins.

Čermák, František & Rosen, Alexandr. 2012. The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics 13(3): 411–427.

Clematide, Simon, Graën, Johannes & Volk, Martin. 2016. Multilingwis – A multilingual search tool for multi-word units in multiparallel corpora. In Computerised and Corpusbased Approaches to Phraseology: Monolingual and Multilingual Perspectives – Fraseología computacional y basada en corpus: perspectivas monolingües y multilingües, Gloria Corpas Pastor (ed.), 447–455. Geneva: Tradulex.

Danielsson, Pernilla & Ridings, Daniel. 1997. Practical presentation of a Vanilla Aligner. In TELRI Workshop in alignment and exploitation of texts, Ljubljana, Slovenia. <[URL]> (30 May 2017).

Dörk, Marian & Knight, Dawn. 2015. WordWanderer: A navigational approach to text visualisation. Corpora 10(1): 83–94.

Doval, Irene. 2016. PaGeS: Design and compilation of a bilingual parallel corpus German Spanish. Epic Series in Languages and Linguistics 1: 88–96.

. 2017. POS-tagging a bilingual parallel corpus: Methods and challenges. Research in Corpus Linguistics 5: 35–46.

. 2018. Das PaGeS-Korpus, ein Parallelkorpus der deutschen und spanischen Gegenwartssprache. Revista de Filología Alemana 26: 181–197.

Łaziński, Marek & Kuratczyk, Magdalena. 2016 Korpus Polsko-Rosyjski Uniwersytetu Warszawskiego / The University of Warsaw Polish-Russian Parallel Corpus. In Polskojęzyczne korpusy równoległe – Polish-language Parallel Corpora, Ewa Gruszczyńska & Anieszka Leńko-Szymańska (eds), 83–95. Warszawa: Instytut Lingwistyki Stosowanej WLS, Uniwersytet Warszawski.

Lübke, Barbara & Liste Lamas, Elsa. 2019. Raumrelationen im Deutschen: Kontrast, Erwerb und Übersetzung. Tübingen: Stauffenburg.

Lüdeling, Anke & Kytö, Merja (eds). 2008. Corpus Linguistics. An International Handbook, Vol. 1. Berlin: Walter de Gruyter.

Macken, Lieve, Trushkina, Julia, Paulussen, Hans, Rura, Lidia, Desmet, Piet & Wandeweghe, Wily. 2007. Dutch Parallel Corpus: A multilingual annotated corpus. In Proceedings of the fourth Corpus Linguistics conference, University of Birmingham. <[URL]> (12 April 2017).

Molés-Cases, Teresa & Oster, Ulrike. This volume. Indexation and analysis of a parallel corpus using CQPweb: The COVALT PAR_ES corpus (EN/FR/DE>ES). In Parallel Corpora for Contrastive and Translation Studies: New Resources and Applications [Studies in Corpus Linguistics 90], Irene Doval & M. Teresa Sánchez (eds). Amsterdam: John Benjamins.

Rosen, Alexandr. 2016. InterCorp – a look behind the façade of a parallel corpus. In Polskojęzyczne korpusy równoległe – Polish-language Parallel Corpora, Ewa Gruszczyńska & Anieszka Leńko-Szymańska (eds), 21–40. Warszawa: Instytut Lingwistyki Stosowanej WLS, Uniwersytet Warszawski.

Steinberger, Ralf. et al. 2006. The JRCAcquis: A multilingual aligned parallel corpus with 20 + languages. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06). <[URL]> (12 October 2017).

2014. An overview of the European Union’s highly multilingual parallel corpora. Language Resources and Evaluation 48(4): 679–707.

Tiedemann, Jörg. 2009. News from OPUS – A collection of multilingual parallel corpora with tools and interfaces. In Recent Advances in Natural Language Processing, Vol. V [Current Issues in Linguistic Theory 309], Nicolas Nicolov, Galia Angelova & Ruslan Mitkov (eds), 237–248. Amsterdam: John Benjamins.

. 2011. Bitext Alignment. San Rafael, CA: Morgan & Claypool Publishers.

Tóth, Krisztina, Farkas, Richárd & Kocsor, András. 2008. Sentence alignment of Hungarian–English parallel corpora using a hybrid algorithm. Acta Cybern 18: 463–478.

Varga, Dániel, Németh, László, Halácsy, Péter, Kornai, András, Trón, Viktor & Nagy, Viktor. 2005. Parallel corpora for medium density languages. In Proceedings of RANLP 2005, 590–596.

Varga, Dániel. 2012. Natural Language Processing of Large Parallel Corpora. PhD dissertation. Budapest: Eötvös Loránd University.

Volk, Martin, Graen, Johannes & Callegaro, Elena. 2014. Innovations in parallel corpus search tools. In Proceedings of LREC, Reykjavik. <[URL]> (13 May 2017)

Volk, Martin, Clematide, Simon, Graen, Johannes, Ströbel, Phillip. 2016. Bi-particle adverbs, pos-tagging and the recognition of German separable prefix verbs. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016), 296–305.

Wynne, Martin. 2008. Searching and concordancing. In Corpus linguistics. An International Handbook, Anke Lüdeling & Merja Kytö (eds), 706–737. Berlin: de Gruyter.

Zanettin, Federico. 2012. Translation-driven Corpora. London: Routledge.

Cited by (4)

Cited by four other publications

Order by:

Schmidhofer, Astrid & Jesús Manuel Millán Vidal

2025. traducción del verbo modal alemán sollen al español. Hikma 24:1 ► pp. 1 ff.

Sánchez Nieto, María Teresa

2023. “Ich bekomme es erklärt”. In Corpus Use in Cross-linguistic Research [Studies in Corpus Linguistics, 113], ► pp. 67 ff.

Molés-Cases, Teresa & Ulrike Oster

2019. Indexation and analysis of a parallel corpus using CQPweb. In Parallel Corpora for Contrastive and Translation Studies [Studies in Corpus Linguistics, 90], ► pp. 197 ff.

DOVAL, Irene

2018. Corpus paralelos en la enseñanza de lenguas extranjeras: un ejemplo de aplicación basado en el corpus PaGeS. CLINA: Revista Interdisciplinaria de Traducción, Interpretación y Comunicación Intercultural 4:2 ► pp. 65 ff.

This list is based on CrossRef data as of 1 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.