Innovations in parallel corpus alignment and retrieval

Volk, Martin

doi:10.1075/scl.90.05vol

In:Parallel Corpora for Contrastive and Translation Studies: New resources and applications
Edited by Irene Doval and M. Teresa Sánchez Nieto
[Studies in Corpus Linguistics 90] 2019
► pp. 79–90

Get fulltext from our e-platform

Download Book PDF

Innovations in parallel corpus alignment and retrieval

Martin Volk | University of Zurich

Published online: 20 March 2019

https://doi.org/10.1075/scl.90.05vol

In this chapter, we give an overview of parallel corpus annotation, alignment and retrieval. We present standard annotation methods such as Part-of-Speech tagging, lemmatization and dependency parsing, but we also introduce language-specific methods, for example for dealing with split verbs or truncated compounds in German. Our corpus annotation includes the identification of code-switching within sentences as a special case of language identification. We argue for careful sentence and word alignment for parallel corpora. And we explain how word alignment is the basis for a wide range of applications from translation variant ranking to lemma disambiguation.

Keywords: multiparallel corpora, corpus annotation, word alignment, corpus retrieval

Article outline

1.Introduction
2.Corpus annotations
- 2.1General corpus annotation
- 2.2Exploiting parallel corpora for annotation
- 2.3Language-specific corpus annotation
3.Aligning parallel corpora
4.Retrieval from parallel corpora
5.Conclusion
Acknowledgments
Note
References

References (20)

References

Aepli, Noëmi & Volk, Martin. 2013. Reconstructing complete lemmas for incomplete German compounds. In Proceedings of The International Conference of the German Society for Computational Linguistics and Language Technology (GSCL), Irena Gurevych, Chris Biemann & Torsten Zesch (eds), 1–13. Darmstadt: Springer.

Augustinus, Liesbeth, Vandeghinste, Vincent & Vanallemeersch, Tom. 2016. Poly-GrETEL: Cross-lingual example-based querying of syntactic constructions. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), 3549–3554. European Language Resources Association (ELRA).

Ebling, Sarah, Sennrich Rico, Klaper, David & Volk, Martin. 2011 Digging for names in the mountains: combined person name recognition and reference resolution for German alpine texts. In Human Language Technology Challenges for Computer Science and Linguistics. LTC 2011 [Lecture Notes in Computer Science Vol. 8387], Zygmunt Vetulani, Joseph Mariani (eds), 189–200. Cham: Springer. DOI:

Göhring, Anne & Volk, Martin. 2011. The Text + Berg corpus: An alpine French-German parallel resource. In Proceedings of Traitement Automatique des Langues Naturelles (TALN 2011), Montpellier, 27 Juni −1 Juli 2011.

Graën, Johannes, Batinic, Dolores & Volk, Martin. 2014. Cleaning the Europarl corpus for linguistic applications. In Proceedings of KONVENS, 222–227. Hildesheim.

Junczys-Dowmunt, Marcin, Pouliquen, Bruno & Mazenc, Christophe. 2016. Coppa v2.0: Corpus of parallel patent applications building large parallel corpora with gnu make. In Proceedings of the Workshop on Challenges in the Management of Large Corpora at LREC, 15–19. Portorož, Slovenia.

Koehn, Philipp. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X, 79–86. Phuket.

Lison, Pierre & Tiedemann, Jörg. 2016. Opensubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC), 923–929. Portorož, Slovenia.

McDonald, Ryan & Nivre, Joakim. 2011. Analyzing and integrating dependency parsers. Computational Linguistics 37(1): 197–230.

Meurer, Paul. 2012. INESS-Search: A search system for LFG (and other) treebanks. In Proceedings of LFG12 Conference, Miriam, Butt & Tracy, H. King (eds). Stanford, CA: CSLI Publications).

Petrov, Slav, Das, Dipanjan & McDonald, Ryan. 2012. A universal part-of-speech tagset. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC), 2089–2096. Istanbul.

Rios, Annette, Göhring, Anne & Volk, Martin. 2012. Parallel treebanking Spanish–Quechua: How and how well do they align? Linguistic Issues in Language Technology 7(1): 1–19.

Sennrich, Rico & Volk, Martin. 2011. Iterative, MT-based sentence alignment of parallel texts. In Proceedings of the 18th International Nordic Conference of Computational Linguistics (Nodalida), 175–182. Riga.

Steinberger, Ralf, Pouliquen, Bruno, Widiger, Anna, Ignat, Carmelia, Erjavec, Tomaz, Tufis, Dan & Varga, Daniel. 2006. The JRC-Acquis: A multilingual aligned parallel corpus with 20 + languages. In Proceedings of LREC, 2142–2147. Genoa.

Volk, Martin & Clematide, Simon. 2014. Detecting code-switching in a multilingual alpine heritage corpus. In Proceedings of the First Workshop on Computational Approaches to Code Switching, 24–33, Doha, Qatar.

Volk, Martin, Graën, Johannes & Callegaro, Elena. 2014. Innovations in parallel corpus search tools. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC), 3172–3178. Reykjavik.

Volk, Martin, Amrhein, Chantal, Aepli, Noëmi, Müller, Mathias & Ströbel, Phillip. 2016a. Building a parallel corpus on the world’s oldest banking magazine. In Proceedings of KONVENS, 288–296. Bochum.

Volk, Martin, Clematide, Simon, Graën, Johannes & Ströbel, Phillip. 2016b. Bi-particle adverbs, PoS-tagging and the recognition of German separable prefix verbs. In Proceedings of KONVENS, 297–305. Bochum.

Volk, Martin, Marek, Torsten, & Yvonne, Samuelsson. 2011. Building and querying parallel treebanks. Translation: Computation, Corpora, Cognition (Special Issue on Parallel Corpora: Annotation, Exploitation and Evaluation) 1(1): 7–28.

Ziemski, Michał, Junczys-Dowmunt, Marcin & Pouliquen, Bruno. 2016. The United Nations parallel corpus v1.0. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC), 3530–3534. Portorož, Slovenia.