A filter for syntactically incomparable parallel sentences

Kroon, Martin; Barbiers, Sjef; Odijk, Jan; van der Pas, Stéphanie

doi:10.1075/avt.00029.kro

Article published In: Linguistics in the Netherlands 2019
Edited by Janine Berns and Elena Tribushinina
[Linguistics in the Netherlands 36] 2019
► pp. 147–161

Get fulltext from our e-platform

Download PDF

Part II: Selected papers presented at the Dutch Annual Linguistics Day of 2019

A filter for syntactically incomparable parallel sentences

Martin Kroon | Leiden University Centre for Linguistics

Sjef Barbiers | Leiden University Centre for Linguistics

Jan Odijk | Universiteit Utrecht, UIL-OTS

Stéphanie van der Pas | Mathematical Institute, Leiden University

Available under the Creative Commons Attribution-NonCommercial (CC BY-NC) 4.0 license.

For any use beyond this license, please contact the publisher at rights@benjamins.nl.

Published online: 5 November 2019

https://doi.org/10.1075/avt.00029.kro

Abstract

Massive automatic comparison of languages in parallel corpora will greatly speed up and enhance comparative syntactic research. Automatically extracting and mining syntactic differences from parallel corpora requires a pre-processing step that filters out sentence pairs that cannot be compared syntactically, for example because they involve “free” translations. In this paper we explore four possible filters: the Damerau-Levenshtein distance between POS-tags, the sentence-length ratio, the graph-edit distance between dependency parses, and a combination of the three in a logistic regression model. Results suggest that the dependency-parse filter is the most stable throughout language pairs, while the combination filter achieves the best results.

Keywords: filter, parallel corpus, syntactic comparability, dependency parses

Article outline

1.Introduction
2.Syntactic comparability
3.Data
4.Filters
- 4.1Levenshtein distance on POS-tags
- 4.2Sentence-length ratio
- 4.3Graph edit distance on dependency trees
- 4.4Combination filter
- 4.5Automatically setting a threshold
5.Evaluation of the filters
6.Results
7.Discussion
8.Conclusion
Acknowledgements
Notes
References

References (14)

References

Abu-Aisheh, Zeina, Romain Raveaux, Jean-Yves Ramel & Patrick Martineau. 2015. “An exact graph edit distance algorithm for solving pattern recognition problems”. 4th International Conference on Pattern Recognition Applications and Methods 2015. Jan 2015, Lisbon, Portugal. ff10.5220/0005209202710278ff. ffhal-01168816.

Abzianidze, Lasha, Johannes Bjerva, Kilian Evang, Hessel Haagsma, Rik van Noord, Pierre Ludmann & Johan Bos. 2017. “The Parallel Meaning Bank: Towards a multilingual corpus of translations annotated with compositional meaning representations”. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 242–247.

Barbiers, Sjef. 2009. “Locus and limits of syntactic microvariation”. Lingua 119 (11): 1607–1623.

Bard, Gregory V. 2007. “Spelling-error tolerant, order-independent pass-phrases via the Damerau-Levenshtein string-edit distance metric”. Proceedings of the Fifth Australasian Symposium on ACSW Frontiers: Volume 68, 117–124. Australian Computer Society, Inc.

Cohen, Jacob. 1960. “A coefficient of agreement for nominal scales”. Educational and Psychological Measurement. 20 (1): 37–46.

Fleiss, J. L. & Jacob Cohen. 1973. “The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability”. Educational and Psychological Measurement 331: 613–619.

Hagberg, Aric, Daniel Schult & Pieter Swart. 2008. “Exploring network structure, dynamics, and function using Network”. Proceedings of the 7th Python in Science Conference (SciPy2008) ed. by G. Varoquaux, T. Vaught, & J. Millman, 11–15. Pasadena, CA USA.

Klis, van der, Martijn, Bert Le Bruyn & Henriëtte de Swart. 2017. “Mapping the perfect via translation mining”. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 497–502.

Koehn, Philipp. 2005. “Europarl: A parallel corpus for statistical machine translation”. MT Summit: Volume 5, 79–86.

Levenshtein, Vladimir I. 1966. “Binary codes capable of correcting deletions, insertions, and reversals”. Soviet Physics Doklady 10 (8): 707–710.

Nivre, Joakim, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan Mc Donald et al. “Universal dependencies v1: A multilingual treebank collection”. LREC 2016, pp. 1659–1666.

Straka, Milan & Jana Straková. 2017. “Tokenizing, POS-tagging, lemmatizing and parsing UD 2.0 with UDPipe”. Proceedings of the CONLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, 88–99. Vancouver, Canada: Association of Computational Linguistics.

Wiersma, Wybo, John Nerbonne & Timo Lauttamus. 2011. “Automatically extracting typical syntactic differences from corpora”. Literary and Linguistic Computing 26 (1): 107–124.

Youden, William J. 1950. “Index for rating diagnostic tests”. Cancer 3 (1): 32–35.