Article published In: Linguistics in the Netherlands 2019
Edited by Janine Berns and Elena Tribushinina
[Linguistics in the Netherlands 36] 2019
► pp. 147–161
Part II: Selected papers presented at the Dutch Annual Linguistics Day
of 2019
A filter for syntactically incomparable parallel sentences
Available under the Creative Commons Attribution-NonCommercial (CC BY-NC) 4.0 license.
For any use beyond this license, please contact the publisher at rights@benjamins.nl.
Published online: 5 November 2019
https://doi.org/10.1075/avt.00029.kro
https://doi.org/10.1075/avt.00029.kro
Abstract
Massive automatic comparison of languages in parallel corpora
will greatly speed up and enhance comparative syntactic research. Automatically
extracting and mining syntactic differences from parallel corpora requires a
pre-processing step that filters out sentence pairs that cannot be compared
syntactically, for example because they involve “free” translations. In this
paper we explore four possible filters: the Damerau-Levenshtein distance between
POS-tags, the sentence-length ratio, the graph-edit distance between dependency
parses, and a combination of the three in a logistic regression model. Results
suggest that the dependency-parse filter is the most stable throughout language
pairs, while the combination filter achieves the best results.
Keywords: filter, parallel corpus, syntactic comparability, dependency parses
Article outline
- 1.Introduction
- 2.Syntactic comparability
- 3.Data
- 4.Filters
- 4.1Levenshtein distance on POS-tags
- 4.2Sentence-length ratio
- 4.3Graph edit distance on dependency trees
- 4.4Combination filter
- 4.5Automatically setting a threshold
- 5.Evaluation of the filters
- 6.Results
- 7.Discussion
- 8.Conclusion
- Acknowledgements
- Notes
References
References (14)
Abu-Aisheh, Zeina, Romain Raveaux, Jean-Yves Ramel & Patrick Martineau. 2015. “An exact graph edit distance algorithm for solving pattern
recognition problems”. 4th International Conference on Pattern Recognition Applications and
Methods 2015. Jan 2015, Lisbon, Portugal. ff10.5220/0005209202710278ff. ffhal-01168816.
Abzianidze, Lasha, Johannes Bjerva, Kilian Evang, Hessel Haagsma, Rik van Noord, Pierre Ludmann & Johan Bos. 2017. “The Parallel Meaning Bank: Towards a multilingual corpus of
translations annotated with compositional meaning
representations”. Proceedings of the 15th Conference of the European Chapter of the
Association for Computational Linguistics: Volume 2, Short Papers, 242–247.
Bard, Gregory V. 2007. “Spelling-error tolerant, order-independent pass-phrases via the
Damerau-Levenshtein string-edit distance metric”. Proceedings of the Fifth Australasian Symposium on ACSW Frontiers:
Volume 68, 117–124. Australian Computer Society, Inc.
Cohen, Jacob. 1960. “A coefficient of agreement for nominal scales”. Educational and Psychological Measurement. 20 (1): 37–46.
Fleiss, J. L. & Jacob Cohen. 1973. “The equivalence of weighted kappa and the intraclass correlation
coefficient as measures of reliability”. Educational and Psychological Measurement 331: 613–619.
Hagberg, Aric, Daniel Schult & Pieter Swart. 2008. “Exploring network structure, dynamics, and function using
Network”. Proceedings of the 7th Python in Science Conference (SciPy2008) ed. by G. Varoquaux, T. Vaught, & J. Millman, 11–15. Pasadena, CA USA.
Klis, van der, Martijn, Bert Le Bruyn & Henriëtte de Swart. 2017. “Mapping the perfect via translation mining”. Proceedings of the 15th Conference of the European Chapter of the
Association for Computational Linguistics: Volume 2, Short Papers, 497–502.
Koehn, Philipp. 2005. “Europarl: A parallel corpus for statistical machine
translation”. MT Summit: Volume 5, 79–86.
Levenshtein, Vladimir I. 1966. “Binary codes capable of correcting deletions, insertions, and
reversals”. Soviet Physics Doklady 10 (8): 707–710.
Nivre, Joakim, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan Mc Donald et al. “Universal dependencies v1: A multilingual treebank
collection”. LREC 2016, pp. 1659–1666.
Straka, Milan & Jana Straková. 2017. “Tokenizing, POS-tagging, lemmatizing and parsing UD 2.0 with
UDPipe”. Proceedings of the CONLL 2017 Shared Task: Multilingual Parsing from Raw
Text to Universal Dependencies, 88–99. Vancouver, Canada: Association of Computational Linguistics.
