In:Multiword Units in Machine Translation and Translation Technology
Edited by Ruslan Mitkov, Johanna Monti, Gloria Corpas Pastor and Violeta Seretan
[Current Issues in Linguistic Theory 341] 2018
► pp. 147–162
Dutch compound splitting for bilingual terminology extraction
Published online: 20 July 2018
https://doi.org/10.1075/cilt.341.07mac
https://doi.org/10.1075/cilt.341.07mac
Abstract
As compounds pose a problem for applications that rely on precise word alignments, we developed a state-of-the-art
compound splitter for Dutch that makes use of corpus frequency information and linguistic knowledge. Domain-adaptation
techniques are used to combine large out-of-domain and dynamically compiled in-domain frequency lists.
As compounds are not always translated compositionally, we developed a novel methodology for word alignment. We train
the word alignment models twice: a first time on the original data set and a second time on the data set in which the
compounds are split into their component parts.
Experiments show that the compound splitter combined with the novel word alignment technique considerably improves
bilingual terminology extraction results.
Article outline
- 1.Introduction
- 2.Dutch compound splitter
- 2.1Domain adaptation
- 2.2Data Sets and Experiments
- 3.Impact on word alignment
- 3.1Data sets and experiments
- 4.Impact on terminology extraction
- 4.1Experiments
- 5.Conclusion
Notes References
References (15)
Baayen, R. H., R. Piepenbrock, & van Rijn, H. (1993). The CELEX lexical database on CD-ROM. Philadelphia, PA: Linguistic Data Consortium.
Brown, P. F., Della Pietra, V. J., Della Pietra, S. A., & Mercer R. L. (1993). “The Mathematics of Statistical Machine Translation: Parameter Estimation”. Computational Linguistics, 19(2), 263–311.
Frantzi, K., & Ananiadou. S. (1999). The C-value / NC-value domain independent method for multiword term extraction. Journal of Natural Language Processing, 6(3), 145–179.
Fritzinger, F., & Fraser, A. (2010). How to avoid burning ducks: combining linguistic analysis and corpus statistics for German compound
processing. In Proceedings of the ACL 2010 Joint Fifth Workshop on Statistical Machine Translation and
MetricsMATR. 224–234. Uppsala, Sweden.
Kageura, K., & Umino, B. (1996). Methods of automatic term recognition. A review. Terminology, 3(2), 259–289.
Koehn, P., Axelrod, A., Birch Mayne, A., Callison-Burch, C., Osborne, M., & Talbot, D. (2005). Edinburgh system description for the 2005 IWSLT speech translation evaluation. In Proceedings of the International Workshop on Spoken Language Translation: Evaluation Campaign on Spoken
Language Translation (IWSLT 2005). Pittsburgh, PA, USA.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N. et al. (2007). Moses: Open Source Toolkit for Statistical Machine Translation. In Proceedings of the ACL 2007 Demo and Poster Sessions. 177–180. Prague, Czech Republic.
Koehn, P., & Knight, K. (2003). Empirical methods for compound splitting. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational
Linguistics (EACL 2003). 187–193. Budapest, Hungary.
Lefever, E., Macken, L., & Hoste, V. (2009). Language-independent bilingual terminology extraction from a multilingual parallel
corpus. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational
Linguistics (EACL). 496–504. Athens, Greece.
Macken, L., De Clercq, O., & Paulussen, H. (2011). “Dutch Parallel Corpus: a Balanced Copyright-Cleared Parallel Corpus”. Meta, 56(2), 374–390.
Macken, L., Lefever, E., & Hoste, V. (2013). TExSIS. Bilingual terminology extraction from parallel corpora using chunk-based
alignment. Terminology, 19(1), 1–30.
Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.
Parra Escartín, C. (2014). Chasing the Perfect Splitter: A Comparison of Different Compound Splitting Tools. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). 3340–3347. Reykjavik, Iceland.
