Why the tip of the iceberg remains problematic: Chapter 1. Multi-word units in neural machine translation

Colson, Jean-Pierre

doi:10.1075/cilt.366.01col

In:Recent Advances in Multiword Units in Machine Translation and Translation Technology
Edited by Johanna Monti, Gloria Corpas Pastor, Ruslan Mitkov and Carlos Manuel Hidalgo-Ternero
[Current Issues in Linguistic Theory 366] 2024
► pp. 2–17

Get fulltext from our e-platform

Download Book PDF

Download Book EPUB

Chapter 1
Multi-word units in neural machine translation

Why the tip of the iceberg remains problematic

Jean-Pierre Colson | University of Louvain

Published online: 7 November 2024

https://doi.org/10.1075/cilt.366.01col

Abstract

Neural machine translation (NMT) has recently made significant progress in improving the quality of the texts it produces. New features of NMT include the fluidity of translations and the successful handling of multi-word units. In this paper we first report the results of an automated evaluation of the percentage of phraseology in the translations produced by Google Translate and DeepL. A corpus-based approach makes it possible to estimate that both NMT systems succeed in producing an average percentage of phraseology that is quite reasonable and sometimes even higher than in natural language production by native speakers. However, a closer look at some problematic cases shows that the ability of NMT systems to treat phraseological units can be deceptive, as they are often unable to cope with contextual complexity and low-frequency idioms.

Keywords: phraseology, neural machine translation, deep learning, idioms, transformer architecture

Article outline

1.Introduction: Lingering doubts about neural machine translation
2.Are texts produced by NMT rich in phraseology? An experiment
3.Looking closer at problematic examples for NMT
4.Fine-tuning NMT for phraseology: An experiment
5.Conclusion
Notes
References
Appendix

References (17)

References

Barreiro, A., Monti, J., Batista, F., & Orliac, B. (2013). When multiword go bad in machine translation. Proceedings of the workshop on multi-word units in machine translation and translation technologies, 14th Machine Translation Summit, Nice.

Burger, A., Dobrovol’skij, D., Kühn, P., & Norrick, N. (Eds.). (2007). Phraseologie / Phraseology. Ein internationales Handbuch der zeitgenössischen Forschung / An International Handbook of Contemporary Research. De Gruyter.

Clark, K., Luong, M. -T., Le, Q. V., & Manning, C. D. (2020). Electra: Pre-training text encoders as discriminators rather than generators. ICLR 2020, (pp. 1–18).

Colson, J. -P. (2017). The IdiomSearch experiment: Extracting phraseology from a probabilistic network of constructions. In R. Mitkov (Ed.), Computational and Corpus-based phraseology, Lecture Notes in Artificial Intelligence 10596. Springer International Publishing, Cham (pp. 16–28).

(2018). From Chinese word segmentation to extraction of constructions: Two sides of the same algorithmic coin. Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE‑CxG-2018), Association for Computational Linguistics (pp. 41–50).

(2020). HMSid and HMSid2 at PARSEME Shared Task 2020: Computational corpus linguistics and unseen-in-training MWEs. Coling 2020 – Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons. Association for Computational Linguistics.

Croft, W. (2001). Radical construction grammar: Syntactic theory in typological perspective. Oxford University Press.

Denkowski, M., & Lavie, A. (2014). Meteor Universal: Language specific translation evaluation for any target language. Proceedings of the EACL 2014 Workshop on Statistical Machine Translation (pp. 376–380).

Dupal, J. (2018). Investigating the Phrasicon of CLIL and NON-CLIL students: A corpus-based comparative analysis using IdiomSearch. Thesis, Université catholique de Louvain, Louvain-la-Neuve.

Goldberg, A. (2006). Constructions at work. Oxford University Press.

Hoffmann, Th., & Trousdale, G. (Eds.). (2013). The Oxford Handbook of Construction Grammar. Oxford University Press.

Isabelle, P., Cherry, C., & Foster, G. (2017). A Challenge Set approach to evaluating machine translation. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 2486–2496).

Laviosa, S. (2002). Corpus-Based translation studies: Theory, findings, applications. Rodopi.

Loock, R. (2018). Traduction automatique et usage linguistique : une analyse de traductions anglais-français réunies en corpus. Meta, Journal des traducteurs, 63, 786–806.

Papineni, K., Roukos, S., Ward, T. et al. (2002). Bleu: A method for automatic evaluation of machine translation. Proceedings of 40th Annual Meeting of the Association for Computational Linguistics (pp. 311–318).

Sinclair, J. (1991). Corpus, concordance, collocation. Oxford University Press.

Wray, A. (2008). Formulaic language: Pushing the boundaries. Oxford University Press.

Chapter 1Multi-word units in neural machine translation

Why the tip of the iceberg remains problematic

Chapter 1
Multi-word units in neural machine translation