Evaluer, diagnostiquer et analyser la traduction automatique neuronale

Yvon, François

doi:10.1075/forum.00023.yvo

Article published In: Intelligences pour la traduction. IA et interculturel : actions et interactions.
Edited by Ludovica Maggi and Sarah Bordes
[FORUM 20:2] 2022
► pp. 315–332

Get fulltext from our e-platform

Download PDF

Download EPUB

Evaluer, diagnostiquer et analyser la traduction automatique neuronale

François Yvon | Université Paris-Saclay, CNRS, LISN

Article language: French

Published online: 12 January 2023

https://doi.org/10.1075/forum.00023.yvo

Résumé

Les outils de traduction automatique (TA) neuronale ont fait des progrès sensibles, qui qui les rendent utilisables pour un nombre croissant de domaines et de couples de langues. Cette évolution majeure des technologies de traduction invite à revisiter les méthodes de mesure de la qualité de la traduction, en particulier des mesures dites automatiques, qui jouent un rôle fondamental pour orienter les nouveaux développements de ces systèmes. Dans cet article, nous dressons un état des lieux des méthodes utilisées dans le cycle de développement des outils de traduction automatique, depuis les évaluations purement quantitatives jusqu’aux méthodologies récemment proposées pour analyser et diagnostiquer le fonctionnement de ces “boites noires” neuronales.

Mots-clés: Traduction automatique neuronale, Evaluation de la traduction automatique, Métriques pour la traduction automatique

Abstract

Neural machine translation (MT) technologies have made significant progress, making them useful for an increasing number of domains and language pairs. These major developments of translation technologies invite us to revisit our methods for measuring translation quality, in particular the so-called “automatic metrics”, which play a fundamental role in guiding the new developments of MT systems. In this work, we review the methods used in the development cycle of machine translation tools, from purely quantitative evaluations to recently proposed methodologies aiming to analyse and diagnose the functioning of these neural “black boxes”.

Keywords: Neural Machine Translation, Machine Translation Evaluation, Machine Translation Metrics

Article outline

1.Introduction
2.La TA Neuronale : Principes et concepts
- 2.1Traduire par apprentissage
- 2.2TAN : Traduction automatique numérique
- 2.3Configurer A_θ: Le choix des méta-paramètres
3.Métriques automatiques : Le rôle des références humaines
- 3.1Les évaluations globales
- 3.2Évaluer sans référence
4.À la recherche des failles de la TA
- 4.1Des bancs d’essais spécialisés
- 4.2Évaluation par des manipulations linguistiques
  - 4.2.1Dans la phrase source
  - 4.2.2Dans la phrase cible
5.Sous le capot, le moteur (de traduction)
- 5.1Analyse des représentations (sondes linguistiques)
6.Conclusion
Remarques
Bibliographie

References (51)

Bibliographie

Bahdanau, Dzmitry, Kyunghyun Cho, et Yoshua Bengio. 2015. “Neural Machine Translation by Jointly Learning to Align and Translate.” In Proceedings of the First International Conference on Learning Representations. San Diego, CA.

Banerjee, Satanjeev et Alon Lavie. 2005. “METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments.” In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation, 65–72. Ann Arbor, Michigan.

Bawden, Rachel, Rico Sennrich, Alexandra Birch et Barry Haddow. 2018. “Evaluating Discourse Phenomena in Neural Machine Translation.” In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 1304–13. New Orleans, Louisiana.

Belinkov, Yonatan et Yonatan Bisk. 2018. “Synthetic and Natural Noise Both Break Neural Machine Translation.” In International Conference on Learning Representations.

Belinkov, Yonatan et James Glass. 2019. “Analysis Methods in Neural Language Processing: A Survey.” Transactions of the Association for Computational Linguistics 71 (April): 49–72.

Blanchon, Hervé, and Christian Boitet. 2007. “Pour l’évaluation Externe Des Systèmes de TA Par Des méthodes Fondées Sur La tâche.” Traitement Automatique Des Langues 481: 33–65.

Burchardt, Aljoscha, Vivien Macketanz, Jon Dehdari, Georg Heigold, Jan-Thorsten Peter, et Philip Williams. 2017. “A Linguistic Evaluation of Rule-Based, Phrase-Based, and Neural MT Engines.” The Prague Bulletin of Mathematical Linguistics 1081: 159–70.

Burlot, Franck, et François Yvon. 2017. “Evaluating the Morphological Competence of Machine Translation Systems.” In Proceedings of the Second Conference on Machine Translation, Volume 1: Research Papers, 43–55. Copenhagen, Denmark.

. 2018. “Evaluation morphologique pour la traduction automatique: adaptation au français.” In Conférence sur le Traitement Automatique des Langues Naturelles, 14 pages. TALN. Rennes, France.

Castilho, Sheila, Stephen Doherty, Federico Gaspari, and Joss Moorkens. 2018. “Approaches to Human and Machine Translation Quality Assessment.” In Translation Quality Assessment, 9–38. Springer.

Chatzikoumi, Eirini. 2020. “How to Evaluate Machine Translation: A Review of Automated and Human Metrics.” Natural Language Engineering 26 (2): 137–61.

Cho, Kyunghyun, Bart van Merrienboer, Dzmitry Bahdanau, et Yoshua Bengio. 2014. “On the Properties of Neural Machine Translation: Encoder-Decoder Approaches.” In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, 103–11. Doha, Qatar.

Conneau, Alexis, German Kruszewski, Guillaume Lample, Loı̈c Barrault, and Marco Baroni. 2018. “What You Can Cram into a Single $&!#* Vector: Probing Sentence Embeddings for Linguistic Properties.” In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2126–36. Melbourne, Australia.

Forcada, Mikel L., Carolina Scarton, Lucia Specia, Barry Haddow, and Alexandra Birch. 2018. “Exploring Gap Filling as a Cheaper Alternative to Reading Comprehension Questionnaires When Evaluating Machine Translation for Gisting.” In Proceedings of the Third Conference on Machine Translation: Research Papers, 192–203. Brussels, Belgium.

Freitag, Markus, George Foster, David Grangier, Viresh Ratnakar, Qijun Tan, et Wolfgang Macherey. 2021. “Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation.” Transactions of the Association for Computational Linguistics 91: 1460–74.

Gehring, Jonas, Michael Auli, David Grangier, Denis Yarats, et Yann N. Dauphin. 2017. “Convolutional Sequence to Sequence Learning.” In Proceedings of the 34th International Conference on Machine Learning, edited by D. Precup and Y. W. Teh, 701:1243–52. Sydney, Australia.[URL]

Giulianelli, Mario, Jack Harding, Florian Mohnert, Dieuwke Hupkes, et Willem Zuidema. 2018. “Under the Hood: Using Diagnostic Classifiers to Investigate and Improve How Language Models Track Agreement Information.” In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 240–48. Brussels, Belgium.

Guillou, Liane, and Christian Hardmeier. 2016. “PROTEST: A Test Suite for Evaluating Pronouns in Machine Translation.” In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 636–43. Portorož, Slovenia.

Guillou, Liane, Christian Hardmeier, Preslav Nakov, Sara Stymne, Jörg Tiedemann, Yannick Versley, Mauro Cettolo, Bonnie Webber, and Andrei Popescu-Belis. 2016. “Findings of the 2016 WMT Shared Task on Cross-Lingual Pronoun Prediction.” In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, 525–42. Berlin, Germany.

Hardmeier, Christian, Preslav Nakov, Sara Stymne, Jörg Tiedemann, Yannick Versley et Mauro Cettolo. 2015. “Pronoun-Focused MT and Cross-Lingual Pronoun Prediction: Findings of the 2015 DiscoMT Shared Task on Pronoun Translation.” In Proceedings of the Second Workshop on Discourse in Machine Translation, 1–16. Lisbon, Portugal.

Hewitt, John et Percy Liang. 2019. “Designing and Interpreting Probes with Control Tasks.” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2733–43. Hong Kong, China.

Hovy, Eduard, Margaret King et Andrei Popescu-Belis. 2002. “Principles of Context-Based Machine Translation Evaluation.” Machine Translation 17 (1): 43–75.

Isabelle, Pierre, Colin Cherry, et George Foster. 2017. “A Challenge Set Approach to Evaluating Machine Translation.” In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2486–96. Copenhagen, Denmark.

King, Margaret et Kirsten Falkedal. 1990. “Using Test Suites in Evaluation of Machine Translation Systems.” In Papers Presented to the 13th International Conference on Computational Linguistics. COLING 1990.

Koehn, Philipp. 2010. Statistical Machine Translation. Cambridge University Press.

Krubiński, Mateusz, Erfan Ghadery, Marie-Francine Moens, and Pavel Pecina. 2021. “Just Ask! Evaluating Machine Translation by Asking and Answering Questions.” In Proceedings of the Sixth Conference on Machine Translation, 495–506. Online.

Kübler, Natalie. 2008. “A Comparable Learner Translator Corpus: Creation and Use.” In Proc. Of LREC 2008 Workshop on Building and Using Comparable Corpora, 73–78. BUCC. Marrakech, Morocco.

Läubli, Samuel, Sheila Castilho, Graham Neubig, Rico Sennrich, Qinlan Shen, and Antonio Toral. 2020. “A Set of Recommendations for Assessing Human-Machine Parity in Language Translation.” Journal of Artificial Intelligence Review 671: 653–72.

Lommel, Arle, Hans Uszkoreit, and Aljoscha Burchardt. 2014. “Multidimensional Quality Metrics (MQM): A Framework for Declaring and Describing Translation Quality Metrics.” Revista Tradumàtica: Tecnologies de La Traducció, no. 12: 455–63.

Maruf, Sameen, Fahimeh Saleh, and Gholamreza Haffari. 2021. “A Survey on Document-Level Neural Machine Translation: Methods and Evaluation.” ACM Comput. Surv. 54 (2).

Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. “BLEU: A Method for Automatic Evaluation of Machine Translation.” In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 311–18. ACL ’02. Stroudsburg, PA, USA.

Pierce, John R., John B. Carroll, Eric P. Hamp, David G. Hays, Charles F. Hockett, Anthony G. Oettinger, and Alan Perlis. 1966. “Language and Machines – Computers in Translation and Linguistics.” Washington, DC: ALPAC Report, National Academy of Sciences.

Raganato, Alessandro, Yves Scherrer, and Jörg Tiedemann. 2019. “The MuCoW Test Suite at WMT 2019: Automatically Harvested Multilingual Contrastive Word Sense Disambiguation Test Sets for Machine Translation.” In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), 470–80. Florence, Italy.

Rei, Ricardo, Craig Stewart, Ana C. Farinha, and Alon Lavie. 2020. “COMET: A Neural Framework for MT Evaluation.” In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2685–2702. Online.

Rios, Annette, Mathias Müller, and Rico Sennrich. 2018. “The Word Sense Disambiguation Test Suite at WMT18.” In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, 588–96. Belgium, Brussels.

Rudin, Cynthia. 2019. “Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead.” Nature Machine Intelligence 1 (5): 206–15.

Saunders, Danielle, and Bill Byrne. 2020. “Reducing Gender Bias in Neural Machine Translation as a Domain Adaptation Problem.” In of the 58th Annual Meeting of the Association for Computational Linguistics, 7724–36. Online.

Scarton, Carolina, and Lucia Specia. 2016. “A Reading Comprehension Corpus for Machine Translation Evaluation.” In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 3652–58. Portorož, Slovenia.

Sennrich, Rico. 2017. “How Grammatical Is Character-Level Neural Machine Translation? Assessing MT Quality with Contrastive Translation Pairs.” In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 376–82. Valencia, Spain.

Shi, Xing, Inkit Padhi, and Kevin Knight. 2016. “Does String-Based Neural MT Learn Source Syntax?” In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 1526–34. Austin, Texas.

Snover, Matthew, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, et John Makhoul. 2006. “A Study of Translation Edit Rate with Targeted Human Annotation.” In Proceedings of the Seventh Conference of the Association for Machine Translation in the America (AMTA), 223–31. Boston, Massachusetts, USA.

Specia, Lucia, Carolina Scarton, et Gustavo Henrique Paetzold. 2018. Quality Estimation for Machine Translation. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers.

Thompson, Brian et Matt Post. 2020. “Automatic Machine Translation Evaluation in Many Languages via Zero-Shot Paraphrasing.” In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 90–121. Online.

Vanmassenhove, Eva, Jinhua Du, and Andy Way. 2017. “Investigating ‘Aspect’ in NMT and SMT: Translating the English Simple Past and Present Perfect.” Computational Linguistics in the Netherlands Journal 71: 109–28.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, et Illia Polosukhin. 2017. “Attention Is All You Need.” In Advances in Neural Information Processing Systems 301, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 5998–6008.

Vig, Jesse, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, and Stuart Shieber. Investigating gender bias in language models using causal mediation analysis. In NeurIPS, volume 331, pages 12388–12401. Curran Associates, Inc., 2020.

Voita, Elena and Ivan Titov. Information-theoretic probing with minimum description length. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 183–196, Online, November 2020. Association for Computational Linguistics.

Voita, Elena, Rico Sennrich, and Ivan Titov. 2019. “When a Good Translation Is Wrong in Context: Context-Aware Machine Translation Improves on Deixis, Ellipsis, and Lexical Cohesion.” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1198–1212. Florence, Italy.

Wisniewski, Guillaume, Lichao Zhou, Nicolas Ballier, et François Yvon. 2021. “Biais de genre dans un système de traduction automatique neuronale : une étude préliminaire.” In Traitement Automatique des Langues Naturelles, edité by P. Denis, N. Grabar, A. Fraisse, R. Cardon, B. Jacquemin, E. Kergosien, and A. Balvet, 11–25. Lille, France.

Wisniewski, Guillaume, Lichao Zhu, Nicolas Ballier, et François Yvon. 2021. “Screening Gender Transfer in Neural Machine Translation.” In Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Punta Cana, Dominica.

Zhang, Tianyi, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, et Yoav Artzi. 2020. “BERTScore: Evaluating Text Generation with BERT.” In International Conference on Learning Representations.