In:Corpora in Translation and Contrastive Research in the Digital Age: Recent advances and explorations
Edited by Julia Lavid-López, Carmen Maíz-Arévalo and Juan Rafael Zamorano-Mansilla
[Benjamins Translation Library 158] 2021
► pp. 101–124
Chapter 4Semantic textual similarity based on deep learning
Can it improve matching and retrieval for Translation Memory tools?
Published online: 8 December 2021
https://doi.org/10.1075/btl.158.04ran
https://doi.org/10.1075/btl.158.04ran
Abstract
This study proposes an original methodology to underpin the operation of new generation Translation Memory (TM)
systems where the translations to be retrieved from the TM database are matched not on the basis of Levenshtein (edit) distance but by
employing innovative Natural Language Processing (NLP) and Deep Learning (DL) techniques. Three DL sentence encoders were experimented
with to retrieve TM matches in English-Spanish sentence pairs from the DGT TM dataset. Each sentence encoder was compared with Okapi
which uses edit distance to retrieve the best match. The automatic evaluation shows the
benefit of the DL technology for TM matching and holds promise for the implementation of the TM tool itself, which is our next
project.
Article outline
- 1.Introduction
- 2.Methodology
- 2.1InferSent
- 2.2Universal sentence encoder
- 2.3Sentence BERT
- 3.Dataset and experiments
- 4.Evaluation and results
- 5.Analysis of typical errors
- 6.Conclusion
Acknowledgements Notes References
References (35)
Arora, Sanjeev, Yingyu Liang, and Tengyu Ma. 2019. “A
Simple but Tough-to-Beat Baseline for Sentence Embeddings”. Proceedings of the 5th International
Conference on Learning Representations (ICLR’2017).
Cer, D., Yang, Y., Kong, S. yi, Hua, N., Limtiaco, N., St. John, R., Constant, N., Guajardo-Céspedes, M., Yuan, S., Tar, C., Sung, Y. H., Strope, B., & Kurzweil, R. 2018. “Universal
sentence encoder for English”. Proceedings of EMNLP 2018 – Conference on Empirical Methods in
Natural Language Processing: System Demonstrations,
Proceedings, 169–174.
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. 2014. “Empirical
Evaluation of Gated Recurrent Neural Networks on Sequence Modeling”. NIPS 2014 Workshop on Deep
Learning, December 2014. [URL]
Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. 2017. “Supervised
learning of universal sentence representations from natural language inference data”. EMNLP 2017 –
Conference on Empirical Methods in Natural Language Processing,
Proceedings, 670–680.
Damerau, F. J. 1964. “A
technique for computer detection and correction of spelling errors”. Communications of the
ACM, 7(3), 171–176.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. 2018. BERT:
Pre-training of Deep Bidirectional Transformers for Language Understanding. [URL]
Dice, Lee R. 1945. “Measures of the Amount of Ecologic
Association Between
Species”. Ecology. 26 (3): 297–302.
Ganitkevitch, Juri, Van Durme Benjamin, and Chris Callison-Burch. 2013. “PPDB:
The paraphrase database”. In Proceedings of
NAACL-HLT, 758–764, Atlanta, Georgia.
Gow, Francie. 2003. Metrics
for Evaluating Translation Memory Software. PhD
thesis. University of Ottawa.
Grönroos, Mickel, and Ari Becks. 2005. “Bringing
Intelligence to Translation Memory Technology”. Proceedings of the International Conference
Translating and the Computer
27. London: ASLIB.
Gupta, R., Bechara, H., El Maarouf, I. and Orasan, C., 2014, August. UoW:
NLP techniques developed at the University of Wolverhampton for Semantic Similarity and Textual
Entailment. In Proceedings of the 8th International Workshop on Semantic
Evaluation (SemEval
2014) (pp. 785–789).
Rohit Gupta, Hanna Bechara, and Constantin Orăsan. 2014b. Intelligent
Translation Memory Matching and Retrieval Metric Exploiting Linguistic
Technology. In Proceedings of the thirty sixth Conference on Translating
and Computer, London, UK.
Gupta, R., Orǎsan, C., Zampieri, M., Vela, M., Mihaela Vela, van Genabith, J. and R. Mitkov. 2016a. “Improving
Translation Memory matching and retrieval using paraphrases”, Machine
Translation, 30(1), 19–40.
Gupta, R., Orǎsan, C., Liu, Q. and R. Mitkov. 2016b. “A
Dynamic Programming Approach to Improving Translation Memory Matching and Retrieval using
Paraphrases”. Lecture Notes in Computer Science book series (LNCS, volume
9924). Proceedings of the 19th International Conference on Text, Speech and Dialogue
(TSD), Brno, Czech Republic. Springer.
Hochreiter, S., & Schmidhuber, J. 1997. “Long
Short-Term Memory”. Neural
Computation, 9(8), 1735–1780.
Hodász, G. and Pohl, G., 2005, September. MetaMorpho
TM: a linguistically enriched translation memory. In International
Workshop: Modern Approaches in Translation
Technologies (pp. 26-30).
Lavie, A., & Agarwal, A. 2007. “METEOR:
An automatic metric for MT evaluation with high levels of correlation with human
judgments”. Proceedings of the Second Workshop on Statistical Machine Translation,
June, 228–231. [URL].
Levenshtein, V. I., 1966, February. Binary
codes capable of correcting deletions, insertions, and
reversals. In Soviet physics
doklady (Vol. 10, No. 8, pp. 707–710).
Macklovitch, E. and Russell, G., 2000, October. What’s
been forgotten in translation memory. In Conference of the Association
for Machine Translation in the
Americas (pp. 137–146). Springer, Berlin, Heidelberg.
Marelli, Marco, Bentivogli, Luisa, Baroni, Marco, Bernardi, Raffaella, Menini, Stefano and Zamparelli, Roberto, 2014, August. SemEval-2014
Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and Textual
Entailment. In Proceedings of the 8th International Workshop on Semantic
Evaluation (SemEval
2014) (pp. 1–8). Dublin, Ireland: Association for Computational Linguistics. [URL].
Mikolov, Tomas, Grave, Edouard, Bojanowski, Piotr, Puhrsch, Christian and Joulin, Armand, 2018, May. Advances
in Pre-Training Distributed Word Representations. In Proceedings of the
Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA). [URL]
Mitkov, R. 2005. ‘New
Generation Translation Memory systems’. Panel discussion at the 27th
international Aslib conference ‘Translating and the
Computer’. London..
“Translation
Memory”. 2020. In S. Deane-Cox and A. Spiessens (Eds), The
Routledge Handbook of Translation and
Memory. Basingstoke: Routledge.
Pagliardini, M., Gupta, P. and Jaggi, M., 2018, June. Unsupervised
Learning of Sentence Embeddings Using Compositional n-Gram
Features. In Proceedings of the 2018 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long
Papers) (pp. 528–540).
Pekar, V. and Mitkov, R. 2007. “New
Generation Translation Memory: Content-Sensitive Matching”. Proceedings of the 40th Anniversary
Congress of the Swiss Association of Translators, Terminologists and
Interpreters. Bern: ASTTI, 2007.
Pennington, J., Socher, R. and Manning, C. D., 2014, October. Glove:
Global vectors for word representation. In Proceedings of the 2014
conference on empirical methods in natural language processing
(EMNLP) (pp. 1532–1543).
Planas, Emmanuel. 2005. “SIMILIS:
Second-generation translation memory software”. proceedings of
the 27th International Conference Translating and the
Computer. London.
Planas, Emmanuel and Furuse, Osamu. 2003. “Formalizing
Translation Memory”. In Michael Carl and Andy Way (Eds), Recent
Advances in Example-Based Machine
Translation (pp. 157–188). Dordrecht: Springer Netherlands.
Ranasinghe, T., Orasan, C. and Mitkov, R., 2019, September. Enhancing
Unsupervised Sentence Similarity Methods with Deep Contextualised Word
Representations. In Proceedings of the International Conference on Recent
Advances in Natural Language Processing (RANLP
2019) (pp. 994–1003).
, 2019, September. Semantic
textual similarity with Siamese neural networks. In Proceedings of the
International Conference on Recent Advances in Natural Language Processing (RANLP
2019) (pp. 1004–1011).
Reimers, N. and Gurevych, I., 2019, November. Sentence-BERT:
Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the
2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language
Processing
(EMNLP-IJCNLP) (pp. 3973–3983).
Sørensen, T. 1948. “A
method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses
of the vegetation on Danish commons”. Kongelige Danske Videnskabernes
Selskab. 5 (4): 1–34.
Steinberger, R., Eisele, A., Klocek, S., Pilos, S., & Schlüter, P. 2012. “DGT-TM:
A freely available translation memory in 22 languages”. Proceedings of the 8th International
Conference on Language Resources and Evaluation, LREC
2012, 454–459. [URL]
Cited by (2)
Cited by two other publications
Ramnarain-Seetohul, Vidasha, Yasmine Rosunally & Vandana Bassoo
This list is based on CrossRef data as of 3 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
