Experiments with multilingual sentence embeddings and large language models: Chapter 12. Extracting translation equivalents from a near-parallel corpus of news texts

Pęzik, Piotr; Grabowski, Łukasz

doi:10.1075/scl.126.12pez

In:Multilingual Corpus Research: Advances and challenges
Edited by Noelia Ramón and María Pérez Blanco
[Studies in Corpus Linguistics 126] 2026
► pp. 296–314

Get fulltext from our e-platform

Download Book PDF

Download Book EPUB

Chapter 12
Extracting translation equivalents from a near-parallel corpus of news texts

Experiments with multilingual sentence embeddings and large language models

Piotr Pęzik | University of Łódź

Łukasz Grabowski | University of Opole

Published online: 20 February 2026

https://doi.org/10.1075/scl.126.12pez

Abstract

This chapter has two major goals. We report our work on the extension of the English-Polish parallel corpus Paralela (Pęzik, 2016) and present the results of experiments with multilingual sentence embeddings and large language models (LLMs) for bilingual phraseology extraction from the newly obtained data. Our evaluation, conducted using the J48 classifier, which is Weka’s implementation of the C4.5 decision tree algorithm (Quinlan, 1993), showed that with circa 0.84 precision and 0.65 recall we can obtain a combinatorial dictionary with thousands of recurrent phrasal equivalents from the extended near-parallel news headlines corpus. The results are valuable as they may help translation researchers or translator trainers assess the usefulness of AI-assisted tools, notably LLMs, for bilingual phraseology extraction. The findings also provide insights into the degree of repetition or lexical variety in potential translations.

Keywords: parallel corpus, translation equivalence, phraseology extraction, multilingual sentence embeddings

Article outline

1.Introduction
2.Towards the extension of the Paralela corpus
3.Bilingual phraseology extraction
4.Conclusions
Acknowledgements
Notes
References

References (62)

References

Artetxe, M. & Schwenk, H. (2019a). Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. arXiv preprint. arXiv: 1812.10464.

Artetxe, M., & Schwenk, H. (2019b). Margin-based parallel corpus mining with multilingual sentence embeddings. In A. Korhonen, D. Traum, & L. Màrquez (Eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3197–3203). Association for Computational Linguistics. Retrieved on 30 December 2020 from [URL].

Bernardini, S., Ferraresi, A., Garcea, F., & Rodriguez Blanco, N. (2023). Corpus approaches to news translation: Can we do better than comparable? In M. Kajzer-Wietrzny & A. Chmiel (Eds.), UCCTS 2023— Book of abstracts (pp. 21–24). Retrieved on 28 October 2023 from [URL]

Bernardini, S., & Ferreresi, A. (2024). Corpus approaches to news translation: We can do better than comparable! Across Languages and Cultures, 25(2), 198–215.

Biel, Ł. (2017). Enhancing the communicative dimension of legal translation: Comparable corpora in the research-informed classroom. The Interpreter and Translator Trainer, 11(4), 316–336.

Biel, Ł., & Koźbiał, D. (2020). How do translators handle (near-)synonymous legal terms? A mixed-genre parallel corpus study into the variation of EU English-Polish competition law terminology. Estudios de Traducción, 10, 69–90.

Čermák, F., & Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics, 17(3), 411–427.

Chen, P. (2024). The impact of generative AI on the role of translators and its implications for translation education. Education Insights, 1(2), 24–33.

Church, K. (2025). Comparable corpora: Opportunities for new research directions. arXiv preprint. arXiv: 2501.14721v1

Cowie, A. (1998). Phraseology: Theory, analysis and applications. Clarendon Press.

Fantinuoli, C. (2023). Towards AI-enhanced computer-assisted interpreting. In G. Corpas Pastor & B. Defrancq (Eds.), Interpreting technologies — Current and future trends (pp. 46–71). John Benjamins.

Feng, F., Yang, Y., Cer, D., Arivazhagan, N., & Wang, W. (2022). Language-agnostic BERT sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Vol. 1: Long papers (pp. 878–891). Association for Computational Linguistics. Retrieved on 20 October 2023 from [URL].

Frank, E., Hall, M., Witten, I. & Pal, C. (2016). The WEKA Workbench. Online appendix for Data mining: Practical machine learning tools and techniques (4th ed.). Morgan Kaufmann. Retrieved on 5 February 2025 from [URL]

Fu, L., & Liu, L. (2024). What are the differences? A comparative study of generative artificial intelligence translation and human translation of scientific texts. Humanities and Social Sciences Communication, 11, 1236.

Galtung, J., & Ruge, M. H. (1965). The structure of foreign news: The presentation of the Congo, Cuba and Cyprus crises in four Norwegian newspapers. Journal of Peace Research, 2(1), 64–90.

Gete, H., & Etchegoyhen, T. (2022). Making the most of comparable corpora in neural machine translation: A case study. Language Resources & Evaluation, 56, 943–971

Grabowski, Ł. 2018. On identification of bilingual lexical bundles for translation purposes. The case of an English-Polish comparable corpus of patient information leaflets. In R. Mitkov, J. Monti, G. Corpas Pastor, & V. Seretan (Eds.), Multiword units in machine translation and translation technology (pp. 182–199). John Benjamins.

(2022). Provoke or encourage improvements? On semantic prosody in English-to-Polish translation. Perspectives: Studies in Translation Theory and Practice, 30, 120–136.

Granger, S., & Lefer, M.-A. (2022). Corpus-based translation and interpreting studies: A forward-looking review. In S. Granger & M.-A. Lefer (Eds.), Extending the scope of corpus-based translation studies (pp. 13–41). Bloomsbury.

Guo, M., Shen, Q., Yang, Y., Ge, H., Cer, D., Abrego, G., Stevens, K., Constant, N., Sung, Y.-H., Strope, B., & Kurzwell, R. (2018). Effective parallel corpus mining using bilingual sentence embeddings. In O. Bojar, R. Chatterjee, C. Federmann, M. Fishel, Y. Graham et al., Proceedings of the Third Conference on Machine Translation (WMT), Volume 1: Research papers (pp. 165–176). Association for Computational Linguistics.

Hareide, L. (2019). Comparable parallel corpora. A critical review of current practices in corpus-based translation studies. In I. Doval & M. T. Sánchez Nieto (Eds.), Parallel corpora for contrastive and translation studies: New resources and applications (pp. 19–38). John Benjamins.

Hjarvard, S. (2024). The globalization of language. How the media contribute to the spread of English and the emergence of medialects. Nordicom Review, 25(1), 75–97.

Hothorn T., Hornik K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15(3), 651–674.

Hothorn T., & Zeileis, A. (2015). partykit: A modular toolkit for recursive partytioning in R. Journal of Machine Learning Research, 16, 3905–3909.

Hothorn, T., Deibold, H., & Zeileis, A. (2024). Package ‘partykit’: A toolkit for recursive partyitioning. Retrieved on 23 November 2020 from [URL]

Jantunen, J. (2002). Comparable corpora in translation studies: Strengths and limitations. SKY Journal of Linguistics, 5, 105–117.

Johansson, S. (2007). Seeing through multilingual corpora. On the use of corpora in contrastive studies. John Benjamins.

Johnson, J., Douze, M., & Jégou, H. (2019). Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3), 535–547.

Kajzer-Wietrzny, M., Ivaska, I., & Ferraresi, A. (2021). ‘Lost’ in interpreting and ‘found’ in translation: Using an intermodal, multidirectional parallel corpus to investigate the rendition of numbers. Perspectives: Studies in Translation Theory and Practice, 29(4), 469–488.

Klimaszewski, M., & Wróblewska, A. (2021). COMBO: State-of-the-art morphosyntactic analysis. arXiv preprint, arXiv:2109.05361.

Kruk, M., & Kałużna, A. (2024). Investigating the role of AI tools in enhancing translation skills, emotional experiences, and motivation in L2 learning. European Journal of Education, e12859.

Lapshinova-Koltunski, E. (2022). Detecting normalisation and shining-through in novice and professional. In S. Granger & M.-A. Lefer (Eds.), Extending the scope of corpus-based translation studies (pp. 182–206). Bloomsbury.

Lai, G., Dai, Z., & Yang, Y. (2020). Unsupervised parallel corpus mining on web data. arXiv preprint. arXiv: 2009.08595

Lefer, M.-A. (2020). Parallel corpora. In M. Paquot & S. T. Gries (Eds.), Practical handbook of corpus linguistics (pp. 257–282). Springer.

Lewandowska-Tomaszczyk, B., & Pęzik, P. (2018). Parallel and comparable language corpora, cluster equivalence and translator education. In Society and languages in the third millennium — Communication. Education. Translation (pp. 131–142). RUDN University.

López Arroyo, B., & Roberts, R. (2017). Genre and register in comparable corpora: An English/Spanish contrastive analysis. Meta, 62(1), 114–136.

Macanovic, A., & Przepiorka, W. (2024). Mapping individuals’ internal states from online posts. Behavior Research Methods, 56, 2782–2803.

Marco, J. (2019). Living with parallel corpora. In I. Doval & M. T. Sánchez Nieto (Eds.), Parallel corpora for contrastive and translation studies: New resources and applications (pp. 39–56). John Benjamins.

Mastropierro, L. (2020). The translation of reporting verbs in Italian: The case of the Harry Potter series. International Journal of Corpus Linguistics, 25(3), 241–269.

Mastropierro, L., & Grabowski, Ł. (2024). Repeated reporting verbs in English novels and their Italian and Polish translations: A preliminary multifactorial study. Across Languages and Cultures, 25(2), 310–330.

Mikhailov, M., & Cooper, R. (2016). Corpus linguistics for translation and contrastive studies. A guide for research. Routledge.

Nádvorníková, O. (2024). French, Polish and Czech converbs: A contrastive corpus-based study. Languages in Contrast, 24(2), 197–225.

Philip, G. (2009). Arriving at equivalence: Making a case for comparable general reference corpora in translation studies. In A. Beeby, P. Rodríguez-Inés, & P. Sánchez-Gijón (Eds.), Corpus use and translating: Corpus use for learning to translate and learning corpus use to translate (pp. 59–73). John Benjamins.

Pęzik, P. (2014). Graph-based analysis of collocational profiles. In V. Jesenšek & P. Grzybek (Eds.), Phraseologie im Wörterbuch und Korpus (pp. 227–243). Filozofska fakulteta.

(2016). Exploring phraseological equivalence with Paralela. In E. Gruszczyńska & A. Leńko-Szymańska (Eds.), Polish-language parallel corpora (pp. 67–81). Instytut Lingwistyki Stosowanej UW.

(2018). Facets of prefabrication. Perspectives on modelling and detecting phraseological units. Wydawnictwo Uniwersytetu Łódzkiego.

(2020). Budowa i zastosowania korpusu monitorującego MoncoPL. Forum Lingwistyczne, 7, 133–150.

(2021). Exploring the valency of collocational chains. In A. Trklja & Ł. Grabowski (Eds.), Formulaic language: Theories and methods (pp. 53–78).

Quinlan, R. (1993). C4.5: Programs for machine learning. Morgan Kaufmann.

Rabadán, R., & Izquierdo, M. (2013). A corpus-based analysis of English affixal negation translated into Spanish. In K. Aijmer & B. Altenberg (Eds.), Advances in corpus-based contrastive linguistics: Studies in honour of Stig Johansson (pp. 57–82). John Benjamins.

Ramón, N. (2023). Exploring near-synonyms through translation corpora. A case study on ‘begin’ and ‘start’ in the English-Spanish parallel corpus PACTRES. In M. Izquierdo & Z. Sanz-Villar (Eds.), Corpus use in cross-linguistic research: Paving the way for teaching, translation and professional communication (pp. 91–107). John Benjamins.

Sanjurjo-González, H., & Izquierdo, M. (2019). P-ACTRES 2.0: A parallel corpus for cross-linguistic research. In I. Doval & M. T. Sánchez Nieto (Eds.), Parallel corpora for contrastive and translation studies: New resources and applications (pp. 215–231). John Benjamins.

Schwenk, H. & Douze, M. (2017). Learning joint multilingual sentence representations with neural machine translation. arXiv preprint. arXiv:1704.04154.

Sharoff, S., Rapp, R., & Zweigenbaum, P. (2023a). Building comparable corpora. In Building and using comparable corpora for multilingual natural language processing (pp. 17–37). Springer.

(2023b). Other applications of comparable orpora. In Building and using comparable corpora for multilingual natural language processing (pp. 117–128). Springer.

Tannenbaum, P. H. (1953). The effect of headlines on the interpretation of news stories. Journalism Quarterly, 30(2), 189–197.

Teich, E. (2003). Cross-linguistic variation in system and text: A methodology for the investigation of translations and comparable texts. Mouton de Gruyter.

Tiedemann, J. (2011). News from OPUS— A collection of multilingual parallel corpora with tools and interfaces. In N. Nicolov, K. Bontcheva, G. Angelova, & R. Mitkov (Eds.), Recent advances in natural language processing (Vol. V, pp. 237–248). John Benjamins.

Vintar, Š. (2001). Using parallel corpora for translation-oriented term extraction. Babel, 47(2), 121–132.

Xia, G. (2020). A comparable-corpus-based study of informal features in academic writing by English and Chinese scholars across disciplines. Ibérica, 39, 119–140.

Yuxiu, Y. (2024). Application of translation technology based on AI in translation teaching. Systems and Soft Computing, 6, 200072.

Zanettin, F. (2014). Corpora in translation. In J. House (Ed.), Translation: A multidisciplinary approach (pp. 178–199). Palgrave Macmillan.

Chapter 12Extracting translation equivalents from a near-parallel corpus of news texts

Experiments with multilingual sentence embeddings and large language models

Chapter 12
Extracting translation equivalents from a near-parallel corpus of news texts