In:Multilingual Corpus Research: Advances and challenges
Edited by Noelia Ramón and María Pérez Blanco
[Studies in Corpus Linguistics 126] 2026
► pp. 296–314
Chapter 12Extracting translation equivalents from a near-parallel corpus of news texts
Experiments with multilingual sentence embeddings and large language models
Published online: 20 February 2026
https://doi.org/10.1075/scl.126.12pez
https://doi.org/10.1075/scl.126.12pez
Abstract
This chapter has two major goals. We report our work on the extension of the English-Polish parallel corpus Paralela (Pęzik, 2016) and present the results of experiments with multilingual sentence embeddings and large
language models (LLMs) for bilingual phraseology extraction from the newly obtained data. Our evaluation, conducted using the J48
classifier, which is Weka’s implementation of the C4.5 decision tree algorithm (Quinlan, 1993),
showed that with circa 0.84 precision and 0.65 recall we can obtain a combinatorial dictionary with thousands of recurrent phrasal
equivalents from the extended near-parallel news headlines corpus. The results are valuable as they may help translation researchers or
translator trainers assess the usefulness of AI-assisted tools, notably LLMs, for bilingual phraseology extraction. The findings also
provide insights into the degree of repetition or lexical variety in potential translations.
Article outline
- 1.Introduction
- 2.Towards the extension of the Paralela corpus
- 3.Bilingual phraseology extraction
- 4.Conclusions
Acknowledgements Notes References
References (62)
Artetxe,
M. & Schwenk,
H. (2019a). Massively
multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. arXiv
preprint. arXiv: 1812.10464.
Artetxe, M., & Schwenk, H. (2019b). Margin-based parallel corpus mining with multilingual sentence embeddings. In A. Korhonen, D. Traum, & L. Màrquez (Eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3197–3203). Association for Computational Linguistics. Retrieved on 30 December 2020 from [URL].
Bernardini, S., Ferraresi, A., Garcea, F., & Rodriguez Blanco, N. (2023). Corpus approaches to news translation: Can we do better than comparable? In M. Kajzer-Wietrzny & A. Chmiel (Eds.), UCCTS 2023— Book of abstracts (pp. 21–24). Retrieved on 28 October 2023 from [URL]
Bernardini, S., & Ferreresi, A. (2024). Corpus approaches to news translation: We can do better than comparable! Across Languages and Cultures, 25(2), 198–215.
Biel, Ł. (2017). Enhancing the communicative dimension of legal translation: Comparable corpora in the research-informed
classroom. The Interpreter and Translator Trainer, 11(4), 316–336.
Biel, Ł., & Koźbiał, D. (2020). How do translators handle (near-)synonymous legal terms? A mixed-genre parallel corpus study into the variation of EU
English-Polish competition law terminology. Estudios de Traducción, 10, 69–90.
Čermák, F., & Rosen, A. (2012). The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics, 17(3), 411–427.
Chen, P. (2024). The impact of generative AI on the role of translators and its implications for translation education. Education Insights, 1(2), 24–33.
Church, K. (2025). Comparable corpora: Opportunities for new research directions. arXiv preprint. arXiv: 2501.14721v1
Fantinuoli, C. (2023). Towards AI-enhanced computer-assisted interpreting. In G. Corpas Pastor & B. Defrancq (Eds.), Interpreting technologies — Current and future trends (pp. 46–71). John Benjamins.
Feng, F., Yang, Y., Cer, D., Arivazhagan, N., & Wang, W. (2022). Language-agnostic BERT sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Vol. 1: Long papers (pp. 878–891). Association for Computational Linguistics. Retrieved on 20 October 2023 from [URL].
Frank, E., Hall, M., Witten, I. & Pal, C. (2016). The WEKA Workbench. Online appendix for Data mining: Practical machine learning tools and techniques (4th ed.). Morgan Kaufmann. Retrieved on 5 February 2025 from [URL]
Fu, L., & Liu, L. (2024). What are the differences? A comparative study of generative artificial intelligence translation and human translation of
scientific texts. Humanities and Social Sciences Communication, 11, 1236.
Galtung, J., & Ruge, M. H. (1965). The structure of foreign news: The presentation of the Congo, Cuba and Cyprus crises in four Norwegian
newspapers. Journal of Peace Research, 2(1), 64–90.
Gete, H., & Etchegoyhen, T. (2022). Making the most of comparable corpora in neural machine translation: A case study. Language Resources & Evaluation, 56, 943–971
Grabowski, Ł. 2018. On identification of bilingual lexical bundles for translation purposes. The case of an English-Polish comparable corpus
of patient information leaflets. In R. Mitkov, J. Monti, G. Corpas Pastor, & V. Seretan (Eds.), Multiword units in machine translation and translation technology (pp. 182–199). John Benjamins.
(2022). Provoke or encourage improvements? On semantic prosody in English-to-Polish translation. Perspectives: Studies in Translation Theory and Practice, 30, 120–136.
Granger, S., & Lefer, M.-A. (2022). Corpus-based translation and interpreting studies: A forward-looking review. In S. Granger & M.-A. Lefer (Eds.), Extending the scope of corpus-based translation studies (pp. 13–41). Bloomsbury.
Guo, M., Shen, Q., Yang, Y., Ge, H., Cer, D., Abrego, G., Stevens, K., Constant, N., Sung, Y.-H., Strope, B., & Kurzwell, R. (2018). Effective parallel corpus mining using bilingual sentence embeddings. In O. Bojar, R. Chatterjee, C. Federmann, M. Fishel, Y. Graham et al., Proceedings of the Third Conference on Machine Translation (WMT), Volume 1: Research papers (pp. 165–176). Association for Computational Linguistics.
Hareide, L. (2019). Comparable parallel corpora. A critical review of current practices in corpus-based translation studies. In I. Doval & M. T. Sánchez Nieto (Eds.), Parallel corpora for contrastive and translation studies: New resources and applications (pp. 19–38). John Benjamins.
Hjarvard, S. (2024). The globalization of language. How the media contribute to the spread of English and the emergence of
medialects. Nordicom Review, 25(1), 75–97.
Hothorn T., Hornik K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15(3), 651–674.
Hothorn T., & Zeileis, A. (2015). partykit: A modular toolkit for recursive partytioning in R. Journal of Machine Learning Research, 16, 3905–3909.
Hothorn, T., Deibold, H., & Zeileis, A. (2024). Package ‘partykit’: A toolkit for recursive partyitioning. Retrieved on 23 November 2020 from [URL]
Jantunen, J. (2002). Comparable corpora in translation studies: Strengths and limitations. SKY Journal of Linguistics, 5, 105–117.
Johansson, S. (2007). Seeing through multilingual corpora. On the use of corpora in contrastive studies. John Benjamins.
Johnson, J., Douze, M., & Jégou, H. (2019). Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3), 535–547.
Kajzer-Wietrzny, M., Ivaska, I., & Ferraresi, A. (2021). ‘Lost’ in interpreting and ‘found’ in translation: Using an intermodal, multidirectional parallel corpus to investigate
the rendition of numbers. Perspectives: Studies in Translation Theory and Practice, 29(4), 469–488.
Klimaszewski, M., & Wróblewska, A. (2021). COMBO: State-of-the-art morphosyntactic analysis. arXiv preprint, arXiv:2109.05361.
Kruk, M., & Kałużna, A. (2024). Investigating the role of AI tools in enhancing translation skills, emotional experiences, and motivation in L2
learning. European Journal of Education, e12859.
Lapshinova-Koltunski, E. (2022). Detecting normalisation and shining-through in novice and professional. In S. Granger & M.-A. Lefer (Eds.), Extending the scope of corpus-based translation studies (pp. 182–206). Bloomsbury.
Lai, G., Dai, Z., & Yang, Y. (2020). Unsupervised parallel corpus mining on web data. arXiv preprint. arXiv: 2009.08595
Lefer, M.-A. (2020). Parallel corpora. In M. Paquot & S. T. Gries (Eds.), Practical handbook of corpus linguistics (pp. 257–282). Springer.
Lewandowska-Tomaszczyk, B., & Pęzik, P. (2018). Parallel and comparable language corpora, cluster equivalence and translator education. In Society and languages in the third millennium — Communication. Education. Translation (pp. 131–142). RUDN University.
López Arroyo, B., & Roberts, R. (2017). Genre and register in comparable corpora: An English/Spanish contrastive analysis. Meta, 62(1), 114–136.
Macanovic, A., & Przepiorka, W. (2024). Mapping individuals’ internal states from online posts. Behavior Research Methods, 56, 2782–2803.
Marco, J. (2019). Living with parallel corpora. In I. Doval & M. T. Sánchez Nieto (Eds.), Parallel corpora for contrastive and translation studies: New resources and applications (pp. 39–56). John Benjamins.
Mastropierro, L. (2020). The translation of reporting verbs in Italian: The case of the Harry Potter series. International Journal of Corpus Linguistics, 25(3), 241–269.
Mastropierro, L., & Grabowski, Ł. (2024). Repeated reporting verbs in English novels and their Italian and Polish translations: A preliminary multifactorial
study. Across Languages and Cultures, 25(2), 310–330.
Mikhailov, M., & Cooper, R. (2016). Corpus linguistics for translation and contrastive studies. A guide for research. Routledge.
Nádvorníková, O. (2024). French, Polish and Czech converbs: A contrastive corpus-based study. Languages in Contrast, 24(2), 197–225.
Philip, G. (2009). Arriving at equivalence: Making a case for comparable general reference corpora in translation studies. In A. Beeby, P. Rodríguez-Inés, & P. Sánchez-Gijón (Eds.), Corpus use and translating: Corpus use for learning to translate and learning corpus use to translate (pp. 59–73). John Benjamins.
Pęzik, P. (2014). Graph-based analysis of collocational profiles. In V. Jesenšek & P. Grzybek (Eds.), Phraseologie im Wörterbuch und Korpus (pp. 227–243). Filozofska fakulteta.
(2016). Exploring phraseological equivalence with Paralela. In E. Gruszczyńska & A. Leńko-Szymańska (Eds.), Polish-language parallel corpora (pp. 67–81). Instytut Lingwistyki Stosowanej UW.
(2018). Facets of prefabrication. Perspectives on modelling and detecting phraseological units. Wydawnictwo Uniwersytetu Łódzkiego.
(2021). Exploring the valency of collocational chains. In A. Trklja & Ł. Grabowski (Eds.), Formulaic language: Theories and methods (pp. 53–78).
Rabadán, R., & Izquierdo, M. (2013). A corpus-based analysis of English affixal negation translated into Spanish. In K. Aijmer & B. Altenberg (Eds.), Advances in corpus-based contrastive linguistics: Studies in honour of Stig Johansson (pp. 57–82). John Benjamins.
Ramón, N. (2023). Exploring near-synonyms through translation corpora. A case study on ‘begin’ and ‘start’ in the English-Spanish parallel
corpus PACTRES. In M. Izquierdo & Z. Sanz-Villar (Eds.), Corpus use in cross-linguistic research: Paving the way for teaching, translation and professional communication (pp. 91–107). John Benjamins.
Sanjurjo-González, H., & Izquierdo, M. (2019). P-ACTRES 2.0: A parallel corpus for cross-linguistic research. In I. Doval & M. T. Sánchez Nieto (Eds.), Parallel corpora for contrastive and translation studies: New resources and applications (pp. 215–231). John Benjamins.
Schwenk, H. & Douze, M. (2017). Learning joint multilingual sentence representations with neural machine translation. arXiv preprint. arXiv:1704.04154.
Sharoff, S., Rapp, R., & Zweigenbaum, P. (2023a). Building comparable corpora. In Building and using comparable corpora for multilingual natural language processing (pp. 17–37). Springer.
(2023b). Other applications of comparable orpora. In Building and using comparable corpora for multilingual natural language processing (pp. 117–128). Springer.
Tannenbaum, P. H. (1953). The effect of headlines on the interpretation of news stories. Journalism Quarterly, 30(2), 189–197.
Teich, E. (2003). Cross-linguistic variation in system and text: A methodology for the investigation of translations and comparable texts. Mouton de Gruyter.
Tiedemann, J. (2011). News from OPUS— A collection of multilingual parallel corpora with tools and interfaces. In N. Nicolov, K. Bontcheva, G. Angelova, & R. Mitkov (Eds.), Recent advances in natural language processing (Vol. V, pp. 237–248). John Benjamins.
Vintar, Š. (2001). Using parallel corpora for translation-oriented term extraction. Babel, 47(2), 121–132.
Xia, G. (2020). A comparable-corpus-based study of informal features in academic writing by English and Chinese scholars across
disciplines. Ibérica, 39, 119–140.
