TermEnsembler: An ensemble learning approach to bilingual term extraction and alignment

Repar, Andraž; Podpečan, Vid; Vavpetič, Anže; Lavrač, Nada; Pollak, Senja

doi:10.1075/term.00029.rep

Article published In: Terminology
Vol. 25:1 (2019) ► pp.93–120

Get fulltext from our e-platform

Download PDF

TermEnsembler

An ensemble learning approach to bilingual term extraction and alignment

Andraž Repar | Jožef Stefan Institute

Vid Podpečan | Jožef Stefan Institute

Anže Vavpetič | Jožef Stefan Institute

Nada Lavrač | Jožef Stefan Institute

Senja Pollak | Jožef Stefan Institute

Available under the Creative Commons Attribution-NonCommercial (CC BY-NC) 4.0 license.

For any use beyond this license, please contact the publisher at rights@benjamins.nl.

Published online: 24 July 2019

https://doi.org/10.1075/term.00029.rep

Abstract

This paper describes TermEnsembler, a bilingual term extraction and alignment system utilizing a novel ensemble learning approach to bilingual term alignment. In the proposed system, the processing starts with monolingual term extraction from a language industry standard file type containing aligned English and Slovenian texts. The two separate term lists are then automatically aligned using an ensemble of seven bilingual alignment methods, which are first executed separately and then merged using the weights learned with an evolutionary algorithm. In the experiments, the weights were learned on one domain and tested on two other domains. When evaluated on the top 400 aligned term pairs, the precision of term alignment is over 96%, while the number of correctly aligned multi-word unit terms exceeds 30% when evaluated on the top 400 term pairs.

Keywords: bilingual terminology alignment, terminology extraction, ensemble learning, evolutionary algorithm

Article outline

1.Introduction
2.Related work
- 2.1Monolingual term extraction
- 2.2Bilingual term extraction and alignment
3.TermEnsembler system and methodology
- 3.1System overview
- 3.2Monolingual term extraction: LUIZ-CF++ upgrade of LUIZ-CF
- 3.3Bilingual term alignment: A novel ensemble learning approach
  - 3.3.1Individual bilingual term alignment algorithms
    - Co-frequency
    - Dice
    - Mutual information
    - BI-LUIZ+
    - Novel Phrase-Table-Based Alignment (PTBA) approaches PTBA-1, PTBA-2 and PTBA-3
  - 3.3.2Final term pair ranking by ensemble-based weighting of separate lists of term pairs
  - 3.3.3Evolutionary weighting of term alignment algorithms
4.Experiments and results
- 4.1Experimental setting
- 4.2Data
- 4.3Experimental comparison of individual bilingual term alignment components
  - 4.3.1Precision of individual term alignment components
  - 4.3.2Single vs. multi-word unit terms
- 4.4Results of the TermEnsembler’s bilingual term alignment approach
  - 4.4.1Optimizing for optimal precision
  - 4.4.2Optimizing for a compromise between optimal precision and number of correct multi-word unit term pairs
  - 4.4.3Recall of the TermEnsembler system
- 4.5Qualitative analysis of errors
5.Conclusions and future work
Acknowledgements
Notes
References

References (41)

References

Ahmad, Khurshid, Lee Gillam, and Lena Tostevin. 2000. “Weirdness Indexing for Logical Document Extrapolation and Retrieval (WILDER).” In Proceedings of the 8th Text Retrieval Conference (TREC-8), 717–724. Washington, USA.

Aker, Ahmet, Monica Paramita, and Rob Gaizauskas. 2013. “Extracting Bilingual Terminologies from Comparable Corpora.” In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 402–411. Sofia, Bulgaria.

Amjadian, Ehsan, Diana Inkpen, Tahereh Paribakht, and Farahnaz Faez. 2016. “Local-Global Vectors to Improve Unigram Terminology Extraction.” In Proceedings of the 5th International Workshop on Computational Terminology, 2–11. Osaka, Japan.

Baisa, Vít, Barbora Ulipová, and Michal Cukr. 2015. “Bilingual Terminology Extraction in Sketch Engine.” In 9th Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2015 – Proceedings, 61–67. Karlova Studánka, Czech Republic.

Bird, Steven, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. Sebastopol: O’Reilly Media Inc.

Church, Kenneth Ward, and Patrick Hanks. 1990. “Word Association Norms, Mutual Information, and Lexicography.” Computational Linguistics 16 (1): 22–29.

Cohen, Jacob. 1968. “Weighted Kappa: Nominal Scale Agreement Provision for Scaled Disagreement or Partial Credit.” Psychological Bulletin 70 (4): 213.

Conneau, Alexis, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018. “Word Translation Without Parallel Data.” ([URL]) Accessed 2 February 2019.

Daille, Béatrice, and Emmanuel Morin. 2005. “French-English Terminology Extraction from Comparable Corpora.” In Proceedings of the 2nd International Joint Conference on Natural Language Processing, 707–718. Jeju Island, South Korea.

Daille, Béatrice, Éric Gaussier, and Jean-Marc Langé. 1994. “Towards Automatic Extraction of Monolingual and Bilingual Terminology.” In Proceedings of the 15th Conference on Computational linguistics, 515–521. Kyoto, Japan.

Dice, LR. 1945. “Measures of the Amount of Ecologic Association between Species.” Ecology 26 (3): 297–302.

Foo, Jody. 2012. Computational Terminology: Exploring Bilingual and Monolingual Term Extraction. Linköping: Linköping University Electronic Press.

Fortin, Félix-Antoine, François-Michel De Rainville, Marc-André Gardner, Marc Parizeau, and Christian Gagné. 2012. “DEAP: Evolutionary Algorithms Made Easy.” Journal of Machine Learning Research 131 (no. Jul): 2171–2175.

Frantzi, Katerina, Sophia Ananiadou, and Hideki Mirna. 2000. “Automatic Recognition of Multi-Word Terms:. the C-Value/NC-Value Method.” International Journal on Digital Libraries 3(2): 115–130.

Gouadec, Daniel. 2007. Translation as a Profession. Amsterdam/Philadephia: John Benjamins.

Haque, Rejwanul, Sergio Penkale, and Andy Way. 2014. “Bilingual Termbank Creation via Log-Likelihood Comparison and Phrase-Based Statistical Machine Translation.” In Proceedings of the 4th International Workshop on Computational Terminology (Computerm), 42–51. Dublin, Ireland.

Hazem, Amir, and Emmanuel Morin. 2017. “Bilingual Word Embeddings for Bilingual Terminology Extraction from Specialized Comparable Corpora.” In Proceedings of the 8th International Joint Conference on Natural Language Processing, 685–693. Taipei, Taiwan.

Hiemstra, Djoerd. 1998. “Multilingual Domain Modeling in Twenty-One: Automatic Creation of a Bi-Directional Translation Lexicon from a Parallel Corpus.” In Proceedings of the 8th CLIN Meeting, 41–58. Amsterdam, The Netherlands.

Justeson, John, and Slava Katz. 1995. “Technical Terminology: some Linguistic Properties and an Algorithm for Identification in Text.” Natural Language Engineering 1 (1): 9–27.

Kageura, Kyo, and Bin Umino. 1996. “Methods of Automatic Term Recognition: A Review.” Terminology 3 (2): 259–289.

Khan, Muhammad Tahir, Yukun Ma, and Jung-jae Kim. 2016. “Term Ranker: A Graph-Based Re-Ranking Approach.” In Proceedings of the 29th International Florida Artificial Intelligence Research Society Conference, 310–315. Key Largo, USA.

Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan et al. 2007. “Moses: Open Source Toolkit for Statistical Machine Translation.” In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, 177–180. Prague, Czech Republic.

Kupiec, Julian. 1993. “An Algorithm for Finding Noun Phrase Correspondences in Bilingual Corpora.” In Proceedings of the 31st Annual Meeting on Association for Computational Linguistics, 17–22. Columbus, USA.

Landis, Richard, and Gary Koch. 1977. “The Measurement of Observer Agreement for Categorical Data.” Biometrics 33 (1): 159–174.

Ljubešić, Nikola, and Tomaž Erjavec. 2016. “Corpus vs. Lexicon Supervision in Morphosyntactic Tagging: the Case of Slovene.” In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), 23–28. Portorož, Slovenia.

Logar, Nataša, Miha Grčar, Marko Brakus, Tomaž Erjavec, Špela Arhar Holdt, and Simon Krek. 2012. Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba [Slovenian language corpora Gigafida, KRES, ccGigafida, ccKRES: creation, content, use]. Ljubljana: Trojina, zavod za uporabno slovenistiko; Fakulteta za družbene vede.

Macken, Lieve, Els Lefever, and Veronique Hoste. 2013. “Texsis: Bilingual Terminology Extraction from Parallel Corpora using Chunk-Based Alignment.” Terminology 19 (1): 1–30.

McEnery, Tony, Richard Xiao, and Yukio Tono. 2006. Corpus-Based Language Studies: An Advanced Resource Book. London: Taylor & Francis.

Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” ([URL]) Accessed 10 July 2018.

Neubig, Graham, Taro Watanabe, Eiichiro Sumita, Shinsuke Mori, and Tatsuya Kawahara. 2011. “An Unsupervised Model for Joint Phrase Alignment and Extraction.” In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, 632–641. Portland, USA.

Och, Franz Josef, and Hermann Ney. 2003. “A Systematic Comparison of Various Statistical Alignment Models.” Computational Linguistics 29 (1): 19–51.

Pollak, Senja, Anže Vavpetič, Janez Kranjc, Nada Lavrač, and Špela Vintar. 2012. “NLP Workflow for On-Line Definition Extraction from English and Slovene Text Corpora.” In Proceedings of KONVENS 2012, 53–60. Vienna, Austria.

Repar, Andraž, and Senja Pollak. 2017a. “Good Examples for Terminology Databases in Translation.” In Electronic Lexicography in the 21st century. Proceedings of eLex 2017 Conference, 651–661. Leiden, Netherlands.

. 2017b. “Ontology-Based Translation Memory Maintenance.” In Proceedings of the 20th International Multiconference Information Society 2017, 19–22. Ljubljana, Slovenia.

Schmitz, Klaus Dirk, and Daniela Straub. 2016. “Tight Budgets and a Growing Number of Languages Impede Terminology Work.” tcworld magazine for international information management ([URL]). Accessed 24 August 2018.

The British National Corpus, version 3 (BNC XML Edition). 2007. Distributed by Bodleian Libraries, University of Oxford, on behalf of the BNC Consortium. (URL: [URL]). Accessed 10 March 2017.

Vintar, Špela. 2010. “Bilingual Term Recognition Revisited. The Bag-of-Equivalents Term Alignment Approach.” Terminology 16 (2): 141–158.

Wang, Rui, Wei Liu, and Chris McDonald. 2016. “Featureless Domain-Specific Term Extraction with Minimal Labelled Data.” In Proceedings of the Australasian Language Technology Association Workshop, 103–112. Melbourne, Australia.

Wermter, Joachim, and Udo Hahn. 2005. “Paradigmatic Modifiability Statistics for the Extraction of Complex Multi-Word Terms.” In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, 843–850. Vancouver, Canada.

Wüster, Eugene. 1979. Introduction to the General Theory of Terminology and Terminological Lexicography. Vienna: Springer.

Zhang, Zigi, Jie Gao, and Fabio Ciravegna. 2018. “SemRe-Rank: Incorporating Semantic Relatedness to Improve Automatic Term Extraction Using Personalized PageRank.” ([URL]) Accessed 7 January 2019.

Cited by (5)

Cited by five other publications

Order by:

Xu, Kang, Yifan Feng, Qiandi Li, Zhenjiang Dong & Jianxiang Wei

2025. Survey on terminology extraction from texts. Journal of Big Data 12:1

Ivanović, Tanja, Ranka Stanković, Branislava Šandrih Todorović & Cvetana Krstev

2022. Corpus-based bilingual terminology extraction in the power engineering domain. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 28:2 ► pp. 228 ff.

Tran, Hanh Thi Hong, Matej Martinc, Antoine Doucet & Senja Pollak

2022. Can Cross-Domain Term Extraction Benefit from Cross-lingual Transfer?. In Discovery Science [Lecture Notes in Computer Science, 13601], ► pp. 363 ff.

Tran, Hanh Thi Hong, Matej Martinc, Andraz Pelicon, Antoine Doucet & Senja Pollak

2022. Ensembling Transformers for Cross-domain Automatic Term Extraction. In From Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries [Lecture Notes in Computer Science, 13636], ► pp. 90 ff.

Amjadian, Ehsan, Nicholas Prayogo, Serena McDonnell, Cathal Smyth & Muhammad Rizwan Abid

2021. 2021 IEEE Aerospace Conference (50100), ► pp. 1 ff.

This list is based on CrossRef data as of 5 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.