Tagging terms in text: A supervised sequential labelling approach to automatic term extraction

Rigouts Terryn, Ayla; Hoste, Véronique; Lefever, Els

doi:10.1075/term.21010.rig

Article published In: Terminology
Vol. 28:1 (2022) ► pp.157–189

Get fulltext from our e-platform

Download PDF

Download EPUB

Tagging terms in text

A supervised sequential labelling approach to automatic term extraction

Ayla Rigouts Terryn | Ghent University

Véronique Hoste | Ghent University

Els Lefever | Ghent University

Published online: 10 January 2022

https://doi.org/10.1075/term.21010.rig

Abstract

As with many tasks in natural language processing, automatic term extraction (ATE) is increasingly approached as a machine learning problem. So far, most machine learning approaches to ATE broadly follow the traditional hybrid methodology, by first extracting a list of unique candidate terms, and classifying these candidates based on the predicted probability that they are valid terms. However, with the rise of neural networks and word embeddings, the next development in ATE might be towards sequential approaches, i.e., classifying each occurrence of each token within its original context. To test the validity of such approaches for ATE, two sequential methodologies were developed, evaluated, and compared: one feature-based conditional random fields classifier and one embedding-based recurrent neural network. An additional comparison was added with a machine learning interpretation of the traditional approach. All systems were trained and evaluated on identical data in multiple languages and domains to identify their respective strengths and weaknesses. The sequential methodologies were proven to be valid approaches to ATE, and the neural network even outperformed the more traditional approach. Interestingly, a combination of multiple approaches can outperform all of them separately, showing new ways to push the state-of-the-art in ATE.

Keywords: terminology, automatic term extraction, sequential labelling

Article outline

1.Introduction
2.Related research
- 2.1Machine learning approaches
- 2.2Evaluation
- 2.3Features
- 2.4Sequential approaches
3.Data
4.System description
- 4.1CRFSuite feature-based sequential ATE
- 4.2FlairNLP neural, embedding-based sequential ATE
- 4.3HAMLET machine learning approach to traditional hybrid ATE
5.Experiments and results
- 5.1Experimental setup
- 5.2CRF results
- 5.3RNN results
6.Analyses and discussion of results
- 6.1Choice of experiments and motivation
- 6.2Results per corpus
- 6.3Sequential, neural approach vs. traditional, feature-based approach
- 6.4Complementarity of results
7.RNN error analysis
8.Conclusion
Notes
References

References (62)

References

Agić, Željko, and Ivan Vulić. 2019. ‘JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages’. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 3204–10. Florence, Italy: Association for Computational Linguistics.

Akbik, Alan, Tanja Bergmann, Duncan Blythe, Kashif Rasul, Stefan Schweter, and Roland Vollgraf. 2019. ‘FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP’. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, 54–59. Minneapolis, USA: Association for Computational Linguistics.

Akbik, Alan, Duncan Blythe, and Roland Vollgraf. 2018. ‘Contextual String Embeddings for Sequence Labeling’. In Proceedings of the 27th International Conference on Computational Linguistics, 1638–49. Sante Fe, New Mexico, USA: Association for Computational Linguistics.

Alami Merrouni, Zakariae, Bouchra Frikh, and Brahim Ouhbi. 2020. ‘Automatic Keyphrase Extraction: A Survey and Trends’. Journal of Intelligent Information Systems 54 (2): 391–424.

Amjadian, Ehsan, Diana Inkpen, T. Sima Paribakht, and Farahnaz Faez. 2016. ‘Local-Global Vectors to Improve Unigram Terminology Extraction’. In Proceedings of the 5th International Workshop on Computational Terminology, 2–11. Osaka, Japan.

Amjadian, Ehsan, Diana Zaiu Inkpen, T. Sima Paribakht, and Farahnaz Faez. 2018. ‘Distributed Specificity for Automatic Terminology Extraction’. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 24 (1): 23–40.

Astrakhantsev, Nikita, D. Fedorenko, and D. Yu. Turdakov. 2015. ‘Methods for Automatic Term Recognition in Domain-Specific Text Collections: A Survey’. Programming and Computer Software 41 (6): 336–49.

Bay, Matthias, Daniel Bruneß, Miriam Herold, Christian Schulze, Michael Guckert, and Mirjam Minor. 2020. ‘Term Extraction from Medical Documents Using Word Embeddings’. In Proceedings of the 4th IEEE Conference on Machine Learning and Natural Language Processing (MNLP). Agadir, Morocco: IEEE Computer Society.

Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. ‘Enriching Word Vectors with Subword Information’. ArXiv Preprint in ArXiv:1607.04606 [Cs]. [URL]

Bourigault, Didier. 1992. ‘Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases’. In Proceedings of the 14th Conference on Computational Linguistics-Volume 3, 977–81. Nantes, France: Association for Computational Linguistics.

. 1993. ‘An Endogeneous Corpus-Based Method for Structural Noun Phrase Disambiguation’. In Proceedings of the Sixth Conference of the European Chapter of the Association for Computational Linguistics, 81–86. Utrecht, Netherlands: Association for Computational Linguistics.

Cram, Damien, and Beatrice Daille. 2016. ‘TermSuite: Terminology Extraction with Term Variant Detection’. In Proceedings of The 54th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 13–18. Berlin, Germany: Association for Computational Linguistics.

Crammer, Koby, Alex Kulesza, and Mark Dredze. 2009. ‘Adaptive Regularization of Weight Vectors’. Advances in Neural Information Processing Systems 221: 414–22.

Davies, Mark. 2017. ‘The New 4.3 Billion Word NOW Corpus, with 4--5 Million Words of Data Added Every Day’. In Proceedings of the 9th International Corpus Linguistics Conference. Birmingham. Birmingham, UK. [URL]

De Clercq, Orphée, Marjan Van de Kauter, Els Lefever, and Veronique Hoste. 2015. ‘LT3: Applying Hybrid Terminology Extraction to Aspect-Based Sentiment Analysis’. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), 719–24. Denver, Colorado: Association for Computational Linguistics.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. ‘BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding’. ArXiv:1810.04805 [Cs]. [URL]

Dobrov, Boris, and Natalia Loukachevitch. 2011. ‘Multiple Evidence for Term Extraction in Broad Domains’. In Proceedings of the International Conference Recent Advances in Natural Language Processing 2011, 710–15. Hissar, Bulgaria: Association for Computational Linguistics.

Drouin, Patrick. 2003. ‘Term Extraction Using Non-Technical Corpora as a Point of Leverage’. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 9 (1): 99–115.

Drouin, Patrick, Jean-Benoît Morel, and Marie-Claude L’ Homme. 2020. ‘Automatic Term Extraction from Newspaper Corpora: Making the Most of Specificity and Common Features’. In Proceedings of the 6th International Workshop on Computational Terminology (COMPUTERM 2020), 1–7. Marseille, France: ELRA.

Fedorenko, Denis, Nikita Astrakhantsev, and Denis Turdakov. 2013. ‘Automatic Recognition of Domain-Specific Terms: An Experimental Evaluation’. In Proceedings of the Ninth Spring Researcher’s Colloquium on Database and Information Systems, 261:15–23. Kazan, Russia.

Goyal, Archana, Vishal Gupta, and Manish Kumar. 2018. ‘Recent Named Entity Recognition and Classification Techniques: A Systematic Review’. Computer Science Review 291 (August): 21–43.

Graff, David, Ângelo Mendonça, and Denise DiPersio. 2011. ‘French Gigaword Third Edition LDC2011T10’. Philadelphia, USA: Linguistic Data Consortium.

Habibi, Maryam, Leon Weber, Mariana Neves, David Luis Wiegandt, and Ulf Leser. 2017. ‘Deep Learning with Word Embeddings Improves Biomedical Named Entity Recognition’. Bioinformatics 33 (14): i37–48.

Hätty, Anna, Michael Dorna, and Sabine Schulte im Walde. 2017. ‘Evaluating the Reliability and Interaction of Recursively Used Feature Classes for Terminology Extraction’. In Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics, 113–21. Valencia, Spain: Association for Computational Linguistics.

Hätty, Anna, Dominik Schlechtweg, and Michael Dorna. 2020. ‘Predicting Degrees of Technicality in Automatic Terminology Extraction’. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 72883–89. olnine: Association for Computational Linguistics.

Hazem, Amir, Mérieme Bouhandi, Florian Boudin, and Béatrice Daille. 2020. ‘TermEval 2020: TALN-LS2N System for Automatic Term Extraction’. In Proceedings of the 6th International Workshop on Computational Terminology (COMPUTERM 2020), 95–100. Marseille, France: European Language Resources Association.

Kageura, Kyo, and Elizabeth Marshman. 2019. ‘Terminology Extraction and Management’. In The Routledge Handbook of Translation and Technology, edited by O’Hagan, Minako.

Kageura, Kyo, and Bin Umino. 1996. ‘Methods of Automatic Term Recognition’. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 3 (2): 259–89.

Kauter, Marian van de, Geert Coorman, Els Lefever, Bart Desmet, Lieve Macken, and Véronique Hoste. 2013. ‘LeTs Preprocess: The Multilingual LT3 Linguistic Preprocessing Toolkit’. Computational Linguistics in the Netherlands Journal 31: 103–20.

Kim, J.-D., T. Ohta, Y. Tateisi, and J. Tsujii. 2003. ‘GENIA Corpus – a Semantically Annotated Corpus for Bio-Textmining’. Bioinformatics 19 (1): 180–82.

Kingma, Diederik P., and Jimmy Ba. 2015. ‘Adam: A Method for Stochastic Optimization’. In Proceedings of 3rd International Conference for Learning Representations. San Diego, CA. [URL]

Koutropoulou, Theoni, and Efstratios Efstratios. 2019. ‘TMG-BoBI: Generating Back-of-the-Book Indexes with the Text-to-Matrix-Generator’. In Proceedings of the 10th International Conference on Information, Intelligence, Systems and Applications, IISA 2019, 1–8. Patras, Greece.

Kucza, Maren, Jan Niehues, Thomas Zenkel, Alex Waibel, and Sebastian Stüker. 2018. ‘Term Extraction via Neural Sequence Labeling a Comparative Evaluation of Strategies Using Recurrent Neural Networks’. In Proceedings of Interspeech 2018, the 19th Annual Conference of the International Speech Communication Association, 2072–76. Hyderabad, India: International Speech Communication Association.

Loshchilov, Ilya, and Frank Hutter. 2019. ‘Decoupled Weight Decay Regularization’. In Proceedings of the Seventh International Conference on Learning Representations. New Orleans, USA. [URL]

Macken, Lieve, Els Lefever, and Véronique Hoste. 2013. ‘TExSIS: Bilingual Terminology Extraction from Parallel Corpora Using Chunk-Based Alignment’. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 19 (1): 1–30.

Martin, Louis, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric de la Clergerie, Djamé Seddah, and Benoît Sagot. 2020. ‘CamemBERT: A Tasty French Language Model’. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7203–19. Online: Association for Computational Linguistics.

McCrae, John P., and Adrian Doyle. 2019. ‘Adapting Term Recognition to an Under-Resourced Language: The Case of Irish’. In Proceedings of the Celtic Language Technology Workshop, 48–57. Dublin, Ireland.

Meyers, Adam L., Yifan He, Zachary Glass, John Ortega, Shasha Liao, Angus Grieve-Smith, Ralph Grishman, and Olga Babko-Malaya. 2018. ‘The Termolator: Terminology Recognition Based on Chunking, Statistical and Search-Based Scores’. Frontiers in Research Metrics and Analytics 31 (June).

Mikolov, Tomas, Wen-tau Yih, and Geoffrey Zweig. 2013. ‘Linguistic Regularities in Continuous Space Word Representations’. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 746–51. Atlanta, GA, USA: Association for Computational Linguistics.

Okazaki, Naoaki. 2007. CRFsuite: A Fast Implementation of Conditional Random Fields (CRFs). [URL]

Oostdijk, Nelleke, Martin Reynaert, Véronique Hoste, and Ineke Schuurman. 2013. ‘The Construction of a 500-Million-Word Reference Corpus of Contemporary Written Dutch’. In Essential Speech and Language Technology for Dutch, edited by Peter Spyns and Jan Odijk, 219–47. Berlin, Heidelberg: Springer Berlin Heidelberg.

Paszke, Adam, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, et al. 2019. ‘PyTorch: An Imperative Style, High-Performance Deep Learning Library’. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), 8024–35. Vancouver, Canada. [URL]

Pedregosa, Fabian, Gael Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, et al. 2011. ‘Scikit-Learn: Machine Learning in Python’. Machine Learning in Python, no. 12: 2825–30.

Peters, Matthew, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. ‘Deep Contextualized Word Representations’. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2227–37. New Orleans, Louisiana: Association for Computational Linguistics.

Petrov, Slav, Dipanjan Das, and Ryan McDonald. 2012. ‘A Universal Part-of-Speech Tagset’. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), 2089–96. Istanbul, Turkey: European Language Resources Association.

Pires, Telmo, Eva Schlinger, and Dan Garrette. 2019. ‘How Multilingual Is Multilingual BERT?’ In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 4996–5001. Florence, Italy: Association for Computational Linguistics.

Qasemizadeh, Behrang, and Siegfried Handschuh. 2014. ‘The ACL RD-TEC: A Dataset for Benchmarking Terminology Extraction and Classification in Computational Linguistics’. In Proceedings of COLING 2014: 4th International Workshop on Computational Terminology, 52–63. Dublin, Ireland.

Qasemizadeh, Behrang, and Anne-Kathrin Schumann. 2016. ‘The ACL RD-TEC 2.0: A Language Resource for Evaluating Term Extraction and Entity Recognition Methods’. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 1862–68. Portorož, Slovenia: European Language Resources Association.

Rigouts Terryn, Ayla, Patrick Drouin, Véronique Hoste, and Els Lefever. 2019. ‘Analysing the Impact of Supervised Machine Learning on Automatic Term Extraction: HAMLET vs TermoStat’. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), 1012–21. Varna, Bulgaria.

Rigouts Terryn, Ayla, Véronique Hoste, Patrick Drouin, and Els Lefever. 2020. ‘TermEval 2020: Shared Task on Automatic Term Extraction Using the Annotated Corpora for Term Extraction Research (ACTER) Dataset’. In Proceedings of the 6th International Workshop on Computational Terminology (COMPUTERM 2020), 85–94. Marseille, France: European Language Resources Association.

Rigouts Terryn, Ayla, Véronique Hoste, and Els Lefever. 2020. ‘In No Uncertain Terms: A Dataset for Monolingual and Multilingual Automatic Term Extraction from Comparable Corpora’. Language Resources and Evaluation 54 (2): 385–418.

. 2021. ‘HAMLET: Hybrid Adaptable Machine Learning Approach to Extract Terminology’. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 27 (2).

Rokas, Aivaras, Sigita Rackevičienė, and Andrius Utka. 2020. ‘Automatic Extraction of Lithuanian Cybersecurity Terms Using Deep Learning Approaches’. In Proceedings of the Ninth International Conference on Baltic Human Language Technologies, 39–46. Kaunas, Lithuania: IOS Press.

Stenetorp, Pontus, Goran Topić, Sampo Pyysalo, Tomoko Ohta, Jin-Dong Kim, and Jun’ichi Tsujii. 2011. ‘BioNLP Shared Task 2011: Supporting Resources’. In Proceedings of BioNLP Shared Task 2011 Workshop, 112–20. Portland, oregon: Association for Computational Linguistics.

Vintar, Spela. 2010. ‘Bilingual Term Recognition Revisited’. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 16 (2): 141–58.

Vivaldi, Jorge, and Horacio Rodríguez. 2001. ‘Improving Term Extraction by Combining Different Techniques’. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 7 (1): 31–48.

Vries, Wietse de, Andreas van Cranenburgh, Arianna Bisazza, Tommaso Caselli, Gertjan van Noord, and Malvina Nissim. 2019. ‘BERTje: A Dutch BERT Model’. ArXiv:1912.09582, December. [URL]

Wang, Rui, Wei Liu, and Chris McDonald. 2016. ‘Featureless Domain-Specific Term Extraction with Minimal Labelled Data’. In Proceedings of Australasian Language Technology Association Workshop, 103–12. Melbourne, Australia.

Wolf, Thomas, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, et al. 2020. ‘Transformers: State-of-the-Art Natural Language Processing’. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38–45. Online: Association for Computational Linguistics.

Wołk, Krzysztof, and Krzysztof Marasek. 2014. ‘Building Subject-Aligned Comparable Corpora and Mining It for Truly Parallel Sentence Pairs’. Procedia Technology 181: 126–32.

Yuan, Yu, Jie Gao, and Yue Zhang. 2017. ‘Supervised Learning for Robust Term Extraction’. In The Proceedings of 2017 International Conference on Asian Language Processing (IALP), 302–5. Singapore: IEEE.

Zhang, Ziqi, Johann Petrak, and Diana Maynard. 2018. ‘Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms’. ACM Transactions on Knowledge Discovery from Data 12 (5): 1–7.

Cited by (8)

Cited by eight other publications

Order by:

Bolshakova, Elena I. & Vladislav V. Semak

2026. An Experimental Study on Cross-Domain Transformer-Based Term Recognition for Russian. In Data Analytics and Management in Data Intensive Domains [Communications in Computer and Information Science, 2641], ► pp. 127 ff.

Al-Thubaity, Abdulmohsen

2025. A Novel Dataset for Arabic Domain Specific Term Extraction and Comparative Evaluation of BERT-Based Models for Arabic Term Extraction. ACM Transactions on Asian and Low-Resource Language Information Processing 24:9 ► pp. 1 ff.

Tran, Hanh Thi-Hong, Carlos-Emiliano González-Gallardo, Antoine Doucet & Senja Pollak

2025. LlamATE. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 31:1 ► pp. 5 ff.

Wissik, Tanja

2025. Impact of automatic term extraction on terminology work. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 31:1 ► pp. 110 ff.

Zhang, Peng, MengChen Zou & HongXia Du

2025. 2025 8th International Conference on Advanced Algorithms and Control Engineering (ICAACE), ► pp. 2138 ff.

Carbajo Coronado, Blanca & Antonio Moreno Sandoval

2024. Financial concepts extraction and lexical simplification in Spanish. Revista Electrónica de Lingüística Aplicada 22:1 ► pp. 164 ff.

Delaunay, Julien, Hanh Thi Hong Tran, Carlos-Emiliano González-Gallardo, Georgeta Bordea, Mathilde Ducos, Nicolas Sidere, Antoine Doucet, Senja Pollak & Olivier De Viron

2024. CoastTerm: A Corpus for Multidisciplinary Term Extraction in Coastal Scientific Literature. In Text, Speech, and Dialogue [Lecture Notes in Computer Science, 15048], ► pp. 97 ff.

Lefever, Els & Ayla Rigouts Terryn

2024. Computational Terminology. In New Advances in Translation Technology [New Frontiers in Translation Studies, ], ► pp. 141 ff.

This list is based on CrossRef data as of 6 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.