Improving term candidates selection using terminological tokens

Vàzquez, Mercè; Oliver, Antoni

doi:10.1075/term.00016.vaz

Article published In: Computational terminology and filtering of terminological information
Edited by Patrick Drouin, Natalia Grabar, Thierry Hamon, Kyo Kageura and Koichi Takeuchi
[Terminology 24:1] 2018
► pp. 122–147

Get fulltext from our e-platform

Download PDF

Improving term candidates selection using terminological tokens

Mercè Vàzquez | Universitat Oberta de Catalunya

Antoni Oliver | Universitat Oberta de Catalunya

Available under the Creative Commons Attribution-NonCommercial (CC BY-NC) 4.0 license.

For any use beyond this license, please contact the publisher at rights@benjamins.nl.

Published online: 31 May 2018

https://doi.org/10.1075/term.00016.vaz

Abstract

The identification of reliable terms from domain-specific corpora using computational methods is a task that has to be validated manually by specialists, which is a highly time-consuming activity. To reduce this effort and improve term candidate selection, we implemented the Token Slot Recognition method, a filtering method based on terminological tokens which is used to rank extracted term candidates from domain-specific corpora. This paper presents the implementation of the term candidates filtering method we developed in linguistic and statistical approaches applied for automatic term extraction using several domain-specific corpora in different languages. We observed that the filtering method outperforms term candidate selection by ranking a higher number of terms at the top of the term candidate list than raw frequency, and for statistical term extraction the improvement is between 15% and 25% both in precision and recall. Our analyses further revealed a reduction in the number of term candidates to be validated manually by specialists. In conclusion, the number of term candidates extracted automatically from domain-specific corpora has been reduced significantly using the Token Slot Recognition filtering method, so term candidates can be easily and quickly validated by specialists.

Keywords: automatic term extraction, terminology extraction, domain-specific corpora, terminological tokens, TSR filtering method, TBXTools, term candidates, terminological units

Article outline

1.Introduction
2.Background
3.Materials and methods
4.Results and discussion
- 4.1Experimental settings
- 4.2Term extraction procedure
- 4.3Results and evaluation
  - Results for JRC Economics English
    - Statistical term extraction
    - Linguistic term extraction
  - Results for JRC Economics Spanish
  - Results for JRC Economics French
  - Results for IULA Economics Spanish
  - Results for IULA Health Spanish
  - Results for TERMCAT Social Services Spanish
  - Results for TERMCAT Social Services Catalan
- 4.4Discussion
5.Conclusions and future work
References

References (67)

References

Ananiadou, Sofia. 1988. Towards a Methodology for Automatic Term Recognition. Dissertation. University of Manchester, Institute of Science and Technology.

Ananiadou, Sophia. 1994a. “A Computational Linguistic Approach to Automatic Term Recognition.” In Proceedings of the 3rd International Society for Knowledge Organization (ISKO 1994) 41: 134–141. Copenhagen, Denmark: Indeks Verlag.

. 1994b. “A Methodology for Automatic Term Recognition.” In Proceedings of the 15th International Conference on Computational Linguistics (COLING 1994) 21: 1034–1038. Kyoto, Japan.

Arppe, Antti. 1995. “Term Extraction from Unrestricted Text.” In Proceedings of the 10th Nordic Conference on Computational Linguistics (NODALIDA 1995). Helsinki, Finland: Department of General Linguistics.

Aubin, Sophie, and Thierry Hamon. 2006. “Improving Term Extraction with Terminological Resources.” In Advances in Natural Language Processing. Lecture Notes in Computer Science 41391. Berlin, Heidelberg: Springer.

Badia, Toni, Mercè Pujol, Antoni Tuells, Jorge Vivaldi, Lluis de Yzaguirre, and Teresa Cabré. 1998. “IULA’s LSP Multilingual Corpus: Compilation and Processing.” In Proceedings of the 1st International Conference on Language Resources and Evaluation. Granada, Spain.

Basili, Roberto, Gianluca De Rossi, and Maria Teresa Pazienza. 1997. “Inducing Terminology for Lexical Acquisition.” In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing Conference (EMNLP 1997). Providence, USA. ([URL]). Accessed 15 February 2018

Bentounsi, Imene, and Zizette Boufaida. 2013. “Extracting Candidate Terms from Medical Texts.” In International Conference on Computer Systems and Applications (AICCSA): 1–4. Fes, Morocco.

Bourigault, Didier. 1992. “Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases.” In Proceedings of the 14th Conference on Computational linguistics (COLING 1992) 31: 977–981. Nantes, France.

Bourigault, Didier, Isabelle Gonzalez-Mullier, and Cécile Gros. 1996. “LEXTER, a Natural Language Processing Tool for Terminology Extraction.” In Proceedings of the 7th European Association for Lexicography International Congress on Lexicography International Congress (EURALEX 1996): 771–779. Göteborg, Sweden: Göteborg University.

Bourigault, Didier, Christian Jacquemin, and Marie-Claude L’Homme. 2001. “Introduction.” Recent Advances in Computational Terminology 21, ed. by Didier Bourigault, Christian Jacquemin, and Marie-Claude L’Homme, iix–xviii. John Benjamins.

Bouslimi, Riadh, Jalel Akaichi, Mouhamed Gaith Ayadi and Hana Hedhli. 2016. “A Medical Collaboration Network for Medical Image Analysis.” Network Modeling Analysis in Health Informatics and Bioinformatics 5(1): 1–11.

Carreras, Xavier, Isaac Chao, Lluís Padró and Muntsa Padró. 2004. “FreeLing: An Open-Source Suite of Language Analyzers.” In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004). Lisbon, Portugal.

Conrado, Merley S., Thiago A. S. Pardo, and Solange O. Rezende. 2013. “Exploration of a Rich Feature Set for Automatic Term Extraction.” Advances in Artificial Intelligence and Its Applications 82651: 342–354. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer.

Dagan, Ido, and Ken Church. 1994. “Termight: Identifying and Translating Technical Terminology.” Proceedings of the 4th Conference on Applied Natural Language Processing: 34–40. Stuttgart, Germany.

David, Sophie, and Pierre Plante. 1990. “Le progiciel TERMINO : de la nécessité d’une analyse morphosyntaxique pour le dépouillement terminologique des textes.” In Actes du Colloque international sur les industries de la langue : perspectives des années 1990 11: 71–88. Montreal, Canada.

Drouin, Patrick. 1997. “Une méthodologie d’identification automatique des syntagmes terminologiques: l’apport de la description du non-terme.” Meta: Journal des traducteurs 42(1): 45–54.

Daille, Béatrice. 1994. Approche mixte pour l’extraction de terminologie: statistique lexicale et filtres linguistiques. Dissertation. Université de Paris 7.

. 1995. Combined Approach for Terminology Extraction: Lexical Statistics and Linguistic Filtering. 51. Lancaster, United Kingdom: UCREL Technical Papers.

. 1997. “Study and Implementation of Combined Techniques for Automatic Extraction of Terminology.” The Balancing Act: Combining Symbolic and Statistical Approaches to Language 11: 49–66. Boston: Massachusetts Institute of Technology.

Dias, Gaël. 2003. “Multiword Unit Hybrid Extraction.” In Proceedings of the ACL Workshop on Multiword Expressions: Analysis, Acquisition and Treatment (MWE 2003) 181: 41–48. Sapporo, Japan.

Dramé, Khadim, Gallo Diallo, Fleur Delva, Jean François Dartigues, Evelyne Mouillet, Roger Salamon and Fleur Mougin. 2014. “Reuse of Termino-ontological Resources and Text Corpora for Building a Multilingual Domain Ontology: an Application to Alzheimer’s Disease.” Journal of biomedical informatics 481: 171–182.

Earl, Lois L. 1970. “Experiments in Automatic Extracting and Indexing.” Information Storage and Retrieval 6(4): 313–330.

Enguehard, Chantal, and Laurent Pantera. 1995. “Automatic Natural Acquisition of a Terminology.” Journal of Quantitative Linguistics 2(1): 27–32.

Evans, David A., and Chengxiang Zhai. 1996. “Noun-phrase Analysis in Unrestricted Text for Information Retrieval.” In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics (ACL 1996): 17–24. Santa Cruz, California, USA.

Evert, Stefan, and Brigitte Krenn. 2001. “Methods for the Qualitative Evaluation of Lexical Association Measures.” In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics: 188–195.

Evert, Stefan. 2005. The Statistics of Word Cooccurrences: Word Pairs and Collocations. Dissertation. University of Stuttgart.

Fabre, Cécile. 1996. Interprétation automatique des séquences binominales en anglais et en français. Application à la recherche d’informations. Dissertation. Université de Rennes 1.

Fedorenko, Denis G., Nikita Astrakhantsev, and Denis Turdakov. 2013. “Automatic Recognition of Domain-specific Terms: an Experimental Evaluation.” In Proceedings of the Institute for System Programming of the RAS (ISP RAS) 26(4): 15–23. Russia.

Foo, Jody. 2012. Computational Terminology: Exploring Bilingual and Monolingual Term Extraction. Dissertation. Linköping University.

Frantzi, Katerina T., and Sophia Ananiadou. 1997. “Automatic Term Recognition using Contextual Cues.” In Proceedings of the 3rd DELOS Workshop: 19–27. Zurich, Suisse.

Gornostay, Tatiana. 2010. “Terminology Management in Real Use.” In Proceedings of the 5th International Conference on Applied Linguistics in Science and Education: 25–26. Saint Petersburg, Russia.

Heid, Ulrich, and John McNaught. 1991. EUROTRA-7 Study: Feasibility and Project Definition Study on the Reusability of Lexical and Terminological Resources in Computerised Applications. Final Report. CEC-DG XIII. University of Stuttgart.

Jacquemin, Christian. 1994. “FASTR: A Unification-based Front-end to Automatic Indexing.” In Proceedings of the 4th International Conference on Computer-Assisted Information Retrieval (Recherche d’information et ses Applications) (RIAO 1994) 21: 34–47. New York, USA: Rockfeller University Press.

. 1999. “Syntagmatic and Paradigmatic Representations of Term Variation.” In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL 1999): 341–348. College Park, Maryland, USA.

Jiang, Birong, Endong Xun, and Jianzhong Qi. 2015. “A Domain Independent Approach for Extracting Terms from Research Papers”. In Databases Theory and Applications. ADC 2015, ed. by Mohamed Sharaf, Muhammad Cheema, and Jianzhong Qi, 155–166. Australia. Lecture Notes in Computer Science, vol 90931. Heidelberg, Berlin: Springer.

Justeson, John S., and Slava M. Katz. 1995. “Technical Terminology: some Linguistic Properties and an Algorithm for Identification in Text.” Natural Language Engineering 1(1): 9–27.

Kageura, Kyo, and Bin Umino. 1996. “Methods of Automatic Term Recognition: A Review.” Terminology 3(2): 259–289.

Loukachevitch, Natalia V. 2012. “Automatic Term Recognition Needs Multiple Evidence.” In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012): 2401–2407. Istanbul, Turkey.

Liu, Bao, Guiping Zhang, and Dongfeng Cai. 2008. “Technical Term Automatic Extraction Research based on Statistics and Rules [J].” Computer Engineering and Applications 44(23): 147–150.

Lossio-Ventura, Juan Antonio, et al. 2014. “Yet Another Ranking Function for Automatic Multiword Term Extraction.” In Advances in Natural Language Processing. NLP 2014, ed. by Adam Przepiórkowski, and Maciej Ogrodniczuk, 52–64. Poland. Lecture Notes in Computer Science, vol 86861. Heidelberg, Berlin: Springer.

2016. “Biomedical Term Extraction: Overview and a New Methodology.” Information Retrieval Journal 19(1–2): 59–99.

Maynard, Diana, and Sophia Ananiadou. 1999. “Identifying Contextual Information for Multi-word Term Extraction.” In Proceedings of Terminology and Knowledge Engineering Conference 991: 212–221. Innsbruck, Austria.

Messaoudi, Abir, Riadh Bouslimi, and Jalel Akaichi. 2013. “Indexing Medical Images based on Collaborative Experts Reports.” International Journal of Computer Applications 70(5): 1–9.

McEnery, Tony, et al. 1997. “The Exploitation of Multilingual Annotated Corpora for Term Extraction.” Corpus Annotation: Linguistic Information from Computer Text Corpora: 220–230. Boston, MA, USA: Addison Wesley Longman.

Merkel, Magnus, and Mikael Andersson. 2000. “Knowledge-lite Extraction of Multi-word Units with Language Filters and Entropy Thresholds.” In Proceedings of the 6th International Conference on Computer-Assisted Information Retrieval (Recherche d’Information et ses Applications) (RIAO 2000): 737–746. Paris, France.

Miller, George A. 1995. “WordNet: a Lexical Database for English.” Communications of the ACM 38(11): 39–41.

Naulleau, Elie. 1998. Apprentissage et filtrage syntactico-sémantique de syntagmes nominaux pertinents pour la recherche documentaire. Dissertation. Université Paris XIII.

Nazarenko, Adeline, and Haifa Zargayouna. 2009. “Evaluating Term Extraction.” In International Conference on Recent Advances in Natural Language Processing (RANLP 2009): 299–304. Borovets, Bulgaria.

Oliver, Antoni, Salvador Climent, and Joaquim Moré. 2007. Traducción y tecnologías 41. Barcelona: Editorial UOC.

Oliver, Antoni, and Mercè Vàzquez. 2015. “TBXTools: A Free, Fast and Flexible Tool for Automatic Terminology Extraction.” International Conference on Recent Advances in Natural Language Processing (RANLP 2015): 473–479. Hissar, Bulgaria.

Padró, Lluís, and Evgeny Stanilovsky. 2012. “FreeLing 3.0: Towards Wider Multilinguality.” In Proceedings of the 8th International Conference on Language Resources and Evaluation Conference (LREC 2012): 2473–2479. Istanbul, Turkey.

Pazienza, Maria Teresa, Pennacchiotti, Marco, and Zanzotto, Fabio. 2005. “Terminology Extraction: an Analysis of Linguistic and Statistical Approaches.” Knowledge Mining. Studies in Fuzziness and Soft Computing 1851: 255–279. Heidelberg, Berlin: Springer.

Pereira, Rui, Paul Crocker, and Gaël Dias. 2004. “A Parallel Multikey Quicksort Algorithm for Mining Multiword Units.” In Proceedings of the Workshop on Methodologies and Evaluation of Multiword Units in Real-world Application: 17–23. Lisbon, Portugal.

Piao, Scott S., and McEnery, Tony. 2001. “Multi-word unit Alignment in English-Chinese Parallel Corpora.” In Proceedings of the Corpus Linguistics Conference 131: 466–475. Lancaster. England.

Smadja, Frank. 1993. “Retrieving Collocations from Text: Xtract”. Computational Linguistics 19(1): 143–177.

Valaski, Joselaine, Sheila Reinehr, and Andreia Malucelli. 2015. “Approaches and Strategies to Extract Relevant Terms: How are they being applied?” In Proceedings of the International Conference on Artificial Intelligence (ICAI 2015): 478–484. The Steering Committee of the World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp). San Diego, USA.

Vasiljevs, Andrejs, Marcis Pinnis, and Tatiana Gornostay. 2014. “Service Model for Semi-automatic Generation of Multilingual Terminology Resources.” In Proceedings of the Terminology and Knowledge Engineering Conference: 67–76. Berlin, Germany.

Vàzquez, Mercè, and Antoni Oliver. 2013. “Improving Term Candidate Validation Using Ranking Metrics.” In Proceedings of the 3rd World Conference on Information Technology (WCIT-2012) 31: 1348–1359. AWERProcedia Information Technology & Computer Science. Barcelona, Spain.

Vàzquez, Mercè. 2014. Estratègies estadístiques aplicades a l’extracció automàtica de terminologia. Dissertation. Universitat Pompeu Fabra.

Velardi, Paola, Michele Missikoff, and Roberto Basili. 2001. “Identification of Relevant Terms to Support the Construction of Domain Ontologies.” In Proceedings of the Workshop on Human Language Technology and Knowledge Management – Volume 2001, 1–8. Association for Computational Linguistics. Morristown, USA.

Vivaldi, Jorge, and Horacio Rodríguez. 2001. “Improving Term Extraction by Combining different Techniques.” Terminology 7(1): 31–48.

Vivaldi, Jorge. 2009. “Corpus and Exploitation Tool: IULACT and BwanaNet.” In International Conference on Corpus Linguistics (CICL 2009), A survey on corpus-based research: 224–239. Universidad de Murcia, Spain.

Vossen, Piek. 1998. A Multilingual Database with Lexical Semantic Networks. Dordrecht: Kluwer Academic Publishers.

Vu, Thuy, Ai Ti Aw, and Min Zhang. 2008. “Term Extraction through Unithood and Termhood Unification.” In Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP 2008) 11: 631–636. Hyderabad, India.

Wong, Wilson, Wei Liu, and Mohammed Bennamoun. 2007. “Tree-traversing Ant Algorithm for Term Clustering based on Featureless Similarities.” Data Mining and Knowledge Discovery 15(3): 349–381.

Zheng, Dequan, Tiejun Zhao, and Jing Yang. 2009. “Research on Domain Term Extraction based on Conditional Random Fields.” In International Conference on Computer Processing of Oriental Languages: 290–296. Heidelberg, Berlin: Springer.

Cited by (3)

Cited by three other publications

Lei, Lei, Yaochen Deng & Dilin Liu

2023. Examining research topics with a dependency-based noun phrase extraction method: a case in accounting. Library Hi Tech 41:2 ► pp. 570 ff.

Kister, Laurence & Evelyne Jacquey

2022. Identification d’occurrences de candidats termes dans des articles scientifiques. Corela :20-1

Martín-Chozas, Patricia, Karen Vázquez-Flores, Pablo Calleja, Elena Montiel-Ponsoda, Víctor Rodríguez-Doncel, Julia Bosque-Gil, Milan Dojchinovski, Philipp Cimiano, Julia Bosque-Gil, Philipp Cimiano & Milan Dojchinovski

2022. TermitUp: Generation and enrichment of linked terminologies. Semantic Web 13:6 ► pp. 967 ff.

This list is based on CrossRef data as of 5 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.