Article published In: Computational terminology and filtering of terminological information
Edited by Patrick Drouin, Natalia Grabar, Thierry Hamon, Kyo Kageura and Koichi Takeuchi
[Terminology 24:1] 2018
► pp. 122–147
Improving term candidates selection using terminological tokens
Available under the Creative Commons Attribution-NonCommercial (CC BY-NC) 4.0 license.
For any use beyond this license, please contact the publisher at rights@benjamins.nl.
Published online: 31 May 2018
https://doi.org/10.1075/term.00016.vaz
https://doi.org/10.1075/term.00016.vaz
Abstract
The identification of reliable terms from domain-specific corpora using
computational methods is a task that has to be validated manually by
specialists, which is a highly time-consuming activity. To reduce this effort
and improve term candidate selection, we implemented the Token Slot Recognition
method, a filtering method based on terminological tokens which is used to rank
extracted term candidates from domain-specific corpora. This paper presents the
implementation of the term candidates filtering method we developed in
linguistic and statistical approaches applied for automatic term extraction
using several domain-specific corpora in different languages. We observed that
the filtering method outperforms term candidate selection by ranking a higher
number of terms at the top of the term candidate list than raw frequency, and
for statistical term extraction the improvement is between 15% and 25% both in
precision and recall. Our analyses further revealed a reduction in the number of
term candidates to be validated manually by specialists. In conclusion, the
number of term candidates extracted automatically from domain-specific corpora
has been reduced significantly using the Token Slot Recognition filtering
method, so term candidates can be easily and quickly validated by
specialists.
Article outline
- 1.Introduction
- 2.Background
- 3.Materials and methods
- 4.Results and discussion
- 4.1Experimental settings
- 4.2Term extraction procedure
- 4.3Results and evaluation
- Results for JRC Economics English
- Statistical term extraction
- Linguistic term extraction
- Results for JRC Economics Spanish
- Results for JRC Economics French
- Results for IULA Economics Spanish
- Results for IULA Health Spanish
- Results for TERMCAT Social Services Spanish
- Results for TERMCAT Social Services Catalan
- Results for JRC Economics English
- 4.4Discussion
- 5.Conclusions and future work
References
References (67)
Ananiadou, Sofia. 1988. Towards a Methodology for Automatic Term Recognition. Dissertation. University of Manchester, Institute of Science and Technology.
Ananiadou, Sophia. 1994a. “A Computational Linguistic Approach to Automatic Term Recognition.” In Proceedings of the 3rd International Society for Knowledge Organization (ISKO 1994) 41: 134–141. Copenhagen, Denmark: Indeks Verlag.
. 1994b. “A Methodology for Automatic Term Recognition.” In Proceedings of the 15th International Conference on Computational Linguistics (COLING 1994) 21: 1034–1038. Kyoto, Japan.
Arppe, Antti. 1995. “Term Extraction from Unrestricted Text.” In Proceedings of the 10th Nordic Conference on Computational Linguistics (NODALIDA 1995). Helsinki, Finland: Department of General Linguistics.
Aubin, Sophie, and Thierry Hamon. 2006. “Improving Term Extraction with Terminological Resources.” In Advances in Natural Language Processing. Lecture Notes in Computer Science 41391. Berlin, Heidelberg: Springer.
Badia, Toni, Mercè Pujol, Antoni Tuells, Jorge Vivaldi, Lluis de Yzaguirre, and Teresa Cabré. 1998. “IULA’s LSP Multilingual Corpus: Compilation and Processing.” In Proceedings of the 1st International Conference on Language Resources and Evaluation. Granada, Spain.
Basili, Roberto, Gianluca De Rossi, and Maria Teresa Pazienza. 1997. “Inducing Terminology for Lexical Acquisition.” In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing Conference (EMNLP 1997). Providence, USA. ([URL]). Accessed 15 February 2018
Bentounsi, Imene, and Zizette Boufaida. 2013. “Extracting Candidate Terms from Medical Texts.” In International Conference on Computer Systems and Applications (AICCSA): 1–4. Fes, Morocco.
Bourigault, Didier. 1992. “Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases.” In Proceedings of the 14th Conference on Computational linguistics (COLING 1992) 31: 977–981. Nantes, France.
Bourigault, Didier, Isabelle Gonzalez-Mullier, and Cécile Gros. 1996. “LEXTER, a Natural Language Processing Tool for Terminology Extraction.” In Proceedings of the 7th European Association for Lexicography International Congress on Lexicography International Congress (EURALEX 1996): 771–779. Göteborg, Sweden: Göteborg University.
Bourigault, Didier, Christian Jacquemin, and Marie-Claude L’Homme. 2001. “Introduction.” Recent Advances in Computational Terminology 21, ed. by Didier Bourigault, Christian Jacquemin, and Marie-Claude L’Homme, iix–xviii. John Benjamins.
Bouslimi, Riadh, Jalel Akaichi, Mouhamed Gaith Ayadi and Hana Hedhli. 2016. “A Medical Collaboration Network for Medical Image Analysis.” Network Modeling Analysis in Health Informatics and Bioinformatics 5(1): 1–11.
Carreras, Xavier, Isaac Chao, Lluís Padró and Muntsa Padró. 2004. “FreeLing: An Open-Source Suite of Language Analyzers.” In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004). Lisbon, Portugal.
Conrado, Merley S., Thiago A. S. Pardo, and Solange O. Rezende. 2013. “Exploration of a Rich Feature Set for Automatic Term Extraction.” Advances in Artificial Intelligence and Its Applications 82651: 342–354. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer.
Dagan, Ido, and Ken Church. 1994. “Termight: Identifying and Translating Technical Terminology.” Proceedings of the 4th Conference on Applied Natural Language Processing: 34–40. Stuttgart, Germany.
David, Sophie, and Pierre Plante. 1990. “Le progiciel TERMINO : de la nécessité d’une analyse morphosyntaxique pour le dépouillement terminologique des textes.” In Actes du Colloque international sur les industries de la langue : perspectives des années 1990 11: 71–88. Montreal, Canada.
Drouin, Patrick. 1997. “Une méthodologie d’identification automatique des syntagmes terminologiques: l’apport de la description du non-terme.” Meta: Journal des traducteurs 42(1): 45–54.
Daille, Béatrice. 1994. Approche mixte pour l’extraction de terminologie: statistique lexicale et filtres linguistiques. Dissertation. Université de Paris 7.
. 1995. Combined Approach for Terminology Extraction: Lexical Statistics and Linguistic Filtering. 51. Lancaster, United Kingdom: UCREL Technical Papers.
. 1997. “Study and Implementation of Combined Techniques for Automatic Extraction of Terminology.” The Balancing Act: Combining Symbolic and Statistical Approaches to Language 11: 49–66. Boston: Massachusetts Institute of Technology.
Dias, Gaël. 2003. “Multiword Unit Hybrid Extraction.” In Proceedings of the ACL Workshop on Multiword Expressions: Analysis, Acquisition and Treatment (MWE 2003) 181: 41–48. Sapporo, Japan.
Dramé, Khadim, Gallo Diallo, Fleur Delva, Jean François Dartigues, Evelyne Mouillet, Roger Salamon and Fleur Mougin. 2014. “Reuse of Termino-ontological Resources and Text Corpora for Building a Multilingual Domain Ontology: an Application to Alzheimer’s Disease.” Journal of biomedical informatics 481: 171–182.
Earl, Lois L. 1970. “Experiments in Automatic Extracting and Indexing.” Information Storage and Retrieval 6(4): 313–330.
Enguehard, Chantal, and Laurent Pantera. 1995. “Automatic Natural Acquisition of a Terminology.” Journal of Quantitative Linguistics 2(1): 27–32.
Evans, David A., and Chengxiang Zhai. 1996. “Noun-phrase Analysis in Unrestricted Text for Information Retrieval.” In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics (ACL 1996): 17–24. Santa Cruz, California, USA.
Evert, Stefan, and Brigitte Krenn. 2001. “Methods for the Qualitative Evaluation of Lexical Association Measures.” In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics: 188–195.
Evert, Stefan. 2005. The Statistics of Word Cooccurrences: Word Pairs and Collocations. Dissertation. University of Stuttgart.
Fabre, Cécile. 1996. Interprétation automatique des séquences binominales en anglais et en français. Application à la recherche d’informations. Dissertation. Université de Rennes 1.
Fedorenko, Denis G., Nikita Astrakhantsev, and Denis Turdakov. 2013. “Automatic Recognition of Domain-specific Terms: an Experimental Evaluation.” In Proceedings of the Institute for System Programming of the RAS (ISP RAS) 26(4): 15–23. Russia.
Foo, Jody. 2012. Computational Terminology: Exploring Bilingual and Monolingual Term Extraction. Dissertation. Linköping University.
Frantzi, Katerina T., and Sophia Ananiadou. 1997. “Automatic Term Recognition using Contextual Cues.” In Proceedings of the 3rd DELOS Workshop: 19–27. Zurich, Suisse.
Gornostay, Tatiana. 2010. “Terminology Management in Real Use.” In Proceedings of the 5th International Conference on Applied Linguistics in Science and Education: 25–26. Saint Petersburg, Russia.
Heid, Ulrich, and John McNaught. 1991. EUROTRA-7 Study: Feasibility and Project Definition Study on the Reusability of Lexical and Terminological Resources in Computerised Applications. Final Report. CEC-DG XIII. University of Stuttgart.
Jacquemin, Christian. 1994. “FASTR: A Unification-based Front-end to Automatic Indexing.” In Proceedings of the 4th International Conference on Computer-Assisted Information Retrieval (Recherche d’information et ses Applications) (RIAO 1994) 21: 34–47. New York, USA: Rockfeller University Press.
. 1999. “Syntagmatic and Paradigmatic Representations of Term Variation.” In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL 1999): 341–348. College Park, Maryland, USA.
Jiang, Birong, Endong Xun, and Jianzhong Qi. 2015. “A Domain Independent Approach for Extracting Terms from Research Papers”. In Databases Theory and Applications. ADC 2015, ed. by Mohamed Sharaf, Muhammad Cheema, and Jianzhong Qi, 155–166. Australia. Lecture Notes in Computer Science, vol 90931. Heidelberg, Berlin: Springer.
Justeson, John S., and Slava M. Katz. 1995. “Technical Terminology: some Linguistic Properties and an Algorithm for Identification in Text.” Natural Language Engineering 1(1): 9–27.
Kageura, Kyo, and Bin Umino. 1996. “Methods of Automatic Term Recognition: A Review.” Terminology 3(2): 259–289.
Loukachevitch, Natalia V. 2012. “Automatic Term Recognition Needs Multiple Evidence.” In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012): 2401–2407. Istanbul, Turkey.
Liu, Bao, Guiping Zhang, and Dongfeng Cai. 2008. “Technical Term Automatic Extraction Research based on Statistics and Rules [J].” Computer Engineering and Applications 44(23): 147–150.
Lossio-Ventura, Juan Antonio, et al. 2014. “Yet Another Ranking Function for Automatic Multiword Term Extraction.” In Advances in Natural Language Processing. NLP 2014, ed. by Adam Przepiórkowski, and Maciej Ogrodniczuk, 52–64. Poland. Lecture Notes in Computer Science, vol 86861. Heidelberg, Berlin: Springer.
2016. “Biomedical Term Extraction: Overview and a New Methodology.” Information Retrieval Journal 19(1–2): 59–99.
Maynard, Diana, and Sophia Ananiadou. 1999. “Identifying Contextual Information for Multi-word Term Extraction.” In Proceedings of Terminology and Knowledge Engineering Conference 991: 212–221. Innsbruck, Austria.
Messaoudi, Abir, Riadh Bouslimi, and Jalel Akaichi. 2013. “Indexing Medical Images based on Collaborative Experts Reports.” International Journal of Computer Applications 70(5): 1–9.
McEnery, Tony, et al. 1997. “The Exploitation of Multilingual Annotated Corpora for Term Extraction.” Corpus Annotation: Linguistic Information from Computer Text Corpora: 220–230. Boston, MA, USA: Addison Wesley Longman.
Merkel, Magnus, and Mikael Andersson. 2000. “Knowledge-lite Extraction of Multi-word Units with Language Filters and Entropy Thresholds.” In Proceedings of the 6th International Conference on Computer-Assisted Information Retrieval (Recherche d’Information et ses Applications) (RIAO 2000): 737–746. Paris, France.
Miller, George A. 1995. “WordNet: a Lexical Database for English.” Communications of the ACM 38(11): 39–41.
Naulleau, Elie. 1998. Apprentissage et filtrage syntactico-sémantique de syntagmes nominaux pertinents pour la recherche documentaire. Dissertation. Université Paris XIII.
Nazarenko, Adeline, and Haifa Zargayouna. 2009. “Evaluating Term Extraction.” In International Conference on Recent Advances in Natural Language Processing (RANLP 2009): 299–304. Borovets, Bulgaria.
Oliver, Antoni, Salvador Climent, and Joaquim Moré. 2007. Traducción y tecnologías 41. Barcelona: Editorial UOC.
Oliver, Antoni, and Mercè Vàzquez. 2015. “TBXTools: A Free, Fast and Flexible Tool for Automatic Terminology Extraction.” International Conference on Recent Advances in Natural Language Processing (RANLP 2015): 473–479. Hissar, Bulgaria.
Padró, Lluís, and Evgeny Stanilovsky. 2012. “FreeLing 3.0: Towards Wider Multilinguality.” In Proceedings of the 8th International Conference on Language Resources and Evaluation Conference (LREC 2012): 2473–2479. Istanbul, Turkey.
Pazienza, Maria Teresa, Pennacchiotti, Marco, and Zanzotto, Fabio. 2005. “Terminology Extraction: an Analysis of Linguistic and Statistical Approaches.” Knowledge Mining. Studies in Fuzziness and Soft Computing 1851: 255–279. Heidelberg, Berlin: Springer.
Pereira, Rui, Paul Crocker, and Gaël Dias. 2004. “A Parallel Multikey Quicksort Algorithm for Mining Multiword Units.” In Proceedings of the Workshop on Methodologies and Evaluation of Multiword Units in Real-world Application: 17–23. Lisbon, Portugal.
Piao, Scott S., and McEnery, Tony. 2001. “Multi-word unit Alignment in English-Chinese Parallel Corpora.” In Proceedings of the Corpus Linguistics Conference 131: 466–475. Lancaster. England.
Smadja, Frank. 1993. “Retrieving Collocations from Text: Xtract”. Computational Linguistics 19(1): 143–177.
Valaski, Joselaine, Sheila Reinehr, and Andreia Malucelli. 2015. “Approaches and Strategies to Extract Relevant Terms: How are they being applied?” In Proceedings of the International Conference on Artificial Intelligence (ICAI 2015): 478–484. The Steering Committee of the World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp). San Diego, USA.
Vasiljevs, Andrejs, Marcis Pinnis, and Tatiana Gornostay. 2014. “Service Model for Semi-automatic Generation of Multilingual Terminology Resources.” In Proceedings of the Terminology and Knowledge Engineering Conference: 67–76. Berlin, Germany.
Vàzquez, Mercè, and Antoni Oliver. 2013. “Improving Term Candidate Validation Using Ranking Metrics.” In Proceedings of the 3rd World Conference on Information Technology (WCIT-2012) 31: 1348–1359. AWERProcedia Information Technology & Computer Science. Barcelona, Spain.
Vàzquez, Mercè. 2014. Estratègies estadístiques aplicades a l’extracció automàtica de terminologia. Dissertation. Universitat Pompeu Fabra.
Velardi, Paola, Michele Missikoff, and Roberto Basili. 2001. “Identification of Relevant Terms to Support the Construction of Domain Ontologies.” In Proceedings of the Workshop on Human Language Technology and Knowledge Management – Volume 2001, 1–8. Association for Computational Linguistics. Morristown, USA.
Vivaldi, Jorge, and Horacio Rodríguez. 2001. “Improving Term Extraction by Combining different Techniques.” Terminology 7(1): 31–48.
Vivaldi, Jorge. 2009. “Corpus and Exploitation Tool: IULACT and BwanaNet.” In International Conference on Corpus Linguistics (CICL 2009), A survey on corpus-based research: 224–239. Universidad de Murcia, Spain.
Vossen, Piek. 1998. A Multilingual Database with Lexical Semantic Networks. Dordrecht: Kluwer Academic Publishers.
Vu, Thuy, Ai Ti Aw, and Min Zhang. 2008. “Term Extraction through Unithood and Termhood Unification.” In Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP 2008) 11: 631–636. Hyderabad, India.
Cited by (3)
Cited by three other publications
Lei, Lei, Yaochen Deng & Dilin Liu
Kister, Laurence & Evelyne Jacquey
Martín-Chozas, Patricia, Karen Vázquez-Flores, Pablo Calleja, Elena Montiel-Ponsoda, Víctor Rodríguez-Doncel, Julia Bosque-Gil, Milan Dojchinovski, Philipp Cimiano, Julia Bosque-Gil, Philipp Cimiano & Milan Dojchinovski
This list is based on CrossRef data as of 5 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
