Automatic acquisition and classification of terminology using a tagged corpus in the molecular biology domain

Collier, Nigel; Nobata, Chikashi; Tsujii, Junichi

doi:10.1075/term.7.2.07col

Article published In: Terminology
Vol. 7:2 (2001) ► pp.239–257

Get fulltext from our e-platform

Download PDF

Automatic acquisition and classification of terminology using a tagged corpus in the molecular biology domain

Nigel Collier

Chikashi Nobata

Junichi Tsujii

Published online: 22 April 2002

https://doi.org/10.1075/term.7.2.07col

This article describes our work to identify and classify terms in the domain of molecular biology according to examples that have been marked up by a domain expert in a corpus of abstracts taken from a controlled search of the Medline database. Automatic acquisition of biomedical term lists has so far been slow due to high variability in both the terms and their classification scheme, which we attribute to the diversity of research disciplines involved. Nevertheless, the explosive growth in online molecular biology literature makes a persuasive case for automating many tasks. This includes acquisition of records for gene-product databases such as SwissProt which are currently updated by human experts, a task that is both time consuming and often highly idiosyncratic. In this article we report results from a tool based on a hidden-Markov model for extracting and classifying terms that can be used as a key component in an information extraction system. We discuss the results in light of lexical, syntactic and semantic properties of terms that were revealed by our study.

Keywords: information extraction, molecular biology, named entity

Cited by (11)

Cited by 11 other publications

Order by:

Jiang, Zhuoxuan, Yan Zhang & Xiaoming Li

2017. MOOCon: A Framework for Semi-supervised Concept Extraction from MOOC Content. In Database Systems for Advanced Applications [Lecture Notes in Computer Science, 10179], ► pp. 303 ff.

Alimzhanov, Yermek & Madina Mansurova

2016. An Approach of Automatic Extraction of Domain Keywords from the Kazakh Text. In Computational Collective Intelligence [Lecture Notes in Computer Science, 9876], ► pp. 555 ff.

Dorji, Tshering Cigay, El-sayed Atlam, Susumu Yata, Masao Fuketa, Kazuhiro Morita & Jun-ichi Aoe

2011. Extraction, selection and ranking of Field Association (FA) Terms from domain-specific corpora for building a comprehensive FA terms dictionary. Knowledge and Information Systems 27:1 ► pp. 141 ff.

Saneifar, Hassan, Stéphane Bonniol, Anne Laurent, Pascal Poncelet & Mathieu Roche

2011. How to Rank Terminology Extracted by Exterlog. In Knowledge Discovery, Knowlege Engineering and Knowledge Management [Communications in Computer and Information Science, 128], ► pp. 121 ff.

Sclano, F. & P. Velardi

2007. TermExtractor: a Web Application to Learn the Shared Terminology of Emergent Web Communities. In Enterprise Interoperability II, ► pp. 287 ff.

Zhihao Yang, Hongfei Lin & Jing Zhao

2006. 2006 6th World Congress on Intelligent Control and Automation, ► pp. 9391 ff.

Spasic, I., S. Ananiadou & J. Tsujii

2005. MaSTerClass: a case-based reasoning system for the classification of biomedical terms. Bioinformatics 21:11 ► pp. 2748 ff.

Wermter, Joachim & Udo Hahn

2005. Proceedings of the 3rd international conference on Knowledge capture, ► pp. 137 ff.

Wermter, Joachim & Udo Hahn

2005. Massive Biomedical Term Discovery. In Discovery Science [Lecture Notes in Computer Science, 3735], ► pp. 281 ff.

Spasić, Irena & Sophia Ananiadou

2004. Using automatically learnt verb selectional preferences for classification of biomedical terms. Journal of Biomedical Informatics 37:6 ► pp. 483 ff.

Wattarujeekrit, Tuangthong, Parantu K Shah & Nigel Collier

2004. PASBio: predicate-argument structures for event extraction in molecular biology. BMC Bioinformatics 5:1

This list is based on CrossRef data as of 6 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.