Article published In: Terminology
Vol. 28:2 (2022) ► pp.299–327
Automatic medical term extraction from Vietnamese clinical texts
Published online: 9 June 2022
https://doi.org/10.1075/term.20037.vo
https://doi.org/10.1075/term.20037.vo
Abstract
In this paper, we propose the first method for automatic Vietnamese medical term discovery and extraction from
clinical texts. The method combines linguistic filtering based on our defined open patterns with nested term extraction and
statistical ranking using C-value. It does not require annotated corpora, external data resources, parameter
settings, or term length restriction. Beside its specialty in handling Vietnamese medical terms, another novelty is that it uses
Pointwise Mutual Information to split nested terms and the disjunctive acceptance condition to extract them. Evaluated on real
Vietnamese electronic medical records, it achieves a precision of about 74% and recall of about 92% and is proved stably effective
with small datasets. It outperforms the previous works in the same category of not using annotated corpora and external data
resources. Our method and empirical evaluation analysis can lay a foundation for further research and development in Vietnamese
medical term discovery and extraction.
Article outline
- 1.Introduction
- 2.Related works
- 2.1Linguistics-based
- 2.2Statistics-based
- 2.3Machine learning-based
- 2.4Hybrid
- 3.The proposed method
- 3.1Method overview
- 3.2Preprocessing
- 3.3Linguistics-based candidate term extraction
- Part-of-Speech tagging
- Open pattern-based term extraction
- PMI-based nested term extraction
- Stop word-based filtering
- 3.4Statistics-based term ranking
- 4.Empirical evaluation
- 4.1Data descriptions
- 4.2Experiment settings and results
- Self-Evaluation
- Comparative evaluation
- 5.Conclusions
References
References (55)
Arbabi, Aryan, David R. Adams, Sanja Fidler, and Michael Brudno. 2019. “Identifying
clinical terms in free-text notes using ontology-guided machine
learning.” In RECOMB 2019, ed.
by L. J. Cowen, LNBI, 114671: 19–34. Springer-Verlag.
Aubin, Sophie, and Thierry Hamon. 2006. “Improving
term extraction with terminological resources.” In Proc the
International Conference on Natural Language
Processing: 380–387.
Barrón-Cedeño, Alberto, Gerardo Sierra, Patrick Drouin, and Sophia Ananiadou. 2009. “An
improved automatic term recognition method for Spanish.” In CICLing
2009, ed. by A. Gelbukh, Lecture
Notes in Computer
Science 54491: 125–136. Springer-Verlag.
Bonin, Francesca, Felice Dell’Orletta, Giulia Venturi, and Simonetta Montemagni. 2010. “A
contrastive approach to multi-word term extraction from domain
corpora.” In Proc the 7th International Conference on Language
Resources and
Evaluation (LREC’10): 3222–3229.
Boulaknadel, Siham, Beatrice Daille, and Driss Aboutajdine. 2008. “A
multi-word term extraction program for Arabic language.” In Proc the
6th International Conference on Language Resources and
Evaluation (LREC’08): 1485–1488.
Bouma, Gerlof. 2009. “Normalized
(pointwise) mutual information in collocation extraction.” In Proc
GSCL: 31–40.
Bourigault, Didier. 1992. “Surface
grammatical analysis for the extraction of terminological noun
phrases.” In Proc
COLING-92: 977–981.
Bourigault, Didier and Christian Jacquemin. 1999. “TERM
EXTRACTION + TERM CLUSTERING: an integrated platform for computer-aided
terminology.” In Proc the 9th Conference on European Chapter of the
Association for Computational Linguistics
(EACL’99): 15–22.
Cabré Castellví, M. Teresa. 2003. “Theories of terminology:
Their description, prescription and
explanation.” Terminology 9 (2): 163–199.
Chaimongkol, Panot and Akiko Aizawa. 2013. “Utilizing
LDA clustering for technical term extraction.” In Proc the 19th
Annual Meeting of the Association for Natural Language Processing
(ANLP): 686–689.
Chen, Jinying, and Hong Yu. 2017. “Unsupervised
ensemble ranking of terms in electronic health record notes based on their importance to
patients.” Journal of Biomedical
Informatics: 1–30.
Chung, Teresa Mihwa. 2003. “A corpus comparison
approach for terminology
extraction.” Terminology 9 (2): 221–246.
Church, Kenneth Ward, and Patrick Hanks. 1989. “Word
association norms, mutual information, and lexicography.” In Proc the
27th Annual Meetings of the Association for Computational
Linguistics: 76–83.
Conrado, Merley S., Thiago A. S. Pardo, and Solange O. Rezende. 2013. “Exploration
of a rich feature set for automatic term extraction.” In MICAI
2013, ed. by F. Castro, A. Gelbukh, and M. González, LNAI, 82651: 342–354. Springer-Verlag.
Dagan, Ido and Ken Church. 1997. “Termight:
coordinating humans and machines in bilingual terminology acquisition.” Machine
Translation 121: 89–107.
Daille, Béatrice. 1994. “Study
and implementation of combined techniques for automatic extraction of
terminology.” In Proc the Balancing Act Workshop at the 32nd Annual
Meeting of the ACL: 29–36.
Dias, Gaël. 2003. “Multiword
unit hybrid extraction.” In Proc the ACL 2003 Workshop on Multiword
Expressions: Analysis, Acquisition and Treatment: 41–48.
Dice, Lee R. 1945. “Measures of the amount of
ecological association between species.” J.
Ecology 261: 297–302.
Drouin, Patrick. 2003. “Term
extraction using non-technical corpora as a point of
leverage.” Terminology 9 (1): 99–115.
Fahmi, Ismail, Gosse Bouma, and Lonneke van der Plas. 2007. “Using
multilingual terms for biomedical term extraction.” In Proc the RANLP
Workshop on Acquisition and Management of Multilingual
Lexicons: 1–8.
Frantzi, Katerina T., and Sophia Ananiadou. 1999. “The
C-value/NC-value domain-independent method for multi-word term extraction.” Journal of Natural
Language
Processing 6 (3): 145–179.
Frantzi, Katerina, Sophia Ananiadou, and Hideki Mima. 2000. “Automatic
recognition of multi-word terms: the C-value/NC-value method.” Int J Digit
Libr 31: 115–130.
Gao, Yuze, and Yu Yuan. 2019. “Feature-less
end-to-end nested term extraction.” In Proc the International
Conference on Natural Language Processing and Chinese
Computing: 607–616.
He, Yulan. 2016. “Extracting
topical phrases from clinical documents.” In Proc the 30th AAAI Conf
on Artificial Intelligence: 2957–2963.
Heylen, Kris, and Dirk De Hertog. 2015. “Automatic
term extraction.” In Handbook of
Terminology, ed. by H. J. Kockaert and F. Steurs, Vol. 11, 203–221. John Benjamins.
Kageura, Kyo, and Bin Umino. 1996. “Methods
of automatic term recognition – a review.” Terminology. International Journal of Theoretical
and Applied Issues in Specialized
Communication 3(2): 259–289.
Krauthammer, Michael, and Goran Nenadic. 2004. “Term
identification in the biomedical literature.” Journal of Biomedical
Informatics 371: 512–526.
Le, Hong Phuong. 2016. “Vitk: a Vietnamese text
processing
toolkit.” (Jan. 2016). Retrieved Jan 01, 2016 from [URL]
Liu, Liangliang, Xiaojing Wu, Hui Liu, Xinyu Cao, Haitao Wang, Hongwei Zhou, and Qi Xie. 2020. “A
semi-supervised approach for extracting TCM clinical terms based on feature words.” BMC Medical
Informatics and Decision Making 20 (Suppl
3): 118.
Liu, Wei, Bo Chuen Chung, Rui Wang, Jonathon Ng, and Nigel Morlet. 2015. “A
genetic algorithm enabled ensemble for unsupervised medical term extraction from clinical
letters.” Health Inf Sci
Syst 3 (5): 1–14.
Lossio-Ventura, Juan Antonio, Clement Jonquet, Mathieu Roche, and Maguelonne Teisseire. 2016. “Biomedical
term extraction: overview and a new methodology.” Information Retrieval Journal, Medical
Information
Retrieval 19 (1): 59–99.
Maclean, Diana Lynn, and Jeffrey Heer. 2013. “Identifying
medical terms in patient-authored text: a crowdsourcing-based approach.” J Am Med Inform
Assoc: 1–8.
Marciniak, Malgorzata, and Agnieszka Mykowiecka. 2014. “Terminology
extraction from medical texts in Polish.” Journal of Biomedical
Semantics 5 (24): 1–14.
Maynard, Diana, and Sophia Ananiadou. 2001. “TRUCKS:
a model for automatic multi-word term recognition.” Journal of Natural Language
Processing 8 (1): 101–125.
McInnes, Bridget T., Ted Pedersen, and Serguei V. Pakhomov. 2007. “Determining
the syntactic structure of medical terms in clinical notes.” In Proc
the ACL 2007 Workshop on Biological, Translational, and Clinical Language Processing (BioNLP
2007): 9–16.
Mihalcea, Rada, and Paul Tarau. 2004. “TextRank:
Bringing order into text.” In Proc the 2004 Conference on Empirical
Methods in Natural Language Processing: 404–411.
Nguyen, Bao An, and Don-Lin Yang. 2012. “A
semi-automatic approach to construct Vietnamese ontology from online text.” The International
Review of Research in Open and Distributed
Learning 13 (5): 148–172.
Nguyen, Hong Son, Minh Hieu Le, Chan Quan Loi Lam, and Trong Hai Duong. 2017. “Smart
interactive search for Vietnamese disease by using data mining-based ontology.” Journal of
Information and
Telecommunication 1 (2): 176–191.
Nguyen, Minh Hiep, Huyen Nguyen Thi Minh, and Quyen Ngo The. 2018. “Building
Resources for Vietnamese Clinical Text Processing.” Computación y
Sistemas 22 (4): 1287–1294.
Nguyen, Minh-Tien, and Tri-Thanh Nguyen. 2015. “DESRM:
a disease extraction system for real-time monitoring.” International Journal of Computational
Vision and
Robotics 5 (3): 282–301.
Oliver, Antoni, and Mercè Vàzquez. 2015. “TBXTools:
a free, fast and flexible tool for automatic terminology
extraction.” In Proc Recent Advances in Natural Language
Processing: 473–479.
. 2020. “TermEval
2020: Using TSR Filtering Method to Improve Automatic Term
Extraction.” In Proc the 6th International Workshop on Computational
Terminology (COMPUTERM
2020): 106–113.
Pei, Jian, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, and Mei-Chun Hsu. 2001. “PrefixSpan:
Mining sequential patterns efficiently by Prefix-Projected Pattern
Growth.” In Proc the 17th International Conference on Data
Engineering: 1–10.
Periñán-Pascual, Carlos, and Eva M. Mestre-Mestre. 2015. “DEXTER:
Automatic extraction of domain-specific glossaries for language teaching.” Procedia – Social
and Behavioral
Sciences 1981: 377–385.
Repar, Andraž, Vid Podpečan, Anze Vavpetič, Nada Lavrač, and Senja Pollak. 2019. “TermEnsembler:
an ensemble learning approach to bilingual term extraction and
alignment.” Terminology 25 (1): 93–120.
Samy, Doaa, Antonio Moreno-Sandoval, Conchi Bueno-Díaz, Marta Garrote-Salazar, and José M. Guirao. 2012. “Medical
term extraction in an Arabic medical corpus.” In Proc the 8th
International Conference on Language Resources and
Evaluation (LREC’12): 640–645.
Terryn, Ayla Rigouts, Patrick Drouin, Véronique Hoste, and Els Lefever. 2019. “Analysing
the impact of supervised machine learning on automatic term extraction: HAMLET vs
TermoStat.” In Proc Recent Advances in Natural Language
Processing: 1012–1021.
Terryn, Ayla Rigouts, Véronique Hoste, Joost Buysschaert, Robert Vander Stichele, Elise Van Campen, and Els Lefever. 2019. “Validating
multilingual hybrid automatic term extraction for search engine optimization: the use of
EBM-GUIDELINES.” Argentinian Journal of Applied
Linguistics: 93–108.
Terryn, Ayla Rigouts, Véronique Hoste, and Els Lefever. 2018. “A
gold standard for multilingual automatic term extraction from comparable corpora: term structure and translation
equivalents.” In Proc the 11th International Conference on Language
Resources and Evaluation (LREC
2018): 1803–1808.
Vàzquez, Mercè, and Antoni Oliver. 2018. “Improving
term candidates selection using terminological
tokens.” Terminology 24 (1): 122–147.
Vivaldi, Jordi, Lluís Màrquez, and Horacio Rodríguez. 2001. “Improving
term extraction by system combination using boosting.” In ECML
2001, ed. by L. De Raedt and P. Flach, LNAI, Vol. 21671, 515–526. Springer-Verlag.
Cited by (1)
Cited by one other publication
This list is based on CrossRef data as of 6 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
