Automatic medical term extraction from Vietnamese clinical texts

Vo, Chau; Cao, Tru; Truong, Ngoc; Ngo, Trung; Bui, Dai

doi:10.1075/term.20037.vo

Article published In: Terminology
Vol. 28:2 (2022) ► pp.299–327

Get fulltext from our e-platform

Download PDF

Download EPUB

Automatic medical term extraction from Vietnamese clinical texts

Chau Vo | Ho Chi Minh City University of Technology, Vietnam National UniversityHo Chi Minh City

Tru Cao | The University of Texas Health Science Center at Houston

Ngoc Truong | FPT University

Trung Ngo | Tokyo University of Agriculture and Technology

Dai Bui | Unit Corporation

Published online: 9 June 2022

https://doi.org/10.1075/term.20037.vo

Abstract

In this paper, we propose the first method for automatic Vietnamese medical term discovery and extraction from clinical texts. The method combines linguistic filtering based on our defined open patterns with nested term extraction and statistical ranking using C-value. It does not require annotated corpora, external data resources, parameter settings, or term length restriction. Beside its specialty in handling Vietnamese medical terms, another novelty is that it uses Pointwise Mutual Information to split nested terms and the disjunctive acceptance condition to extract them. Evaluated on real Vietnamese electronic medical records, it achieves a precision of about 74% and recall of about 92% and is proved stably effective with small datasets. It outperforms the previous works in the same category of not using annotated corpora and external data resources. Our method and empirical evaluation analysis can lay a foundation for further research and development in Vietnamese medical term discovery and extraction.

Keywords: automatic term extraction, electronic medical record, open linguistic pattern, pointwise mutual information, statistical ranking

Article outline

1.Introduction
2.Related works
- 2.1Linguistics-based
- 2.2Statistics-based
- 2.3Machine learning-based
- 2.4Hybrid
3.The proposed method
- 3.1Method overview
- 3.2Preprocessing
- 3.3Linguistics-based candidate term extraction
  - Part-of-Speech tagging
  - Open pattern-based term extraction
  - PMI-based nested term extraction
  - Stop word-based filtering
- 3.4Statistics-based term ranking
4.Empirical evaluation
- 4.1Data descriptions
- 4.2Experiment settings and results
  - Self-Evaluation
  - Comparative evaluation
5.Conclusions
References

References (55)

References

Arbabi, Aryan, David R. Adams, Sanja Fidler, and Michael Brudno. 2019. “Identifying clinical terms in free-text notes using ontology-guided machine learning.” In RECOMB 2019, ed. by L. J. Cowen, LNBI, 114671: 19–34. Springer-Verlag.

Aubin, Sophie, and Thierry Hamon. 2006. “Improving term extraction with terminological resources.” In Proc the International Conference on Natural Language Processing: 380–387.

Barrón-Cedeño, Alberto, Gerardo Sierra, Patrick Drouin, and Sophia Ananiadou. 2009. “An improved automatic term recognition method for Spanish.” In CICLing 2009, ed. by A. Gelbukh, Lecture Notes in Computer Science 54491: 125–136. Springer-Verlag.

Bonin, Francesca, Felice Dell’Orletta, Giulia Venturi, and Simonetta Montemagni. 2010. “A contrastive approach to multi-word term extraction from domain corpora.” In Proc the 7th International Conference on Language Resources and Evaluation (LREC’10): 3222–3229.

Boulaknadel, Siham, Beatrice Daille, and Driss Aboutajdine. 2008. “A multi-word term extraction program for Arabic language.” In Proc the 6th International Conference on Language Resources and Evaluation (LREC’08): 1485–1488.

Bouma, Gerlof. 2009. “Normalized (pointwise) mutual information in collocation extraction.” In Proc GSCL: 31–40.

Bourigault, Didier. 1992. “Surface grammatical analysis for the extraction of terminological noun phrases.” In Proc COLING-92: 977–981.

Bourigault, Didier and Christian Jacquemin. 1999. “TERM EXTRACTION + TERM CLUSTERING: an integrated platform for computer-aided terminology.” In Proc the 9th Conference on European Chapter of the Association for Computational Linguistics (EACL’99): 15–22.

Cabré Castellví, M. Teresa. 2003. “Theories of terminology: Their description, prescription and explanation.” Terminology 9 (2): 163–199.

Chaimongkol, Panot and Akiko Aizawa. 2013. “Utilizing LDA clustering for technical term extraction.” In Proc the 19th Annual Meeting of the Association for Natural Language Processing (ANLP): 686–689.

Chen, Jinying, and Hong Yu. 2017. “Unsupervised ensemble ranking of terms in electronic health record notes based on their importance to patients.” Journal of Biomedical Informatics: 1–30.

Chung, Teresa Mihwa. 2003. “A corpus comparison approach for terminology extraction.” Terminology 9 (2): 221–246.

Church, Kenneth Ward, and Patrick Hanks. 1989. “Word association norms, mutual information, and lexicography.” In Proc the 27th Annual Meetings of the Association for Computational Linguistics: 76–83.

Conrado, Merley S., Thiago A. S. Pardo, and Solange O. Rezende. 2013. “Exploration of a rich feature set for automatic term extraction.” In MICAI 2013, ed. by F. Castro, A. Gelbukh, and M. González, LNAI, 82651: 342–354. Springer-Verlag.

Dagan, Ido and Ken Church. 1997. “Termight: coordinating humans and machines in bilingual terminology acquisition.” Machine Translation 121: 89–107.

Daille, Béatrice. 1994. “Study and implementation of combined techniques for automatic extraction of terminology.” In Proc the Balancing Act Workshop at the 32nd Annual Meeting of the ACL: 29–36.

Dias, Gaël. 2003. “Multiword unit hybrid extraction.” In Proc the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment: 41–48.

Dice, Lee R. 1945. “Measures of the amount of ecological association between species.” J. Ecology 261: 297–302.

Diep, Quang Ban. 2014. Vietnamese Grammar. Education Publisher, Vietnam. In Vietnamese.

Drouin, Patrick. 2003. “Term extraction using non-technical corpora as a point of leverage.” Terminology 9 (1): 99–115.

Fahmi, Ismail, Gosse Bouma, and Lonneke van der Plas. 2007. “Using multilingual terms for biomedical term extraction.” In Proc the RANLP Workshop on Acquisition and Management of Multilingual Lexicons: 1–8.

Frantzi, Katerina T., and Sophia Ananiadou. 1999. “The C-value/NC-value domain-independent method for multi-word term extraction.” Journal of Natural Language Processing 6 (3): 145–179.

Frantzi, Katerina, Sophia Ananiadou, and Hideki Mima. 2000. “Automatic recognition of multi-word terms: the C-value/NC-value method.” Int J Digit Libr 31: 115–130.

Gao, Yuze, and Yu Yuan. 2019. “Feature-less end-to-end nested term extraction.” In Proc the International Conference on Natural Language Processing and Chinese Computing: 607–616.

He, Yulan. 2016. “Extracting topical phrases from clinical documents.” In Proc the 30th AAAI Conf on Artificial Intelligence: 2957–2963.

Heylen, Kris, and Dirk De Hertog. 2015. “Automatic term extraction.” In Handbook of Terminology, ed. by H. J. Kockaert and F. Steurs, Vol. 11, 203–221. John Benjamins.

Kageura, Kyo, and Bin Umino. 1996. “Methods of automatic term recognition – a review.” Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 3(2): 259–289.

Krauthammer, Michael, and Goran Nenadic. 2004. “Term identification in the biomedical literature.” Journal of Biomedical Informatics 371: 512–526.

Le, Hong Phuong. 2016. “Vitk: a Vietnamese text processing toolkit.” (Jan. 2016). Retrieved Jan 01, 2016 from [URL]

Liu, Liangliang, Xiaojing Wu, Hui Liu, Xinyu Cao, Haitao Wang, Hongwei Zhou, and Qi Xie. 2020. “A semi-supervised approach for extracting TCM clinical terms based on feature words.” BMC Medical Informatics and Decision Making 20 (Suppl 3): 118.

Liu, Wei, Bo Chuen Chung, Rui Wang, Jonathon Ng, and Nigel Morlet. 2015. “A genetic algorithm enabled ensemble for unsupervised medical term extraction from clinical letters.” Health Inf Sci Syst 3 (5): 1–14.

Lossio-Ventura, Juan Antonio, Clement Jonquet, Mathieu Roche, and Maguelonne Teisseire. 2016. “Biomedical term extraction: overview and a new methodology.” Information Retrieval Journal, Medical Information Retrieval 19 (1): 59–99.

Maclean, Diana Lynn, and Jeffrey Heer. 2013. “Identifying medical terms in patient-authored text: a crowdsourcing-based approach.” J Am Med Inform Assoc: 1–8.

Marciniak, Malgorzata, and Agnieszka Mykowiecka. 2014. “Terminology extraction from medical texts in Polish.” Journal of Biomedical Semantics 5 (24): 1–14.

. 2015. “Nested term recognition driven by word connection strength.” Terminology 21 (2): 1–31.

Maynard, Diana, and Sophia Ananiadou. 2001. “TRUCKS: a model for automatic multi-word term recognition.” Journal of Natural Language Processing 8 (1): 101–125.

McInnes, Bridget T., Ted Pedersen, and Serguei V. Pakhomov. 2007. “Determining the syntactic structure of medical terms in clinical notes.” In Proc the ACL 2007 Workshop on Biological, Translational, and Clinical Language Processing (BioNLP 2007): 9–16.

Mihalcea, Rada, and Paul Tarau. 2004. “TextRank: Bringing order into text.” In Proc the 2004 Conference on Empirical Methods in Natural Language Processing: 404–411.

Nguyen, Bao An, and Don-Lin Yang. 2012. “A semi-automatic approach to construct Vietnamese ontology from online text.” The International Review of Research in Open and Distributed Learning 13 (5): 148–172.

Nguyen, Hong Son, Minh Hieu Le, Chan Quan Loi Lam, and Trong Hai Duong. 2017. “Smart interactive search for Vietnamese disease by using data mining-based ontology.” Journal of Information and Telecommunication 1 (2): 176–191.

Nguyen, Minh Hiep, Huyen Nguyen Thi Minh, and Quyen Ngo The. 2018. “Building Resources for Vietnamese Clinical Text Processing.” Computación y Sistemas 22 (4): 1287–1294.

Nguyen, Minh-Tien, and Tri-Thanh Nguyen. 2015. “DESRM: a disease extraction system for real-time monitoring.” International Journal of Computational Vision and Robotics 5 (3): 282–301.

Oliver, Antoni, and Mercè Vàzquez. 2015. “TBXTools: a free, fast and flexible tool for automatic terminology extraction.” In Proc Recent Advances in Natural Language Processing: 473–479.

. 2020. “TermEval 2020: Using TSR Filtering Method to Improve Automatic Term Extraction.” In Proc the 6th International Workshop on Computational Terminology (COMPUTERM 2020): 106–113.

Pei, Jian, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, and Mei-Chun Hsu. 2001. “PrefixSpan: Mining sequential patterns efficiently by Prefix-Projected Pattern Growth.” In Proc the 17th International Conference on Data Engineering: 1–10.

Periñán-Pascual, Carlos, and Eva M. Mestre-Mestre. 2015. “DEXTER: Automatic extraction of domain-specific glossaries for language teaching.” Procedia – Social and Behavioral Sciences 1981: 377–385.

Repar, Andraž, Vid Podpečan, Anze Vavpetič, Nada Lavrač, and Senja Pollak. 2019. “TermEnsembler: an ensemble learning approach to bilingual term extraction and alignment.” Terminology 25 (1): 93–120.

Samy, Doaa, Antonio Moreno-Sandoval, Conchi Bueno-Díaz, Marta Garrote-Salazar, and José M. Guirao. 2012. “Medical term extraction in an Arabic medical corpus.” In Proc the 8th International Conference on Language Resources and Evaluation (LREC’12): 640–645.

Terryn, Ayla Rigouts, Patrick Drouin, Véronique Hoste, and Els Lefever. 2019. “Analysing the impact of supervised machine learning on automatic term extraction: HAMLET vs TermoStat.” In Proc Recent Advances in Natural Language Processing: 1012–1021.

Terryn, Ayla Rigouts, Véronique Hoste, Joost Buysschaert, Robert Vander Stichele, Elise Van Campen, and Els Lefever. 2019. “Validating multilingual hybrid automatic term extraction for search engine optimization: the use of EBM-GUIDELINES.” Argentinian Journal of Applied Linguistics: 93–108.

Terryn, Ayla Rigouts, Véronique Hoste, and Els Lefever. 2018. “A gold standard for multilingual automatic term extraction from comparable corpora: term structure and translation equivalents.” In Proc the 11th International Conference on Language Resources and Evaluation (LREC 2018): 1803–1808.

Vàzquez, Mercè, and Antoni Oliver. 2018. “Improving term candidates selection using terminological tokens.” Terminology 24 (1): 122–147.

Vivaldi, Jordi, Lluís Màrquez, and Horacio Rodríguez. 2001. “Improving term extraction by system combination using boosting.” In ECML 2001, ed. by L. De Raedt and P. Flach, LNAI, Vol. 21671, 515–526. Springer-Verlag.

Zhang, Xing, Yan Song, and Alex Chengyu Fang. 2010. “Term recognition using conditional random fields.” In Proc the 6th International Conference on Natural Language Processing and Knowledge Engineering: 1–6.

Zhang, Ziqi, Jie Gao, and Fabio Ciravegna. 2017. “SemRe-Rank: improving automatic term extraction by incorporating semantic relatedness with personalised PageRank.” ACM Trans Knowl Discov Data 9 (4): 1–40.

Cited by (1)

Cited by one other publication

Al-Thubaity, Abdulmohsen

2025. A Novel Dataset for Arabic Domain Specific Term Extraction and Comparative Evaluation of BERT-Based Models for Arabic Term Extraction. ACM Transactions on Asian and Low-Resource Language Information Processing 24:9 ► pp. 1 ff.

This list is based on CrossRef data as of 6 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.