Word segmentation granularity in Korean

Park, Jungyeul; Kim, Mija

doi:10.1075/kl.00008.par

Article published In: Korean Linguistics
Vol. 20:1 (2024) ► pp.82–112

Get fulltext from our e-platform

Download PDF

Download EPUB

Word segmentation granularity in Korean

Jungyeul Park | The University of British Columbia

Mija Kim | Kyung Hee University

Available under the Creative Commons Attribution-NonCommercial (CC BY-NC) 4.0 license.

For any use beyond this license, please contact the publisher at rights@benjamins.nl.

Published online: 30 May 2024

https://doi.org/10.1075/kl.00008.par

Abstract

This paper describes word segmentation granularity in Korean language processing. From a word separated by blank space, which is termed an eojeol, to a sequence of morphemes in Korean, there are multiple possible levels of word segmentation granularity in Korean. For specific language processing and corpus annotation tasks, several different granularity levels have been proposed and utilized, because the agglutinative languages including Korean language have a one-to-one mapping between functional morpheme and syntactic category. Thus, we analyze these different granularity levels, presenting the examples of Korean language processing systems for future reference. Interestingly, the granularity by separating only functional morphemes including case markers and verbal endings, and keeping other suffixes for morphological derivation results in the optimal performance for phrase structure parsing. This contradicts previous best practices for Korean language processing, which has been the de facto standard for various applications that require separating all morphemes.

Keywords: word segmentation granularity, morphological segmentation, agglutinative language, evaluation

Article outline

1.Introduction
2.Previous work
3.Definition of segmentation granularity
- 3.1Level 1: Eojeols
- 3.2Level 2: Separating words and symbols
- 3.3Level 3: Separating case markers
- 3.4Level 4: Separating verbal endings
- 3.5Level 5: Separating all morphemes
- 3.6Discussion
4.Diagnostic analysis
- 4.1Language processing tasks
  - Word segmentation, morphological analysis and POS tagging
  - Syntactic parsing
  - Machine translation
- 4.2Results and discussion
Conclusion
Acknowledgement
Notes
References

References (67)

References

Bikel, Daniel M.. 2004. Intricacies of Collins’ Parsing Model. Computational Linguistics, 30(4):479–511.

Black, Ezra, Steve Abney, Dan Flickinger, Claudia Gdaniec, Ralph Grishman, Phil Harrison, Donald Hindle, Robert Ingria, Frederick Jelinek, Judith L. Klavans, Mark Liberman, Mitch Marcus, Salim Roukos, Beatrice Santorini, and Tomek Strzalkowski. 1991. A Procedure for Quantitatively Comparing the Syntactic Coverage of English Grammars. In Speech and Natural Language: Proceedings of a Workshop Held at Pacific Grove, California, February 19–22, 1991, pages 306–311, Pacific Grove, California. DARPA/ISTO. [URL].

Cha, Jeong-Won, Geunbae Lee, and Jong-Hyeok Lee. 1998. Generalized Unknown Morpheme Guessing for Hybrid POS Tagging of Korean. In Eugene Charniak, editor, Proceedings of the Sixth Workshop on Very Large Corpora, pages 85–93, Montreal, Quebec, Canada. Morgan Kaufrnann Publisher. [URL]

Chen, Yige, Eunkyul Leah Jo, Yundong Yao, KyungTae Lim, Miikka Silfverberg, Francis M Tyers, and Jungyeul Park. 2022. Yet Another Format of Universal Dependencies for Korean. In Proceedings of the 29th International Conference on Computational Linguistics, pages 5432–5437, Gyeongju, Republic of Korea, 101. International Committee on Computational Linguistics. [URL]

Chen, Yige, KyungTae Lim, and Jungyeul Park. 2023. Korean Named Entity Recognition Based on Language-Specific Features. Natural Language Engineering, FirstView:1–25.

Choi, DongHyun, Jungyeul Park, and Key-Sun Choi. 2012. Korean Treebank Transformation for Parser Training. In Proceedings of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages, pages 78–88, Jeju, Republic of Korea. Association for Computational Linguistics. [URL]

Choi, Key-Sun, Young S. Han, Young G. Han, and Oh W. Kwon. 1994. KAIST Tree Bank Project for Korean: Present and Future Development. In Proceedings of the International Workshop on Sharable Natural Language Resources, pages 7–14, Nara Institute of Science and Technology. Nara Institute of Science and Technology.

Choi, Sanghyuk, Taeuk Kim, Jinseok Seol, and Sang-goo Lee. 2017. A Syllable-based Technique for Word Embeddings of Korean Words. In Proceedings of the First Workshop on Subword and Character Level Models in NLP, pages 36–40, Copenhagen, Denmark, 91. Association for Computational Linguistics. [URL].

Chomsky, Noam. 1981. Lectures on Government and Binding. Studies in Generative Grammar. Foris Publications, Dordrecht, The Netherlands.

. 1982. Some Concepts and Consequences of the Theory of Government and Binding. Linguistic Inquiry Monograph 6. The MIT Press, Cambridge, MA. ISBN 9780262030908.

Chun, Jayeol, Na-Rae Han, Jena D. Hwang, and Jinho D. Choi. 2018. Building Universal Dependency Treebanks in Korean. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA). ISBN 979-10-95546-00-9.

Chung, Min-Chung. 1998. Les nominalisations d’adjectifs en coréen: constructions nominales à support issda (il y avoir). PhD thesis, Université Paris 7 – Denis Diderot, Paris, France. [URL]

Chung, Tagyoung and Daniel Gildea. 2009. Unsupervised Tokenization for Machine Translation. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 718–726, Singapore. Association for Computational Linguistics. [URL].

Chung, Tagyoung, Matt Post, and Daniel Gildea. 2010. Factors Affecting the Accuracy of Korean Parsing. In Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages, pages 49–57, Los Angeles, CA, USA. Association for Computational Linguistics. [URL]

Collins, Michael. 1997. Three Generative, Lexicalised Models for Statistical Parsing. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, pages 16–23, Madrid, Spain. Association for Computational Linguistics. . [URL]

Gross, Maurice. 1975. Méthodes en syntaxe. Hermann.

Han, Chung-Hye, Na-Rae Han, Eon-Suk Ko, Martha Palmer, and Heejong Yi. 2002. Penn Korean Treebank: Development and Evaluation. In Proceedings of the 16th Pacific Asia Conference on Language, Information and Computation, pages 69–78, Jeju, Korea. Pacific Asia Conference on Language, Information and Computation.

Han, Sunhae. 2000. Les predicats nominaux en coreen: Constructions a verbe support hata. PhD thesis, Université Paris 7 – Denis Diderot, Paris, France. [URL]

Hong, Jeen-Pyo. 2009. Korean Part-Of-Speech Tagger using Eojeol Patterns (M.S. Thesis). Technical report, Changwon National University, Changwon.

Hwang, Byung-sun. 2003. A Study on Interpretation of the Korean Tense. The Korean Language and Literature, 79(1):309–346.

Johnson, Mark. 1998. PCFG Models of Linguistic Tree Representations . Computational Linguistics, 24 (4):613–632. [URL]

Joshi, Aravind K., Leon S. Levy, and Masako Takahashi. 1975. Tree Adjunct Grammars. Journal of Computer and System Sciences, 10(1):136–163.

Jung, Sangkeun, Changki Lee, and Hyunsun Hwang. 2018. End-to-End Korean Part-of-Speech Tagging Using Copying Mechanism. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 17(3):19:1–19:8. ISSN 2375-4699.

Kang, Juyeon. 2011. Problèmes morpho-syntaxiques analysés dans un modèle catégoriel étendu: application au coréen et au français avec une réalisation informatique. PhD thesis, Université Paris IV – Paris-Sorbonne, Paris, France. [URL]

Kim, Mija and Jungyeul Park. 2022. A note on constituent parsing for Korean. Natural Language Engineering, 28(2):199–222.

Klein, Dan and Christopher D. Manning. 2003. Accurate Unlexicalized Parsing. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 423–430, Sapporo, Japan. Association for Computational Linguistics. . [URL]

Ko, Kil Soo. 2010. La syntaxe du syntagme nominal et l’extraction du complément du nom en coréen: description, analyse et comparaison avec le français. PhD thesis, Université Paris 7 – Denis Diderot, Paris, France. [URL]

Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pages 177–180, Prague, Czech Republic. Association for Computational Linguistics. [URL].

Lim, Donghoon. 2008. The Mood and Modal systems in Korean. Korean Semantics, 26(2):211–248.

. 2011. Sentence types in Korean. Journal of Korean Linguistics, 60(1):323–359.

Maamouri, Mohamed and Ann Bies. 2004. Developing an Arabic Treebank: Methods, Guidelines, Procedures, and Tools. In Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages, pages 2–9, Geneva, Switzerland, 81. COLING. [URL].

McDonald, Ryan, Joakim Nivre, Yvonne Quirmbach-Brundage, Yoav Goldberg, Dipanjan Das, Kuzman Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Oscar Täckström, Claudia Bedini, Núria Bertomeu Castelló, and Jungmee Lee. 2013. Universal Dependency Annotation for Multilingual Parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 92–97, Sofia, Bulgaria. Association for Computational Linguistics. [URL]

Marcus, Mitchell P., Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a Large Annotated Corpus of English: The Penn Treebank. Computational linguistics, 19(2):313–330. [URL]

Matsuzaki, Takuya, Yusuke Miyao, and Jun’ichi Tsujii. 2005. Probabilistic CFG with Latent Annotations. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 75–82, Ann Arbor, Michigan, 61. Association for Computational Linguistics. . [URL]

Na, Seung-Hoon. 2015. Conditional Random Fields for Korean Morpheme Segmentation and POS Tagging. ACM Transactions on Asian and Low-Resource Language Information Processing, 14(3):1–10. ISSN 2375-4699.

Nam, Jee-Sun. 1994. Classification syntaxique des constructions adjectivales en coréen. PhD thesis, Université Paris 7 – Denis Diderot, Paris, France. [URL]

Nho, Yun-Chae. 1992. Les constructions converses du coréen : études des prédicats nominaux. PhD thesis, Université Paris 7 – Denis Diderot, Paris, France. [URL]

Nivre, Joakim, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. 2016. Universal Dependencies v1: A Multilingual Treebank Collection. In Luis von Ahn, editor, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), page 1659–1666, Portorož, Slovenia. European Language Resources Association (ELRA). [URL]

Nivre, Joakim, Marie-Catherine de Marneffe, Filip Ginter, Jan Hajič, Christopher D. Manning, Sampo Pyysalo, Sebastian Schuster, Francis Tyers, and Daniel Zeman. 2020. Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4034–4043, Marseille, France, 5. European Language Resources Association. ISBN 979-10-95546-34-4. [URL]

Och, Franz Josef. 2003. Minimum Error Rate Training in Statistical Machine Translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 160–167, Sapporo, Japan. Association for Computational Linguistics. . [URL]

Oh, Jin-Young and Jeong-Won Cha. 2013. Korean Dependency Parsing using Key Eojoel. Journal of KIISE:Software and Applications, 40(10):600–608.

Oh, Jin-Young, Yo-Sub Han, Jungyeul Park, and Jeong-Won Cha. 2011. Predicting Phrase-Level Tags Using Entropy Inspired Discriminative Models. In International Conference on Information Science and Applications (ICISA) 2011, pages 1–5, Jeju, Korea. Information Science and Applications (ICISA).

Pak, Hyong-Ik. 1987. Lexique-grammaire du coréen : construction à verbes datifs. PhD thesis, Université Paris 7- Denis Diderot, Paris, France. [URL]

Palmer, Martha, Daniel Gildea, and Paul Kingsbury. 2005. The Proposition Bank: An Annotated Corpus of Semantic Roles. Computational Linguistics, 31(1):71–106.

Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, 71. Association for Computational Linguistics. . [URL]

Park, Chulwoo. 2007. The Grammatical Voice in Korean: an Interface Phenomenon between Syntax and Semantics. Korean Linguistics, 37(1):207–228.

Park, Jungyeul. 2006. Extraction automatique d’une grammaire d’arbres adjoints à partir d’un corpus arboré pour le coréen. PhD thesis, Université Paris 7 – Denis Diderot, Paris, France. [URL]

Park, Jungyeul and Francis Tyers. 2019. A New Annotation Scheme for the Sejong Part-of-speech Tagged Corpus. In Proceedings of the 13th Linguistic Annotation Workshop, pages 195–202, Florence, Italy, 81. Association for Computational Linguistics. [URL].

Park, Jungyeul, Daisuke Kawahara, Sadao Kurohashi, and Key-Sun Choi. 2013. Towards Fully Lexicalized Dependency Parsing for Korean. In Proceedings of the 13th International Conference on Parsing Technologies (IWPT 2013), pages 120–126, Nara, Japan, 111. Assocation for Computational Linguistics. [URL]

Park, Jungyeul, Jeen-Pyo Hong, and Jeong-Won Cha. 2016. Korean Language Resources for Everyone. In Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Oral Papers (PACLIC 30), pages 49–58, Seoul, Korea. Pacific Asia Conference on Language, Information and Computation. [URL]

Park, Jungyeul, Loïc Dugast, Jeen-Pyo Hong, Chang-Uk Shin, and Jeong-Won Cha. 2017. Building a Better Bitext for Structurally Different Languages through Self-training. In Proceedings of the First Workshop on Curation and Applications of Parallel and Comparable Corpora, pages 1–10, Taipei, Taiwan, 111. Asian Federation of Natural Language Processing. [URL]

Park, Jungyeul and Mija Kim. 2023. A role of functional morphemes in Korean categorial grammars. Korean Linguistics, 19(1):1–30.

Park, Jungyeul, Sejin Nam, Youngsik Kim, Younggyun Hahm, Dosam Hwang, and Key-Sun Choi. 2014. Frame-Semantic Web: a Case Study for Korean. In ISWC-PD’14: Proceedings of the 2014 International Conference on Posters & Demonstrations Track – Volume 1272, pages 257–260, Riva del Garda, Italy, 101. International Semantic Web Conference.

Park, Sounnam. 1996. La construction des verbes neutres en coreen. PhD thesis, Université Paris 7 – Denis Diderot, Paris, France. [URL]

Park, Sungjoon, Jihyung Moon, Sungdong Kim, Won Ik Cho, Jiyoon Han, Jangwon Park, Chisung Song, Junseong Kim, Yongsook Song, Taehwan Oh, Joohong Lee, Juhyun Oh, Sungwon Lyu, Younghoon Jeong, Inkwon Lee, Sangwoo Seo, Dongjun Lee, Hyunwoo Kim, Myeonghwa Lee, Seongbo Jang, Seungwon Do, Sunkyoung Kim, Kyungtae Lim, Jongwon Lee, Kyumin Park, Jamin Shin, Seonghyun Kim, Lucy Park, Alice Oh, Jung-Woo Ha, and Kyunghyun Cho. 2021. KLUE: Korean Language Understanding Evaluation. Technical report, [URL], 51. [URL]

Petrov, Slav and Dan Klein. 2007. Improved Inference for Unlexicalized Parsing. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 404–411, Rochester, New York. Association for Computational Linguistics. [URL]

Petrov, Slav, Dipanjan Das, and Ryan McDonald. 2012. A Universal Part-of-Speech Tagset. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), pages 2089–2096, Istanbul, Turkey. European Language Resources Association (ELRA). ISBN 978-2-9517408-7-7

Petrov, Slav, Leon Barrett, Romain Thibaux, and Dan Klein. 2006. Learning Accurate, Compact, and Interpretable Tree Annotation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 433–440, Sydney, Australia. Association for Computational Linguistics. . [URL]

Shin, Kwang-Soon. 1994. Le verbe support hata en coréen contemporain : morpho-syntaxe et comparaison. PhD thesis, Université Paris 7 – Denis Diderot, Paris, France. [URL]

Song, Hyun-Je and Seong-Bae Park. 2020. Korean Part-of-Speech Tagging Based on Morpheme Generation. ACM Transactions on Asian and Low-Resource Language Information Processing (TAL-LIP), 19(3):1–41, 11. ISSN 2375-4699.

Song, Jae Mog. 1998. Semantic functions of the non-terminal suffix -te- in Korean: from a typological perspective. Journal of Korean Linguistics, 32(1):135–169.

Straka, Milan, Jan Hajic, and Jana Straková. 2016. UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pages 4290–4297, Paris, France, 51. European Language Resources Association (ELRA). ISBN 978-2-9517408-9-1

Stratos, Karl. 2017. A Sub-Character Architecture for Korean Language Processing. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 732–737, Copenhagen, Denmark, 91. Association for Computational Linguistics. [URL].

Stratos, Karl, Michael Collins, and Daniel Hsu. 2016. Unsupervised Part-Of-Speech Tagging with Anchor Hidden Markov Models. Transactions of the Association for Computational Linguistics, 41:245–257. ISSN 2307-387X. [URL].

Taylor, Ann, Mitchell Marcus, and Beatrice Santorini. 2003. The Penn Treebank: An Overview. In Anne Abeillé, editor, Treebanks: Building and Using Parsed Corpora, pages 5–22. Springer Netherlands, Dordrecht. ISBN 978-94-010-0201-1.

Xue, Naiwen, Fei Xia, Fu-dong Chiou, and Marta Palmer. 2005. The Penn Chinese TreeBank: Phrase Structure Annotation of a Large Corpus. Natural Language Engineering, 11(2):207–238, 61. ISSN 1351-3249.

Yu, Seunghak, Nilesh Kulkarni, Haejun Lee, and Jihie Kim. 2017. Syllable-level Neural Language Model for Agglutinative Language. In Proceedings of the First Workshop on Subword and Character Level Models in NLP, pages 92–96, Copenhagen, Denmark, 91. Association for Computational Linguistics. [URL].