NWJC2Vec: Word embedding dataset from ‘NINJAL Web Japanese Corpus’

Asahara, Masayuki

doi:10.1075/term.00011.asa

Article published In: Computational terminology and filtering of terminological information
Edited by Patrick Drouin, Natalia Grabar, Thierry Hamon, Kyo Kageura and Koichi Takeuchi
[Terminology 24:1] 2018
► pp. 7–22

Get fulltext from our e-platform

Download PDF

NWJC2Vec

Word embedding dataset from ‘NINJAL Web Japanese Corpus’

Masayuki Asahara | National Institute for Japanese Language and Linguistics

Published online: 31 May 2018

https://doi.org/10.1075/term.00011.asa

Abstract

In this paper, we present a word embedding dataset NWJC2Vec constructed using ‘NINJAL Web Japanese Corpus (NWJC)’. NWJC is a Web-crawled text corpus that contains 25.8 billion tokens. We construct two types of the word embedding dataset: one is based on the surface form, and the other is based on the complete morpheme information provided by UniDic, which is a lexicon for the Japanese morphological analyser MeCab. We perform an evaluation of the dataset by comparing it with the ‘Word List by Semantic Principles (Bunrui Goihyo)’.

Keywords: word embedding, web corpus, thesaurus, Japanese language

Article outline

1.Introduction
2.Related work
3.NWJC
4.Modelling parameters
5.Evaluation using WLSP
- 5.1WLSP
- 5.2Evaluation methodology
  - Agreement between lexemes
  - Agreement between syntactic and semantic categories (finest level, article)
  - Agreement between syntactic and semantic categories (second level, section)
  - Agreement between syntactic and semantic categories (top level)
  - Agreement between syntactic categories
  - Disagreement
- 5.3Evaluation using similarity measure buckets
  - Agreement between lexemes
  - Disagreement
  - Agreement between syntactic and semantic categories (finest level)
  - Agreement of syntactic and semantic categories (top level)
- 5.4Evaluation using similarity rank
  - Agreement between syntactic and semantic categories (finest level)
  - Agreement between syntactic categories
  - Disagreement
- 5.5Discussion
6.Conclusion
Acknowledgements
References

References (25)

References

Asahara, Masayuki, Kazuya Kawahara, Yuya Takei, Hideto Masuoka, Yasuko Ohba, Yuki Torii, Toru Morii, Yuki Tanaka, Kikuo Maekawa, Sachi Kato, and Hikari Konishi. 2016. “‘BonTen’ Corpus Concordance System for ‘NINJAL Web Japanese Corpus’.” In Proceedings of COLING 2016, The 26th International Conference on Computational Linguistics: System Demonstrations, Osaka, Japan, 25–29.

Asahara, Masayuki, Kikuo Maekawa, Mizuho Imada, Sachi Kato, and Hikari Konishi. 2014. “Archiving and Analysing Techniques of the Ultra-large-scale Web-based Corpus Project of NINJAL, Japan.” Alexandria: The Journal of National and International Library and Information Issues 25 (1–2): 129–148.

Asahara, Masayuki, and Yuji Matsumoto. 2003. IPADIC version 2.7.0 User’s Manual (in Japanese). Nara Institute of Science and Technology, Japan. Information Science Division. Technical Report.

Baroni, Marco, and Motoko Ueyama. 2006. “Building General- and Special-purpose Corpora by Web Crawling.” In Proceedings of the 13th NIJL International Symposium, Language Corpora: Their Compilation and Application. Tokyo, Japan, 31–40.

Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching Word Vectors with Subword Information ([URL]). Accessed 18 January 2018.

Cardellino, Cristian. 2016. Spanish Billion Words Corpus and Embeddings. ([URL]). Accessed 18 January 2018.

Den, Yasuharu, Junichi Nakamura, Toshinobu Ogiso, and Hideki Ogura. 2008. “A Proper Approach to Japanese Morphological Analysis: Dictionary, Model, and Evaluation.” In Proceedings of the 6th Language Resources and Evaluation Conference (LREC 2008), 1019–1024, Marrakech, Morocco.

Kawahara, Daisuke, and Sadao Kurohashi. 2006. “Case Frame Compilation from the Web Using High-performance Computing.” In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC-2006), Genoa, Italy, 1344–1347.

Kilgarriff, Adam, Siva Reddy, Jan Pomikálek, and Avinesh Pvs. 2010. “A Corpus Factory for Many Languages.” In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC-2010), Malta, 904–910.

Kokuritsu Kokugo Kenkyusho. 1964. Word List by Semantic Principles, 1st Edition. Shuei Shuppan, Kokuritsu Kokugo Kenkyusho Shiryo-shu 6.

. 2004. Word List by Semantic Principles, Revised and Enhanced Version Dainippon Tosho, Kokuritsu Kokugo Kenkyusho Shiryo-shu 14,

Kudo, Taku, and Yuji Matsumoto. 2002. “Japanese Dependency Analysis using Cascaded Chunking.” In Proceedings of CoNLL 2002: Proceedings of the 6th Conference on Natural Language Learning 2002 (COLING 2002 Post-Conference Workshops), 63–69. Taipei, Taiwan.

Kudo, Taku, Kaoru Yamamoto, and Yuji Matsumoto. 2004. “Applying Conditional Random Fields to Japanese Morphological Analysis”. In Proceedings of EMNLP 2004. 230–237. Barcelona, Spain.

Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” In Workshop Proceedings of the International Conference on Learning Representations (ICLR), 1–12. Scottsdale, Arizona. ([URL]). Accessed 18 January 2018.

Morita, Hajime, Daisuke Kawahara, and Sadao Kurohashi. 2015. “Morphological Analysis for Unsegmented Languages using Recurrent Neural Network Language Model.” In Proceedings of EMNLP 2015. 2292–2297. Lisbon, Portugal.

Murawaki, Yugo, and Sadao Kurohashi. 2008. “Online Acquisition of Japanese Unknown Morphemes using Morphological Constraints.” In Proceedings of EMNLP 2008. Honolulu, pp. 429–437. ([URL]). Accessed 18 January 2018.

. 2010a. “Online Japanese Unknown Morpheme Detection using Orthographic Variation.” In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC-2010). 832–839. Malta.

. 2010b. “Semantic Classification of Automatically Acquired Nouns using Lexico-Syntactic Clues.” In Proceedings of COLING 2010. 876–884. Beijing, China.

Pennington, Jeffery, Richard Socher, and Christopher D. Manning. 2014. “GloVe: Global Vectors for Word Representation.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 1532–1543.

Pomikálek, Jan, and Vít Suchomel. 2012. “Efficient Web Crawling for Large Text Corpora.” In Proceedings of the Seventh Web as Corpus Workshop (WAC7), 39–43. Lyon, France.

Shinzato, Keiji, Tomohide Shibata, Daisuke Kawahara, Chikara Hashimoto, and Sadao Kurohashi. 2008. ‘TSUBAKI: An Open Search Engine Infrastructure for Developing New Information Access.” In Proceedings of Third International Joint Conference on Natural Language Processing (IJCNLP2008), Hyderabad, India, 189–196.

Srdanović, E. Irena, Erjavec Tomaž, and Adam Kilgarriff. 2008. “A Web Corpus and Word-sketches for Japanese.” Shizen gengo shori (Journal of Natural Language Processing) 15 (2): 137–159.

Thomee, Bart, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. Yfcc100m: The New Data in Multimedia Research 591: 64–73.

Ueyama, Motoko, and Marco Baroni. 2005. “Automated Construction and Evaluation of Japanese Web-based Reference Corpora,” In Proceedings of Corpus Linguistics 2005. Birmingham, UK. ([URL] [URL]). Accessed 18 January 2018.

Yata, Susumu. 2010. nwc-toolkit. ([URL]). Accessed 18 January 2018.

Cited by (6)

Cited by six other publications

Order by:

Nie, Xiaozhe, Zhijie Xu, Jianqin Zhang & Yu Tian

2023. Attention-Based Personalized Compatibility Learning for Fashion Matching. Applied Sciences 13:17 ► pp. 9638 ff.

Omura, Mai, Aya Wakasa & Masayuki Asahara

2023. Universal Dependencies for Japanese Based on Long-Unit Words by NINJAL. Journal of Natural Language Processing 30:1 ► pp. 4 ff.

Kato, Sachi, Masayuki Asahara, Nanami Moriyama, Asami Ogiwara & Makoto Yamazaki

2021. Opposite Information Annotation on ‘Word List by Semantic Principles’. Journal of Natural Language Processing 28:1 ► pp. 60 ff.

Ko, Daiki & Koichi Takeuchi

2020. Evaluation of Embedded Vectors for Lexemes and Synsets Toward Expansion of Japanese WordNet. In Computational Linguistics [Communications in Computer and Information Science, 1215], ► pp. 79 ff.

Asahara, Masayuki

2019. Surprisal through Word Embeddings. Journal of Natural Language Processing 26:3 ► pp. 635 ff.

Yoneda, Yoshiki, Yu Suzuki & Akiyo Nadamoto

2019. 2019 International Conference on Data Mining Workshops (ICDMW), ► pp. 441 ff.

This list is based on CrossRef data as of 5 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.