The Process of Designing a Multidisciplinary Monolingual Sample Corpus

Dash, N.S.

doi:10.1075/ijcl.5.2.05das

Article published In: International Journal of Corpus Linguistics
Vol. 5:2 (2000) ► pp.179–197

Get fulltext from our e-platform

Download PDF

The Process of Designing a Multidisciplinary Monolingual Sample Corpus

N.S. Dash | Indian Statistical Institute

Published online: 30 May 2001

https://doi.org/10.1075/ijcl.5.2.05das

This paper discusses the approach of developing a sample of printed corpus in Bangla, one of the national languages of India and the only national language of Bangladesh. It is designed from the data collected from various published documents. The paper highlights different issues related to corpus generation, data-file preparation, language analysis, and processing as well as application potentials to different areas of pure and applied linguistics. It also includes statistical studies on the corpus along with some interpretation of the results. The difficulties that one may face during corpus generation are also pointed out.

Keywords: dictionary, word forms, concordance, NLP, machine translation, graphic symbol, corpus, diacritic, data-file

Cited by (5)

Cited by five other publications

Order by:

Wynne, Hilary S.Z., Beinan Zhou, Sandra Kotzor & Aditi Lahiri

2025. The effect of orthography on the visual processing of affixed words: Evidence from Bengali. Cognition 264 ► pp. 106196 ff.

Dash, Niladri Sekhar

2021. Extratextual Annotation. In Language Corpora Annotation and Processing, ► pp. 71 ff.

Pal, Alok Ranjan, Diganta Saha, Sudip Kumar Naskar & Niladri Sekhar Dash

2021. In search of a suitable method for disambiguation of word senses in Bengali. International Journal of Speech Technology 24:2 ► pp. 439 ff.

Parameswarappa, S., V. N. Narayana & G. N. Bharathi

2012. 2012 International Conference on Computer Communication and Informatics, ► pp. 1 ff.

Dash, N.S. & B.B. Chaudhuri

2003. Language Engineering Conference, 2002. Proceedings, ► pp. 99 ff.

This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.