Analysis of English text genre classification based on dependency types

Wang, Yaqin

doi:10.1075/cilt.356.17wan

In:Language and Text: Data, models, information and applications
Edited by Adam Pawłowski, Jan Mačutek, Sheila Embleton and George Mikros
[Current Issues in Linguistic Theory 356] 2021
► pp. 257–270

Get fulltext from our e-platform

Download Book PDF

Download Book EPUB

Analysis of English text genre classification based on dependency types

Yaqin Wang | Guangdong University of Foreign Studies

Published online: 22 December 2021

https://doi.org/10.1075/cilt.356.17wan

Abstract

The present study aims to explore whether dependency type can be used as a distinctive text vector for classifying English genres. Three classification methods, namely principal component analysis, hierarchical clustering, and random forest were employed to investigate the clustering effect. Results show that dependency type is an effective measure in distinguishing text genres, especially between spoken genre and written genre.

Keywords: dependency type, genre classification, principal component analysis, hierarchical clustering, random forest, spoken English, written English

Article outline

1.Introduction
2.Treebank establishment
3.Methods
- 3.1Principal component analysis
- 3.2Hierarchical cluster analysis
- 3.3Random Forest
4.Results and discussion
- 4.1PCA
- 4.2Text clustering
- 4.3Random forest
5.Conclusions
Acknowledgements
Notes
References
Appendix

References (34)

References

Baayen, Harald, Hans Van Halteren & Fiona Tweedie. 1996. Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing 11(3). 121–132.

Baayen, R. Harald. 2008. Analyzing linguistic data: A practical introduction to statistics using R. Cambridge: Cambridge University Press.

Baharudin, Baharum, Lam H. Lee, Khairullah Khan & Aurangzeb Khan. 2010. A review of machine learning algorithms for text-documents classification. Journal of Advances in Information Technology 1(1). 4–20.

Biber, Douglas. 1993. Using register-diversified corpora for general language studies. Computational Linguistics 19(2). 219–241.

. 1995. Dimensions of register variation: A cross-linguistic comparison. Cambridge: Cambridge University Press.

Breiman, Lee. 2001. Random forests. Machine Learning 45(1). 5–32.

Burnard, Lou. 2000. Reference guide for the British National Corpus (World Edition). Oxford: Oxford University Computing Services.

de Marneffe, Marie-Catherine & Christopher D. Manning. 2008. Stanford typed dependencies manual. Technical report, Stanford University. [URL]

Eppler, Eva M. 2005. The syntax of German-English code-switching. London: University of London dissertation.

Feldman, Sergey, M. A. Marin, Mari Ostendorf & Maya R. Gupta. 2009. Part-of-speech histograms for genre classification of text. In 2009 IEEE International Conference Acoustics, Speech and Signal Processing, 4781–4784). Taipei: IEEE.

Futrell, Richard, Kyle Mahowald & Edward Gibson. 2015. Large-scale evidence of dependency length minimization in 37 languages. Proceedings of the National Academy of Sciences 112(33). 10336–10341.

Gao, Song & Zhiwei Feng. 2011. Research on text clustering based on dependency treebank. Journal of Chinese Information Processing 25(3). 59–63.

Hiranuma, So. 1999. Syntactic difficulty in English and Japanese: A textual study. UCL Working Papers in Linguistics 11. 309–322.

Hollingsworth, Charles. 2012. Using dependency-based annotations for authorship identification. Text, Speech and Dialogue 7499. 314–319.

Hotelling, Harold. 1933. Analysis of a complex of statistical variables into principal components, Journal of Educational Psychology 24(6). 417–441.

Hou, Renkui & Minghu Jiang. 2014. Analysis on Chinese quantitative stylistic features based on text mining. Digital Scholarship in the Humanities 31 (2). 357–367.

Hou, Renkui, Jiang Yang & Minghu Jiang. 2014. A study on Chinese quantitative stylistic features and relation among different styles based on text clustering. Journal of Quantitative Linguistics 21(3). 246–280.

Hudson, Richard A. 1990. English word grammar. Oxford: Basil Blackwell.

Jiang, Jingyang & Haitao Liu. 2015. The effects of sentence length on dependency distance, dependency direction and the implications–Based on a parallel English–Chinese dependency treebank. Language Sciences 50. 93–104.

Kessler, Brett, Geoffrey Nunberg & Hinrich Schütze. 1997. Automatic detection of text genre. In Philip R. Cohen & Wolfgang Wahlster (ed.), Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, 32–38. Stroudsburg, PA: Association for Computational Linguistics.

Liu, Haitao. 2008. Dependency distance as a metric of language comprehension difficulty. Journal of Cognitive Science 9(2). 159–191.

. 2010. Dependency direction as a means of word-order typology: A method based on dependency treebanks. Lingua 120(6). 1567–1578.

Liu, Haitao, Richard Hudson & Zhiwei Feng. 2009a. Using a Chinese treebank to measure dependency distance. Corpus Linguistics and Linguistic Theory 5(2). 161–174.

Liu, Haitao, Yiyi Zhao & Wenwen Li. 2009b. Chinese syntactic and typological properties based on dependency syntactic treebanks. Poznań Studies in Contemporary Linguistics 45(4). 509–523.

Liu, Haitao, Chunshan Xu & Junying Liang. 2017. Dependency distance: A new perspective on syntactic patterns in natural languages. Physics of Life Reviews 21. 171–193.

Nivre, Joakim, Hall Johan, Kübler Sandra, McDonald Ryan, Nilsson Jens, Riedel Sebastian & Yuret Deniz. 2007. The CoNLL 2007 shared task on dependency parsing. In Jason Eisner (ed.), Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) (915–932). Prague: Association for Computational Linguistics.

Nivre, Joakim, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajič, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty & Daniel Zeman. 2016. Universal Dependencies v1: A multilingual treebank collection. In Nicoletta Calzolari et al. (eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 1659–1666. Portorož: European Language Resources Association (ELRA).

Rygl, Jan. 2014. Automatic adaptation of author’s stylometric features to document types. In International Conference on Text, Speech, and Dialogue, 53–61. Cham: Springer.

Stamatatos, Efstathios, Nikos Fakotakis & George Kokkinakis. 2000a. Automatic text categorization in terms of genre and author. Computational Linguistics 26(4). 471–495.

. 2000b. Text genre detection using common word frequencies. In Martin Kay (ed.) Proceedings of the 18th conference on Computational Linguistics, Volume 2, 808–814. Stroudsburg, PA: Association for Computational Linguistics.

Tesnière, Lucien. 1959. Eléments de syntaxe structurale. Paris: Librairie C. Klincksieck.

Wang, Yaqin & Haitao Liu. 2017. The effects of genre on dependency distance and dependency direction. Language Sciences 59. 135–147.

Zaghloul, Waleed, Sang M. Lee & Silvana Trimi. 2013. Text classification: Neural networks vs support vector machines. Industrial Management and Data Systems 109(5). 708–717.

Zhang, Wen, Xijin Tang & Toshida Yoshida. 2015. TESC: An approach to TExt classification using Semi-supervised Clustering. Knowledge Based Systems, 152–160.