Genre annotation for the Web: Text-external and text-internal perspectives

Sharoff, Serge

doi:10.1075/rs.19015.sha

Article published In: Register Studies
Vol. 3:1 (2021) ► pp.1–32

Get fulltext from our e-platform

Download PDF

Genre annotation for the Web

Text-external and text-internal perspectives

Serge Sharoff | University of Leeds

Published online: 3 June 2021

https://doi.org/10.1075/rs.19015.sha

Abstract

This paper describes a digital curation study aimed at comparing the composition of large Web corpora, such as enTenTen, ukWac or ruWac, by means of automatic text classification. First, the paper presents a Deep Learning model suitable for classifying texts from large Web corpora using a small number of communicative functions, such as Argumentation or Reporting. Second, it describes the results of applying the automatic classification model to these corpora and compares their composition. Finally, the paper introduces a framework for interpreting the results of automatic genre classification using linguistic features. The framework can help in comparing general reference corpora obtained from the Web and in comparing corpora across languages.

Keywords: automatic genre identification, Deep learning, interpreting neural networks

Article outline

1.Introduction
- 1.1Text-external communicative functions
- 1.2Text-internal linguistic features
2.Automatic genre identification
- 2.1Text classification model
- 2.2Datasets for training
- 2.3Prediction accuracy
3.Comparing large Web corpora
4.Communicative functions vs linguistic features
- 4.1Detection of linguistic features
- 4.2Mapping linguistic features to functions
- 4.3Linguistic features across languages
5.Related studies on computational analysis of genres
6.Conclusions and further work
Notes
References

References (49)

References

Adamzik, Kirsten. 1995. Textsorten – Texttypologie. Eine Kommentierte Bibliographie. Münster: Nodus.

Argamon, Shlomo. 2019. “Computational Register Analysis and Synthesis.” Register Studies 11.

Argamon, Shlomo, Casey Whitelaw, Paul Chase, Sobhan Raj Hota, Navendu Garg, and Shlomo Levitan. 2007. “Stylistic Text Classification Using Functional Lexical Features.” Journal of the American Society for Information Science and Technology 58 (6). Wiley Online Library: 802–22.

Baker, Mona. 1996. “Corpus-Based Translation Studies: The Challenges That Lie Ahead.” In Terminology, Lsp and Translation: Studies in Language Engineering, edited by Harold Somers. John Benjamins.

Baroni, Marco, and Silvia Bernardini. 2006. “A New Approach to the Study of Translationese: Machine-Learning the Difference Between Original and Translated Text.” Literary and Linguistic Computing 21 (3): 259–74.

Benko, Vladimír. 2016. “Two Years of Aranea: Increasing Counts and Tuning the Pipeline.” In Proc Lrec. Portorož, Slovenia.

Biber, Douglas. 1988. Variation Across Speech and Writing. Cambridge University Press.

. 1995. Dimensions of Register Variation: A Cross-Linguistic Comparison. Cambridge University Press.

Biber, Douglas, and Jesse Egbert. 2016. “Register Variation on the Searchable Web: A Multi-Dimensional Analysis.” Journal of English Linguistics 44 (2): 95–137.

Biber, Douglas, and Bethany Gray. 2016. Grammatical Complexity in Academic English: Linguistic Change in Writing. Cambridge University Press.

Cienki, Alan J. 1989. Spatial Cognition and the Semantics of Prepositions in English, Polish, and Russian. Vol. 2371. Sagner Munich.

Conneau, Alexis, Shijie Wu, Haoran Li, Luke Zettlemoyer, and Veselin Stoyanov. 2020. “Emerging Cross-Lingual Structure in Pretrained Language Models.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 6022–34. Online: Association for Computational Linguistics.

Crowston, Kevin, Barbara Kwasnik, and Joseph Rubleske. 2010. “Problems in the Use-Centered Development of a Taxonomy of Web Genres.” In Genres on the Web: Computational Models and Empirical Studies, edited by Alexander Mehler, Serge Sharoff, and Marina Santini. Springer.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv Preprint arXiv:1810.04805.

Evert, Stefan. 2006. “How Random Is a Corpus? The Library Metaphor.” Zeitschrift Für Anglistik Und Amerikanistik 54 (2): 177–90.

Ferraresi, Adriano, Eros Zanchetta, Silvia Bernardini, and Marco Baroni. 2008. “Introducing and Evaluating ukWaC, a Very Large Web-Derived Corpus of English.” In The 4th Web as Corpus Workshop: Can We Beat Google? (At Lrec 2008). Marrakech. [URL].

Forsyth, Richard, and Serge Sharoff. 2014. “Document Dissimilarity Within and Across Languages: A Benchmarking Study.” Literary and Linguistic Computing 291: 6–22.

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.

Görlach, M. 2004. Text Types and the History of English. Berlin: Walter de Gruyter.

Gulordava, Kristina, Piotr Bojanowski, Edouard Grave, Tal Linzen, and Marco Baroni. 2018. “Colorless Green Recurrent Networks Dream Hierarchically.” arXiv Preprint arXiv:1803.11138.

Hearst, Marti A. 1997. “TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages.” Computational Linguistics 23 (1). MIT Press: 33–64.

Hosmer Jr, David W., Stanley Lemeshow, and Rodney X. Sturdivant. 2013. Applied Logistic Regression. John Wiley & Sons.

Jakubíček, Miloš, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychly, and Vít Suchomel. 2013. “The Tenten Corpus Family.” In Proc Corpus Linguistics Conference, 125–27. Lancaster.

Kanaris, Ioannis, and Efstathios Stamatatos. 2007. “Webpage Genre Identification Using Variable-Length Character N-Grams.” [URL].

Karlgren, Jussi, and Douglass Cutting. 1994. “Recognizing Text Genres with Simple Metrics Using Discriminant Analysis.” In COLING ’94: Proc. of the 15th. International Conference on Computational Linguistics, 1071–5. Kyoto, Japan.

Katinskaya, Anisya, and Serge Sharoff. 2015. “Applying Multi-Dimensional Analysis to a Russian Webcorpus: Searching for Evidence of Genres.” In Proc Bsnlp. Sofia.

Kessler, Brett, Geoffrey Nunberg, and Hinrich Schütze. 1997. “Automatic Detection of Text Genre.” In Proceedings of the 35〖^(th)〗 ACL/8〖^(th)〗 Eacl, 32–38.

Kilgarriff, Adam. 2001. “The Web as Corpus.” In Proc Corpus Linguistics 2001. Lancaster. [URL].

Kilgarriff, Adam, and Vít Suchomel. 2013. “Web Spam.” In Proc Web as Corpus Workshop (Wac8) at Corpus Linguistics Conference. Lancaster.

Krippendorff, Klaus. 2004. “Reliability in Content Analysis: Some Common Misconceptions and Recommendations.” Human Communication Research 30 (3): 411–33.

Kunilovskaya, Maria, and Serge Sharoff. 2019. “Building Functionally Similar Corpus Resources for Translation Studies.” In Proc Ranlp, 583–92. Varna.

Lee, David. 2001. “Genres, Registers, Text Types, Domains, and Styles: Clarifying the Concepts and Navigating a Path Through the BNC Jungle.” Language Learning and Technology 5 (3): 37–72.

Liu, Bing, and Ian Lane. 2016. “Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling.” arXiv Preprint arXiv:1609.01454.

Matthiessen, Christian MIM. 2015. “Register in the Round: Registerial Cartography.” Functional Linguistics 2 (1): 1–48.

Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” In Proc. Workshop at Iclr’13.

Nesi, Hilary, and Sheena Gardner. 2012. Genres Across the Disciplines: Student Writing in Higher Education. Cambridge: Cambridge University Press.

Petrenz, Philipp, and Bonnie Webber. 2010. “Stable Classification of Text Genres.” Computational Linguistics 34 (4): 285–93.

Santini, Marina, Alexander Mehler, and Serge Sharoff. 2010. “Riding the Rough Waves of Genre on the Web.” In Genres on the Web: Computational Models and Empirical Studies, edited by Alexander Mehler, Serge Sharoff, and Marina Santini. Berlin/New York: Springer.

Sharoff, Serge. 2018. “Functional Text Dimensions for the Annotation of Web Corpora.” Corpora 13 (1): 65–95.

Sharoff, Serge, Dirk Goldhahn, and Uwe Quasthoff. 2017. “Frequency Dictionary: Russian.” In, 91:9–14. Frequency Dictionaries. Leipziger Universitätsverlag.

Sharoff, Serge, Zhili Wu, and Katja Markert. 2010. “The Web Library of Babel: Evaluating Genre Collections.” In Proc Seventh Language Resources and Evaluation Conference, LREC. Malta.

Sinclair, John. 1991. Corpus, Concordance and Collocation. Oxford: OUP.

Sinclair, John, and Jackie Ball. 1996. “Preliminary Recommendations on Text Typology.” EAG-TCWG-TTYP/P. Expert Advisory Group on Language Engineering Standards document. [URL].

Sorower, Mohammad S. 2010. “A Literature Survey on Algorithms for Multi-Label Learning.” Vol. 181. Oregon State University.

Stamatatos, Efstathios, George Kokkinakis, and Nikos Fakotakis. 2000. “Automatic Text Categorization in Terms of Genre and Author.” Computational Linguistics 26 (4): 471–95.

Straka, Milan, and Jana Straková. 2017. “Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe.” In Proc Conll 2017 Shared Task, 88–99. Vancouver, Canada: Association for Computational Linguistics.

Szmrecsanyi, Benedikt. 2009. “Typological Parameters of Intralingual Variability: Grammatical Analyticity Versus Syntheticity in Varieties of English.” Language Variation and Change 21 (3). Cambridge University Press: 319–53.

Yang, Yiming, and Jan O. Pedersen. 1997. “A Comparative Study on Feature Selection in Text Categorization.” In Proc ICML, edited by Douglas H. Fisher, 412–20. Nashville, US.

Yogatama, Dani, Chris Dyer, Wang Ling, and Phil Blunsom. 2017. “Generative and Discriminative Text Classification with Recurrent Neural Networks.” arXiv Preprint arXiv:1703.01898.

Cited by (8)

Cited by eight other publications

Order by:

Kuzman, Taja & Nikola Ljubešić

2025. Automatic genre identification: a survey. Language Resources and Evaluation 59:1 ► pp. 537 ff.

Boumechaal, Souad & Serge Sharoff

2024. Attitudes, communicative functions, and lexicogrammatical features of anti-vaccine discourse on Telegram. Applied Corpus Linguistics 4:2 ► pp. 100095 ff.

Erten-Johansson, Selcen, Valtteri Skantsi, Sampo Pyysalo & Veronika Laippala

2024. Linguistic variation beyond the Indo-European web. Register Studies 6:1 ► pp. 60 ff.

Sharoff, Serge & Nenad Ivanović

2024. Sociolinguistic Variation in Slavic Languages. In The Cambridge Handbook of Slavic Linguistics, ► pp. 559 ff.

Kuzman, Taja, Igor Mozetič & Nikola Ljubešić

2023. Automatic Genre Identification for Robust Enrichment of Massive Text Collections: Investigation of Classification Methods in the Era of Large Language Models. Machine Learning and Knowledge Extraction 5:3 ► pp. 1149 ff.

Repo, Liina, Brett Hashimoto & Veronika Laippala

2023. In search of founding era registers: automatic modeling of registers from the corpus of Founding Era American English. Digital Scholarship in the Humanities 38:4 ► pp. 1659 ff.

Sharoff, Serge Aleksandrovich

2022. What neural networks know about linguistic complexity. Russian Journal of Linguistics 26:2 ► pp. 371 ff.

Solberg, Winton U.

1992. The Early Years of the Jewish Presence at the University of Illinois. Religion and American Culture: A Journal of Interpretation 2:2 ► pp. 215 ff.

This list is based on CrossRef data as of 30 november 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.