Article published In: Register Studies
Vol. 3:1 (2021) ► pp.1–32
Genre annotation for the Web
Text-external and text-internal perspectives
Published online: 3 June 2021
https://doi.org/10.1075/rs.19015.sha
https://doi.org/10.1075/rs.19015.sha
Abstract
This paper describes a digital curation study aimed at comparing the composition of large Web corpora, such as
enTenTen, ukWac or ruWac, by means of automatic text classification. First, the paper presents a Deep Learning model suitable for
classifying texts from large Web corpora using a small number of communicative functions, such as Argumentation or Reporting.
Second, it describes the results of applying the automatic classification model to these corpora and compares their composition.
Finally, the paper introduces a framework for interpreting the results of automatic genre classification using linguistic
features. The framework can help in comparing general reference corpora obtained from the Web and in comparing corpora across
languages.
Article outline
- 1.Introduction
- 1.1Text-external communicative functions
- 1.2Text-internal linguistic features
- 2.Automatic genre identification
- 2.1Text classification model
- 2.2Datasets for training
- 2.3Prediction accuracy
- 3.Comparing large Web corpora
- 4.Communicative functions vs linguistic features
- 4.1Detection of linguistic features
- 4.2Mapping linguistic features to functions
- 4.3Linguistic features across languages
- 5.Related studies on computational analysis of genres
- 6.Conclusions and further work
- Notes
References
References (49)
Adamzik, Kirsten. 1995. Textsorten – Texttypologie. Eine Kommentierte Bibliographie. Münster: Nodus.
Argamon, Shlomo, Casey Whitelaw, Paul Chase, Sobhan Raj Hota, Navendu Garg, and Shlomo Levitan. 2007. “Stylistic Text Classification Using Functional Lexical Features.” Journal of the American Society for Information Science and Technology 58 (6). Wiley Online Library: 802–22.
Baker, Mona. 1996. “Corpus-Based Translation Studies: The Challenges That Lie Ahead.” In Terminology, Lsp and Translation: Studies in Language Engineering, edited by Harold Somers. John Benjamins.
Baroni, Marco, and Silvia Bernardini. 2006. “A New Approach to the Study of Translationese: Machine-Learning the Difference Between Original and Translated Text.” Literary and Linguistic Computing 21 (3): 259–74.
Benko, Vladimír. 2016. “Two Years of Aranea: Increasing Counts and Tuning the Pipeline.” In Proc Lrec. Portorož, Slovenia.
. 1995. Dimensions of Register Variation: A Cross-Linguistic Comparison. Cambridge University Press.
Biber, Douglas, and Jesse Egbert. 2016. “Register Variation on the Searchable Web: A Multi-Dimensional Analysis.” Journal of English Linguistics 44 (2): 95–137.
Biber, Douglas, and Bethany Gray. 2016. Grammatical Complexity in Academic English: Linguistic Change in Writing. Cambridge University Press.
Cienki, Alan J. 1989. Spatial Cognition and the Semantics of Prepositions in English, Polish, and Russian. Vol. 2371. Sagner Munich.
Conneau, Alexis, Shijie Wu, Haoran Li, Luke Zettlemoyer, and Veselin Stoyanov. 2020. “Emerging Cross-Lingual Structure in Pretrained Language Models.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 6022–34. Online: Association for Computational Linguistics.
Crowston, Kevin, Barbara Kwasnik, and Joseph Rubleske. 2010. “Problems in the Use-Centered Development of a Taxonomy of Web Genres.” In Genres on the Web: Computational Models and Empirical Studies, edited by Alexander Mehler, Serge Sharoff, and Marina Santini. Springer.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv Preprint arXiv:1810.04805.
Evert, Stefan. 2006. “How Random Is a Corpus? The Library Metaphor.” Zeitschrift Für Anglistik Und Amerikanistik 54 (2): 177–90.
Ferraresi, Adriano, Eros Zanchetta, Silvia Bernardini, and Marco Baroni. 2008. “Introducing and Evaluating ukWaC, a Very Large Web-Derived Corpus of English.” In The 4th Web as Corpus Workshop: Can We Beat Google? (At Lrec 2008). Marrakech. [URL].
Forsyth, Richard, and Serge Sharoff. 2014. “Document Dissimilarity Within and Across Languages: A Benchmarking Study.” Literary and Linguistic Computing 291: 6–22.
Gulordava, Kristina, Piotr Bojanowski, Edouard Grave, Tal Linzen, and Marco Baroni. 2018. “Colorless Green Recurrent Networks Dream Hierarchically.” arXiv Preprint arXiv:1803.11138.
Hearst, Marti A. 1997. “TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages.” Computational Linguistics 23 (1). MIT Press: 33–64.
Hosmer Jr, David W., Stanley Lemeshow, and Rodney X. Sturdivant. 2013. Applied Logistic Regression. John Wiley & Sons.
Jakubíček, Miloš, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychly, and Vít Suchomel. 2013. “The Tenten Corpus Family.” In Proc Corpus Linguistics Conference, 125–27. Lancaster.
Kanaris, Ioannis, and Efstathios Stamatatos. 2007. “Webpage Genre Identification Using Variable-Length Character N-Grams.” [URL].
Karlgren, Jussi, and Douglass Cutting. 1994. “Recognizing Text Genres with Simple Metrics Using Discriminant Analysis.” In COLING ’94: Proc. of the 15th. International Conference on Computational Linguistics, 1071–5. Kyoto, Japan.
Katinskaya, Anisya, and Serge Sharoff. 2015. “Applying Multi-Dimensional Analysis to a Russian Webcorpus: Searching for Evidence of Genres.” In Proc Bsnlp. Sofia.
Kessler, Brett, Geoffrey Nunberg, and Hinrich Schütze. 1997. “Automatic Detection of Text Genre.” In Proceedings of the 35〖^(th)〗 ACL/8〖^(th)〗 Eacl, 32–38.
Kilgarriff, Adam. 2001. “The Web as Corpus.” In Proc Corpus Linguistics 2001. Lancaster. [URL].
Kilgarriff, Adam, and Vít Suchomel. 2013. “Web Spam.” In Proc Web as Corpus Workshop (Wac8) at Corpus Linguistics Conference. Lancaster.
Krippendorff, Klaus. 2004. “Reliability in Content Analysis: Some Common Misconceptions and Recommendations.” Human Communication Research 30 (3): 411–33.
Kunilovskaya, Maria, and Serge Sharoff. 2019. “Building Functionally Similar Corpus Resources for Translation Studies.” In Proc Ranlp, 583–92. Varna.
Lee, David. 2001. “Genres, Registers, Text Types, Domains, and Styles: Clarifying the Concepts and Navigating a Path Through the BNC Jungle.” Language Learning and Technology 5 (3): 37–72.
Liu, Bing, and Ian Lane. 2016. “Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling.” arXiv Preprint arXiv:1609.01454.
Matthiessen, Christian MIM. 2015. “Register in the Round: Registerial Cartography.” Functional Linguistics 2 (1): 1–48.
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” In Proc. Workshop at Iclr’13.
Nesi, Hilary, and Sheena Gardner. 2012. Genres Across the Disciplines: Student Writing in Higher Education. Cambridge: Cambridge University Press.
Petrenz, Philipp, and Bonnie Webber. 2010. “Stable Classification of Text Genres.” Computational Linguistics 34 (4): 285–93.
Santini, Marina, Alexander Mehler, and Serge Sharoff. 2010. “Riding the Rough Waves of Genre on the Web.” In Genres on the Web: Computational Models and Empirical Studies, edited by Alexander Mehler, Serge Sharoff, and Marina Santini. Berlin/New York: Springer.
Sharoff, Serge. 2018. “Functional Text Dimensions for the Annotation of Web Corpora.” Corpora 13 (1): 65–95.
Sharoff, Serge, Dirk Goldhahn, and Uwe Quasthoff. 2017. “Frequency Dictionary: Russian.” In, 91:9–14. Frequency Dictionaries. Leipziger Universitätsverlag.
Sharoff, Serge, Zhili Wu, and Katja Markert. 2010. “The Web Library of Babel: Evaluating Genre Collections.” In Proc Seventh Language Resources and Evaluation Conference, LREC. Malta.
Sinclair, John, and Jackie Ball. 1996. “Preliminary Recommendations on Text Typology.” EAG-TCWG-TTYP/P. Expert Advisory Group on Language Engineering Standards document. [URL].
Sorower, Mohammad S. 2010. “A Literature Survey on Algorithms for Multi-Label Learning.” Vol. 181. Oregon State University.
Stamatatos, Efstathios, George Kokkinakis, and Nikos Fakotakis. 2000. “Automatic Text Categorization in Terms of Genre and Author.” Computational Linguistics 26 (4): 471–95.
Straka, Milan, and Jana Straková. 2017. “Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe.” In Proc Conll 2017 Shared Task, 88–99. Vancouver, Canada: Association for Computational Linguistics.
Szmrecsanyi, Benedikt. 2009. “Typological Parameters of Intralingual Variability: Grammatical Analyticity Versus Syntheticity in Varieties of English.” Language Variation and Change 21 (3). Cambridge University Press: 319–53.
Cited by (8)
Cited by eight other publications
Kuzman, Taja & Nikola Ljubešić
Boumechaal, Souad & Serge Sharoff
Erten-Johansson, Selcen, Valtteri Skantsi, Sampo Pyysalo & Veronika Laippala
Sharoff, Serge & Nenad Ivanović
Kuzman, Taja, Igor Mozetič & Nikola Ljubešić
Repo, Liina, Brett Hashimoto & Veronika Laippala
Sharoff, Serge Aleksandrovich
This list is based on CrossRef data as of 30 november 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
