Register in computational language research

Argamon, Shlomo Engelson

doi:10.1075/rs.18015.arg

Article published In: Register Studies
Vol. 1:1 (2019) ► pp.100–135

Get fulltext from our e-platform

Download PDF

Register in computational language research

Shlomo Engelson Argamon | Illinois Institute of Technology

Published online: 26 April 2019

https://doi.org/10.1075/rs.18015.arg

Abstract

Shlomo Argamon is Professor of Computer Science and Director of the Master of Data Science Program at the Illinois Institute of Technology (USA). In this article, he reflects on the current and potential relationship between register and the field of computational linguistics. He applies his expertise in computational linguistics and machine learning to a variety of problems in natural language processing. These include stylistic variation, forensic linguistics, authorship attribution, and biomedical informatics. He is particularly interested in the linguistic structures used by speakers and writers, including linguistic choices that are influenced by social variables such as age, gender, and register, as well as linguistic choices that are unique or distinctive to the style of individual authors. Argamon has been a pioneer in computational linguistics and NLP research in his efforts to account for and explore register variation. His computational linguistic research on register draws inspiration from Systemic Functional Linguistics, Biber’s multi-dimensional approach to register variation, as well as his own extensive experience accounting for variation within and across text types and authors. Argamon has applied computational methods to text classification and description across registers – including blogs, academic disciplines, and news writing – as well as the interaction between register and other social variables, such as age and gender. His cutting-edge research in these areas is certain to have a lasting impact on the future of computational linguistics and NLP.

Keywords: computational linguistics, natural language processing, style, stylistics, text classification

Article outline

Introduction
1.How is register conceptualized in computational language research?
2.How does register relate to the research goals within computational language research?
- 2.1Aims of research in computational register analysis
  - 2.1.1Aims of register classification
  - 2.1.2Aims of multi-dimensional analysis
- 2.2Aims of research in computational register synthesis
3.What are the major methodological approaches that are used to analyze or account for register in computational language research?
- 3.1Methods of register analysis research
  - 3.1.1Stylistic text features
  - 3.1.2Methods of register classification
  - 3.1.3Methods of multi-dimensional analysis
- 3.2Methods of register synthesis research
4.What does a typical register study look like in computational language research?
- 4.1Classification analysis: Multidisciplinary scientific texts
- 4.2Multi-dimensional analysis: Abusive language
- 4.3Text generation: Customized medical information
5.What are the most promising areas of future research on register in computational language research?
References

References (139)

References

Abbasi, A., & Chen, H. (2007). Categorization and analysis of text in computer mediated communication archives using visualization. In Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 11–18). New York: ACM.

Aizawa, A. (2003). An information-theoretic perspective of tf–idf measures. Information Processing & Management, 39(1), 45–65.

Amasyalı, M. F., & Diri, B. (2006). Automatic Turkish text categorization in terms of author, genre and gender. In International Conference on Application of Natural Language to Information Systems (pp. 221–226). Berlin: Springer.

Argamon-Engelson, S., Koppel, M., Avneri, G. (1998). Style-based text categorization: What newspaper am I reading? In Proc. of AAAI Workshop on Learning for Text Categorization, 1998 (pp. 1–4).

Argamon, S., & Levitan, S. (2005). Measuring the usefulness of function words for authorship attribution. In Proceedings of the 2005 ACH/ALLC Conference.

Argamon, S., & Koppel, M. (2010). The rest of the story: Finding meaning in stylistic variation. In The structure of style (pp. 79–112). Berlin: Springer.

Argamon, S., Dodick, J., & Chase, P. (2008). Language use reflects scientific methodology: A corpus-based study of peer-reviewed journal articles. Scientometrics, 75(2), 203–238.

Argamon, S., Koppel, M., Fine, J., & Shimoni, A. R. (2003). Gender, genre, and writing style in formal written texts. Text, 23(3), 321–346.

Argamon, S., Whitelaw, C., Chase, P., Hota, S. R., Garg, N., & Levitan, S. (2007). Stylistic text classification using functional lexical features. Journal of the American Society for Information Science and Technology, 58(6), 802–822.

Atkinson, D. (1992). The evolution of medical research writing from 1735 to 1985: The case of the Edinburgh Medical Journal. Applied Linguistics, 13(4), 337–374.

Bateman, J. A., Maier, E. A., Teich, E., & Wanner, L. (1991). Towards an architecture for situated text generation. In Proceedings of the ICCICL (pp. 289–302).

Belz, A. (2005). Statistical generation: Three methods compared and evaluated. In Proceedings of ENLG-2005 (pp. 15–23).

Berber Sardinha, T. (2017). Text types in Brazilian Portuguese: A multi-dimensional perspective. Corpora, 12(3), 483–515.

Biber, D. (1989). A typology of English texts. Linguistics, 27(1), 3–44.

(1991). Variation across speech and writing. Cambridge: Cambridge University Press.

(1995). Dimensions of register variation: A cross-linguistic comparison. Cambridge: Cambridge University Press.

(2003). Variation among university spoken and written registers: A new multi-dimensional analysis. Language and Computers, 461, 47–70.

(2004). Conversation text types: A multi-dimensional analysis. In Le poids des mots: Proc. of the 7th International Conference on the Statistical Analysis of Textual Data (pp. 15–34). Louvain: Presses universitaires de Louvain.

Biber, D., & Barbieri, F. (2007). Lexical bundles in university spoken and written registers. English for Specific P_urposes, 26(3), 263–286.

Biber, D., & Conrad, S. (2001). Register variation: A corpus approach. In D. Schiffrin, D. Tannen, & H. E. Hamilton (Eds.), The handbook of discourse analysis (pp. 175–196). Malden, MA: Blackwell.

(2009). Register, genre, and style. Cambridge: Cambridge University Press.

Biber, D., & Finegan, E. (2001). Diachronic relations among speech-based and written registers in English. In S. Conrad & D. Biber (Eds.), Variation in English: Multi-dimensional studies (pp. 66–83). Harlow: Pearson Education.

Brooke, J., Wang, T., & Hirst, G. (2010). Automatic acquisition of lexical formality. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters (pp. 90–98). Stroudsburg, PA: Association for Computational Linguistics.

Carroll, J., Minnen, G., & Briscoe, T. (1999). Corpus annotation for parser evaluation. In Proceedings of the EACL workshop on LINC, June 1999.

Clarke, I., & Grieve, J. (2017). Dimensions of abusive language on twitter. In Proceedings of the First Workshop on Abusive Language Online (pp. 1–10).

Cohen, W. W. (1995). Fast effective rule induction. In Proceedings 12th International Conference on Machine Learning (pp. 115–123). Burlington MA: Morgan Kaufmann.

Conrad, S. M. (1996). Investigating academic texts with corpus-based techniques: An example from biology. Linguistics and Education, 8(3), 299–326.

Crowston, K., & Kwasnik, B. H. (2003). Can document-genre metadata improve information access to large digital collections? Library Trends, 52(2), 345–361.

Crystal, D. (2011). Internet linguistics: A student guide. London: Routledge.

Damashek, M. (1995). Gauging similarity with n-grams: Language-independent categorization of text. Science, 267(5199), 843–848.

De Vel, O., Anderson, A., Corney, M., & Mohay, G. (2001). Mining e-mail content for author identification forensics. ACM Sigmod Record, 30(4), 55–64.

Degaetano-Ortlieb, S., Kermes, H., Khamis, A., & Teich, E. (2016). An information-theoretic approach to modeling diachronic change in scientific English. In C. Suhr, T. Nevalainen, & I. Taavitsainen (Eds.), Selected papers from Varieng – From data to evidence (d2e), Helsinki, Finland. Leiden: Brill.

Diederich, J., Kindermann, J., Leopold, E., & Paass, G. (2003). Authorship attribution with support vector machines. Applied Intelligence, 19(1–2), 109–123.

DiMarco, C., & Foster, M. E. (1997). The automated generation of Web documents that are tailored to the individual reader. In Proceedings of the AAAI-97 Spring Symposium on Natural Language Processing for the World Wide Web, Stanford, CA.

Dong, L., Watters, C., Duffy, J., & Shepherd, M. (2008). An examination of genre attributes for web page classification. In Proceedings of HICSS (pp. 133). IEEE.

Eisenstein, J., Smith, N. A., & Xing, E. P. (2011). Discovering sociolinguistic associations with structured sparsity. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 11 (pp. 1365–1374). Stroudsburg, PA: Association for Computational Linguistics.

Ficler, J., & Goldberg, Y. (2017). Controlling linguistic style aspects in neural language generation. In Proceedings from the Conference on Empirical Methods in Natural Language Processing (EMNLP) Workshop on Stylistic Variation (pp. 94–104).

Finn, A., & Kushmerick, N. (2006). Learning to classify documents according to genre. Journal of the American Society for Information Science and Technology, 57(11), 1506–1518.

Freund, L., Clarke, C. L., & Toms, E. G. (2006). Towards genre classification for IR in the workplace. In Proceedings of the 1st International Conference on Information Interaction in Context (pp. 30–36). ACM.

Fu, Z., Tan, X., Peng, N., Zhao, D. & Yan, R. (2018). Style transfer in text: Exploration and evaluation. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.

Gatt, A., & Krahmer, E. (2018). Survey of the state of the art in natural language generation: Core tasks, applications and evaluation. Journal of Artificial Intelligence Research, 611, 65–170.

Genkin, A., Lewis, D., & Madigan, D. (2006). Large-scale Bayesian logistic regression for text categorization. Technometrics, 49(3), 291–304.

Giesbrecht, E., & Evert, S. (2009). Is part-of-speech tagging a solved task? An evaluation of POS taggers for the German web as corpus. In Proceedings of the Fifth Web as Corpus Workshop (pp. 27–35).

Glover, A., & Hirst, G. (1996). Detecting stylistic inconsistencies in collaborative writing. In M. Sharples & T. van der Geest (Eds.), The New Writing Environment (pp. 147–168). London: Springer.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information Processing Systems 271 (pp. 2672–2680).

Goutte, C., & Gaussier, E. (2005, March). A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In European Conference on Information Retrieval (pp. 345–359). Berlin: Springer.

Gries, S. T., & Mukherjee, J. (2010). Lexical gravity across varieties of English: An ICE-based study of n-grams in Asian Englishes. International Journal of Corpus Linguistics, 15(4), 520–548.

Gries, S. T., Newman, J., & Shaoul, C. (2011). N-grams and the clustering of registers. Empirical Language Research Journal, 5(11).

Grieve, J., Biber, D., Friginal, E. & Nekrasova, T. (2010). Variation among blogs: A multi-dimensional analysis. In Genres on the Web (pp. 303–322). Springer, Dordrecht.

Grieve, J., Biber, D., Friginal, E. and Nekrasova, T. (2010). Variation among blogs: A multi-dimensional analysis. In A. Mehler, S. Sharoff, & M. Santini (Eds.), Genres on the Web (pp. 303–322). Dordrecht: Springer.

Halliday, M., & Hasan, R. (1989). Language, Context, and text: Aspects of language in a social-semiotic perspective, 2nd ed. Oxford: Oxford University Press.

Halliday, M. A., McIntosh, A., & Strevens, P. (1968). The users and uses of language. In J. Fischman (Ed.), Readings in the sociology of language (139–169). The Hague: Mouton.

Halliday, M. A. K., & Matthiessen, C. (2004). An introduction to functional grammar. London: Routledge.

Hammerton, J., Osborne, M., Armstrong, S., & Daelemans, W. (2002). Introduction to special issue on machine learning approaches to shallow parsing. Journal of Machine Learning Research, 21, 551–558.

Herring, S., Johnson, D. A., & DiBenedetto, T. (1995). This discussion is going too far!: Male resistance to female participation on the internet. In K. Hall & M. Bucholtz (Eds.), Gender articulated: Language and the socially constructed self (pp. 67–96). New York: Routledge.

Herring, S. C., & Paolillo, J. C. (2006). Gender and genre variation in weblogs. Journal of Sociolinguistics, 10(4), 439–459.

Heylighen, F., & Dewaele, J. (1999). Formality of language: definition, measurement and behavioral determinants. Interner Bericht, Center “Leo Apostel”, Vrije Universiteit Brüssel.

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735–1780.

Holmes, D. (1998). The evolution of stylometry in humanities scholarship. Literary and Linguistic Computing, 13(3), 111–117.

Hoorn, J. F., Frank, S. L., Kowalczyk, W., & van Der Ham, F. (1999). Neural network identification of poets using letter sequences. Literary and Linguistic Computing, 14(3), 311–338.

Hovy, E., Lavid, J., Maier, E., Mittal, V., & Paris, C. (1992). Employing knowledge resources in a new text planner architecture. In Aspects of automated natural language generation (pp. 57–72). Berlin, Heidelberg: Springer.

Hovy, E. H. (1990). Pragmatics and natural language generation. Artificial Intelligence, 43(2), pp.153–197.

(1991). Approaches to the planning of coherent text. In R. Dale, E. Hovy, D. Rösner, & O. Stock (Eds.), Natural language generation in artificial intelligence and computational linguistics (pp. 83–102). Boston, MA: Springer.

Husson, F., Lê, S., & Pags, J. (2010). Exploratory multivariate analysis by example using R. London: Chapman & Hall CRC.

Jhamtani, H., Gangal, V., Hovy, E., & Nyberg, E. (2017). Shakespearizing modern language using copy-enriched sequence to sequence models. In Proceedings of the Workshop on Stylistic Variation at EMNLP 2017 (pp. 10–19).

Johansson, S., Leech, G. N., & Goodluck, H. (1978). Manual of information to accompany the Lancaster-Oslo/Bergen Corpus of British English, for use with digital computer. Oslo: Department of English, University of Oslo.

Jolliffe, I. (2011). Principal component analysis. In M. Lovric (Ed.), International encyclopedia of statistical science (pp. 1094–1096). Berlin: Springer.

Kakkonen, T., & Sutinen, E. (2008). Coverage-based evaluation of parser generalizability. In Proceedings of the Third International Joint Conference on Natural Language Processing, Volume–II1.

Kan, M. Y., & McKeown, K. R. (2002). Corpus-trained text generation for summarization. In Proceedings of the International Natural Language Generation Conference (pp. 1–8).

Kanaris, I., & Stamatatos, E. (2007). Webpage genre identification using variable-length character n-grams. In Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence (p. 3–10). Washington, DC.

Karlgren, J. (1999). Stylistic experiments in information retrieval. In T. Strzalkowski (Ed.) Natural Language Information Retrieval (pp. 147–166). Dordrecht: Springer.

Kešelj, V., Peng, F., Cercone, N., & Thomas, C. (2003). N-gram-based author profiles for authorship attribution. In Proceedings of the Conference Pacific Association for Computational Linguistics, PACLING, 31 (pp. 255–264).

Kjell, B. (1994a). Authorship attribution of text samples using neural networks and Bayesian classifiers. In IEEE International Conference on Systems, Man and Cybernetics, San Antonio, TX.

Kjell, B., Woods, W. A., Frieder, O. (1995). Information retrieval using letter tuples with neural network and nearest neighbor classifiers. In IEEE International Conference on Systems, Man and Cybernetics (Vol., pp. 1222–1225). Vancouver, BC.

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., & Herbst, E. (2007). Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions.

Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In C. S. Mellish (Ed.). Proceedings IJCAI-95, 14(2), 1137–1145. Montreal, Quebec.

Koller, D., Friedman, N., & Bach, F. (2009). Probabilistic graphical models: Principles and techniques. Cambridge, MA: The MIT press.

Koppel, M., Argamon, S., & Shimoni, A. R. (2002). Automatically categorizing written texts by author gender. Literary and Linguistic Computing, 17(4), 401–412.

Koppel, M., & Schler, J. (2003). August. Exploiting stylistic idiosyncrasies for authorship attribution. In Proceedings of IJCAI’03 Workshop on Computational Approaches to Style Analysis and Synthesis (Vol. 691, pp. 72–80).

Koppel, M., Schler, J., & Zigdon, K. (2005). Determining an author’s native language by mining a text for errors. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (pp. 624–628). ACM.

Langkilde-Geary, I. (2002). An empirical verification of coverage and correctness for a general-purpose sentence generator. In Proceedings of the International Natural Language Generation Conference (pp. 17–24).

Lee, D. (2001). Genres, registers, text types, domains, and styles: Clarifying the concepts and navigating a path through the BNC jungle. Language Learning and Technology, 5(3). 37–72.

Loehlin, J. C. (1998). Latent variable models: An introduction to factor, path, and structural analysis. Hillsdale, NJ: Lawrence Erlbaum Associates.

Louwerse, M. M., & Graesser, A. C. (2004). Coherence in discourse. In P. Strazny (Ed.), Encyclopedia of linguistics. Chicago, IL: Fitzroy Dearborn.

Lowe, D., & Matthews, R. (1995), Shakespeare vs. Fletcher: A stylometric analysis by radial basis functions. Computers and the Humanities, 291, 449–461.

Madigan, D., Genkin, A., Lewis, D. D., Argamon, S., Fradkin, D., & Ye, L. (2006). Author identification on the large scale. In Proc. of Classification Society of N. America, 2005.

Mann, W. C., & Thompson, S. A. (1988). Rhetorical structure theory: Toward a functional theory of text organization. Text-Interdisciplinary Journal for the Study of Discourse, 8(3), 243–281.

Marcu, D. (2000). The theory and practice of discourse parsing and summarization. Cambridge: The MIT press.

(1997). From local to global coherence: A bottom-up approach to text planning. In AAAI/IAAI (pp. 629–635).

Martin, J. H., & Jurafsky, D. (2000). Speech and language processing. Englewood Cliffs, NJ: Prentice-Hall.

Martin, J. R. (1992). English text: System and structure. Amsterdam: John Benjamins.

Matthews, R., & Merriam, T. (1993). Neural computation in stylometry : An application to the works of Shakespeare and Fletcher. Literary and Linguistic Computing, 8(4), 203–209.

Matthiessen, C. M. I. M. (2015). Register in the round: Registerial cartography. Functional Linguistics, 2(1), 9.

Matthiessen, C. M. I. M., & Teruya, K. (2015). Grammatical realizations of rhetorical relations in different registers. Word, 61(3), 232–281.

McKeown, K., Kukich, K., & Shaw, J. (1994). Practical issues in automatic documentation generation. In Proceedings of the Fourth Conference on Applied Natural Language Processing (pp. 7–14). Stroudsburg, PA: Association for Computational Linguistics.

Merriam, T., & Matthews, R. (1994). Neural compuation in stylometry II: An application to the works of Shakespeare and Marlowe. Literary and Linguistic Computing 91, 1–6.

Moore, J. D., & Paris, C. L. (1993). Planning text for advisory dialogues: Capturing intentional and rhetorical information. Computational Linguistics, 19(4), 651–694.

Morato, J., Llorens, J., Génova, G., & Moreiro, J. A. (2003). Experiments in discourse analysis impact on information classification and retrieval algorithms. Information Processing & Management, 39(6), 825–851.

Mosquera, A., & Moreda, P. (2012). A qualitative analysis of informality levels in web 2.0 texts: The Facebook case study. In Proceedings of the LREC workshop:@ NLP can u tag# user generated content (pp. 23–29).

Nowson, S. (2006). The language of weblogs: A study of genre and individual differences. Unpublished PhD dissertation. University of Edinburgh.

Nowson, S., Oberlander, J., & Gill, A. J. (2005). Weblogs, genres and individual differences. In Proceedings of the 27th Annual Conference of the Cognitive Science Society (pp. 1666–1671). Hillsdale, NJ: Lawrence Erlbaum Associates.

Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.

Paiva, D. S., & Evans, R. (2005). Empirically-based control of natural language generation. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05) (pp. 58–65).

Pavalanathan, U., Fitzpatrick, J., Kiesling, S., & Eisenstein, J. (2017). A multi-dimensional lexicon for interpersonal stancetaking. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (pp. 884–895). Vancouver, CA: Association for Computation Linguistics.

Pearl, J. (2009). Causality. Cambridge: Cambridge University Press.

Peng, F., Schuurmans, D., & Wang, S. (2004). Augmenting naive bayes classifiers with statistical language models. Information Retrieval, 7(3–4), 317–345.

Power, R., Scott, D., & Bouayad-Agha, N. (2003). Generating texts with style. In International Conference on Intelligent Text Processing and Computational Linguistics (pp. 444–452). Berlin: Springer.

Prabhumoye, S., Tsvetkov, Y., Salakhutdinov, R., & Black, A. W. (2018). Style transfer through back-translation. In Proceedings of Association for Computational Linguistics Conference. Stroudsburg, PA: ACL.

Quinlan, J. R. (2014). C4. 5: Programs for machine learning. Oxford: Elsevier.

Raileanu, L. E., & Stoffel, K. (2004). Theoretical comparison between the GINI index and information gain criteria. Annals of Mathematics and Artificial Intelligence, 41(1), 77–93.

Rehbein, I., & Bildhauer, F. (2017). Data point selection for genre-aware parsing. In Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories (pp. 95–105).

Reiter, E., & Dale, R. (2000). Building natural language generation systems. Cambridge: Cambridge University Press.

Reiter, E., Sripada, S., Hunter, J., & Yu, J. (2005). Choosing words in computer-generated weather forecasts. Artificial Intelligence, 167(1–2), 137–169.

Reiter, E., & Williams, S. (2010). Generating texts in different styles. In S. Argamon, K. Burns, & S. Dubnov (Eds.), The structure of style. Algorithmic approachees to understanding manner and meaning (pp. 59–75). Heidelberg: Springer.

Santini, M. (2005). Genres in formation? An exploratory study of web pages using cluster analysis. In Proceedings of the 8th Annual Colloquium for the UK Special Interest Group for Computational Linguistics (CLUK05). Manchester, UK.

(2006). Some issues in automatic genre classification of web pages. In Proceedings of JADT 2006: 8èmes Journées Internationales d’Analyse statistique des Données Textuelles.

(2008). Zero, single, or multi? Genre of web pages through the users’ perspective. Information Processing & Management, 44(2), 702–737.

Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34(1), 1–47.

Sennrich, R., Haddow, B., & Birch, A. (2016). Controlling politeness in neural machine translation via side constraints. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 35–40).

Sharoff, S., Wu, Z., & Markert, K. (2010). The Web Library of Babel: Evaluating genre collections. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10).

Sheika, F. A., & Inkpen, D. (2012). Learning to classify documents according to formal and informal style. Linguistic Issues in Language Technology, 8(1), 1–29.

Speelman, D., Gondelaers, S., & Geeraerts, D. (2006). A profile-based calculation of region and register variation: The synchronic and diachronic status of the two main national varieties of Dutch. In A. Wilson, D. Archer, & P. Rayson (Eds.), Corpus Linguistics Around the World (pp. 181–194). Amsterdam: Rodopi.

Stamatatos, E. (2008). Author identification: Using text sampling to handle the class imbalance problem. Information Processing & Management, 44(2), 790–799.

Stamatatos, E., Fakotakis, N., & Kokkinakis, G. (2000). Automatic text categorization in terms of genre and author. Computational Linguistics, 26(4), 471–495.

Svartvik, J., & Quirk, R. (1980). A corpus of English conversation. Lund: Gleerup.

Szmrecsanyi, B. (2009). Typological parameters of intralingual variability: Grammatical analyticity versus syntheticity in varieties of English. Language Variation and Change, 21(3), 319–353.

Tambouratzis, G., Markantonatou, S., Hairetakis, N., Vassiliou, M., Tambouratzis, D., & Carayannis, G. (2000). Discriminating the registers and styles in the Modern Greek language. In Proceedings of the workshop on Comparing corpora (Vol. 91, pp. 35–42). Stroudsburg, PA: Association for Computational Linguistics.

Teich, E., & Fankhauser, P. (2010). Exploring a corpus of scientific texts using data mining. Language & Computers, 71(1), 233–247.

Teich, E., Degaetano-Ortlieb, S., Kermes, H., & Lapshinova-Koltunski, E. (2013). Scientific registers and disciplinary diversification: a comparable corpus approach. In Proceedings of the Sixth Workshop on Building and Using Comparable Corpora (pp. 59–68).

The British National Corpus, version 3 (BNC XML Edition). (2007). Distributed by Bodleian Libraries, University of Oxford, on behalf of the BNC Consortium. <[URL]>

Tweedie, S. Singh, & Holmes, D. I. (1996). Neural network applications in stylometry: The Federalist Papers. Computers and the Humanities, 30(1), 1–10.

van Dijk, T. A. (1993). Stories and racism. In D. K. Mumby (Ed.). Narrative and social control: Critical perspectives. Newbury Park, CA: Sage.

Vidulin, V., Luštrek, M., & Gams, M. (2007). Using genres to improve search engines. In Proceedings of the International Workshop Towards Genre-Enabled Search Engines (pp. 45–51).

Waseem, Z., & Hovy, D. (2016). Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In Proceedings of Proceedings of NAACL-HLT 2016. (pp. 88–93). Stroudsburg, PA: The Association for Computational Linguistics <[URL]>

Waugh, S., Adams, A., & Tweedie, F. J. (2000). Computational stylistics using Artificial Neural Networks. Literary and Linguistic Computing, 15(2), 187–198.

Xiao, R. (2009). Multi-dimensional analysis and the study of world Englishes. World Englishes, 28(4), 421–450.

Xu, W., Ritter, A., Dolan, B., Grishman, R., & Cherry, C. (2012). Paraphrasing for style. In Proceedings of COLING 2012 (pp. 2899–2914).

Zhao, Y. & Zobel, J. (2005). Effective and scalable authorship attribution using function words. In Asia Information Retrieval Symposium (pp. 174–189). Heidelberg: Springer.

Zheng, R., Li, J., Chen, H., & Huang, Z. (2006). A framework for authorship identification of online messages: Writing-style features and classification techniques. Journal of the American Society for Information Science and Technology, 57(3), 378–393.

Cited by (12)

Cited by 12 other publications

Order by:

Poucke, Margo Van

2025. Appraising Feedback Stance in Higher Education: A Corpus-Assisted Discourse Study of Student and Academic Perceptions, Perspectives and Preferences. Corpus Pragmatics 9:3 ► pp. 337 ff.

Skantsi, Valtteri & Veronika Laippala

2025. Analyzing the unrestricted web: The finnish corpus of online registers. Nordic Journal of Linguistics 48:1 ► pp. 1 ff.

Repo, Liina, Brett Hashimoto & Veronika Laippala

2023. In search of founding era registers: automatic modeling of registers from the corpus of Founding Era American English. Digital Scholarship in the Humanities 38:4 ► pp. 1659 ff.

Chaves, Ana Paula, Jesse Egbert, Toby Hocking, Eck Doerry & Marco Aurelio Gerosa

2022. Chatbots Language Design: The Influence of Language Variation on User Experience with Tourist Assistant Chatbots. ACM Transactions on Computer-Human Interaction 29:2 ► pp. 1 ff.

Chaves, Ana Paula & Marco Aurelio Gerosa

2022. The Impact of Chatbot Linguistic Register on User Perceptions: A Replication Study. In Chatbot Research and Design [Lecture Notes in Computer Science, 13171], ► pp. 143 ff.

Marko, Karoline, Margit Reitbauer & Georg Pickl

2022. Same person, different platform. Register Studies 4:2 ► pp. 202 ff.

Mendhakar, Akshay

2022. Linguistic Profiling of Text Genres: An Exploration of Fictional vs. Non-Fictional Texts. Information 13:8 ► pp. 357 ff.

Degaetano-Ortlieb, Stefania, Tanja Säily & Yuri Bizzoni

2021. Registerial Adaptation vs. Innovation Across Situational Contexts: 18th Century Women in Transition. Frontiers in Artificial Intelligence 4

Laippala, Veronika, Jesse Egbert, Douglas Biber & Aki-Juhani Kyröläinen

2021. Exploring the role of lexis and grammar for the stable identification of register in an unrestricted corpus of web documents. Language Resources and Evaluation 55:3 ► pp. 757 ff.

Pérez-Guerra, Javier

2021. Theme as a proxy for register categorization. In Corpus-based approaches to register variation [Studies in Corpus Linguistics, 103], ► pp. 85 ff.

Bizzoni, Yuri, Stefania Degaetano-Ortlieb, Peter Fankhauser & Elke Teich

2020. Linguistic Variation and Change in 250 Years of English Scientific Writing: A Data-Driven Approach. Frontiers in Artificial Intelligence 3

[no author supplied]

2024. Textbook English [Studies in Corpus Linguistics, 116],

This list is based on CrossRef data as of 30 november 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.