A confirmatory technique for comparisons in corpus linguistics: Logistic regression

Speelman, Dirk

doi:10.1075/hcp.43.18spe

In:Corpus Methods for Semantics: Quantitative studies in polysemy and synonymy
Edited by Dylan Glynn and Justyna A. Robinson
[Human Cognitive Processing 43] 2014
► pp. 487–533

Get fulltext from our e-platform

Download Book PDF

Logistic regression

A confirmatory technique for comparisons in corpus linguistics

Dirk Speelman | University of Leuven

Published online: 6 November 2014

https://doi.org/10.1075/hcp.43.18spe

This text offers an introduction to binary logistic regression, a confirmatory technique for statistically modelling the effect of one or several predictors on a binary response variable. It is explained why logistic regression is exceptionally well suited for the comparison of near-synonyms in corpus data; the technique allows the researcher to identify the different factors that have an impact on the choice between near synonyms, and to tease apart their respective effects. Moreover, the technique is well suited to deal with the type of unbalanced data sets that are typical of Corpus Linguistics. First, we describe in which contexts logistic regression is applicable and we give examples of the types of research questions for which it is an appropriate tool. Next, we explain why and how logistic regression analysis is different from linear regression analysis and we illustrate how the output of logistic regression analysis can be interpreted, using the study of an alternation pattern in Dutch as our example. The R code used in the case study is explained in detail and an URL is given from which R code and data sets can be downloaded. Finally, suggestions for further reading are given.

Keywords: confirmatory statistics, outcome prediction, statistical modelling

References (20)

References

Arnold, J., Wasow, Th., Losongco, A., & Ginstrom, R. (2000). Heaviness vs. newness: The effects of complexity and information structure on constituent ordering. Language , 76, 28–55.

Berkson, J. (1944). Application of the logistic function to bio-assay. Journal of the American Statistical Association , 39, 357–365.

Cedergren, H., & Sankoff, D. (1974). Variable rules: Performance as a statistical reflection of competence. Language , 50, 33–56.

Cox, D.R. (1969). The analysis of binary data . London: Chapman and Hall.

Fox, J. (2003). Effect displays in R for generalised linear models. Journal of Statistical Software , 8(15), 1–27. Retrieved from [URL].

Grondelaers, S., Speelman, D., & Geeraerts, D. (2002). Regressing on er. Statistical analysis of texts and language variation. In A. Morin, & P. Sébillot (Eds.), 6èmes journées internationales d’analyse statistique des données textuelles (pp. 335–346). Rennes: Institut National de Recherche en Informatique et en Automatique.

Harrell, F.E. (2001). Regression modeling strategies: With applications to linear models, logistic regression, and survival analysis . Berlin: Springer.

Hilbe, J.M. (2009). Logistic regression models . London: Chapman & Hall/CRC Press.

Hosmer, D., & Lemeshow, S. (2000). Applied logistic regression (2^nd ed.). New York: Wiley.

Johnson, D.E. (2008). Getting off the GoldVarb standard: Introducing Rbrul for mixed-effects variable rule analysis. Language and Linguistics Compass , 3, 359–83.

Keune, K., Ernestus, M., van Hout, R., & Baayen, H. (2005). Social, geographical, and register variation in Dutch: From written mogelijk to spoken mok . Corpus Linguistics and Linguistic Theory , 1, 183–223.

Nelder, J., & Wedderburn, R. (1972). Generalized linear models. Journal of the Royal Statistical Society: Series A , 135, 370–384.

Oostdijk, N. (2000). The spoken Dutch corpus: Overview and first evaluation. In S. Markantontou, S. Piperidis, & G. Stainhauoer (Eds.), Proceedings of the second international conference on language resources and evaluation (pp. 887–893). Athens: Institute for Language and Speech Processing.

Pampel, F.C. (2000). Logistic regression: A primer . Thousand Oaks, CA: Sage.

Paolillo, J. (2002). Analyzing linguistic variation: Statistical models and methods . Stanford: CSLI.

Sankoff, D. (1988). Variable rules. In U. Ammon, N. Dittmar, & K.J. Mathheier (Eds.), Berlin sociolinguistics: An international handbook of the science of language and society , Vol. 2.(pp. 984–997). Berlin & New York: Walter de Gruyter.

Sankoff, D., Tagliamonte, S., & Smith, E. (2005). Goldvarb X: A variable rule application for Macintosh and Windows . Department of Linguistics, University of Toronto.

Tagliamonte, S.A. (2006). Analysing sociolinguistic variation . Cambridge: Cambridge University Press.

Williams, R.S. (1994). A statistical analysis of English double object alternation. Issues in Applied Linguistics , 5, 37–58.

Wilson, E.B., & Worcester, J. (1943). The determination of L. D. 50 and its sampling error in bio-assay. Proceedings of the National Academy of Sciences , 29, 257–262.

Cited by (50)

Cited by 50 other publications

Order by:

Guan, Lei, Enqin Liu, Man Yang & Bing Gao

2025. Fine identification of noxious weeds based on close-range hyperspectral imaging and spectral–spatial features of alpine meadow plants. Ecological Informatics 92 ► pp. 103430 ff.

Kaya, Muhammed-Fatih & Mareike Schoop

2025. The Impact of Information Load on Predicting Success in Electronic Negotiations. Group Decision and Negotiation 34:3 ► pp. 487 ff.

Li, Yi

2025. Cognitive and sociolectal constraints on the theme-recipient alternation: evidence from Mandarin. Corpus Linguistics and Linguistic Theory 21:2 ► pp. 237 ff.

Shahsavar, Yeganeh, Avishek Choudhury & Justus Onu

2025. Behavioral and social predictors of suicidal ideation and attempts among adolescents and young adults. PLOS Mental Health 2:1 ► pp. e0000221 ff.

Sinap, Vahid

2025. Optuna Tabanlı Hiper Parametre Optimizasyonu ile Konut Fiyat Tahminlemede Makine Öğrenmesi Tekniklerinin Karşılaştırmalı Analizi. Gazi Üniversitesi Fen Bilimleri Dergisi Part C: Tasarım ve Teknoloji 13:1 ► pp. 10 ff.

Wang, Xiaosong, Haisong Feng, Yilei Zhang & Fan Lin

2025. SGCLMD: Signed graph-based contrastive learning model for predicting somatic mutation-drug association. Computers in Biology and Medicine 190 ► pp. 110067 ff.

A. S., Anurag & M. Johnpaul

2024. Predictive Analytics. In Advancements in Intelligent Process Automation [Advances in Computational Intelligence and Robotics, ], ► pp. 481 ff.

Eva-Marie Bloom Ström, Hannah Gibson, Rozenn Guérois & Lutz Marten

2024. Morphosyntactic Variation in Bantu,

Glynn, Dylan & Olaf Mikkelsen

2024. Concrete constructions or messy mangroves? How modelling contextual effects on constructional alternations reflect theoretical assumptions of language structure. Linguistics Vanguard 10:s1 ► pp. 9 ff.

Marine, Buzuneh & Dagne Mengistie

2024. An Analysis of Various Factors Underlying Covid-19 Prevention Practice and Strategy in Jigjiga Town, Northeast Ethiopia. Infection and Drug Resistance Volume 17 ► pp. 187 ff.

Rathi, Snehal Rahul, Narendra Jadhav, Abhishek Raut, Abhishek Navhal & Manas Patil

2024. Predictive Analytics Using Machine Learning for Enhanced Online Advertising. In Bridging Academia and Industry Through Cloud Integration in Education [Advances in Educational Technologies and Instructional Design, ], ► pp. 281 ff.

Redelinghuys, Karien

2024. Language contact and change through translation in Afrikaans and South African English. In Constraints on Language Variation and Change in Complex Multilingual Contact Settings [Contact Language Library, 60], ► pp. 58 ff.

Silva, Douglas, Nadia Felix & Sergio Carvalho

2024. Detection of Structured Fraud Supported by Shell Companies on Goods and Services Trading Operations. In Electronic Government and the Information Systems Perspective [Lecture Notes in Computer Science, 14913], ► pp. 168 ff.

Yan, Fangke, Shuangbing Wen, Chengwei Liao, Jun Li & Tao Hu

2024. 2024 7th International Conference on Electronics Technology (ICET), ► pp. 512 ff.

Babu, C. Ganesh, M. Gowri Shankar, G. S. Priyanka & B. Vidhya

2023. SECOND INTERNATIONAL CONFERENCE ON CIRCUITS, SIGNALS, SYSTEMS AND SECURITIES (ICCSSS - 2022) [SECOND INTERNATIONAL CONFERENCE ON CIRCUITS, SIGNALS, SYSTEMS AND SECURITIES (ICCSSS - 2022), 2725], ► pp. 020008 ff.

Davey, Kira & Danielle Barth

2023. Directional constructions in Matukar Panau. Asia-Pacific Language Variation 9:2 ► pp. 156 ff.

Mummadi, Swathi, Tharun A, Divija Chigullapally, Aravind Bommena & Akhila D

2023. 2023 IEEE 9th International Women in Engineering (WIE) Conference on Electrical and Computer Engineering (WIECON-ECE), ► pp. 1 ff.

Oyebola, Folajimi & Warsa Melles

2023. Question intonation patterns in Nigerian English. In New Englishes, New Methods [Varieties of English Around the World, G68], ► pp. 108 ff.

Rajaguru, Harikumar, M. Gowri Shankar, S. Mohammed Irfan & C. Mukesh Balaji

Rajaguru, Harikumar, M. Gowri Shankar, S. P. Nanthakumar & I. Arul Murugan

Romine, Samuel, Joshua Jensen & Robert Ball

2023. Comparing Sentiment Analysis and Emotion Analysis of Algorithms vs. People. In Artificial Intelligence in HCI [Lecture Notes in Computer Science, 14051], ► pp. 167 ff.

SUGAWARA, Yuki & Kazuho KAMBARA

2023. <i>The Many Uses of Explain:</i>. Annals of the Japan Association for Philosophy of Science 32:0 ► pp. 23 ff.

Ferreira, Tiago S., Ewaldo E. C. Santana, Antônio F. L. Jacob Junior, Paulo F. Silva Junior, Luciana S. Bastos, Ana L. A. Silva, Solange A. Melo, Carlos A. M. Cruz, Vivianne S. Aquino, Luís S. O. Castro, Guilherme O. Lima & Raimundo C. S. Freire

2022. Diagnostic Classification of Cases of Canine Leishmaniasis Using Machine Learning. Sensors 22:9 ► pp. 3128 ff.

Hirota, Harunobu

2022. The Indicative/subjunctive Mood Alternation with Adverbs of Doubt in Spanish. Journal of Quantitative Linguistics 29:4 ► pp. 450 ff.

Jiahuai Ma, Kaixian Xu, Yu Qiao & Zhaoyan Zhang

2022. An Integrated Model for Social Media Toxic Comments Detection: Fusion of High-Dimensional Neural Network Representations and Multiple Traditional Machine Learning Algorithms. Journal of Computational Methods in Engineering Applications ► pp. 1 ff.

Krawczak, Karolina

2022. Modeling constructional variation. In Analogy and Contrast in Language [Human Cognitive Processing, 73], ► pp. 341 ff.

Ma, Guanghui, Rajendran Parthiban & Nemai Karmakar

2022. 2022 IEEE Symposium on Computers and Communications (ISCC), ► pp. 1 ff.

Nguyen, Allison, Tom Roberts, Pranav Anand & Jean E Fox Tree

2022. Look, Dude: How hyperpartisan and non-hyperpartisan speech differ in online commentary. Discourse & Society 33:3 ► pp. 371 ff.

Pijpops, Dirk

2022. Lectal contamination. International Journal of Corpus Linguistics 27:3 ► pp. 259 ff.

Pijpops, Dirk, Dirk Speelman & Antal van den Bosch

2022. Generating hypotheses for alternations at low and intermediate levels of schematicity. The use of Memory-based Learning. Linguistics Vanguard 8:1 ► pp. 305 ff.

Silva, Douglas, Sergio T. Carvalho & Nadia Silva

2022. Comparative Analysis of Classification Algorithms Applied to Circular Trading Prediction Scenarios. In Electronic Government and the Information Systems Perspective [Lecture Notes in Computer Science, 13429], ► pp. 95 ff.

TIZÓN-COUTO, DAVID

2022. A multivariate account of particle alternation after bare-formtryin native varieties of English. English Language and Linguistics 26:4 ► pp. 645 ff.

Heng, Tianyu, Dezhi Yang, Ruonan Wang, Li Zhang, Yang Lu & Guanhua Du

2021. Progress in Research on Artificial Intelligence Applied to Polymorphism and Cocrystal Prediction. ACS Omega 6:24 ► pp. 15543 ff.

Pijpops, Dirk, Dirk Speelman, Freek Van de Velde & Stefan Grondelaers

2021. Incorporating the multi-level nature of the constructicon into hypothesis testing. Cognitive Linguistics 32:3 ► pp. 487 ff.

Podhorodecka, Joanna

2021. Real-life pseudo-passives: The usage and discourse functions of adjunct-based passive constructions. Poznan Studies in Contemporary Linguistics 57:1 ► pp. 33 ff.

Tizón-Couto, David & David Lorenz

2021. Variables are valuable: making a case for deductive modeling. Linguistics 59:5 ► pp. 1279 ff.

Comer, Marie

2020. ¿Perífrasis cuasisinónimas? Una regresión logística aplicada a las incoativas expresadas con ponerse y meterse. ELUA :34 ► pp. 9 ff.

Franco, Karlien & Sali A. Tagliamonte

2020. New -way(s) with -ward(s): lexicalization, splitting and sociolinguistic patterns. Language Variation and Change 32:2 ► pp. 217 ff.

Franco, Karlien & Sali A. Tagliamonte

2021. InterestingFellowor Tough OldBird?. American Speech 96:2 ► pp. 192 ff.

De Smet, Isabeau & Freek Van de Velde

2019. Reassessing the evolution of West Germanic preterite inflection. Diachronica 36:2 ► pp. 139 ff.

PIJPOPS, DIRK, DIRK SPEELMAN, STEFAN GRONDELAERS & FREEK VAN DE VELDE

2018. Comparing explanations for the Complexity Principle: evidence from argument realization. Language and Cognition 10:3 ► pp. 514 ff.

Claes, Jeroen

2017. Cognitive and geographic constraints on morphosyntactic variation. Belgian Journal of Linguistics 31 ► pp. 30 ff.

Donaldson, Bryan

2017. Negation in Near‐Native French: Variation and Sociolinguistic Competence. Language Learning 67:1 ► pp. 141 ff.

Donaldson, Bryan

2020. Clitic position in Old Occitan affirmative verb-first declaratives coordinated bye. Journal of Historical Linguistics 10:3 ► pp. 389 ff.

Granvik, Anton

2017. Accounting for syntactic variation in diachrony. Belgian Journal of Linguistics 31 ► pp. 243 ff.

Chambaz, Antoine & Guillaume Desagulier

2016. Predicting Is Not Explaining: Targeted Learning of the Dative Alternation. Journal of Causal Inference 4:1 ► pp. 1 ff.

FONTEYN, LAUREN & NIKKI VAN DE POL

2016. Divide and conquer: the formation and functional dynamics of the Modern Englishing-clause network. English Language and Linguistics 20:2 ► pp. 185 ff.

Pijpops, Dirk & Freek Van de Velde

2016. Constructional contamination: How does it work and how do we measure it? . Folia Linguistica 50:2 ► pp. 543 ff.

[no author supplied]

2021. Nominal and Pronominal Address in Jamaica and Trinidad [Topics in Address Research, 3],

This list is based on CrossRef data as of 10 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.