Incorporating structural topic modeling into short text analysis

doi:10.1075/consl.22026.wan

Article published In: Concentric
Vol. 49:1 (2023) ► pp.96–139

Get fulltext from our e-platform

Download PDF

Download EPUB

Incorporating structural topic modeling into short text analysis

Po-Ya Angela Wang | National Taiwan University

Shu-Kai Hsieh | National Taiwan University

Available under the Creative Commons Attribution-NonCommercial (CC BY-NC) 4.0 license.

For any use beyond this license, please contact the publisher at rights@benjamins.nl.

Published online: 25 May 2023

https://doi.org/10.1075/consl.22026.wan

Abstract

The past few decades have seen the rapid development of topic modeling. So far, research has been more concerned with determining the ideal number of topics or meaningful topic clustering words than with applying topic modeling techniques to evaluate linguistic theories. This study proposes the Structural Topic Model (STM)-led framework to facilitate the interpretation of topic modeling results and standardize text analysis. STM encompasses various model training mechanisms, thereby requiring systematic designs to properly combine language studies. “Structural” in STM refers to the inclusion of metadata structure. Unlike the corpus-based keyness approach, STM can capture contextual cues and meta-information for the interpretation of topical results. Besides, STM can make cross-corpora comparisons via topical contrast, a challenging task for corpus-driven related models such as the Biterm Topic Model (BTM). Stylistic variations in song lyrics are taken as an illustration to show how to use the suggested framework to delve into the linguistic theory proposed by . The topical model and iterable model in the proposed paradigm can clarify how pronouns affect style distinction. We believe the proposed STM-led framework can shed light on text analysis by conducting a reproducible cross-corpora comparison on short texts.

Keywords: structural topic modeling, biterm topic model, Chinese lyrics, corpus linguistics, keyness

Article outline

1.Introduction
2.Literature review
- 2.1Lyrics and linguistics
- 2.2Corpus-based approaches
- 2.3Corpus-driven approaches
  - 2.3.1Topic modeling
  - 2.3.2Model evaluation
3.A proposed STM-led analytics framework
4.Lyrics analytics as a case study
- 4.1Data pre-processing and exploration
- 4.2Iterable assessment
  - 4.2.1Linguistic supervision of model selection
  - 4.2.2Topical quality model (TQ model)
  - 4.2.3Iterable assessment model (IA model)
- 4.3Generalization
5.Conclusion
Acknowledgements
Notes
References

References (81)

References

Aarts, F. G. A. M. 1971. On the distribution of noun-phrase types in English clause structure. Lingua 26.31:281–293.

Abuzayed, Abeer, and Hend Al-Khalifa. 2021. BERT for Arabic topic modeling: An experimental study on BERTopic technique. Procedia Computer Science 1891:191–194.

Akella, Revanth, and Teng-Sheng Moh. 2019. Mood classification with lyrics and ConvNet. Proceedings of the 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA), ed. by M. A. Wani, 511–514. Los Alamitos, CA: IEEE Computer Society.

Angelov, Dimo. 2020. Top2vec: Distributed Representations of Topics. Retrieved January 14th, 2023, from [URL]

Aranda, Ana M., Kathrin Sele, Helen Etchanchu, Jonne Y. Guyt, and Eero Vaara. 2021. From big data to rich theory: Integrating critical discourse analysis with structural topic modeling. European Management Review 181:197–214.

Arifah, Khadijah. 2016. Figurative Language Analysis in Five John Legend’s Song. Doctoral dissertation, Maulana Malik Ibrahim State Islamic University, Malang, Indonesia.

Arora, Sanjeev, Rong Ge, Yonatan Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, and Michael Zhu. 2013. A practical algorithm for topic modeling with provable guarantees. Proceedings of the 30th International Conference on Machine Learning, ed. by Sanjoy Dasgupta and David McAllester, 280–288. Atlanta, GA: JMLR.org.

Baratè, Adriano, Luca A. Ludovico, and Enrica Santucci. 2013. A semantics-driven approach to lyrics segmentation. Proceedings of the 2013 8th International Workshop on Semantic and Social Media Adaptation and Personalization (SMAP), ed. by Randall Bilof, 73–79. Los Alamitos, CA: IEEE Computer Society.

Barradas, Gonçalo T., and Laura S. Sakka. 2021. When words matter: A cross-cultural perspective on lyrics and their relationship to musical emotions. Psychology of Music 50.21:650–669.

Besson, Mireille, Frederique Faita, Isabelle Peretz, A-M. Bonnel, and Jean Requin. 1998. Singing in the brain: Independence of lyrics and tunes. Psychological Science 9.61:494–498.

Bischof, Jonathan, and Edoardo M. Airoldi. 2012. Summarizing topical content with word frequency and exclusivity. Proceedings of the 29th International Conference on Machine Learning (ICML-12), ed. by John Langford and Joelle Pineau, 201–208. Madison, WI: Omnipress.

Blei, David M. 2012. Probabilistic topic models. Communications of the ACM 55.41:77–84.

Blei, David M., and John D. Lafferty. 2007. A correlated topic model of Science. The Annals of Applied Statistics 1.11:17–35.

Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research 31:993–1022.

Chang, Jonathan, Sean Gerrish, Chong Wang, Jordan Boyd-graber, and David M. Blei. 2009. Reading tea leaves: How humans interpret topic models. Advances in Neural Information Processing Systems 321:288–296.

Chen, Stanley F., and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language 13.41:359–394.

Chen, Xieling, Di Zou, Gary Cheng, and Haoran Xie. 2020. Detecting latent topics and trends in educational technologies over four decades using structural topic modeling: A retrospective of all volumes of Computers & Education. Computers & Education 1511:103855.

Damerau, Fred J. 1993. Generating and evaluating domain-oriented multi-word terms from texts. Information Processing & Management 29.41:433–447.

Devi, Maibam Debina, and Navanath Saharia. 2020. Exploiting topic modelling to classify sentiment from lyrics. Proceedings of the 2nd International Conference on Machine Learning, Image Processing, Network Security and Data Sciences (MIND 2020), ed. by Arup Bhattacharjee, Samir Kr. Borgohain, Badal Soni, Gyanendra Verma and Xiao-Zhi Gao, 411–423. Singapore: Springer.

Dewi, Erniyanti Nur Fatahhela, Didin Nuruddin Hidayat, and Alek Alek. 2020. Investigating figurative language in “Lose You to Love Me” song lyric. Loquen: English Studies Journal 13.11:6–16.

Dunning, Ted. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19.11:61–74.

Ebeling, Régis, Carlos Abel Córdova Sáenz, Jeferson Campos Nobre, and Karin Becker. 2021. The effect of political polarization on social distance stances in the Brazilian COVID-19 scenario. Journal of Information and Data Management 12.11:86–108.

Eckstein, Lars. 2010. Reading Song Lyrics. Leiden: Brill.

Eisenstein, Jacob, Amr Ahmed, and Eric P. Xing. 2011. Sparse additive generative models of text. Proceedings of the 28th International Conference on Machine Learning (ICML-11), ed. by Lise Getoor and Tobias Scheffer, 1041–1048. Madison, WI: Omnipress.

Gabrielatos, Costas. 2018. Keyness analysis: Nature, metrics and techniques. Corpus Approaches to Discourse: A Critical Review, ed. by Charlotte Taylor and Anna Marchi, 225–258. London: Routledge.

Grootendorst, Maarten. 2022. BERTopic: Neural Topic Modeling with a Class-based TF-IDF Procedure. Retrieved May 7th, 2022, from [URL]

Hofmann, Thomas. 1999. Probabilistic latent semantic indexing. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ed. by Fredric Gey, Marti Hearst and Richard Tong, 50–57. New York, NY: Association for Computing Machinery.

Hong, Liangjie, and Brian D. Davison. 2010. Empirical study of topic modeling in twitter. Proceedings of the 1st Workshop on Social Media Analytics, ed. by Prem Melville, Jure Leskovec and Foster Provost, 80–88. New York, NY: Association for Computing Machinery.

Hoover, David L. 2007. Corpus stylistics, stylometry, and the styles of Henry James. Style 41.21:174–203.

Kilgarriff, Adam. 1997. Using word frequency lists to measure corpus homogeneity and similarity between corpora. Proceedings of the 5th ACL Workshop on Very Large Corpora, ed. by Joe Zhou and Kenneth Church, 231–245. Beijing and Hong Kong: Tsinghua University and The Hong Kong University of Science and Technology.

. 2005. Language is never, ever, ever, random. Corpus Linguistics and Linguistic Theory 1.21:263–276.

Kreyer, Rolf, and Joybrato Mukherjee. 2007. The style of pop song lyrics: A corpus-linguistic pilot study. Anglia. Journal of English Philology 125.11:31–58.

Laoh, Enrico, Isti Surjandari, and Limisgy Ramadhina Febirautami. 2018. Indonesians’ song lyrics topic modelling using latent dirichlet allocation. Proceedings of the 2018 5th International Conference on Information Science and Control Engineering (ICISCE), ed. by Shaozi Li, Ying Dai and Yun Cheng, 270–274. Los Alamitos, CA: IEEE Computer Society.

Leech, Geoffrey, and Roger Fallon. 1992. Computer corpora-what do they tell us about culture? ICAME Journal 161:29–50.

Li, Peng-Hsuan, Tsu-Jui Fu, and Wei-Yun Ma. 2020. Why attention? Analyze BiLSTM deficiency and its remedies in the case of NER. Proceedings of the AAAI Conference on Artificial Intelligence, ed. by Vincent Conitzer and Fei Sha, 8236–8244. California, USA: AAAI Press, Palo Alto.

Lindstedt, Nathan C. 2019. Structural topic modeling for social scientists: A brief case study with social movement studies literature, 2005–2017. Social Currents 6.41:307–318.

Mimno, David M., and Andrew McCallum. 2008. Topic models conditioned on arbitrary features with Dirichlet-multinomial regression. Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence (UAI 2008), ed. by David McAllester and Petri Myllymaki, 411–418. Arlington, VA: AUAI Press.

Mimno, David M., Hanna M. Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. Optimizing semantic coherence in topic models. Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, ed. by Regina Barzilay and Mark Johnson, 262–272. Edinburgh, Scotland, UK: Association for Computational Linguistics.

Nahajec, Lisa. 2019. Song lyrics and the disruption of pragmatic processing: An analysis of linguistic negation in 10CC’s ‘I’m Not in Love’. Language and Literature 28.11:23–40.

Narayan, Ashwin, Bonnie Berger, and Hyunghoon Cho. 2021. Assessing single-cell transcriptomic variability through density-preserving data visualization. Nature Biotechnology 391:765–774.

Newman, David, Youn Noh, Edmund Talley, Sarvnaz Karimi, and Timothy Baldwin. 2010. Evaluating topic models for digital libraries. Proceedings of the 10th Annual Joint Conference on Digital Libraries, ed. by Jane Hunter, 215–224. New York, NY: Association for Computing Machinery.

North, Adrian C., Amanda E. Krause, and David Ritchie. 2020. The relationship between pop music and lyrics: A computerized content analysis of the United Kingdom’s weekly top five singles, 1999–2013. Psychology of Music 49.41:735–758.

Pennebaker, James W. 2013. The Secret Life of Pronouns: What Our Words Say About Us. London: Bloomsbury Publishing.

Petrie, Keith J., James W. Pennebaker, and Borge Sivertsen. 2008. Things we said today: A linguistic analysis of the Beatles. Psychology of Aesthetics, Creativity, and the Arts 2.41:97–202.

Pettijohn, Terry F., and Donald F. Sacco Jr. 2009. The language of lyrics: An analysis of popular Billboard songs across conditions of social and economic threat. Journal of Language and Social Psychology 28.31:297–311.

Pojanapunya, Punjaporn, and Richard Watson Todd. 2018. Log-likelihood and odds ratio: Keyness statistics for different purposes of keyword analysis. Corpus Linguistics and Linguistic Theory 14.11:133–167.

Rajpurkar, Pranav, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), ed. by Jian Su, Kevin Duh and Xavier Carreras, 2383–2392. Stroudsburg, PA: Association for Computational Linguistics.

Rayson, Paul. 2019. Corpus analysis of key words. The Concise Encyclopedia of Applied Linguistics, ed. by Carol Ann Chapelle, 320–326. Oxford: John Wiley & Sons.

Roberts, Margaret E., Brandon M. Stewart, and Dustin Tingley. 2016. Navigating the local modes of big data: The case of topic models. Computational Social Science: Discovery and Prediction, ed. by R. Michael Alvarez, 51–97. New York: Cambridge University Press.

. 2019. Stm: An R package for structural topic models. Journal of Statistical Software 91.21:1–40.

Roberts, Margaret E., Brandon M. Stewart, Dustin Tingley, and Edoardo M. Airoldi. 2013. The structural topic model and applied social science. Advances in Neural Information Processing Systems Workshop on Topic Models: Computation, Application, and Evaluation 41:1–20.

Roberts, Margaret E., Brandon M. Stewart, Dustin Tingley, Christopher Lucas, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson, and David G. Rand. 2014. Structural topic models for open-ended survey responses. American Journal of Political Science 58.41:1064–1082.

Röder, Michael, Andreas Both, and Alexander Hinneburg. 2015. Exploring the space of topic coherence measures. Proceedings of the 8th ACM International Conference on Web Search and Data Mining, ed. by Xueqi Cheng and Hang Li, 399–408. New York, NY: Association for Computing Machinery.

Sasaki, Shoto, Kazuyoshi Yoshii, Tomoyasu Nakano, Masataka Goto, and Shigeo Morishima. 2014. LyricsRadar: A lyrics retrieval system based on latent topics of lyrics. Proceedings of the 15th International Society for Music Information Retrieval Conference (ISMIR 2014), ed. by Hsin-Min Wang, Yi-Hsuan Yang and Jin Ha Lee, 585–590. Taipei: International Society for Music Information Retrieval.

Schedl, Markus. 2019. Deep learning in music recommendation systems. Frontiers in Applied Mathematics and Statistics 51:44.

Schweinberger, Martin, Michael Haugh, and Sam Hames. 2021. Analysing discourse around COVID-19 in the Australian Twittersphere: A real-time corpus-based analysis. Big Data & Society 8.11:1–17.

Setiawati, Wilya, and Maryani Maryani. 2018. An analysis of figurative language in Taylor Swift’s song lyrics. PROJECT (Professional Journal of English Education) 1.31:261–268.

Shahmohammadi, Hassan, MirHossein Dezfoulian, and Muharram Mansoorizadeh. 2021. Paraphrase detection using LSTM networks and handcrafted features. Multimedia Tools and Applications 80.41:6479–6492.

Sharma, Hardik, Shelly Gupta, Yukti Sharma, and Archana Purwar. 2020. A new model for emotion prediction in music. Proceedings of the 2020 6th International Conference on Signal Processing and Communication (ICSC), ed. by Jitendra Mohan and Abhinav Gupta, 156–161. Los Alamitos, CA: IEEE Computer Society.

Snyder, Robin M. 2015. An introduction to topic modeling as an unsupervised machine learning way to organize text information. Paper presented at the Annual Meeting of the Association Supporting Computer Users in Education (ASCUE), Myrtle Beach, SC.

Sophiadi, Angelina. 2014. The song remains the same… or not? A pragmatic approach to the lyrics of rock music. Major Trends in Theoretical and Applied Linguistics, vol. 21, ed. by Nikolaos Lavidas, Thomaï Alexiou and Areti-Maria Sougari, 125–142. London: De Gruyter Open Poland.

Sterckx, Lucas. 2014. Topic Detection in a Million Songs. Doctoral dissertation, Ghent University, Ghent, Belgium.

Taddy, Matt. 2012. On estimation and selection for topic models. Proceedings of the 15th International Conference on Artificial Intelligence and Statistics, ed. by Neil D. Lawrence and Mark Girolami, 1184–1193. Retrieved May 27th, 2022, from [URL]

Tegge, Friederike. 2017. The lexical coverage of popular songs in English language teaching. System 671:87–98.

Trenquier, Henri. 2018. Improving Semantic Quality of Topic Models for Forensic Investigation. Doctoral dissertation, University of Amsterdam, Amsterdam, Netherlands.

Varnum, Michael E. W., Jaimie Arona Krems, Colin Morris, Alexandra Wormley, and Igor Grossmann. 2021. Why are song lyrics becoming simpler? A time series analysis of lyrical complexity in six decades of American popular music. PLOS ONE 16.11:0244576.

Wallach, Hanna Megan. 2006. Topic modeling: beyond bag-of-words. Proceedings of the 23rd International Conference on Machine learning, ed. by William W. Cohen and Andrew Moore, 977–984. New York, NY: Association for Computing Machinery.

. 2008. Structured Topic Models for Language. Doctoral dissertation, University of Cambridge, Cambridge, UK.

Wallach, Hanna Megan, Iain Murray, Ruslan Salakhutdinov, and David Mimno. 2009. Evaluation methods for topic models. Proceedings of the 26th Annual International Conference on Machine Learning, ed. by Andrea Danyluk, 1105–1112. New York, NY: Association for Computing Machinery.

Wang, Jie, and Xinyan Zhao. 2019. Theme-Aware Generation Model for Chinese Lyrics. Retrieved September 20th, 2022, from [URL]

Watanabe, Kento, Matsubayashi Yuichiroh, Inui Kentaro, Nakano Tomoyasu, Fukayama Satoru, and Goto Masataka. 2017. Lyrisys: An interactive support system for writing lyrics based on topic transition. Proceedings of the 22nd International Conference on Intelligent User Interfaces, ed. by George A. Papadopoulos and Tsvi Kuflik, 559–563. New York, NY: Association for Computing Machinery.

Weng, Jianshu, Ee-Peng Lim, Jing Jiang, and Qi He. 2010. Twitterrank: Finding topic-sensitive influential twitterers. Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, ed. by Brian D. Davison and Torsten Suel, 261–270. New York, NY: Association for Computing Machinery.

Werner, Valentin. 2021. Catchy and conversational? A register analysis of pop lyrics. Corpora 16.21:237–270.

Whissell, Cynthia. 1996. Traditional and emotional stylometric analysis of the songs of Beatles Paul McCartney and John Lennon. Computers and the Humanities 30.31:257–265.

Wright, David. 2014. Stylistics Versus Statistics: A Corpus Linguistic Approach to Combining Techniques in Forensic Authorship Analysis Using Enron Emails. Doctoral dissertation, University of Leeds, Leeds, England.

Xia, Xiaoling, Xin Gu, and Qinyang Lu. 2019. Research on the model of lyric emotion algorithm. Journal of Physics: Conference Series 12131:042004.

Yan, Xiaohui, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. 2013. A biterm topic model for short texts. Proceedings of the 22nd International Conference on World Wide Web, ed. by Daniel Schwabe, Virgílio Almeida and Hartmut Glaser, 1445–1456. New York, NY: Association for Computing Machinery.

Yao, Liang, Chengsheng Mao, and Yuan Luo. 2019. Graph convolutional networks for text classification. Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI-19), ed. by The Association for the Advancement of Artificial Intelligence, 7370–7377. Palo Alto, CA: AAAI Press.

Zhang, Lei, Shuai Wang, and Bing Liu. 2018. Deep learning for sentiment analysis: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8.41:1253.

Zhang, Liang, Keli Xiao, Hengshu Zhu, Chuanren Liu, Jingyuan Yang, and Bo Jin. 2018. CADEN: A context-aware deep embedding network for financial opinions mining. Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM), ed. by Lisa O’Conner, 757–766. Los Alamitos, CA: IEEE Computer Society.

Zhao, Wayne Xin, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan, and Xiaoming Li. 2011. Comparing twitter and traditional media using topic models. Advances in Information Retrieval: Proceedings of the 33rd European Conference on IR Research, ed. by Paul Clough, Colum Foley, Cathal Gurrin, Gareth J. F. Jones, Wessel Kraaij, Hyowon Lee and Vanessa Mudoch, 338–349. Heidelberg: Springer Berlin.

Cited by (3)

Cited by three other publications

Britt, Brian C, Matthew S VanDyke, Jameson L Hayes & Kate A Brauman

2025. ‘You were elected to lead the living, not the dead’: a computational analysis of social media discourse about Nigerian farmer-herder conflicts. Environmental Research: Water 1:1 ► pp. 015004 ff.

Coelho, Raquel & Alden McCollum

2025. Illuminating socially distributed identity resources in student writing through artificial intelligence to support the design of culturally informed learning experiences. Discourse Processes ► pp. 1 ff.

Salami, Olawale & Temitayo Matthew Fagbola

2025. Topic modelling and sentiment analysis for public opinion mining of the #BBNaija reality TV show: a critical analysis. Social Network Analysis and Mining 15:1

This list is based on CrossRef data as of 8 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.