Detecting contrast patterns in newspaper articles by combining discourse analysis and text mining

Senja Pollak, Roel Coesemans, Walter Daelemans and Nada Lavrač

Text mining aims at constructing classification models and finding interesting patterns in large text collections. This paper investigates the utility of applying these techniques to media analysis, more specifically to support discourse analysis of news reports about the 2007 Kenyan elections and post-election crisis in local (Kenyan) and Western (British and US) newspapers. It illustrates how text mining methods can assist discourse analysis by finding contrast patterns which provide evidence for ideological differences between local and international press coverage. Our experiments indicate that most significant differences pertain to the interpretive frame of the news events: whereas the newspapers from the UK and the US focus on ethnicity in their coverage, the Kenyan press concentrates on sociopolitical aspects.

Quick links
A browser-friendly version of this article is not yet available. View PDF
Baker, P
(2006) Using Corpora in Discourse Analysis. London: Continuum.Google Scholar logo with link to Google Scholar
Baker, P., C. Gabrielatos, M. Khosravinik, M. Krzyzanowski, T. McEnery, and R. Wodak
(2008) A useful methodological synergy? Combining critical discourse analysis and corpus linguistics to examine discourses of refugees and asylum seekers in the UK press. Discourse Society 19.3: 273–306.  BoPGoogle Scholar logo with link to Google Scholar
Balahur, A., and R. Steinberger
(2009) Rethinking sentiment analysis in the news: From theory to practice and back. In Proceedings of the 1st Workshop on Opinion Mining and Sentiment Analysis , Satellite to CAEPIA 2009.
Bell, A
(1991) The Language of News Media. Oxford: Blackwell.  BoPGoogle Scholar logo with link to Google Scholar
Cendrowska, J
(1987) PRISM: An algorithm for inducing modular rules. International Journal of Man- Machine Studies 27.4: 349–370. Google Scholar logo with link to Google Scholar
Cohen, W
(1995) Fast effective rule induction. In Proceedings of the 12th International Conference on Machine Learning , p. 115–123.
Cohen, W., and Y. Singer
(1999) Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems (TOIS) 17.2: 141–173. Google Scholar logo with link to Google Scholar
Daelemans, W., S. Bucholz, and J. Veenstra
(1999) Memory-based shallow parsing. In Proceedings of the Computational Natural Language Learning Workshop (CoNLL-99). Demo: http://​www​.cnts​.ua​.ac​.be​/cgi​-bin​/jmeyhi​/MBSP​-instant​-webdemo​.cgi
EU EOM Kenya
(2008) Kenya: Final Report. General Elections 27 December 2007 (3 April 2008) Brussel: EU EOM Kenya, retrieved from http://​www​.eueom​.eu/ [01/03/2010].
Fairclough, N
(1995) Media Discourse. London: Arnold.  BoPGoogle Scholar logo with link to Google Scholar
Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth
(1996) The KDD process for extracting useful knowledge from volumes of data. Communication of the ACM 39. 11: 27–34. Google Scholar logo with link to Google Scholar
Feldman, R., and J. Sanger
(2007) The Text Mining Handbook. Advanced Approaches in Analyzing Unstructured Data. New York: Cambridge University Press.Google Scholar logo with link to Google Scholar
Fielding, N.G., and R.M. Lee
(1998) Computer Analysis of Qualitative Research. London: Sage.Google Scholar logo with link to Google Scholar
Finn, A., and N. Kushmerick
(2006) Learning to classify documents according to genre. In Journal of the American Society for Information Science and Technology 57.11: 1506–1518. Google Scholar logo with link to Google Scholar
Fortuna, B., C. Galleguillos, and N. Cristianini
(2009) Detecting the bias in media with statistical learning methods. In N. Ashok, Srivastava and M. Saham (eds.), Text Mining: Theory and Applications. London: Taylor and Francis Publisher. Google Scholar logo with link to Google Scholar
Fortuna, B., M. Grobelnik, and D. Mladenić
(2006) System for semi-automatic ontology construction. In Proceedings of the Demo Session at European Semantic Web Conference ESWC (2006).
(2007) OntoGen: Semi-automatic ontology editor. In M.J. Smith, and G. Salvendy (eds.), Proceedings of Human Interface, Part II, HCI International 2007, LNCS 4558, Springer, p. 309–318.
Galtung, J., and M.H. Ruge
(1965) The structure of foreign news: The presentation of the Congo, Cuba and Cyprus crises in four Norwegian newspapers. Journal of Peace Research 2.1: 64–91. Google Scholar logo with link to Google Scholar
Gibbs, G.R
(2004) Computer-assisted Qualitative Data Analysis (CAQDAS). In M.S. Lewis-Beck, A. Bryman, and T.F. Liao (eds.), The Sage Encyclopedia of Social Science Research Methods (1). Thousand Oaks: Sage, p. 87–89.Google Scholar logo with link to Google Scholar
Greevy, E.P., and A.F. Smeaton
(2004) Text categorisation of racist texts using a support vector machine. In Proceedings of 7es Journées internationales d’Analyse statistique des Données Textuelles JADT (1) . Leuven: PUL, p. 533–544.
Harcup, T
(2004) Journalism: Principles and Practice. London: Sage.Google Scholar logo with link to Google Scholar
Harris, R.J
(2004) A Cognitive Psychology of Mass Communication (4th ed.) Mahwah: Lawrence Erlbaum.Google Scholar logo with link to Google Scholar
Kennedy, G
(1998) An Introduction to Corpus Linguistics. London: Longman.  TSBGoogle Scholar logo with link to Google Scholar
Koller, V., and G. Mautner
(2004) Computer applications in critical discourse analysis. In C. Coffin, A. Hewings, and K. O'Halloran (eds.), Applying English Grammar: Functional and Corpus Approaches. London: Arnold, p. 216–228.Google Scholar logo with link to Google Scholar
Krishnamurty, R
(1996) Ethnic, racial and tribal: The language of racism? In C.R. Caldas-Coulthard, and M. Coulthard (eds.), Texts and Practices: Readings in Critical Discourse Analysis. London/New York: Routledge, p. 129–149.  BoPGoogle Scholar logo with link to Google Scholar
Lee, C., J.M. Chan, Z. Pan, and C.Y.K. So
(2000) National prisms of a global 'Media Event'. In J. Curran, and M. Gurevitch (eds.), Mass Media and Society (3rd ed.). London: Arnold., p. 295–309.Google Scholar logo with link to Google Scholar
Lin, W.-H., E. Xing, and A. Hauptmann
(2008) A joint topic and perspective model for ideological discourse. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases , p. 17–32.
Lindlof, T.R., and B.C. Taylor
(2011) Qualitative Communication Research Methods (3rd ed.). Thousand Oaks: Sage.
Liu, S.-Z., and H.-P. Hu
(2007) Text classification using sentential frequent item sets. In Journal of Computer Science and Technology 22.2. Beijing: Institute of Computing Technology, p. 334–337. Google Scholar logo with link to Google Scholar
Liu, B
(2010) Sentiment Analysis: A Multi-Faceted Problem. IEEE Intelligent Systems 25.3. Google Scholar logo with link to Google Scholar
Lüdeling, A., and M. Kytö
(eds.) (2008) Corpus Linguistics. An International Handbook. Berlin: Mouton de Gruyter. Google Scholar logo with link to Google Scholar
Luyckx, K
(2010) Scalability Issues in Authorship Attribution. Brussels: UPA University Press Antwerp.Google Scholar logo with link to Google Scholar
Luyckx, K., and W. Daelemans
(2008) Authorship attribution and verification with many authors and limited data. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008), p. 513–520.
Machin, D
(2008) News discourse I: Understanding the social goings-on behind news texts. In A. Mayr (ed.), Language and Power: An Introduction to Institutional Discourse. London: Continuum, p. 62–89.Google Scholar logo with link to Google Scholar
MacMillan, K
(2005) More than just coding? Evaluating CAQDAS in a discourse analysis of news texts. Forum Qualitative Sozialforschung / Forum: Qualitative Social Research 6.3, art. 25.Google Scholar logo with link to Google Scholar
Mahlberg, M
(2007) Lexical items in discourse: Identifying local textual functions of sustainable development. In M. Hoey, M. Mahlberg, M. Stubbs, and W. Teubert (eds.), Text, Discourse and Corpora. Theory and Analysis. London/New York: Continuum, p. 191–218.Google Scholar logo with link to Google Scholar
Matu, P.M., and H.J. Lubbe
(2007) Investigating language and ideology: A presentation of the ideological square and transitivity in the editorials of three Kenyan newspapers. Journal of Language and Politics 6.3: 401–418. Google Scholar logo with link to Google Scholar
Mautner, G
(2007) Mining large corpora for social information: The case of elderly. Language in Society 36.1: 51–72.  BoPGoogle Scholar logo with link to Google Scholar
McGee, M.C
(1980) The ‘ideograph’: A link between rhetoric and ideology. The Quarterly Journal of Speech 66.1: 1–16. Google Scholar logo with link to Google Scholar
Mitchell, T
(1997) Machine Learning. Boston: McGraw Hill.Google Scholar logo with link to Google Scholar
Morley, J., and P. Bayley
(2009) Corpus-Assisted Discourse Studies on the Iraq Conflict: Wording the War. New York: Routledge.  BoPGoogle Scholar logo with link to Google Scholar
Ngonyani, D
(2000) Tools of deception: Media coverage of student protests in Tanzania. Nordic Journal of African Studies 9.2: 22–48.Google Scholar logo with link to Google Scholar
Ogola, G
(2009) Media at cross-roads: Reflections on the Kenyan news media and the coverage of the 2007 political crisis. Africa Insight 39.1: 58–71.Google Scholar logo with link to Google Scholar
O’Halloran, K
(2010) How to use corpus linguistics in the study of media discourse. In A. O’Keeffe, and M. McCarthy (eds.), The Routledge Handbook of Corpus Linguistics. London/New York: Routledge, p. 563–577. Google Scholar logo with link to Google Scholar
O'Halloran, K., and C. Coffin
(2004) Checking overinterpretation and underinterpretation: Help from corpora in critical linguistics. In C. Coffin, A. Hewings, and K. O'Halloran (eds.), Applying English Grammar: Functional and Corpus Approaches. London: Arnold, p. 275–297.Google Scholar logo with link to Google Scholar
O’Keeffe, A., B. Clancy, and S. Adolphs
(2011) Introducing Pragmatics in Use. London: Routledge.  BoP. Google Scholar logo with link to Google Scholar
Oloo, A.G.R
(2007) The contemporary opposition in Kenya: Between internal traits and state manipulation. In G.R. Murunga, and S.W. Nasong’o (eds.), Kenya: The Struggle for Democracy. Dakar: CODESRIA Books, p. 90–125.Google Scholar logo with link to Google Scholar
Pape, S., and S. Featherstone
(2005) Newspaper Journalism: A Practical Introduction. London: Sage.Google Scholar logo with link to Google Scholar
Quinlan, J
(1993) C4.5: Programs for Machine Learning. San Francisco: Morgan Kaufmann.Google Scholar logo with link to Google Scholar
Rambaud, B
(2008) Caught between information and condemnation: The Kenyan media in the electoral campaigns of December 2007. In J. Lafargue (ed.), The General Elections in Kenya, 2007 (Special issue of Les Cahiers d’Afrique de l’Est (38)). Nairobi: IFRA, p. 57–107.Google Scholar logo with link to Google Scholar
Ray, C
(2008) How the word 'tribe' stereotypes Africa. New African 471: 8–9.Google Scholar logo with link to Google Scholar
Reah, D
(1998) The Language of Newspapers. London/New York: Routledge.  BoPGoogle Scholar logo with link to Google Scholar
Richardson, J.E
(2007) Analysing Newspapers: An Approach from Critical Discourse Analysis. Basingstoke: Palgrave Macmillan. Google Scholar logo with link to Google Scholar
Rühlemann, C
(2010) What can a corpus tell us about pragmatics? In A. O’Keeffe, and M. McCarthy (eds.), The Routledge Handbook of Corpus Linguistics. London/New York: Routledge, p. 288–301. Google Scholar logo with link to Google Scholar
Scott, M
(2008) WordSmith Tools version 5, Liverpool: Lexical Analysis Software.Google Scholar logo with link to Google Scholar
Schönfelder, W
(2011) CAQDAS and qualitative syllogism logic—NVivo 8 and MAXQDA 10 Compared [91 paragraphs]. Forum Qualitative Sozialforschung/Forum: Qualitative Social Research 12(1), art. 21.Google Scholar logo with link to Google Scholar
Sebastiani, F
(2002) Machine learning in automated text categorization. ACM Computing Surveys 34.1: 1–47. Google Scholar logo with link to Google Scholar
Sinclair, J
(1991) Corpus, Concordance, Collocation. Oxford: Oxford University Press.Google Scholar logo with link to Google Scholar
Stamatatos, E., N. Fakotakis, and G. Kokkinakis
(2000) Automatic text categorization in terms of genre and author. Computational Linguistics 26.4: 471–495. Google Scholar logo with link to Google Scholar
Stubbs, M
(1996) Text and Corpus Analysis: Computer-assisted Studies of Language and Culture. Oxford: Blackwell.  BoPGoogle Scholar logo with link to Google Scholar
(2001) Texts, corpora, and problems of interpretation: A response to Widdowson. Applied Linguistics 22.2: 149–172. Google Scholar logo with link to Google Scholar
Thornbury, S
(2010) What can a corpus tell us about discourse? In A. O’Keeffe, and M. McCarthy (eds.), The Routledge Handbook of Corpus Linguistics. London/New York: Routledge, p. 270–287. Google Scholar logo with link to Google Scholar
Van Dijk, T.A
(1988) News as Discourse. Hillsdale: Lawrence Erlbaum.Google Scholar logo with link to Google Scholar
(2006) Ideology and discourse analysis. Journal of Political Ideologies 11.2: 115–140. Google Scholar logo with link to Google Scholar
Van Ginneken, J
(2002) De schepping van de wereld in het nieuws: De 101 vertekeningen die elk 1 procent verschil maken (2nd ed.). Kluwer: Alphen aan den Rijn.Google Scholar logo with link to Google Scholar
Van Leeuwen, T
(2008) Discourse and Practice: New Tools for Critical Discourse Analysis. Oxford: Oxford University Press.  BoP Google Scholar logo with link to Google Scholar
Verschueren, J
(1996) Contrastive ideology research: Aspects of a pragmatic methodology. Language Sciences 18.3/4: 589–603.  BoPGoogle Scholar logo with link to Google Scholar
(1999) Understanding Pragmatics. London: Arnold.  BoPGoogle Scholar logo with link to Google Scholar
(2008) Context and structure in a theory of pragmatics. Studies of Pragmatics 10: 13–23.Google Scholar logo with link to Google Scholar
Westerståhl, J., and F. Johansson
(1994) Foreign news: News values and ideologies. European Journal of Communication 9: 71–89. Google Scholar logo with link to Google Scholar
Witten, I.H., and E. Frank
(2005) Data Mining Practical Machine Learning Tools and Techniques (2nd ed.). San Francisco: Elsevier.Google Scholar logo with link to Google Scholar
Wrong, M
(2008) Don’t mention the war. New Statesman 137.4884: 22–23.Google Scholar logo with link to Google Scholar
Wu, D.H
(2007) A brave new world for international news? Exploring the determinants of the coverage of foreign nations on US websites. The International Communication Gazette 69.6: 539–551. Google Scholar logo with link to Google Scholar
Zhao, Y., and J. Zobel
(2005) Effective and scalable authorship attribution using function words, LNCS 3689, p. 174–189. Berlin/Heidelberg: Springer.Google Scholar logo with link to Google Scholar
 
Mobile Menu Logo with link to supplementary files background Layer 1 prag Twitter_Logo_Blue