In:Investigating Wikipedia: Linguistic corpus building, exploration and analysis
Edited by Céline Poudat, Harald Lüngen and Laura Herzberg
[Studies in Corpus Linguistics 121] 2024
► pp. 12–44
Chapter 1Building a comparable corpus of online discussions on Wikipedia
The EFG WikiCorpus
Published online: 31 October 2024
https://doi.org/10.1075/scl.121.01hod
https://doi.org/10.1075/scl.121.01hod
Abstract
This chapter presents the EFG WikiCorpus, a corpus composed of all the talk pages dedicated to (co)writing an article in the English, French
and German Wikipedias. This chapter explains the place
of talk pages in Wikipedia and describes what is the basic structure of a talk page before detailing the building
process of the EFG WikiCorpus: from the Wikipedia archives to a TEI resource encoded according to the TEI CMC-core schema. It concludes with a quantitative
overview of the EFG WikiCorpus and the EFG WikiDemoCorpus, a derived subcorpus used for qualitative analyses in
various contributions of this volume.
Keywords: Wikipedia talk pages, corpus building, TEI CMC-core
Article outline
- 1.Introduction
- 2.Wikipedia talk pages: Wikipedia’s backstage
- 2.1The main characteristics of Wikipedia talk pages
- 2.2The basic structure of a Wikipedia talk page (tp)
- 2.3Talk page encoding: The TEI CMC-core schema
- 3.Building the EFG WikiCorpus
- 3.1Searching for relevant content in the Wikipedia archives and the wikiCode
- 3.2Extracting talk pages and TEI encoding metadata
- 3.3Parsing the wikiCode and TEI CMC-core encoding
- 3.3.1The global content structure of a talk page
- 3.3.2Structuring and encoding the threads into posts
- 3.3.3Templates and special features
- 4.The resulting EFG WikiCorpus
- 4.1Quantitative overview of the talk page content
- 4.2Metadata overview and multilingual alignments
- 4.3Brief linguistic overview
- 4.4The EFG WikiDemoCorpus (WDC): A derived subcorpus for more qualitative analyses
- 5.Conclusion
Notes References
References (31)
Baldwin, Timothy, Cook, Paul, Lui, Marco, MacKinlay, Andrew & Wang, Li. 2013. How
noisy social media text, how different social media
sources? In Proceedings of the Sixth International
Joint Conference on Natural Language Processing, Ruslan Mitkov & Jong C. Park (eds), 356–364. Nagoya, Japan.
Beißwenger, Michael & Lüngen, Harald. 2020. CMC-core:
A schema for the representation of CMC corpora in
TEI. Corpus 20. 〈[URL]〉
Beißwenger, Michael, Wigham, Ciara, Etienne, Carole, Grumt Suárez, Holger, Herzberg, Laura, Fišer, Darja, Hinrichs, Erhard, Horsmann, Tobias, Karlova-Bourbonus, Natali, Lemnitzer, Lothar, Longhi, Julien, Lüngen, Harald, Ho-Dac, Lydia-Mai, Parisse, Christophe, Poudat, Céline, Schmidt, Thomas, Stemle, Egon, Storrer, Angelika & Zesch, Torsten. 2017. Connecting
resources: Which issues have to be solved to integrate CMC corpora from heterogeneous sources and for
different languages? In Proceedings of the 5th
Conference on CMC and Social Media Corpora for the Humanities (Cmccorpora17), Egon W. Stemle & Ciara Wigham (eds) 52–55. Bolzano, Italy.
Borra, Erik, Weltevrede Esther, Ciuccarelli, Paolo, Kaltenbrunner, Andreas, Laniado, David, Magni, Giovanni, Mauri, Michele, Rogers, Richard & Venturini, Tommaso. 2015. Societal
controversies in wikipedia articles. In CHI ’15:
Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing
Systems, 193–196. New York, NY: ACM.
Chang, Jonathan P., Chiam, Caleb, Fu, Liye, Wang, Andrew Z., Zhang, Justine & Danescu-Niculescu-Mizil, Cristian. 2020. ConvoKit:
A toolkit for the analysis of
conversations. In Proceedings of the 21th Annual
Meeting of the Special Interest Group on Discourse and Dialogue, Olivier Pietquin, Smaranda Muresan, Vivian Chen, Casey Kennington, David Vandyke, Nina Dethlefs, Koji Inoue, Erik Ekstedt & Stefan Ultes (eds), 57–60. [System
demo]. Stroudsburg PA: ACL.
Chang, Jonathan P. & Danescu-Niculescu-Mizil, Cristian. 2019. Trouble
on the horizon: Forecasting the derailment of online conversations as they
develop. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing (XXth EMNLP). Stroudsburg PA: ACL.
Elia, Antonella. 2009. Quantitative
data and graphics on lexical specificity and index readability: The case of
wikipedia. Revista Electrónica de Lingüı́stica
Aplicada 8: 248–271.
Ferschke, Oliver, Gurevych, Iryna & Chebotar, Yevgen. 2012. Behind
the article: Recognizing dialog acts in wikipedia talk
pages. In Proceedings of the 13th Conference of the
European Chapter of the Association for Computational
Linguistics, 777–786. Stroudsburg PA: ACL.
Ho-Dac, Lydia-Mai. 2024. EFG
WikiCorpus — discussions in Wikipedia’s backstage (English, French, German)
[Corpus]. ORTOLANG (Open Resources and TOols for
LANGuage) — [URL], [URL]
Ho-Dac, Lydia-Mai & Laippala Veronika. 2017. Le
corpus WikiDisc: Ressource pour la caractérisation des discussions en
ligne. In Corpus de communication médiée par les
réseaux: Construction, structuration, analyse. Ciara R. Wigham & Gudrun Ledegen (eds), 107–124. Paris: l’Harmattan.
Ho-Dac, Lydia-Mai, Laippala, Veronika, Poudat, Céline & Tanguy, Ludovic. 2017. Exploring
Wikipedia talk pages for conflict
detection. In Investigating Computer-Mediated
Communication: Corpus-Based Approaches to Language in the Digital World, Darja Fišer & Michael Beißwenger (eds), 146–168. Ljubljana: Ljubljana University Press, Faculty of Arts.
Huta, YiqingDanescu-Niculescu-Mizil, Cristian, Taraborelli, Dario, Thain, Nithum, Sorensen, Jeffery & Dixon, Lucas. 2018. WikiConv:
A corpus of the complete conversational history of a large online collaborative
community. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing,
Brussels, 2818–2823. Stroudsburg PA: ACL.
Konieczny, Piotr. 2010. Adhocratic
governance in the internet age: A case of Wikipedia. Journal of Information
Technology &
Politics 7(4): 263–283.
Laniado, David, Tasso, Riccardo, Volkovich, Yana & Kaltenbrunner, Andreas. 2011. When
the Wikipedians talk: Network and tree structure of Wikipedia discussion
pages. In Fifth International AAAI Conference on
Weblogs and Social Media (ICWSM
11), Barcelona, 17–21 July.
Langlais, Pierre-Carl. 2014. La
négociation contre la démocratie : le cas
Wikipedia. Négociations 1: 21–34.
Lehmann, Jens, Isele, Robert, Jakob, Max, Jentzsch, Anja, Kontokostas, Dimitri, Mendes, Pablo N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S. & Bizer, C. 2015. Dbpedia
— A large-scale, multilingual knowledge base extracted from wikipedia. Semantic
Web, 6(2), 167–195.
Lih, Andrew. 2004. Wikipedia
as Participatory Journalism: Reliable Sources? Metrics for evaluating collaborative media as a news
resource.
Linguatools (2018). Wikipedia Monolingual Corpora. From
Intersectional Accuracy Disparities in Commercial Gender. 〈[URL]〉 (1 June 2024).
Lüngen, Harald & Herzberg, Laura. 2019. Types
and annotation of reply relations in computer-mediated communication. European
Journal of Applied
Linguistics 7(2): 305–331.
Margaretha, Eliza & Lüngen, Harald. 2014. Building
linguistic corpora from Wikipedia articles and discussions. Journal for
Language Technology and Computational
Linguistics 29(2): 59–82.
Medelyan, Olena, Milne, David, Legg, Catherine & Witten, Ian H. 2009. Mining
meaning from Wikipedia. International Journal of Human-Computer
Interactions 67(9): 716–754.
Mitrevski, Blagoj, Piccardi, Tiziano, & West, Robert. 2020. WikiHist.html:
English Wikipedia’s full revision history in HTML Format. Proceedings of the
International AAAI Conference on Web and Social
Media 14: 878–884.
Poudat, Céline, Grabar, Natalia, Paloque-Bergès, Camille, Chanier, Thierry & Jin, Kun. 2017. Wikiconflits:
Un corpus de discussions éditoriales conflictuelles du Wikipédia
francophone. In Corpus de communication médiée par
les réseaux: Construction, structuration, analyse, Ciara R. Wigham & Gudrun Ledegen (eds). Paris: l’Harmattan.
Poudat, Céline, Vanni, Laurent, & Grabar, Natalia. 2016. How
to explore conflicts in French wikipedia talk
pages? In Statistics Analysis of Textual Data, Nice,
France, June, 645–656. 〈[URL]〉 (1 June
2024).
Potthast, Martin, Stein, Benno, Gerling, Robert. 2008. Automatic
Vandalism Detection in Wikipedia. In Advances in
Information Retrieval. ECIR 2008. Lecture Notes in Computer Science, Vol. 4956, Craig Macdonald, Iadh Ounis, Vassilis Plachouras, Ian Ruthven & Ryen W. White (eds), 663–668. Springer, Berlin, Heidelberg.
Walton, Aengus. 2009. A
Statistical Analysis of Stylistics and Homogeneity in the English
Wikipedia. PhD dissertation, Trinity College Dublin.
Wulczyn, Ellery, Thain, Nithum and Dixon, Lucas. 2017. Ex
machina: Personal attacks seen at
scale. In Proceedings of the 26th International
Conference on World Wide
Web, 1391–1399. International World Wide
Web Conferences Steering Committee.
Zesch, Torsten, Müller, Christof & Gurevych, Iryna. 2008. Extracting
lexical semantic knowledge from Wikipedia and
Wiktionary. In Proceedings of the Sixth International
Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco. Paris: European Language Resources Association (ELRA).
Zhang, Justine, Chang, Jonathan P., Danescu-Niculescu-Mizil, Cristian, Dixon, Lucas, Hua, Yiqing, Thain, Nithum & Taraborelli, Dario. 2018. Conversations
gone awry: Detecting early signs of conversational
failure. In Proceedings of the 56th Annual Meeting of
the Association for Computational
Linguistics: Vol. 1: Long
Papers, Iryna Gurevych & Yusuke Miyao (eds), 1350–1361. Stroudsburg PA: ACL.
Cited by (2)
Cited by two other publications
Tanguy, Ludovic, Céline Poudat & Lydia-Mai Ho-Dac
This list is based on CrossRef data as of 1 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
