In:Diachronic Corpora, Genre, and Language Change
Edited by Richard J. Whitt
[Studies in Corpus Linguistics 85] 2018
► pp. 41–64
Diachronic collocations, genre, and DiaCollo
Published online: 8 November 2018
https://doi.org/10.1075/scl.85.03jur
https://doi.org/10.1075/scl.85.03jur
Abstract
This chapter presents the formal basis for diachronic collocation profiling as implemented in
the open-source software tool “DiaCollo” and sketches some potential applications to multi-genre diachronic corpora.
Explicitly developed for the efficient extraction, comparison, and interactive visualization of collocations from a
diachronic text corpus, DiaCollo is suitable for processing collocation pairs whose association strength depends on
extralinguistic features such as the date of occurrence or text genre. By tracking changes in a word’s typical
collocates over time, DiaCollo can help to provide a clearer picture of diachronic changes in the word’s usage,
especially those related to semantic shift or discourse environment. Use of the flexible DDC search engine back-end allows user queries to make explicit reference to genre and other
document-level metadata, thus allowing e.g. independent genre-local profiles or cross-genre comparisons. In addition
to traditional static tabular display formats, a web-service plugin also offers a number of intuitive interactive
online visualizations for diachronic profile data for immediate inspection.
Article outline
- 1.Introduction
- 2.Related work
- 3.Implementation
- 3.1Overview
- 3.2Corpus data
- 3.3Co-occurrence frequencies
- 3.3.1Native co-occurrence relation
- 3.3.2Term × document matrix co-occurrence relation
- 3.3.3DDC co-occurrence relation
- 3.4Scoring and pruning
- 3.5Comparisons
- 3.6Output & visualization
- 4.Examples
- 4.1Adjectival attribution: What makes a “man”?
- 4.2Pronominal adverbs and deictic locality
- 5.Conclusion
Notes References
References (40)
Baker, Paul, Gabrielatos, Costas, Khosravinik, Majid, Krzyżanowski, Michał, McEnery, Tony & Wodak, Ruth. 2008. A useful methodological synergy? Combining critical discourse analysis and corpus linguistics to
examine discourses of refugees and asylum seekers in the UK press. Discourse & Society 19(3): 273–306.
Berry, Michael W., Dumais, Susan T. & O’Brien, Gavin. 1995. Using linear algebra for intelligent information retrieval. SIAM Review 37(4): 573–595. <[URL].
Biber, Douglas, Johansson, Stig, Leech, Geoffrey, Conrad, Susan & Finegan, Edward. 1999. Longman Grammar of Spoken and Written English. London: Longman.
Blei, David M., Ng, Andrew Y. & Jordan, Michael I. 2003. Latent Dirichlet allocation. Journal of machine Learning Research 3: 993–1022. <[URL]>
Brezina, Vaclav, McEnery, Tony & Wattam, Stephen. 2015. Collocations in context: A new perspective on collocation networks. International Journal of Corpus Linguistics 20(2): 139–173.
Church, Kenneth W. & Hanks, Patrick. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics 16(1):22–29.
Davies, Mark. 2012. Expanding horizons in historical linguistics with the 400-million word Corpus of Historical
American English. Corpora 7(2): 121–157. <[URL].
Didakowski, Jörg & Geyken, Alexander. 2003. From DWDS corpora to a German word profile – methodological problems and solutions. In Network Strategies, Access Structures and Automatic Extraction of Lexicographical Information [OPAL X], Andrea Abel & Lothar Lemnitzer (eds). Mannheim: IDS. <[URL]>
Duff, Iain S., Grimes, Roger G. & Lewis, John G. 1989. Sparse matrix test problems. ACM Transactions on Mathematical Software (TOMS), 15(1): 1–14.
Evert, Stefan. 2005. The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD dissertation, University of Stuttgart. <[URL]>
. 2008. Corpora and collocations. In Corpus Linguistics. An International Handbook, Anke Lüdeling & Merja Kytö (eds), 1212–1248. Berlin: Mouton de Gruyter.
Fielding, Roy T. 2000. Architectural styles and the design of network-based software architectures. PhD dissertation, University of California, Irvine. <[URL]>
Gabrielatos, Costas, McEnery, Tony, Diggle, Peter J. & Baker, Paul. 2012. The peaks and troughs of corpus-based contextual analysis. International Journal of Corpus Linguistics 17(2):151–175.
Galbraith, Mary. 1995. Deictic shift theory and the poetics of involvement in narrative. In Deixis in Narrative: A Cognitive Science Perspective, Judith F. Duchan, Gail A. Bruder & Lynne E. Hewitt (eds), 19–59. Hillsdale NJ: Lawrence Erlbaum Associates.
Geyken, Alexander. 2013. Wege zu einem historischen Referenzkorpus des Deutschen: Das Projekt Deutsches
Textarchiv. In Perspektiven einer corpusbasierten historischen Linguistik und Philologie [Thesaurus Linguae Aegyptiae 4], Ingelore Hafemann (eds), 221–234. Berlin: Berlin-Brandenburgische Akademie der Wissenschaften. <[URL]>
Geyken, Alexander, Barbaresi, Adrien, Didakowski, Jörg, Jurish, Bryan, Wiegand, Frank & Lemnitzer, Lothar. 2017. Die Korpusplattform des “Digitalen Wörterbuchs der deutschen Sprache” (dwds). Zeitschrift für Germanistische Linguistik 45(2): 327–344.
Glazebrook, Karl & Economou, Frossie. 1997. PDL: The Perl data language. Dr. Dobb’s Journal, September 1997. <[URL]>
Gries, Stephan Th. & Hilpert, Martin. 2008. The identification of stages in diachronic data: Variability-based neighbor
clustering. Corpora 3(1): 59–81. <[URL].
Gulordava, Kristina & Baroni, Marco. 2011. A distributional similarity approach to the detection of semantic change in the Google Books Ngram
corpus. In
Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language
Semantics
, Edinburgh, UK, July 2011, 67–71. Stroudsburg PA: ACL. <[URL]>
Heaps, H. Stanley. 1978. Information Retrieval: Computational and Theoretical Aspects. Orlando FL: Academic Press.
Heidegger, Martin. 1927. Sein und Zeit. In Jahrbuch für Philosophie und phänomenologische Forschung, Edmund Husserl (ed.). Tübingen: Neomarius.
Herrmann, J. Bernike. 2013. Metaphor in Academic Discourse [LOT Dissertation Series]. Utrecht: Netherlands Graduate School of Linguistics.
Jurish, Bryan. 2015. DiaCollo: On the trail of diachronic collocations. In
CLARIN Annual Conference 2015
, Wrocław, Poland, October 14–16 2015, 28–31. <[URL]>
Jurish, Bryan, Thomas, Christian & Wiegand, Frank. 2014. Querying the deutsches Textarchiv. In Proceedings of the Workshop “Beyond Single-Shot Text Queries: Bridging the Gap(s) between Research
Communities” (MindTheGap 2014), Berlin, Germany, March 2014, Udo Kruschwitz, Frank Hopfgartner & Cathal Gurrin (eds), 25–30. <[URL]>
Jurish, Bryan, Geyken, Alexander & Werneke, Thomas. 2016. DiaCollo: Diachronen Kollokationen auf der Spur. In
Proceedings DHd 2016: Modellierung – Vernetzung – Visualisierung, University of
Leipzig
, March 2016, 172–175. <[URL]>
Kilgarriff, Adam & Tugwell, David. 2002. Sketching words. In Lexicography and Natural Language Processing: A Festschrift in Honour of B. T. S. Atkins, Marie-Hélène Corréard (ed.), 125–137. <[URL]>
Kilgarriff, Adam, Herman, Andrej, Busta, Jan, Rychlý, Pavel & Jakubíček, Milos. 2015. DIACRAN: A framework for diachronic analysis. In Proceedings of Corpus Linguistics 2015, Federica Formato & Andrew Hardie (eds), 65–70. Lancaster: UCREL.
Kim, Yoon, Chiu, Yi-K, Hanaki, Kentaro, Hegde, Darshan & Petrov, Slav. 2014. Temporal analysis of language through neural language models. In
Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social
Science
, June 2014, 61–65. Stroudsburg PA: ACL. <[URL].
Manning, Christopher D. & Schütze, Hinrich. 1999. Foundations of Statistical Natural Language Processing. Cambridge MA: The MIT Press.
Mikolov, Tomas, Chen, Kai, Corrado, Greg & Dean, Jeffrey. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. <[URL]>
Rychlý, Pavel. 2008. A lexicographer-friendly association score. In Proceedings of
Recent Advances in Slavonic Natural Language Processing
, RASLAN 2008, 6–9. <[URL]>
Sagi, Eyal, Kaufmann, Stefan & Clark, Brady. 2009. Semantic density analysis: Comparing word meaning across time and phonetic space. In
Proceedings of the EACL 2009 Workshop on Geometrical Models of Natural Language
Semantics
, March 2009. Stroudsburg PA: ACL. <[URL]>
Scharloth, Joachim, Eugster, David & Bubenhofer, Noah. 2013. Das Wuchern der Rhizome. Linguistische Diskursanalyse und Data-driven Turn. In Linguistische Diskursanalyse. Neue Perspektiven, Dietrich Busse & Wolfgang Teubert (eds), 345–380. Wiesbaden: VS Verlag.
Schiller, Anne, Teufel, Simone & Thielen, Christine. 1995. Guidelines fur das Tagging deutscher Textcorpora mit STTS. Technical report, University of Stuttgart, Institut für maschinelle Sprachverarbeitung and University of Tübingen, Seminar für Sprachwissenschaft.
Sokirko, A. 2003. A technical overview of DWDS/Dialing Concordance. Talk delivered at the meeting
Computational Linguistics and Intellectual Technologies
, Protvino, Russia. <[URL]>
Cited by (4)
Cited by four other publications
Friedrich, Alexander, Saba Anwar & Chris Biemann
De Luca, Ernesto William, Francesca Fallucchi, Bouchra Ghattas & Riem Spielhaus
Bick, Eckhard, Katja Gorbahn & Nina Kalwa
This list is based on CrossRef data as of 1 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
